From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: =?utf-8?B?0JPQtdC+0YDQs9C40Lkg0JrQuNGA0LjRh9C10L3QutC+?= Subject: Re: [tarantool-patches] [PATCH] Relay logs from wal thread Date: Thu, 31 Jan 2019 10:15:49 +0300 Message-ID: <2214375.2TNa3IZgi8@home.lan> In-Reply-To: <20190129104416.scbw5h2zcgoydvyq@esperanza> References: <3427357.uHWZoEdiqX@home.lan> <20190129104416.scbw5h2zcgoydvyq@esperanza> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart3404690.gbpLEW5dXH"; micalg="pgp-sha256"; protocol="application/pgp-signature" To: Vladimir Davydov Cc: tarantool-patches@freelists.org List-ID: --nextPart3404690.gbpLEW5dXH Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="UTF-8" On Tuesday, January 29, 2019 1:44:16 PM MSK Vladimir Davydov wrote: > On Tue, Jan 29, 2019 at 12:49:53PM +0300, =D0=93=D0=B5=D0=BE=D1=80=D0=B3= =D0=B8=D0=B9 =D0=9A=D0=B8=D1=80=D0=B8=D1=87=D0=B5=D0=BD=D0=BA=D0=BE wrote: > > On Tuesday, January 29, 2019 11:53:18 AM MSK Vladimir Davydov wrote: > > > On Tue, Jan 22, 2019 at 05:18:36PM +0300, Georgy Kirichenko wrote: > > > > Use wal thread to relay rows. Relay writer and reader fibers live > > > > in a wal thread. Writer fiber uses wal memory to send rows to the > > > > peer when it possible or spawns cords to recover rows from a file. > > > >=20 > > > > Wal writer stores rows into two memory buffers swapping by buffer > > > > threshold. Memory buffers are splitted into chunks with the > > > > same server id issued by chunk threshold. In order to search > > > > position in wal memory all chunks indexed by wal_mem_index > > > > array. Each wal_mem_index contains corresponding replica_id and > > > > vclock just before first row in the chunk as well as chunk buffer > > > > number, position and size of the chunk in memory. > > > >=20 > > > > Closes: #3794 > > >=20 > > > It's too early to comment on the code. Here are some general issues t= hat > > >=20 > > > I think need to be addressed first: > > > - Moving relay code to WAL looks ugly as it turns the code into a me= ssy > > > =20 > > > bundle. Let's try to hide relay behind an interface. Currently, we > > > only need one method - write() - but in future we might need more, > > > for instance prepare() or commit(). May be, we could reuse journal > > > struct for this? > >=20 > > It was done because WAL should control relays' ACK in case of synchrono= us > > replication. So, in likely case forwarding wals and reading ACK should = be > > done in wal thread. >=20 > I don't deny that. >=20 > > And if I left this code in a relay then wal-relay communication > > would be to complicated (expose internal structures like sockets, memory > > buffers, vclock and etc). >=20 > Not necessarily. You could hide it all behind a vtab. For now, vtab > would be trivial - we only need write() virtual method (ACKs can be > received by a fiber started by this method when relay switches to sync). > Later on, we could extend it with methods that would allow WAL to wait > for remote commit. I do not think wal should wait for remote commit but relay should inform wa= l=20 about state. For example wal should track state of an relay even if the rel= ay=20 is reading from file in a different cord (and might exit at any time). So we also add some v-functions to request memory chunks, promote in-wal re= lay=20 vclock or attach to sync mode as you suggested. >=20 > > Also I planned some future changes to move all wal- > > processing logic into a wal.c to. >=20 > Please don't. We try to split the code into modules so that it's easier > for understanding. Mixing relay/recovery/WAL in one module doesn't sound > like a good idea to me. >=20 > > > - Memory index looks confusing to me. Why would we need it at all? > > > =20 > > > AFAIU it is only used to quickly position relay after switching to > > > memory buffer, but it's not a big deal to linearly scan the buffer > > > instead when that happens - it's a rare event and scanning memory > > > should be fast enough. At least, we should start with it, because > > > that would be clear and simple, and build any indexing later in > > > separate patches provided we realize that we really need it, which= I > > > doubt. > >=20 > > It allows no to decode all logs as many times as we have relays connect= ed > > to. > I don't understand. When relay is in sync, we should be simply writing a > stream of rows to a socket. No decoding or positioning should be > necessary. Am I missing something? If relay blocks for some time then you stop writing and then resume reading= =20 from some position relay has reached before and this position could be=20 identified only by vclock because memory buffer might be reallocated at any= time=20 when relay sleeps. Also, lets imagine that switching from relay to wal needs dT time in which = wal=20 writes N rows. So, if you do not store logs in memory then you would not be= =20 able to switch this relay in a sync mode. So we should have some rows backe= d=20 in memory and allow relays to navigate in it.=20 Other cause why I rejected direct writing to relay - in that case wal would= be=20 in charge to track relays' state, for me it looks ugly. >=20 > > Also it eliminates following of corresponding vclocks for each row. >=20 > But you know the vclock of each row in WAL and you can pass it to relay. > No reason to decode anything AFAIU. I know it only when i write entry to wal, if relay yielded for some time - = the=20 only possibility is to store memory buffers' starting vclock and then decod= e=20 each row one-by-one to follow vclock. >=20 > > > - We shouldn't stop/start relay thread when switching to/from memory > > > =20 > > > source. Instead on SUBSCRIBE we should always start a relay thread, > > > as we do now, and implement some machinery to suspend/resume it and > > > switch to memory/disk when it falls in/out the memory buffer. > >=20 > > In a good case (we do not have any outdated replicas) we do not even gi= ve > > any chance for relay thread to work. >=20 > Starting/stopping relay thread on every lag looks just ugly. We have to > start it anyway when a replica subscribes (because it's highly unlikely > to fall in the in-memory buffer). I don't see any reason to stop it once > we are in sync - we can switch to async anytime. Relay cord do not started on every lag but only if relay gone out of from t= he=20 wal memory buffer. The buffer should be large enough to compensate all hard= ware=20 fluctuations, and this grants that any replica able to process all rows fro= m=20 master would live in a wal memory without spawning threads. >=20 > > > All that machinery should live in relay.cc. > >=20 > > This leads to to complicated dependencies between wal and relay. AT lea= st > > wal should expose it internal structures into a header file. >=20 > How's that? WAL should only know about write() method. Relay registers > the method in WAL. Everything else is done by relay, including switching > between sync and async mode. Write method is not enough at all. >=20 > > > - When in sync mode, WAL should write directly to the relay socket, > > > =20 > > > without using any intermediate buffer. Buffer should only be used = if > > > the socket is blocked or the relay is running in asynchronous mode. > > > There's no need in extra memory copying otherwise. > >=20 > > Please explain why. I do not hing this can improve something. Also i th= ink >=20 > That would eliminate extra memory copying in the best case scenario. Let me explain - we will form in-memory data that will be used for both to-= file=20 writes and as a relay source. >=20 > > that this memory buffer should be used for writing files to. > > In any case we should have a memory buffer to store wals because you do > > not > > know will relay block or not. But the biggest issue is to transfer state > > from > As I suggested below, let's try to maintain a buffer per each relay, not > a single buffer in WAL. Then we wouldn't need to write anything to > memory buffers if all relays are in sync and no socket is blocked. This > is the best case scenario, but we should converge to it pretty quickly > provided the network doesn't lag much. This would also allow us to hide > this logic in relay.cc. Above I described why it is not possible to not to have wal memory buffer -= =20 with constant load relay would not be able to attach to sync mode as you=20 suggested. So we have to maintain wal memory buffer. There are some situations in that you model would not be functional. =46or example if wal writes a lot (more than tcp write buffer in a some tim= e=20 slot) so each call will block socket and push relay sync down. And if we maintain per relay buffers so you we should copy data as much as= =20 relays we have, even for file-reading relays do let then chance to catch th= e=20 wal. >=20 > > just blocked sync relay to detached from memory reader. > >=20 > > > - Maintaining a single memory buffer in WAL would complicate relay s= ync > > > =20 > > > interface implementation. May be, we could use a separate buffer p= er > > > each relay? That would be less efficient, obviously, because it wo= uld > > > mean extra memory copying and usage, but that would simplify the c= ode > > > a great deal. Besides, we would need to use the buffers only for > > > async relays - once a relay is switched to sync mode, we could wri= te > > > to its socket directly. So I think we should at least consider this > > > option. > > > =20 > > > - You removed the code that tried to advance GC consumers as rarely = as > > > =20 > > > possible, namely only when a WAL file is closed. I'd try to save it > > > somehow, because without it every ACK would result in GC going to = WAL > > > if relay is running in async mode (because only WAL knows about WAL > > > files; for others it's just a continuous stream). When a relay is > > > running in sync mode, advancing GC on each ACK seems to be OK, > > > because WAL won't be invoked then (as we never remove WAL files > > > created after the last checkpoint). > >=20 > > I think, we should move wal gc into a wal thread. >=20 > You can try, but until then I think we should try to leave that code > intact. >=20 > > > - I really don't like code duplication. I think we should reuse the > > > =20 > > > code used for sending rows and receiving ACKs between sync and sync > > > modes. Hiding sync relay implementation behind an interface would > > > allow us to do that. > > > =20 > > > - I don't see how this patch depends on #1025 (sending rows in > > > =20 > > > batches). Let's implement relaying from memory first and do #1025 > > > later, when we see the full picture. --nextPart3404690.gbpLEW5dXH Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part. Content-Transfer-Encoding: 7Bit -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEFB+nbqWGnp59Rk9ZFSyY70x8X3sFAlxSoKUACgkQFSyY70x8 X3vV3Qf/QNwmuIlHsQC61li9PMEGGICXfatZjJG7n/hBCqoW56RIwhBYQVDbb2u4 2dxOhHuQS9XO6uE50pC6t4xTeeKJ3lezFy+JYtYgSFbQAOQanm/1jKCIsmTtPPxV ffon28KP6w3moKC+/7QPmLXd+O4stM1kNgbo8OGz2odckYKTmmAKRiZ/+R0JvCaG oZ7/4b8QupLbXzK5qCkZQ9GPYCuas10cDc/9q0hhxKZ0I1JUN9wk3enymPE02kHn bumMNqdoBe1STR3Myy7WIDiLxrAdNFkmOkihZf1N8Z1Pg962ynqXwo+wEdaRPI/r Yf2q3kDohQtphzVNbJY6pzeFghsXhQ== =zFER -----END PGP SIGNATURE----- --nextPart3404690.gbpLEW5dXH--