From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <georgy@tarantool.org>
From: =?utf-8?B?0JPQtdC+0YDQs9C40Lkg0JrQuNGA0LjRh9C10L3QutC+?= <georgy@tarantool.org>
Subject: Re: [tarantool-patches] [PATCH] Relay logs from wal thread
Date: Thu, 31 Jan 2019 10:15:49 +0300
Message-ID: <2214375.2TNa3IZgi8@home.lan>
In-Reply-To: <20190129104416.scbw5h2zcgoydvyq@esperanza>
References: <ab1c9c4f67a93cc666417cab63e00886b8749d81.1548166552.git.georgy@tarantool.org> <3427357.uHWZoEdiqX@home.lan> <20190129104416.scbw5h2zcgoydvyq@esperanza>
MIME-Version: 1.0
Content-Type: multipart/signed; boundary="nextPart3404690.gbpLEW5dXH"; micalg="pgp-sha256"; protocol="application/pgp-signature"
To: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: tarantool-patches@freelists.org
List-ID: <tarantool-patches.dev.tarantool.org>

--nextPart3404690.gbpLEW5dXH
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="UTF-8"

On Tuesday, January 29, 2019 1:44:16 PM MSK Vladimir Davydov wrote:
> On Tue, Jan 29, 2019 at 12:49:53PM +0300, =D0=93=D0=B5=D0=BE=D1=80=D0=B3=
=D0=B8=D0=B9 =D0=9A=D0=B8=D1=80=D0=B8=D1=87=D0=B5=D0=BD=D0=BA=D0=BE wrote:
> > On Tuesday, January 29, 2019 11:53:18 AM MSK Vladimir Davydov wrote:
> > > On Tue, Jan 22, 2019 at 05:18:36PM +0300, Georgy Kirichenko wrote:
> > > > Use wal thread to relay rows. Relay writer and reader fibers live
> > > > in a wal thread. Writer fiber uses wal memory to send rows to the
> > > > peer when it possible or spawns cords to recover rows from a file.
> > > >=20
> > > > Wal writer stores rows into two memory buffers swapping by buffer
> > > > threshold. Memory buffers are splitted into chunks with the
> > > > same server id issued by chunk threshold. In order to search
> > > > position in wal memory all chunks indexed by wal_mem_index
> > > > array. Each wal_mem_index contains corresponding replica_id and
> > > > vclock just before first row in the chunk as well as chunk buffer
> > > > number, position and size of the chunk in memory.
> > > >=20
> > > > Closes: #3794
> > >=20
> > > It's too early to comment on the code. Here are some general issues t=
hat
> > >=20
> > > I think need to be addressed first:
> > >  - Moving relay code to WAL looks ugly as it turns the code into a me=
ssy
> > > =20
> > >    bundle. Let's try to hide relay behind an interface. Currently, we
> > >    only need one method - write() - but in future we might need more,
> > >    for instance prepare() or commit(). May be, we could reuse journal
> > >    struct for this?
> >=20
> > It was done because WAL should control relays' ACK in case of synchrono=
us
> > replication. So, in likely case forwarding wals and reading ACK should =
be
> > done in wal thread.
>=20
> I don't deny that.
>=20
> > And if I left this code in a relay then wal-relay communication
> > would be to complicated (expose internal structures like sockets, memory
> > buffers, vclock and etc).
>=20
> Not necessarily. You could hide it all behind a vtab. For now, vtab
> would be trivial - we only need write() virtual method (ACKs can be
> received by a fiber started by this method when relay switches to sync).
> Later on, we could extend it with methods that would allow WAL to wait
> for remote commit.
I do not think wal should wait for remote commit but relay should inform wa=
l=20
about state. For example wal should track state of an relay even if the rel=
ay=20
is reading from file in a different cord (and might exit at any time).
So we also add some v-functions to request memory chunks, promote in-wal re=
lay=20
vclock or attach to sync mode as you suggested.
>=20
> > Also I planned some future changes to move all wal-
> > processing logic into a wal.c to.
>=20
> Please don't. We try to split the code into modules so that it's easier
> for understanding. Mixing relay/recovery/WAL in one module doesn't sound
> like a good idea to me.
>=20
> > >  - Memory index looks confusing to me. Why would we need it at all?
> > > =20
> > >    AFAIU it is only used to quickly position relay after switching to
> > >    memory buffer, but it's not a big deal to linearly scan the buffer
> > >    instead when that happens - it's a rare event and scanning memory
> > >    should be fast enough. At least, we should start with it, because
> > >    that would be clear and simple, and build any indexing later in
> > >    separate patches provided we realize that we really need it, which=
 I
> > >    doubt.
> >=20
> > It allows no to decode all logs as many times as we have relays connect=
ed
> > to.
> I don't understand. When relay is in sync, we should be simply writing a
> stream of rows to a socket. No decoding or positioning should be
> necessary. Am I missing something?
If relay blocks for some time then you stop writing and then resume reading=
=20
from some position relay has reached before and this position could be=20
identified only by vclock because memory buffer might be reallocated at any=
 time=20
when relay sleeps.
Also, lets imagine that switching from relay to wal needs dT time in which =
wal=20
writes N rows. So, if you do not store logs in memory then you would not be=
=20
able to switch this relay in a sync mode. So we should have some rows backe=
d=20
in memory and allow relays to navigate in it.=20

Other cause why I rejected direct writing to relay - in that case wal would=
 be=20
in charge to track relays' state, for me it looks ugly.
>=20
> > Also it eliminates following of corresponding vclocks for each row.
>=20
> But you know the vclock of each row in WAL and you can pass it to relay.
> No reason to decode anything AFAIU.
I know it only when i write entry to wal, if relay yielded for some time - =
the=20
only possibility is to store memory buffers' starting vclock and then decod=
e=20
each row one-by-one to follow vclock.
>=20
> > >  - We shouldn't stop/start relay thread when switching to/from memory
> > > =20
> > >    source. Instead on SUBSCRIBE we should always start a relay thread,
> > >    as we do now, and implement some machinery to suspend/resume it and
> > >    switch to memory/disk when it falls in/out the memory buffer.
> >=20
> > In a good case (we do not have any outdated replicas) we do not even gi=
ve
> > any chance for relay thread to work.
>=20
> Starting/stopping relay thread on every lag looks just ugly. We have to
> start it anyway when a replica subscribes (because it's highly unlikely
> to fall in the in-memory buffer). I don't see any reason to stop it once
> we are in sync - we can switch to async anytime.
Relay cord do not started on every lag but only if relay gone out of from t=
he=20
wal memory buffer. The buffer should be large enough to compensate all hard=
ware=20
fluctuations, and this grants that any replica able to process all rows fro=
m=20
master would live in a wal memory without spawning threads.
>=20
> > >    All that machinery should live in relay.cc.
> >=20
> > This leads to to complicated dependencies between wal and relay. AT lea=
st
> > wal should expose it internal structures into a header file.
>=20
> How's that? WAL should only know about write() method. Relay registers
> the method in WAL. Everything else is done by relay, including switching
> between sync and async mode.
Write method is not enough at all.
>=20
> > >  - When in sync mode, WAL should write directly to the relay socket,
> > > =20
> > >    without using any intermediate buffer. Buffer should only be used =
if
> > >    the socket is blocked or the relay is running in asynchronous mode.
> > >    There's no need in extra memory copying otherwise.
> >=20
> > Please explain why. I do not hing this can improve something. Also i th=
ink
>=20
> That would eliminate extra memory copying in the best case scenario.
Let me explain - we will form in-memory data that will be used for both to-=
file=20
writes and as a relay source.
>=20
> > that this memory buffer should be used for writing files to.
> > In any case we should have a memory buffer to store wals because you do
> > not
> > know will relay block or not. But the biggest issue is to transfer state
> > from
> As I suggested below, let's try to maintain a buffer per each relay, not
> a single buffer in WAL. Then we wouldn't need to write anything to
> memory buffers if all relays are in sync and no socket is blocked. This
> is the best case scenario, but we should converge to it pretty quickly
> provided the network doesn't lag much. This would also allow us to hide
> this logic in relay.cc.
Above I described why it is not possible to not to have wal memory buffer -=
=20
with constant load relay would not be able to attach to sync mode as you=20
suggested. So we have to maintain wal memory buffer.
There are some situations in that you model would not be functional.
=46or example if wal writes a lot (more than tcp write buffer in a some tim=
e=20
slot) so each call will block socket and push relay sync down.
And if we maintain per relay buffers so you we should copy data as much as=
=20
relays we have, even for file-reading relays do let then chance to catch th=
e=20
wal.
>=20
> > just blocked sync relay to detached from memory reader.
> >=20
> > >  - Maintaining a single memory buffer in WAL would complicate relay s=
ync
> > > =20
> > >    interface implementation. May be, we could use a separate buffer p=
er
> > >    each relay? That would be less efficient, obviously, because it wo=
uld
> > >    mean extra memory copying and usage, but that would simplify the c=
ode
> > >    a great deal. Besides, we would need to use the buffers only for
> > >    async relays - once a relay is switched to sync mode, we could wri=
te
> > >    to its socket directly. So I think we should at least consider this
> > >    option.
> > > =20
> > >  - You removed the code that tried to advance GC consumers as rarely =
as
> > > =20
> > >    possible, namely only when a WAL file is closed. I'd try to save it
> > >    somehow, because without it every ACK would result in GC going to =
WAL
> > >    if relay is running in async mode (because only WAL knows about WAL
> > >    files; for others it's just a continuous stream). When a relay is
> > >    running in sync mode, advancing GC on each ACK seems to be OK,
> > >    because WAL won't be invoked then (as we never remove WAL files
> > >    created after the last checkpoint).
> >=20
> > I think, we should move wal gc into a wal thread.
>=20
> You can try, but until then I think we should try to leave that code
> intact.
>=20
> > >  - I really don't like code duplication. I think we should reuse the
> > > =20
> > >    code used for sending rows and receiving ACKs between sync and sync
> > >    modes. Hiding sync relay implementation behind an interface would
> > >    allow us to do that.
> > > =20
> > >  - I don't see how this patch depends on #1025 (sending rows in
> > > =20
> > >    batches). Let's implement relaying from memory first and do #1025
> > >    later, when we see the full picture.


--nextPart3404690.gbpLEW5dXH
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part.
Content-Transfer-Encoding: 7Bit

-----BEGIN PGP SIGNATURE-----

iQEzBAABCAAdFiEEFB+nbqWGnp59Rk9ZFSyY70x8X3sFAlxSoKUACgkQFSyY70x8
X3vV3Qf/QNwmuIlHsQC61li9PMEGGICXfatZjJG7n/hBCqoW56RIwhBYQVDbb2u4
2dxOhHuQS9XO6uE50pC6t4xTeeKJ3lezFy+JYtYgSFbQAOQanm/1jKCIsmTtPPxV
ffon28KP6w3moKC+/7QPmLXd+O4stM1kNgbo8OGz2odckYKTmmAKRiZ/+R0JvCaG
oZ7/4b8QupLbXzK5qCkZQ9GPYCuas10cDc/9q0hhxKZ0I1JUN9wk3enymPE02kHn
bumMNqdoBe1STR3Myy7WIDiLxrAdNFkmOkihZf1N8Z1Pg962ynqXwo+wEdaRPI/r
Yf2q3kDohQtphzVNbJY6pzeFghsXhQ==
=zFER
-----END PGP SIGNATURE-----

--nextPart3404690.gbpLEW5dXH--