From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <vdavydov.dev@gmail.com>
Date: Tue, 29 Jan 2019 13:44:16 +0300
From: Vladimir Davydov <vdavydov.dev@gmail.com>
Subject: Re: [tarantool-patches] [PATCH] Relay logs from wal thread
Message-ID: <20190129104416.scbw5h2zcgoydvyq@esperanza>
References: <ab1c9c4f67a93cc666417cab63e00886b8749d81.1548166552.git.georgy@tarantool.org>
 <20190129085318.fggigft4p5pqb5kq@esperanza>
 <3427357.uHWZoEdiqX@home.lan>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <3427357.uHWZoEdiqX@home.lan>
To: =?utf-8?B?0JPQtdC+0YDQs9C40Lkg0JrQuNGA0LjRh9C10L3QutC+?= <georgy@tarantool.org>
Cc: tarantool-patches@freelists.org
List-ID: <tarantool-patches.dev.tarantool.org>

On Tue, Jan 29, 2019 at 12:49:53PM +0300, Георгий Кириченко wrote:
> On Tuesday, January 29, 2019 11:53:18 AM MSK Vladimir Davydov wrote:
> > On Tue, Jan 22, 2019 at 05:18:36PM +0300, Georgy Kirichenko wrote:
> > > Use wal thread to relay rows. Relay writer and reader fibers live
> > > in a wal thread. Writer fiber uses wal memory to send rows to the
> > > peer when it possible or spawns cords to recover rows from a file.
> > > 
> > > Wal writer stores rows into two memory buffers swapping by buffer
> > > threshold. Memory buffers are splitted into chunks with the
> > > same server id issued by chunk threshold. In order to search
> > > position in wal memory all chunks indexed by wal_mem_index
> > > array. Each wal_mem_index contains corresponding replica_id and
> > > vclock just before first row in the chunk as well as chunk buffer
> > > number, position and size of the chunk in memory.
> > > 
> > > Closes: #3794
> > 
> > It's too early to comment on the code. Here are some general issues that
> > I think need to be addressed first:
> > 
> >  - Moving relay code to WAL looks ugly as it turns the code into a messy
> >    bundle. Let's try to hide relay behind an interface. Currently, we
> >    only need one method - write() - but in future we might need more,
> >    for instance prepare() or commit(). May be, we could reuse journal
> >    struct for this?
> It was done because WAL should control relays' ACK in case of synchronous 
> replication. So, in likely case forwarding wals and reading ACK should be done
> in wal thread.

I don't deny that.

> And if I left this code in a relay then wal-relay communication 
> would be to complicated (expose internal structures like sockets, memory 
> buffers, vclock and etc).

Not necessarily. You could hide it all behind a vtab. For now, vtab
would be trivial - we only need write() virtual method (ACKs can be
received by a fiber started by this method when relay switches to sync).
Later on, we could extend it with methods that would allow WAL to wait
for remote commit.

> Also I planned some future changes to move all wal-
> processing logic into a wal.c to.

Please don't. We try to split the code into modules so that it's easier
for understanding. Mixing relay/recovery/WAL in one module doesn't sound
like a good idea to me.

> > 
> >  - Memory index looks confusing to me. Why would we need it at all?
> >    AFAIU it is only used to quickly position relay after switching to
> >    memory buffer, but it's not a big deal to linearly scan the buffer
> >    instead when that happens - it's a rare event and scanning memory
> >    should be fast enough. At least, we should start with it, because
> >    that would be clear and simple, and build any indexing later in
> >    separate patches provided we realize that we really need it, which I
> >    doubt.
> It allows no to decode all logs as many times as we have relays connected to. 

I don't understand. When relay is in sync, we should be simply writing a
stream of rows to a socket. No decoding or positioning should be
necessary. Am I missing something?

> Also it eliminates following of corresponding vclocks for each row.

But you know the vclock of each row in WAL and you can pass it to relay.
No reason to decode anything AFAIU.

> > 
> >  - We shouldn't stop/start relay thread when switching to/from memory
> >    source. Instead on SUBSCRIBE we should always start a relay thread,
> >    as we do now, and implement some machinery to suspend/resume it and
> >    switch to memory/disk when it falls in/out the memory buffer.
> In a good case (we do not have any outdated replicas) we do not even give any 
> chance for relay thread to work.

Starting/stopping relay thread on every lag looks just ugly. We have to
start it anyway when a replica subscribes (because it's highly unlikely
to fall in the in-memory buffer). I don't see any reason to stop it once
we are in sync - we can switch to async anytime.

> >    All that machinery should live in relay.cc.
> This leads to to complicated dependencies between wal and relay. AT least wal 
> should expose it internal structures into a header file.

How's that? WAL should only know about write() method. Relay registers
the method in WAL. Everything else is done by relay, including switching
between sync and async mode.

> 
> > 
> >  - When in sync mode, WAL should write directly to the relay socket,
> >    without using any intermediate buffer. Buffer should only be used if
> >    the socket is blocked or the relay is running in asynchronous mode.
> >    There's no need in extra memory copying otherwise.
> Please explain why. I do not hing this can improve something. Also i think 

That would eliminate extra memory copying in the best case scenario.

> that this memory buffer should be used for writing files to.
> In any case we should have a memory buffer to store wals because you do not 
> know will relay block or not. But the biggest issue is to transfer state from 

As I suggested below, let's try to maintain a buffer per each relay, not
a single buffer in WAL. Then we wouldn't need to write anything to
memory buffers if all relays are in sync and no socket is blocked. This
is the best case scenario, but we should converge to it pretty quickly
provided the network doesn't lag much. This would also allow us to hide
this logic in relay.cc.

> just blocked sync relay to detached from memory reader.
> > 
> >  - Maintaining a single memory buffer in WAL would complicate relay sync
> >    interface implementation. May be, we could use a separate buffer per
> >    each relay? That would be less efficient, obviously, because it would
> >    mean extra memory copying and usage, but that would simplify the code
> >    a great deal. Besides, we would need to use the buffers only for
> >    async relays - once a relay is switched to sync mode, we could write
> >    to its socket directly. So I think we should at least consider this
> >    option.
> > 
> >  - You removed the code that tried to advance GC consumers as rarely as
> >    possible, namely only when a WAL file is closed. I'd try to save it
> >    somehow, because without it every ACK would result in GC going to WAL
> >    if relay is running in async mode (because only WAL knows about WAL
> >    files; for others it's just a continuous stream). When a relay is
> >    running in sync mode, advancing GC on each ACK seems to be OK,
> >    because WAL won't be invoked then (as we never remove WAL files
> >    created after the last checkpoint).
> I think, we should move wal gc into a wal thread.

You can try, but until then I think we should try to leave that code
intact.

> > 
> >  - I really don't like code duplication. I think we should reuse the
> >    code used for sending rows and receiving ACKs between sync and sync
> >    modes. Hiding sync relay implementation behind an interface would
> >    allow us to do that.
> > 
> >  - I don't see how this patch depends on #1025 (sending rows in
> >    batches). Let's implement relaying from memory first and do #1025
> >    later, when we see the full picture.
>