[tarantool-patches] [PATCH] Relay logs from wal thread

Thu Jan 31 11:01:50 MSK 2019

On Thu, Jan 31, 2019 at 10:15:49AM +0300, Георгий Кириченко wrote:
> On Tuesday, January 29, 2019 1:44:16 PM MSK Vladimir Davydov wrote:
> > On Tue, Jan 29, 2019 at 12:49:53PM +0300, Георгий Кириченко wrote:
> > > On Tuesday, January 29, 2019 11:53:18 AM MSK Vladimir Davydov wrote:
> > > > On Tue, Jan 22, 2019 at 05:18:36PM +0300, Georgy Kirichenko wrote:
> > > > > Use wal thread to relay rows. Relay writer and reader fibers live
> > > > > in a wal thread. Writer fiber uses wal memory to send rows to the
> > > > > peer when it possible or spawns cords to recover rows from a file.
> > > > > 
> > > > > Wal writer stores rows into two memory buffers swapping by buffer
> > > > > threshold. Memory buffers are splitted into chunks with the
> > > > > same server id issued by chunk threshold. In order to search
> > > > > position in wal memory all chunks indexed by wal_mem_index
> > > > > array. Each wal_mem_index contains corresponding replica_id and
> > > > > vclock just before first row in the chunk as well as chunk buffer
> > > > > number, position and size of the chunk in memory.
> > > > > 
> > > > > Closes: #3794
> > > > 
> > > > It's too early to comment on the code. Here are some general issues that
> > > > 
> > > > I think need to be addressed first:
> > > >  - Moving relay code to WAL looks ugly as it turns the code into a messy
> > > >  
> > > >    bundle. Let's try to hide relay behind an interface. Currently, we
> > > >    only need one method - write() - but in future we might need more,
> > > >    for instance prepare() or commit(). May be, we could reuse journal
> > > >    struct for this?
> > > 
> > > It was done because WAL should control relays' ACK in case of synchronous
> > > replication. So, in likely case forwarding wals and reading ACK should be
> > > done in wal thread.
> > 
> > I don't deny that.
> > 
> > > And if I left this code in a relay then wal-relay communication
> > > would be to complicated (expose internal structures like sockets, memory
> > > buffers, vclock and etc).
> > 
> > Not necessarily. You could hide it all behind a vtab. For now, vtab
> > would be trivial - we only need write() virtual method (ACKs can be
> > received by a fiber started by this method when relay switches to sync).
> > Later on, we could extend it with methods that would allow WAL to wait
> > for remote commit.
> I do not think wal should wait for remote commit but relay should inform wal 
> about state. For example wal should track state of an relay even if the relay 
> is reading from file in a different cord (and might exit at any time).
> So we also add some v-functions to request memory chunks, promote in-wal relay 
> vclock or attach to sync mode as you suggested.

If we maintain a single memory buffer in WAL, then yes, I guess, we do.
I'd try to keep things simple though and use one buffer per relay, but
if you really think it's a no-go, then please try to design such a vtab
that will neatly hide all the necessary methods, because mixing relay
with WAL logic makes the code less modular and hence more difficult for
understanding.

> > 
> > > Also I planned some future changes to move all wal-
> > > processing logic into a wal.c to.
> > 
> > Please don't. We try to split the code into modules so that it's easier
> > for understanding. Mixing relay/recovery/WAL in one module doesn't sound
> > like a good idea to me.
> > 
> > > >  - Memory index looks confusing to me. Why would we need it at all?
> > > >  
> > > >    AFAIU it is only used to quickly position relay after switching to
> > > >    memory buffer, but it's not a big deal to linearly scan the buffer
> > > >    instead when that happens - it's a rare event and scanning memory
> > > >    should be fast enough. At least, we should start with it, because
> > > >    that would be clear and simple, and build any indexing later in
> > > >    separate patches provided we realize that we really need it, which I
> > > >    doubt.
> > > 
> > > It allows no to decode all logs as many times as we have relays connected
> > > to.
> > I don't understand. When relay is in sync, we should be simply writing a
> > stream of rows to a socket. No decoding or positioning should be
> > necessary. Am I missing something?
> If relay blocks for some time then you stop writing and then resume reading 
> from some position relay has reached before and this position could be 
> identified only by vclock because memory buffer might be reallocated at any time 
> when relay sleeps.
> Also, lets imagine that switching from relay to wal needs dT time in which wal 
> writes N rows. So, if you do not store logs in memory then you would not be 
> able to switch this relay in a sync mode. So we should have some rows backed 
> in memory and allow relays to navigate in it. 

I'm not arguing against maintaining a memory buffer (or buffers) -
obviously it's absolutely necessary for switching from async to sync.
I just think that optimizing for rare case scenarios, like switching
to/from memory, is pointless. Ideally, we should switch to sync mode
once and never ever return to async. If network lags cause the socket
to block frequently, we'd better never switch to sync (may be, add a
configuration option for that). That said, IMO linerarly scanning the
memory buffer when switching to sync is perfectly fine. If we find it
to be a problem, we can implement some kind of indexing later, in a
separate patch.

> 
> Other cause why I rejected direct writing to relay - in that case wal would be 
> in charge to track relays' state, for me it looks ugly.
> > 
> > > Also it eliminates following of corresponding vclocks for each row.
> > 
> > But you know the vclock of each row in WAL and you can pass it to relay.
> > No reason to decode anything AFAIU.
> I know it only when i write entry to wal, if relay yielded for some time - the 
> only possibility is to store memory buffers' starting vclock and then decode 
> each row one-by-one to follow vclock.
> > 
> > > >  - We shouldn't stop/start relay thread when switching to/from memory
> > > >  
> > > >    source. Instead on SUBSCRIBE we should always start a relay thread,
> > > >    as we do now, and implement some machinery to suspend/resume it and
> > > >    switch to memory/disk when it falls in/out the memory buffer.
> > > 
> > > In a good case (we do not have any outdated replicas) we do not even give
> > > any chance for relay thread to work.
> > 
> > Starting/stopping relay thread on every lag looks just ugly. We have to
> > start it anyway when a replica subscribes (because it's highly unlikely
> > to fall in the in-memory buffer). I don't see any reason to stop it once
> > we are in sync - we can switch to async anytime.
> Relay cord do not started on every lag but only if relay gone out of from the 
> wal memory buffer. The buffer should be large enough to compensate all hardware 
> fluctuations, and this grants that any replica able to process all rows from 
> master would live in a wal memory without spawning threads.

But it just looks ugly to start/stop it. Better use cbus to
suspend/resume IMO. We will need this channel anyway to switch
to sync so why not reuse it for switching to async. Shouldn't
be difficult to implement.

> > 
> > > >    All that machinery should live in relay.cc.
> > > 
> > > This leads to to complicated dependencies between wal and relay. AT least
> > > wal should expose it internal structures into a header file.
> > 
> > How's that? WAL should only know about write() method. Relay registers
> > the method in WAL. Everything else is done by relay, including switching
> > between sync and async mode.
> Write method is not enough at all.
> > 
> > > >  - When in sync mode, WAL should write directly to the relay socket,
> > > >  
> > > >    without using any intermediate buffer. Buffer should only be used if
> > > >    the socket is blocked or the relay is running in asynchronous mode.
> > > >    There's no need in extra memory copying otherwise.
> > > 
> > > Please explain why. I do not hing this can improve something. Also i think
> > 
> > That would eliminate extra memory copying in the best case scenario.
> Let me explain - we will form in-memory data that will be used for both to-file 
> writes and as a relay source.

On your branch there's xlog_write_entry, which writes rows to an xlog
file, and wal_encode_entry, which stores rows in a buffer. The latter
was introduced by your patch. The two functions are independent. That
is, you did introduce extra memory copying while you could write to
sockets directly. Not a big deal from performance pov, I guess, but this
still raises questions why it was done like that, why we can't write to
sockets directly, why we should first write to an in-memory buffer and
then suck in data in relay fibers. Looks kinda roundabout IMO.

> > 
> > > that this memory buffer should be used for writing files to.
> > > In any case we should have a memory buffer to store wals because you do
> > > not
> > > know will relay block or not. But the biggest issue is to transfer state
> > > from
> > As I suggested below, let's try to maintain a buffer per each relay, not
> > a single buffer in WAL. Then we wouldn't need to write anything to
> > memory buffers if all relays are in sync and no socket is blocked. This
> > is the best case scenario, but we should converge to it pretty quickly
> > provided the network doesn't lag much. This would also allow us to hide
> > this logic in relay.cc.
> Above I described why it is not possible to not to have wal memory buffer - 
> with constant load relay would not be able to attach to sync mode as you 
> suggested. So we have to maintain wal memory buffer.
> There are some situations in that you model would not be functional.
> For example if wal writes a lot (more than tcp write buffer in a some time 
> slot) so each call will block socket and push relay sync down.

I never argued against a memory buffer...

> And if we maintain per relay buffers so you we should copy data as much as 
> relays we have, even for file-reading relays do let then chance to catch the 
> wal.

Only in the worst case scenario when we're switching from async to sync.
As I said above, I don't think this should happen often. If it does, the
network is too slow for sync and the user should either tune tcp buffer
size or disable sync replication completely.