[Tarantool-patches] [PATCH] Trigger on vclock change

Georgy Kirichenko georgy at tarantool.org
Thu Nov 14 22:16:43 MSK 2019


On Thursday, November 14, 2019 8:33:38 PM MSK Konstantin Osipov wrote:
> * Georgy Kirichenko <georgy at tarantool.org> [19/11/14 20:14]:
> > > I also think it it's a pair of server_id, lsn, rather than entire
> > > vclock - usually you know what you're waiting for, and it's only
> > > one component of vclock, not all of them.
> > 
> > But there are some issues
> > 1. what if we wish to have a timeout
> > 2. what if lsn are waited in non-strictly increasing order
> > 3. what if awaiting fiber is canceled
> > The approach you suggested looks for me like a bike-shed trigger
> > implementation but the implementation is limited to use only for wait for
> > lsn. So I would like to propose to ask Alexander Tikhonov to provide us
> > with a benchmark result first and then make a conclusion about
> > performance impact.
> Maybe you're right. But isn't the entire idea of wait_lsn()
> bike-shed, as you put it, because we don't have sync replication?
> 
> > > > Anyway, we will need to have such trigger in order to make applier
> > > > able to
> > > > report local replica wal and commited vclock in scope of synchronous
> > > > replication issue.
> > > 
> > > This has to happen in WAL thread, not in main thread, and has to
> > > watch relay-from-memory vclock, not async-replication vclock. And
> > > it also needs to roll back the transaction locally on failure,
> > > i.e. write some sort of undo records to the WAL.
> > 
> > This will work in an applier which lives in the TX cord, as an applier
> > processes incoming transactions through the TX. And an applier should be
> > able to answer with two vclocks - committed and written ones. Yes, WAL
> > will batch such vclocks updates but this is still of hundreds of events
> > per second. Unfortunately there is no point to move an applier to the WAL
> > thread because a transaction could not be validated without TX.
> 
> OK, now I get it where you're heading. I think sending acks from
> tx thread has the following disadvantages:
> - we mix up "committed" event and "written to the commit log"
>   event. They become indistinguishable in tx thread. Per RAFT, we
>   should send back acks as soon as we write to the local commit
>   log, and when the leader gets enough 'acks' from enough commit
>   logs it sends another message which makes the local transaction
>   commit. If you 'ack' when you commit the local transaction, how
>   would you be able to roll it back on leader change or majority
>   failure?
> 
>   So the event you need to be acknowledging is not the event this
>   trigger in question is capturing.
Sorry, I think you have outdated information about synchronous replication 
design. At this moment we do not implement the RAFT protocol (I did some 
attempts to discuss it some month before you but you ignored them all). So let 
me give some technical details.
A replica state is described by 2 vclocks - written and committed ones. Right 
now it is not an issue to report them both as an applier submits transaction 
asynchronously. In addition to these two vclocks (yes, the both could be 
transferred from the WAL thread) applier will report a reject vclock - the 
vclock where applying breaks, and this could be done from TX. I do not like 
the idea to split transmission between 2 threads. The write and reject vclocks 
are used to evaluate majority whereas commit vclock instructs a whole cluster 
that majority was already reached. The main point is that any replica member 
could commit a transaction - this relaxes RAFT limitations and increases the 
whole cluster durability (and it is simpler in design and implementation, 
really). Also the new synchronous replication design has a lot of advantages 
in comparison with RAFT but let us discuss it in another thread. If you 
interested please ask for details as I have not enough time to write public 
document right now.
Returning to the subject, I would like to conclude that wal on_commit and 
on_write triggers are good source to initiate status transmission. And the 
trigger implemented by Maria will be replaced by replica on_commit which 
allows us not to change anything at higher levels.

> 
> - the second issue is latency. tx/wal scheduling delay can be in
>   hundreds of microseconds, and this is close to  networking
>   delays on fast networks within the same rack/data center.
>   So it acknowledging commit log writes from WAL thread will
>   also speed up the leader quite a bit, since the round trip
>   will be shorter.
> 
> To sum up, I still think you should not use this trigger to
> acknowledge commit log writes. Better have a separate socket for
> this altogether, or move the write end of the existing socket to
> the wal, while keeping the read end where it is now, in
> tx/applier.






More information about the Tarantool-patches mailing list