[Tarantool-patches] [PATCH] Trigger on vclock change

Fri Nov 15 09:02:13 MSK 2019

On Friday, November 15, 2019 4:57:45 AM MSK Konstantin Osipov wrote:
> * Georgy Kirichenko <georgy at tarantool.org> [19/11/15 04:33]:
> > On Thursday, November 14, 2019 10:48:06 PM MSK Konstantin Osipov wrote:
> > > * Georgy Kirichenko <georgy at tarantool.org> [19/11/14 22:42]:
> > > > A replica state is described by 2 vclocks - written and committed
> > > > ones.
> > > > Right now it is not an issue to report them both as an applier submits
> > > > transaction asynchronously. In addition to these two vclocks (yes, the
> > > > both could be transferred from the WAL thread) applier will report a
> > > > reject vclock - the vclock where applying breaks, and this could be
> > > > done
> > > > from TX. I do not like the idea to split transmission between 2
> > > > threads.
> > > > The write and reject vclocks are used to evaluate majority whereas
> > > > commit
> > > > vclock instructs a whole cluster that majority was already reached.
> > > > The
> > > > main point is that any replica member could commit a transaction -
> > > > this
> > > > relaxes RAFT limitations and increases the whole cluster durability
> > > > (and
> > > > it is simpler in design and implementation, really). Also the new
> > > > synchronous replication design has a lot of advantages in comparison
> > > > with
> > > > RAFT but let us discuss it in another thread. If you interested please
> > > > ask for details as I have not enough time to write public document
> > > > right
> > > > now.
> > > > Returning to the subject, I would like to conclude that wal on_commit
> > > > and
> > > > on_write triggers are good source to initiate status transmission. And
> > > > the
> > > > trigger implemented by Maria will be replaced by replica on_commit
> > > > which
> > > > allows us not to change anything at higher levels.
> > > 
> > > Congratulations, Georgy, maybe you even get a Turing award for
> > > inventing a new protocol.
> > > 
> > > Wait... they don't give a Turing award for "protocols" which have
> > > no proof and yield inconsistent results, or do they?
> > 
> > You do not even know details of the protocol but make such suggestion, so
> > I
> > could only repeat your last statement: "what a shame", seriously.
> > Please, remember all my attempts to discuss it with you or, for instance,
> > our one-per-2-week meetings which all (except the first one) were skipped
> > by you.
> If you want to discuss anything with me, feel free to reach out.
> 
> I am following the process as I believe it should work in a
> distributed open source project: before there is a change, there
> is a design document on which everyone can equally comment.
90% of tarantool core developers are sitting together in one room or just 
call-available during the day. Also, please, tell us how many RFC responds you 
saw from a somebody who is not a part of tarantool core team. So, you wish to 
force the whole team to use only this inconvenient and unproductive (because 
of long-latency responds) communication way because of your beliefs.
So I have the another look: each thing to be discussed should be discussed (or 
brainstormed) verbally (because we are the tarantol TEAM members) first and 
only then a well-designed RFC could be formed (or, maybe, you wish to have 
lots of worthless RFCs but I no see any point here).
> 
> > > Meanwhile, if you have a design in mind, you could send an RFC. I
> > > will respond to the RFC.
> > 
> > Anybody could see the design document after this protocol research will be
> > done. Yes, the research requires to be implemented first.
> 
> You don't need to waste time on implementation. Your approach,
> just by the description of it, is neither consistent, nor durable:
I think you just did not understand the approach because all of your further 
considerations are not related to protocol I am implementing. Also I think you 
have not got even base points why RAFT is not well applicable in case of 
tarantool.
> 
> - if you allow active-active, you can have lost writes.
> 
> Here's a simple example:
>     box.begin() local a = box.space.t.select{1}[2] box.space.t:replace{1,
> a+1} box.commit()
> 
> By running this transaction concurrently on two masters, you will
> get lost writes. RAFT would not let that happen.
> 
> But let's imagine for a second that this is not an issue.
> Your proposal is missing the critical parts of RAFT: neutralizing
> old leaders and completing transactions upon leader failure - i.e.
> when the new leader commits writes accepting by the majority and
> rolls back the rest, on behalf of the deceased.
> 
> Imagine one of your active replica fails midway:
> - it can fail after a record is written to wal by one of the peers
> - it can fail after a record is written to wal by the majority
>   of the peers, bu
> - it can fail after a record is committed by one of the peers, but
>   not all.
> 
> Who and how is going to repair these replicas upon master failure?
> You just threw RAFT "longest log wins" principle into a garbage
> bin, so you would never be able to identify which specific
> transactions need repair, on which replica, and what this repair
> should do. Needless to say that you haven't stopped transaction
> processing on these replicas, so even if you knew which specific
> transactions needed completion and on which replica, the data they
> modify could be easily overwritten by the time you get to finish these
> transactions.
> 
> As to your suggestion to track commit/wal write vclock in tx
> thread, well, this has fortunately nothing to do with correctness,
> but has all to do with efficiency and performance. There was a
> plan to move out the applier to iproto thread from tx, and you
> even wrote a patch, which wasn't finished like many others,
> because you never addressed the review comments.
Wrong, this task was not finished because of schema latch and yielding ddl 
while you rejected strict applier ordering. But after 2 years you changed you 
mind and I was able implement the parallel applier. The second issue why 
applier in iproto was not implemented is the limited capacity oftx fiber pool 
with you still want to preserve. So I wonder why did this slip from your mind.
> Now you chose to
> go in the opposite direction by throwing more logic to the tx
> thread, adding to the scalability bottleneck of the
> single-threaded architecture. We discussed that before - but
> somehow it slips from your mind each time.
The only thing which I wish to add to the tx cord is just around 10 trigger 
calls per batch and each ones just wakes an applier fiber up. And this costs 
nothing in terms of increased WAL latency because of synchronous replication. 
And now you still suggest to do the transaction way longer (iproto->tx->wal-
>tx->iproto), well well.
> 
> Meanwhile, Vlad's review of your patch for in-memory WAL is not
> addressed. You could complain that my reviews are too harsh
> and asking too much, but this is Vlad's...
I have no clue why you remembered my in-memory replication patch and Vlad.
This patch is on-hold because I wish to fix the by-design broken GC first.