From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <kostja.osipov@gmail.com>
Received: from mail-lf1-f67.google.com (mail-lf1-f67.google.com
 [209.85.167.67])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id ABEB1452566
 for <tarantool-patches@dev.tarantool.org>;
 Fri, 15 Nov 2019 04:57:47 +0300 (MSK)
Received: by mail-lf1-f67.google.com with SMTP id d6so6734726lfc.0
 for <tarantool-patches@dev.tarantool.org>;
 Thu, 14 Nov 2019 17:57:47 -0800 (PST)
Date: Fri, 15 Nov 2019 04:57:45 +0300
From: Konstantin Osipov <kostja.osipov@gmail.com>
Message-ID: <20191115015745.GA23299@atlas>
References: <20191114125705.26760-1-maria.khaydich@tarantool.org>
 <2359844.DWZl6MdUWF@home.lan> <20191114194806.GA20289@atlas>
 <27617293.E4uLSYyink@home.lan>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <27617293.E4uLSYyink@home.lan>
Subject: Re: [Tarantool-patches] [PATCH] Trigger on vclock change
List-Id: Tarantool development patches <tarantool-patches.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-patches/>
List-Post: <mailto:tarantool-patches@dev.tarantool.org>
List-Help: <mailto:tarantool-patches-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=subscribe>
To: Georgy Kirichenko <georgy@tarantool.org>
Cc: tarantool-patches@dev.tarantool.org

* Georgy Kirichenko <georgy@tarantool.org> [19/11/15 04:33]:
> On Thursday, November 14, 2019 10:48:06 PM MSK Konstantin Osipov wrote:
> > * Georgy Kirichenko <georgy@tarantool.org> [19/11/14 22:42]:
> > > A replica state is described by 2 vclocks - written and committed ones.
> > > Right now it is not an issue to report them both as an applier submits
> > > transaction asynchronously. In addition to these two vclocks (yes, the
> > > both could be transferred from the WAL thread) applier will report a
> > > reject vclock - the vclock where applying breaks, and this could be done
> > > from TX. I do not like the idea to split transmission between 2 threads.
> > > The write and reject vclocks are used to evaluate majority whereas commit
> > > vclock instructs a whole cluster that majority was already reached. The
> > > main point is that any replica member could commit a transaction - this
> > > relaxes RAFT limitations and increases the whole cluster durability (and
> > > it is simpler in design and implementation, really). Also the new
> > > synchronous replication design has a lot of advantages in comparison with
> > > RAFT but let us discuss it in another thread. If you interested please
> > > ask for details as I have not enough time to write public document right
> > > now.
> > > Returning to the subject, I would like to conclude that wal on_commit and
> > > on_write triggers are good source to initiate status transmission. And the
> > > trigger implemented by Maria will be replaced by replica on_commit which
> > > allows us not to change anything at higher levels.
> > 
> > Congratulations, Georgy, maybe you even get a Turing award for
> > inventing a new protocol.
> > 
> > Wait... they don't give a Turing award for "protocols" which have
> > no proof and yield inconsistent results, or do they?
> You do not even know details of the protocol but make such suggestion, so I 
> could only repeat your last statement: "what a shame", seriously.
> Please, remember all my attempts to discuss it with you or, for instance, our 
> one-per-2-week meetings which all (except the first one) were skipped by you.

If you want to discuss anything with me, feel free to reach out.

I am following the process as I believe it should work in a
distributed open source project: before there is a change, there
is a design document on which everyone can equally comment.

> > Meanwhile, if you have a design in mind, you could send an RFC. I
> > will respond to the RFC.
> Anybody could see the design document after this protocol research will be 
> done. Yes, the research requires to be implemented first.

You don't need to waste time on implementation. Your approach,
just by the description of it, is neither consistent, nor durable:

- if you allow active-active, you can have lost writes.

Here's a simple example:
    box.begin() local a = box.space.t.select{1}[2] box.space.t:replace{1, a+1} box.commit()

By running this transaction concurrently on two masters, you will
get lost writes. RAFT would not let that happen.

But let's imagine for a second that this is not an issue.
Your proposal is missing the critical parts of RAFT: neutralizing
old leaders and completing transactions upon leader failure - i.e.
when the new leader commits writes accepting by the majority and
rolls back the rest, on behalf of the deceased. 

Imagine one of your active replica fails midway: 
- it can fail after a record is written to wal by one of the peers
- it can fail after a record is written to wal by the majority
  of the peers, bu
- it can fail after a record is committed by one of the peers, but
  not all.

Who and how is going to repair these replicas upon master failure?
You just threw RAFT "longest log wins" principle into a garbage 
bin, so you would never be able to identify which specific
transactions need repair, on which replica, and what this repair
should do. Needless to say that you haven't stopped transaction
processing on these replicas, so even if you knew which specific
transactions needed completion and on which replica, the data they
modify could be easily overwritten by the time you get to finish these
transactions.

As to your suggestion to track commit/wal write vclock in tx
thread, well, this has fortunately nothing to do with correctness,
but has all to do with efficiency and performance. There was a
plan to move out the applier to iproto thread from tx, and you
even wrote a patch, which wasn't finished like many others,
because you never addressed the review comments. Now you chose to
go in the opposite direction by throwing more logic to the tx
thread, adding to the scalability bottleneck of the
single-threaded architecture. We discussed that before - but
somehow it slips from your mind each time.

Meanwhile, Vlad's review of your patch for in-memory WAL is not
addressed. You could complain that my reviews are too harsh
and asking too much, but this is Vlad's...

-- 
Konstantin Osipov, Moscow, Russia