From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <vdavydov.dev@gmail.com>
Date: Thu, 31 Jan 2019 17:54:54 +0300
From: Vladimir Davydov <vdavydov.dev@gmail.com>
Subject: Re: [tarantool-patches] [PATCH v2 1/2] Journal transaction boundaries
Message-ID: <20190131145454.hhuwbgff2hznsrkx@esperanza>
References: <cover.1548153546.git.georgy@tarantool.org>
 <3544285.z36fh00o7k@home.lan>
 <20190131081951.e5bijlb7wcqzcqiv@esperanza>
 <5221637.HvkEVp0ifG@home.lan>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <5221637.HvkEVp0ifG@home.lan>
To: =?utf-8?B?0JPQtdC+0YDQs9C40Lkg0JrQuNGA0LjRh9C10L3QutC+?= <georgy@tarantool.org>
Cc: tarantool-patches@freelists.org
List-ID: <tarantool-patches.dev.tarantool.org>

On Thu, Jan 31, 2019 at 05:25:24PM +0300, Георгий Кириченко wrote:
> On Thursday, January 31, 2019 11:19:51 AM MSK Vladimir Davydov wrote:
> > On Thu, Jan 31, 2019 at 10:34:43AM +0300, Георгий Кириченко wrote:
> > > On Tuesday, January 29, 2019 2:00:10 PM MSK Vladimir Davydov wrote:
> > > > On Tue, Jan 29, 2019 at 01:09:50PM +0300, Георгий Кириченко wrote:
> > > > > On Monday, January 28, 2019 3:58:59 PM MSK Vladimir Davydov wrote:
> > > > > > On Tue, Jan 22, 2019 at 01:57:36PM +0300, Georgy Kirichenko wrote:
> > > > > > > diff --git a/src/box/iproto_constants.h
> > > > > > > b/src/box/iproto_constants.h
> > > > > > > index 728514297..d01cdf840 100644
> > > > > > > --- a/src/box/iproto_constants.h
> > > > > > > +++ b/src/box/iproto_constants.h
> > > > > > > @@ -60,6 +60,9 @@ enum iproto_key {
> > > > > > > 
> > > > > > >  	IPROTO_SCHEMA_VERSION = 0x05,
> > > > > > >  	IPROTO_SERVER_VERSION = 0x06,
> > > > > > >  	IPROTO_GROUP_ID = 0x07,
> > > > > > > 
> > > > > > > +	IPROTO_TXN_ID = 0x08,
> > > > > > > +	IPROTO_TXN_REPLICA_ID = 0x09,
> > > > > > 
> > > > > > Do we really need to introduce both TXN_ID and TXN_REPLICA_ID? I
> > > > > > would
> > > > > > expect TXN_ID to be enough - we could use REPLICA_ID to make sure
> > > > > > that
> > > > > > transaction identifiers from different instances don't overlap,
> > > > > > because
> > > > > > there couldn't be a "multi-instance" transaction, could there?
> > > 
> > > txn_replica_id might be differ from replica_id in case when transaction
> > > would be finished on other node, e.g. after recovery.
> > 
> > I'm kinda out of scope here. I think you need to write an RFC or
> > something explaining all the design decisions you made when you
> > introduced transaction boundaries. Or we should get together with
> > Kostja and discuss it. Or both. Neither the commit message nor
> > comments shed the light on the big picture.
> > 
> > > > > > > +	IPROTO_TXN_LAST = 0x0a,
> > > > > > 
> > > > > > I think we should instead introduce BEGIN and COMMIT commands, because:
> > > > > I completely do not like any auto-commit logic in a xlog file. You
> > > > > suggestion breaks backward compatibility because previous logs do not
> > > > > have any BEGIN/
> > > > 
> > > > It wouldn't break backward compatibility. It might break forward
> > > > compatibility, which is fine by me (we do it all the time).
> > > 
> > > Suggested xrow encoding/decoding rules means that any xrow without txn_id,
> > > txn_replica_id, txn_last should be processed as a single statement
> > > transaction as it was before.
> > > If we would require explicit begin/commit then previous logs turns into an
> > > invalid stream without autocommit semantic.
> > 
> > Yeah, we would have to add autocommit.
> I am disagreed with autocommit. There are my key points:
> 1. We could not interleave autocommit transactions with others, or we would 
> have to have special mark for this transaction but this is no autocommit after 
> all.

Do we need to support interleaving transactions in xlog? I'm not sure.
May be, for sync replication we do.

Anyway, we could add BEGIN/COMMIT *in addition* to TXN_ID stored in each
row. They would serve as a header and footer. This would give us a place
holder to store transaction-wide information if we ever need it.
Besides, it would make the stream more robust: if we loose BEGIN, we
will detect it while if we loose the first transaction row in your case
we won't. Yeah, explicit BEGIN/COMMIT will add a few bytes to each
transaction, but IMO it's not critical.

Let's discuss it f2f when we can, because we seem to repeat ourselves
without moving forward.

> 2. Applier would not able to check transaction boundaries.
> 3. There is no difference between an autocommit transaction and a single-row 
> transaction.
> 4. Autocommit is about interface behavior but not about log format.
> 
> Also let me explain xrow encoding rules:
> If we have only one row in transaction, then we do not encode txn_id and 
> txn_last. So single-row transaction (or autocommit if you wish) does not 
> change its' representation. And then while decoding we set txn_id = lsn and 
> txn_last if txn_id was not set. In that case each row from previous xlog acts 
> as one single-row transaction.
> For multi-row transaction we have only penalty in transaction identifiers for 
> each row and one commit flag for the last.
> 
> > > > > > > @@ -917,6 +918,39 @@ wal_assign_lsn(struct vclock *vclock, struct
> > > > > > > xrow_header **row,>
> > > > > > > 
> > > > > > >  			vclock_follow_xrow(vclock, *row);
> > > > > > >  		
> > > > > > >  		}
> > > > > > >  	
> > > > > > >  	}
> > > > > > > 
> > > > > > > +	if ((*begin)->replica_id != instance_id) {
> > > > > > > +		/*
> > > > > > > +		 * Move all local changes to the end of rows array and
> > > > > > > +		 * a fake local transaction (like an autonomous transaction)
> > > > > > > +		 * because we could not replicate the transaction back.
> > > > > > > +		 */
> > > > > > > +		struct xrow_header **row = end - 1;
> > > > > > > +		while (row >= begin) {
> > > > > > > +			if (row[0]->replica_id != instance_id) {
> > > > > > > +				--row;
> > > > > > > +				continue;
> > > > > > > +			}
> > > > > > > +			/* Local row, move it back. */
> > > > > > > +			struct xrow_header **local_row = row;
> > > > > > > +			while (local_row < end - 1 &&
> > > > > > > +			       local_row[1]->replica_id != instance_id) {
> > > > > > > +				struct xrow_header *tmp = local_row[0];
> > > > > > > +				local_row[0] = local_row[1];
> > > > > > > +				local_row[1] = tmp;
> > > > > > 
> > > > > > There's SWAP for this.
> > > > > > 
> > > > > > > +			}
> > > > > > > +			--row;
> > > > > > > +		}
> > > > > > > +		while (begin < end && begin[0]->replica_id != instance_id)
> > > > > > > +			++begin;
> > > > > > > +	}
> > > > > > 
> > > > > > I don't understand why we need to move rows generated locally (by
> > > > > > an on_replace trigger I surmise) to the end of a transaction. We
> > > > > > have TXN_ID attached to each row so we could leave the transactions
> > > > > > interleaved, couldn't we?
> > > > > 
> > > > > You are right, but in that case applier should track a lot of
> > > > > transaction
> > > > > simultaneously. Also it complicates recovery too. I hope it will be
> > > > > fixed
> > > > > while parallel applier implementing.
> > > > 
> > > > May be, we should implement parallel applier in the scope of this issue
> > > > then? Anyway, without it, sync replication won't scale with the number
> > > > of parallel transactions.
> > > 
> > > I do not think so. Parallel applier depends on that feature, but it is
> > > completely different issue.
> > 
> > But as I said, sync replication won't scale without it.
> > 
> > > Also parallel applier is ignorant on a xlog format.
> > 
> > Depends on how you implement it. For example, MySQL marks all
> > transactions that were committed in parallel on the master so that
> > they can be committed in parallel on replicas.
> In case of vinyl it is not true.

In case of vinyl we can mark transactions that overlapped in time on the
master. In fact, that's what MySQL actually does. I guess I misleaded
you by saying "committed in parallel". I meant "executed in parallel".