[tarantool-patches] [PATCH v2 1/2] Journal transaction boundaries

Fri Feb 1 12:31:04 MSK 2019

On Thursday, January 31, 2019 5:54:54 PM MSK Vladimir Davydov wrote:
> On Thu, Jan 31, 2019 at 05:25:24PM +0300, Георгий Кириченко wrote:
> > On Thursday, January 31, 2019 11:19:51 AM MSK Vladimir Davydov wrote:
> > > On Thu, Jan 31, 2019 at 10:34:43AM +0300, Георгий Кириченко wrote:
> > > > On Tuesday, January 29, 2019 2:00:10 PM MSK Vladimir Davydov wrote:
> > > > > On Tue, Jan 29, 2019 at 01:09:50PM +0300, Георгий Кириченко wrote:
> > > > > > On Monday, January 28, 2019 3:58:59 PM MSK Vladimir Davydov wrote:
> > > > > > > On Tue, Jan 22, 2019 at 01:57:36PM +0300, Georgy Kirichenko 
wrote:
> > > > > > > > diff --git a/src/box/iproto_constants.h
> > > > > > > > b/src/box/iproto_constants.h
> > > > > > > > index 728514297..d01cdf840 100644
> > > > > > > > --- a/src/box/iproto_constants.h
> > > > > > > > +++ b/src/box/iproto_constants.h
> > > > > > > > @@ -60,6 +60,9 @@ enum iproto_key {
> > > > > > > > 
> > > > > > > >  	IPROTO_SCHEMA_VERSION = 0x05,
> > > > > > > >  	IPROTO_SERVER_VERSION = 0x06,
> > > > > > > >  	IPROTO_GROUP_ID = 0x07,
> > > > > > > > 
> > > > > > > > +	IPROTO_TXN_ID = 0x08,
> > > > > > > > +	IPROTO_TXN_REPLICA_ID = 0x09,
> > > > > > > 
> > > > > > > Do we really need to introduce both TXN_ID and TXN_REPLICA_ID? I
> > > > > > > would
> > > > > > > expect TXN_ID to be enough - we could use REPLICA_ID to make
> > > > > > > sure
> > > > > > > that
> > > > > > > transaction identifiers from different instances don't overlap,
> > > > > > > because
> > > > > > > there couldn't be a "multi-instance" transaction, could there?
> > > > 
> > > > txn_replica_id might be differ from replica_id in case when
> > > > transaction
> > > > would be finished on other node, e.g. after recovery.
> > > 
> > > I'm kinda out of scope here. I think you need to write an RFC or
> > > something explaining all the design decisions you made when you
> > > introduced transaction boundaries. Or we should get together with
> > > Kostja and discuss it. Or both. Neither the commit message nor
> > > comments shed the light on the big picture.
> > > 
> > > > > > > > +	IPROTO_TXN_LAST = 0x0a,
> > > > > > > 
> > > > > > > I think we should instead introduce BEGIN and COMMIT commands, 
because:
> > > > > > I completely do not like any auto-commit logic in a xlog file. You
> > > > > > suggestion breaks backward compatibility because previous logs do
> > > > > > not
> > > > > > have any BEGIN/
> > > > > 
> > > > > It wouldn't break backward compatibility. It might break forward
> > > > > compatibility, which is fine by me (we do it all the time).
> > > > 
> > > > Suggested xrow encoding/decoding rules means that any xrow without
> > > > txn_id,
> > > > txn_replica_id, txn_last should be processed as a single statement
> > > > transaction as it was before.
> > > > If we would require explicit begin/commit then previous logs turns
> > > > into an
> > > > invalid stream without autocommit semantic.
> > > 
> > > Yeah, we would have to add autocommit.
> > 
> > I am disagreed with autocommit. There are my key points:
> > 1. We could not interleave autocommit transactions with others, or we
> > would
> > have to have special mark for this transaction but this is no autocommit
> > after all.
> 
> Do we need to support interleaving transactions in xlog? I'm not sure.
> May be, for sync replication we do.
> 
> Anyway, we could add BEGIN/COMMIT *in addition* to TXN_ID stored in each
> row. They would serve as a header and footer. This would give us a place
> holder to store transaction-wide information if we ever need it.
> Besides, it would make the stream more robust: if we loose BEGIN, we
> will detect it while if we loose the first transaction row in your case
> we won't. Yeah, explicit BEGIN/COMMIT will add a few bytes to each
> transaction, but IMO it's not critical.
> 
> Let's discuss it f2f when we can, because we seem to repeat ourselves
> without moving forward.
> 
> > 2. Applier would not able to check transaction boundaries.
> > 3. There is no difference between an autocommit transaction and a
> > single-row transaction.
> > 4. Autocommit is about interface behavior but not about log format.
> > 
> > Also let me explain xrow encoding rules:
> > If we have only one row in transaction, then we do not encode txn_id and
> > txn_last. So single-row transaction (or autocommit if you wish) does not
> > change its' representation. And then while decoding we set txn_id = lsn
> > and
> > txn_last if txn_id was not set. In that case each row from previous xlog
> > acts as one single-row transaction.
> > For multi-row transaction we have only penalty in transaction identifiers
> > for each row and one commit flag for the last.
> > 
> > > > > > > > @@ -917,6 +918,39 @@ wal_assign_lsn(struct vclock *vclock,
> > > > > > > > struct
> > > > > > > > xrow_header **row,>
> > > > > > > > 
> > > > > > > >  			vclock_follow_xrow(vclock, *row);
> > > > > > > >  		
> > > > > > > >  		}
> > > > > > > >  	
> > > > > > > >  	}
> > > > > > > > 
> > > > > > > > +	if ((*begin)->replica_id != instance_id) {
> > > > > > > > +		/*
> > > > > > > > +		 * Move all local changes to the end of rows array 
and
> > > > > > > > +		 * a fake local transaction (like an autonomous
> > > > > > > > transaction)
> > > > > > > > +		 * because we could not replicate the transaction 
back.
> > > > > > > > +		 */
> > > > > > > > +		struct xrow_header **row = end - 1;
> > > > > > > > +		while (row >= begin) {
> > > > > > > > +			if (row[0]->replica_id != instance_id) {
> > > > > > > > +				--row;
> > > > > > > > +				continue;
> > > > > > > > +			}
> > > > > > > > +			/* Local row, move it back. */
> > > > > > > > +			struct xrow_header **local_row = row;
> > > > > > > > +			while (local_row < end - 1 &&
> > > > > > > > +			       local_row[1]->replica_id != 
instance_id) {
> > > > > > > > +				struct xrow_header *tmp = local_row[0];
> > > > > > > > +				local_row[0] = local_row[1];
> > > > > > > > +				local_row[1] = tmp;
> > > > > > > 
> > > > > > > There's SWAP for this.
> > > > > > > 
> > > > > > > > +			}
> > > > > > > > +			--row;
> > > > > > > > +		}
> > > > > > > > +		while (begin < end && begin[0]->replica_id != 
instance_id)
> > > > > > > > +			++begin;
> > > > > > > > +	}
> > > > > > > 
> > > > > > > I don't understand why we need to move rows generated locally
> > > > > > > (by
> > > > > > > an on_replace trigger I surmise) to the end of a transaction. We
> > > > > > > have TXN_ID attached to each row so we could leave the
> > > > > > > transactions
> > > > > > > interleaved, couldn't we?
> > > > > > 
> > > > > > You are right, but in that case applier should track a lot of
> > > > > > transaction
> > > > > > simultaneously. Also it complicates recovery too. I hope it will
> > > > > > be
> > > > > > fixed
> > > > > > while parallel applier implementing.
> > > > > 
> > > > > May be, we should implement parallel applier in the scope of this
> > > > > issue
> > > > > then? Anyway, without it, sync replication won't scale with the
> > > > > number
> > > > > of parallel transactions.
> > > > 
> > > > I do not think so. Parallel applier depends on that feature, but it is
> > > > completely different issue.
> > > 
> > > But as I said, sync replication won't scale without it.
> > > 
> > > > Also parallel applier is ignorant on a xlog format.
> > > 
> > > Depends on how you implement it. For example, MySQL marks all
> > > transactions that were committed in parallel on the master so that
> > > they can be committed in parallel on replicas.
> > 
> > In case of vinyl it is not true.
> 
> In case of vinyl we can mark transactions that overlapped in time on the
> master. In fact, that's what MySQL actually does. I guess I misleaded
> you by saying "committed in parallel". I meant "executed in parallel".
The only thing we need is to send all transaction to commit in right order and 
restart a transaction in case of conflict. For example master did transactions 
t1, t2, t3. Replica starts t1, t2 and t3 in received order. But if t3 finished 
earlier that t2 then we should hold t3 commit until t2 sent to wal. After t3 
resume there are two possibilities: t3 is conflicting or not. In the second 
case we will send it to wal, in the case of conflict - just restart - we have 
all info to do that.
And we do not have to have any information about transaction batches on master 
- because all job will be done by local transaction manager.
In comparison with MySQL we also could process transaction in parallel event 
they ware done in a sequence.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: This is a digitally signed message part.
URL: <https://lists.tarantool.org/pipermail/tarantool-patches/attachments/20190201/f43abff1/attachment.sig>