[tarantool-patches] [PATCH v2 1/2] Journal transaction boundaries

Георгий Кириченко georgy at tarantool.org
Thu Jan 31 17:25:24 MSK 2019


On Thursday, January 31, 2019 11:19:51 AM MSK Vladimir Davydov wrote:
> On Thu, Jan 31, 2019 at 10:34:43AM +0300, Георгий Кириченко wrote:
> > On Tuesday, January 29, 2019 2:00:10 PM MSK Vladimir Davydov wrote:
> > > On Tue, Jan 29, 2019 at 01:09:50PM +0300, Георгий Кириченко wrote:
> > > > On Monday, January 28, 2019 3:58:59 PM MSK Vladimir Davydov wrote:
> > > > > On Tue, Jan 22, 2019 at 01:57:36PM +0300, Georgy Kirichenko wrote:
> > > > > > diff --git a/src/box/iproto_constants.h
> > > > > > b/src/box/iproto_constants.h
> > > > > > index 728514297..d01cdf840 100644
> > > > > > --- a/src/box/iproto_constants.h
> > > > > > +++ b/src/box/iproto_constants.h
> > > > > > @@ -60,6 +60,9 @@ enum iproto_key {
> > > > > > 
> > > > > >  	IPROTO_SCHEMA_VERSION = 0x05,
> > > > > >  	IPROTO_SERVER_VERSION = 0x06,
> > > > > >  	IPROTO_GROUP_ID = 0x07,
> > > > > > 
> > > > > > +	IPROTO_TXN_ID = 0x08,
> > > > > > +	IPROTO_TXN_REPLICA_ID = 0x09,
> > > > > 
> > > > > Do we really need to introduce both TXN_ID and TXN_REPLICA_ID? I
> > > > > would
> > > > > expect TXN_ID to be enough - we could use REPLICA_ID to make sure
> > > > > that
> > > > > transaction identifiers from different instances don't overlap,
> > > > > because
> > > > > there couldn't be a "multi-instance" transaction, could there?
> > 
> > txn_replica_id might be differ from replica_id in case when transaction
> > would be finished on other node, e.g. after recovery.
> 
> I'm kinda out of scope here. I think you need to write an RFC or
> something explaining all the design decisions you made when you
> introduced transaction boundaries. Or we should get together with
> Kostja and discuss it. Or both. Neither the commit message nor
> comments shed the light on the big picture.
> 
> > > > > > +	IPROTO_TXN_LAST = 0x0a,
> > > > > 
> > > > > I think we should instead introduce BEGIN and COMMIT commands, 
because:
> > > > I completely do not like any auto-commit logic in a xlog file. You
> > > > suggestion breaks backward compatibility because previous logs do not
> > > > have any BEGIN/
> > > 
> > > It wouldn't break backward compatibility. It might break forward
> > > compatibility, which is fine by me (we do it all the time).
> > 
> > Suggested xrow encoding/decoding rules means that any xrow without txn_id,
> > txn_replica_id, txn_last should be processed as a single statement
> > transaction as it was before.
> > If we would require explicit begin/commit then previous logs turns into an
> > invalid stream without autocommit semantic.
> 
> Yeah, we would have to add autocommit.
I am disagreed with autocommit. There are my key points:
1. We could not interleave autocommit transactions with others, or we would 
have to have special mark for this transaction but this is no autocommit after 
all.
2. Applier would not able to check transaction boundaries.
3. There is no difference between an autocommit transaction and a single-row 
transaction.
4. Autocommit is about interface behavior but not about log format.

Also let me explain xrow encoding rules:
If we have only one row in transaction, then we do not encode txn_id and 
txn_last. So single-row transaction (or autocommit if you wish) does not 
change its' representation. And then while decoding we set txn_id = lsn and 
txn_last if txn_id was not set. In that case each row from previous xlog acts 
as one single-row transaction.
For multi-row transaction we have only penalty in transaction identifiers for 
each row and one commit flag for the last.

> 
> > But I think, that txn_last should be renamed into txn_commit.
> > Also explicit begin operation is redundant because a new one pair txn_id/
> > txn_replica_id already means begin of an transaction.
> > 
> > > > COMMIT. Also separate BEGIN and COMMIT messages increase transaction
> > > > size.
> > > 
> > > I doubt that after compression you'll see a difference.
> > 
> > replication stream does not have any compression.
> 
> OK.
> 
> > > > It is worth noting, that IPROTO_NOP with a new txn_id or txn_last
> > > > emulates
> > > > BEGIN or COMMIT.
> > > 
> > > Yeah, but that would look weird.
> > > 
> > > > >  - We might need to attach some extra information to each
> > > > >  transaction,
> > > > >  
> > > > >    e.g. mark transactions that were committed in parallel on the
> > > > >    master
> > > > >    so that they can be committed in parallel on a replica. Attaching
> > > > >    such information to each row would be excessive.
> > > > 
> > > > The patch is not about this.
> > > 
> > > But we have to think about that in advance, don't we?
> > 
> > We do not have clear view how it should be done.
> 
> Which means we should think it through IMO. The two issues are closely
> related.
The main issue - proposed decision (track which transaction were batched) is 
not applicable, because vinyl might yield in a completely unpredictable 
manner. Also I have full-parallel solution but it does not require any 
information except transaction boundaries.

> 
> > > > >  - We will need BEGIN and COMMIT for IPROTO transactions. It would
> > > > >  be
> > > > >  
> > > > >    nice if we could share the code with them.
> > > > 
> > > > The biggest issue we could not know transaction identifier in case of
> > > > IPROTO. Iproto is single stream proto, but wal might be not as it is
> > > > multiplexing a lot of transactions in a one output, so it might be bad
> > > > be
> > > > in paradigm of universally format for both IPROTO and WAL.
> > > 
> > > OK. I think we need to discuss the options with Kostja.
> > > 
> > > > >  - Having separate BEGIN/COMMIT rows would make tarantoolctl-cat
> > > > >  output
> > > > >  
> > > > >    easier to read.
> > > > >    
> > > > > >  	/* Leave a gap for other keys in the header. */
> > > > > >  	IPROTO_SPACE_ID = 0x10,
> > > > > >  	IPROTO_INDEX_ID = 0x11,
> > > > > > 
> > > > > > diff --git a/src/box/wal.c b/src/box/wal.c
> > > > > > index 4c3537672..17ead08e7 100644
> > > > > > --- a/src/box/wal.c
> > > > > > +++ b/src/box/wal.c
> > > > > > @@ -905,9 +905,10 @@ wal_writer_begin_rollback(struct wal_writer
> > > > > > *writer)
> > > > > > 
> > > > > >  }
> > > > > >  
> > > > > >  static void
> > > > > > 
> > > > > > -wal_assign_lsn(struct vclock *vclock, struct xrow_header **row,
> > > > > > +wal_assign_lsn(struct vclock *vclock, struct xrow_header **begin,
> > > > > > 
> > > > > >  	       struct xrow_header **end)
> > > > > >  
> > > > > >  {
> > > > > > 
> > > > > > +	struct xrow_header **row = begin;
> > > > > > 
> > > > > >  	/** Assign LSN to all local rows. */
> > > > > >  	for ( ; row < end; row++) {
> > > > > >  	
> > > > > >  		if ((*row)->replica_id == 0) {
> > > > > > 
> > > > > > @@ -917,6 +918,39 @@ wal_assign_lsn(struct vclock *vclock, struct
> > > > > > xrow_header **row,>
> > > > > > 
> > > > > >  			vclock_follow_xrow(vclock, *row);
> > > > > >  		
> > > > > >  		}
> > > > > >  	
> > > > > >  	}
> > > > > > 
> > > > > > +	if ((*begin)->replica_id != instance_id) {
> > > > > > +		/*
> > > > > > +		 * Move all local changes to the end of rows array and
> > > > > > +		 * a fake local transaction (like an autonomous 
transaction)
> > > > > > +		 * because we could not replicate the transaction 
back.
> > > > > > +		 */
> > > > > > +		struct xrow_header **row = end - 1;
> > > > > > +		while (row >= begin) {
> > > > > > +			if (row[0]->replica_id != instance_id) {
> > > > > > +				--row;
> > > > > > +				continue;
> > > > > > +			}
> > > > > > +			/* Local row, move it back. */
> > > > > > +			struct xrow_header **local_row = row;
> > > > > > +			while (local_row < end - 1 &&
> > > > > > +			       local_row[1]->replica_id != instance_id) 
{
> > > > > > +				struct xrow_header *tmp = local_row[0];
> > > > > > +				local_row[0] = local_row[1];
> > > > > > +				local_row[1] = tmp;
> > > > > 
> > > > > There's SWAP for this.
> > > > > 
> > > > > > +			}
> > > > > > +			--row;
> > > > > > +		}
> > > > > > +		while (begin < end && begin[0]->replica_id != 
instance_id)
> > > > > > +			++begin;
> > > > > > +	}
> > > > > 
> > > > > I don't understand why we need to move rows generated locally (by
> > > > > an on_replace trigger I surmise) to the end of a transaction. We
> > > > > have TXN_ID attached to each row so we could leave the transactions
> > > > > interleaved, couldn't we?
> > > > 
> > > > You are right, but in that case applier should track a lot of
> > > > transaction
> > > > simultaneously. Also it complicates recovery too. I hope it will be
> > > > fixed
> > > > while parallel applier implementing.
> > > 
> > > May be, we should implement parallel applier in the scope of this issue
> > > then? Anyway, without it, sync replication won't scale with the number
> > > of parallel transactions.
> > 
> > I do not think so. Parallel applier depends on that feature, but it is
> > completely different issue.
> 
> But as I said, sync replication won't scale without it.
> 
> > Also parallel applier is ignorant on a xlog format.
> 
> Depends on how you implement it. For example, MySQL marks all
> transactions that were committed in parallel on the master so that
> they can be committed in parallel on replicas.
In case of vinyl it is not true.
> 
> > > > > > diff --git a/test/vinyl/errinj_stat.result
> > > > > > b/test/vinyl/errinj_stat.result
> > > > > > index 08801dbc6..361ddf5db 100644
> > > > > > --- a/test/vinyl/errinj_stat.result
> > > > > > +++ b/test/vinyl/errinj_stat.result
> > > > > > @@ -69,7 +69,7 @@ i:stat().disk.compaction.queue -- 30 statements
> > > > > > 
> > > > > >  - bytes_compressed: <bytes_compressed>
> > > > > >  
> > > > > >    pages: 3
> > > > > >    rows: 30
> > > > > > 
> > > > > > -  bytes: 411
> > > > > > +  bytes: 477
> > > > > > 
> > > > > >  ...
> > > > > >  i:stat().disk.compaction.queue.bytes ==
> > > > > >  box.stat.vinyl().scheduler.compaction_queue ---
> > > > > > 
> > > > > > @@ -83,7 +83,7 @@ i:stat().disk.compaction.queue -- 40 statements
> > > > > > 
> > > > > >  - bytes_compressed: <bytes_compressed>
> > > > > >  
> > > > > >    pages: 4
> > > > > >    rows: 40
> > > > > > 
> > > > > > -  bytes: 548
> > > > > > +  bytes: 636
> > > > > 
> > > > > I wouldn't expect vinyl stats to be changed by this patch.
> > > > > Why did it happen?
> > > > 
> > > > Because if rows were written in an one entry, then wal creates
> > > > transaction.
> > > 
> > > But those are vinyl files (run, index). They shouldn't be affected by
> > > this, should they?
> > 
> > vinyl uses xlog_write_entry that forms a transaction.
> 
> No, not for run/index files, it does not. Take a look at
> vy_run_dump_stmt. There's no point in writing any extra
> transaction information to run/index files.
Thanks, will check it
> 
> > I do not think that introducing a new version of xlog_write_entry
> > with/without transaction is a good idea.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: This is a digitally signed message part.
URL: <https://lists.tarantool.org/pipermail/tarantool-patches/attachments/20190131/b7d8db59/attachment.sig>


More information about the Tarantool-patches mailing list