[tarantool-patches] [PATCH v2 1/2] Journal transaction boundaries

Vladimir Davydov vdavydov.dev at gmail.com
Thu Jan 31 11:19:51 MSK 2019


On Thu, Jan 31, 2019 at 10:34:43AM +0300, Георгий Кириченко wrote:
> On Tuesday, January 29, 2019 2:00:10 PM MSK Vladimir Davydov wrote:
> > On Tue, Jan 29, 2019 at 01:09:50PM +0300, Георгий Кириченко wrote:
> > > On Monday, January 28, 2019 3:58:59 PM MSK Vladimir Davydov wrote:
> > > > On Tue, Jan 22, 2019 at 01:57:36PM +0300, Georgy Kirichenko wrote:
> > > > > diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
> > > > > index 728514297..d01cdf840 100644
> > > > > --- a/src/box/iproto_constants.h
> > > > > +++ b/src/box/iproto_constants.h
> > > > > @@ -60,6 +60,9 @@ enum iproto_key {
> > > > > 
> > > > >  	IPROTO_SCHEMA_VERSION = 0x05,
> > > > >  	IPROTO_SERVER_VERSION = 0x06,
> > > > >  	IPROTO_GROUP_ID = 0x07,
> > > > > 
> > > > > +	IPROTO_TXN_ID = 0x08,
> > > > > +	IPROTO_TXN_REPLICA_ID = 0x09,
> > > > 
> > > > Do we really need to introduce both TXN_ID and TXN_REPLICA_ID? I would
> > > > expect TXN_ID to be enough - we could use REPLICA_ID to make sure that
> > > > transaction identifiers from different instances don't overlap, because
> > > > there couldn't be a "multi-instance" transaction, could there?
> txn_replica_id might be differ from replica_id in case when transaction would 
> be finished on other node, e.g. after recovery.

I'm kinda out of scope here. I think you need to write an RFC or
something explaining all the design decisions you made when you
introduced transaction boundaries. Or we should get together with
Kostja and discuss it. Or both. Neither the commit message nor
comments shed the light on the big picture.

> > > > 
> > > > > +	IPROTO_TXN_LAST = 0x0a,
> > > > 
> > > > I think we should instead introduce BEGIN and COMMIT commands, because:
> > > I completely do not like any auto-commit logic in a xlog file. You
> > > suggestion breaks backward compatibility because previous logs do not
> > > have any BEGIN/
> > It wouldn't break backward compatibility. It might break forward
> > compatibility, which is fine by me (we do it all the time).
> Suggested xrow encoding/decoding rules means that any xrow without txn_id, 
> txn_replica_id, txn_last should be processed as a single statement transaction 
> as it was before.
> If we would require explicit begin/commit then previous logs turns into an 
> invalid stream without autocommit semantic.

Yeah, we would have to add autocommit.

> But I think, that txn_last should be renamed into txn_commit.
> Also explicit begin operation is redundant because a new one pair txn_id/
> txn_replica_id already means begin of an transaction. 
> > 
> > > COMMIT. Also separate BEGIN and COMMIT messages increase transaction size.
> > 
> > I doubt that after compression you'll see a difference.
> replication stream does not have any compression.

OK.

> > 
> > > It is worth noting, that IPROTO_NOP with a new txn_id or txn_last emulates
> > > BEGIN or COMMIT.
> > 
> > Yeah, but that would look weird.
> > 
> > > >  - We might need to attach some extra information to each transaction,
> > > >  
> > > >    e.g. mark transactions that were committed in parallel on the master
> > > >    so that they can be committed in parallel on a replica. Attaching
> > > >    such information to each row would be excessive.
> > > 
> > > The patch is not about this.
> > 
> > But we have to think about that in advance, don't we?
> We do not have clear view how it should be done.

Which means we should think it through IMO. The two issues are closely
related.

> > 
> > > >  - We will need BEGIN and COMMIT for IPROTO transactions. It would be
> > > >  
> > > >    nice if we could share the code with them.
> > > 
> > > The biggest issue we could not know transaction identifier in case of
> > > IPROTO. Iproto is single stream proto, but wal might be not as it is
> > > multiplexing a lot of transactions in a one output, so it might be bad be
> > > in paradigm of universally format for both IPROTO and WAL.
> > 
> > OK. I think we need to discuss the options with Kostja.
> > 
> > > >  - Having separate BEGIN/COMMIT rows would make tarantoolctl-cat output
> > > >  
> > > >    easier to read.
> > > >    
> > > > >  	/* Leave a gap for other keys in the header. */
> > > > >  	IPROTO_SPACE_ID = 0x10,
> > > > >  	IPROTO_INDEX_ID = 0x11,
> > > > > 
> > > > > diff --git a/src/box/wal.c b/src/box/wal.c
> > > > > index 4c3537672..17ead08e7 100644
> > > > > --- a/src/box/wal.c
> > > > > +++ b/src/box/wal.c
> > > > > @@ -905,9 +905,10 @@ wal_writer_begin_rollback(struct wal_writer
> > > > > *writer)
> > > > > 
> > > > >  }
> > > > >  
> > > > >  static void
> > > > > 
> > > > > -wal_assign_lsn(struct vclock *vclock, struct xrow_header **row,
> > > > > +wal_assign_lsn(struct vclock *vclock, struct xrow_header **begin,
> > > > > 
> > > > >  	       struct xrow_header **end)
> > > > >  
> > > > >  {
> > > > > 
> > > > > +	struct xrow_header **row = begin;
> > > > > 
> > > > >  	/** Assign LSN to all local rows. */
> > > > >  	for ( ; row < end; row++) {
> > > > >  	
> > > > >  		if ((*row)->replica_id == 0) {
> > > > > 
> > > > > @@ -917,6 +918,39 @@ wal_assign_lsn(struct vclock *vclock, struct
> > > > > xrow_header **row,>
> > > > > 
> > > > >  			vclock_follow_xrow(vclock, *row);
> > > > >  		
> > > > >  		}
> > > > >  	
> > > > >  	}
> > > > > 
> > > > > +	if ((*begin)->replica_id != instance_id) {
> > > > > +		/*
> > > > > +		 * Move all local changes to the end of rows array and
> > > > > +		 * a fake local transaction (like an autonomous transaction)
> > > > > +		 * because we could not replicate the transaction back.
> > > > > +		 */
> > > > > +		struct xrow_header **row = end - 1;
> > > > > +		while (row >= begin) {
> > > > > +			if (row[0]->replica_id != instance_id) {
> > > > > +				--row;
> > > > > +				continue;
> > > > > +			}
> > > > > +			/* Local row, move it back. */
> > > > > +			struct xrow_header **local_row = row;
> > > > > +			while (local_row < end - 1 &&
> > > > > +			       local_row[1]->replica_id != instance_id) {
> > > > > +				struct xrow_header *tmp = local_row[0];
> > > > > +				local_row[0] = local_row[1];
> > > > > +				local_row[1] = tmp;
> > > > 
> > > > There's SWAP for this.
> > > > 
> > > > > +			}
> > > > > +			--row;
> > > > > +		}
> > > > > +		while (begin < end && begin[0]->replica_id != instance_id)
> > > > > +			++begin;
> > > > > +	}
> > > > 
> > > > I don't understand why we need to move rows generated locally (by
> > > > an on_replace trigger I surmise) to the end of a transaction. We
> > > > have TXN_ID attached to each row so we could leave the transactions
> > > > interleaved, couldn't we?
> > > 
> > > You are right, but in that case applier should track a lot of transaction
> > > simultaneously. Also it complicates recovery too. I hope it will be fixed
> > > while parallel applier implementing.
> > 
> > May be, we should implement parallel applier in the scope of this issue
> > then? Anyway, without it, sync replication won't scale with the number
> > of parallel transactions.
> I do not think so. Parallel applier depends on that feature, but it is 
> completely different issue.

But as I said, sync replication won't scale without it.

> Also parallel applier is ignorant on a xlog format.

Depends on how you implement it. For example, MySQL marks all
transactions that were committed in parallel on the master so that
they can be committed in parallel on replicas.

> > > > > diff --git a/test/vinyl/errinj_stat.result
> > > > > b/test/vinyl/errinj_stat.result
> > > > > index 08801dbc6..361ddf5db 100644
> > > > > --- a/test/vinyl/errinj_stat.result
> > > > > +++ b/test/vinyl/errinj_stat.result
> > > > > @@ -69,7 +69,7 @@ i:stat().disk.compaction.queue -- 30 statements
> > > > > 
> > > > >  - bytes_compressed: <bytes_compressed>
> > > > >  
> > > > >    pages: 3
> > > > >    rows: 30
> > > > > 
> > > > > -  bytes: 411
> > > > > +  bytes: 477
> > > > > 
> > > > >  ...
> > > > >  i:stat().disk.compaction.queue.bytes ==
> > > > >  box.stat.vinyl().scheduler.compaction_queue ---
> > > > > 
> > > > > @@ -83,7 +83,7 @@ i:stat().disk.compaction.queue -- 40 statements
> > > > > 
> > > > >  - bytes_compressed: <bytes_compressed>
> > > > >  
> > > > >    pages: 4
> > > > >    rows: 40
> > > > > 
> > > > > -  bytes: 548
> > > > > +  bytes: 636
> > > > 
> > > > I wouldn't expect vinyl stats to be changed by this patch.
> > > > Why did it happen?
> > > 
> > > Because if rows were written in an one entry, then wal creates
> > > transaction.
> > 
> > But those are vinyl files (run, index). They shouldn't be affected by
> > this, should they?
> vinyl uses xlog_write_entry that forms a transaction.

No, not for run/index files, it does not. Take a look at
vy_run_dump_stmt. There's no point in writing any extra
transaction information to run/index files.

> I do not think that introducing a new version of xlog_write_entry
> with/without transaction is a good idea.



More information about the Tarantool-patches mailing list