Re: [tarantool-patches] [PATCH v2 1/2] Journal transaction boundaries

Tarantool development patches archive
 help / color / mirror / Atom feed

From: "Георгий Кириченко" <georgy@tarantool.org>
To: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: tarantool-patches@freelists.org
Subject: Re: [tarantool-patches] [PATCH v2 1/2] Journal transaction boundaries
Date: Fri, 01 Feb 2019 12:31:04 +0300	[thread overview]
Message-ID: <1612584.9ML3igCmdI@home.lan> (raw)
In-Reply-To: <20190131145454.hhuwbgff2hznsrkx@esperanza>

[-- Attachment #1: Type: text/plain, Size: 8757 bytes --]

On Thursday, January 31, 2019 5:54:54 PM MSK Vladimir Davydov wrote:
> On Thu, Jan 31, 2019 at 05:25:24PM +0300, Георгий Кириченко wrote:
> > On Thursday, January 31, 2019 11:19:51 AM MSK Vladimir Davydov wrote:
> > > On Thu, Jan 31, 2019 at 10:34:43AM +0300, Георгий Кириченко wrote:
> > > > On Tuesday, January 29, 2019 2:00:10 PM MSK Vladimir Davydov wrote:
> > > > > On Tue, Jan 29, 2019 at 01:09:50PM +0300, Георгий Кириченко wrote:
> > > > > > On Monday, January 28, 2019 3:58:59 PM MSK Vladimir Davydov wrote:
> > > > > > > On Tue, Jan 22, 2019 at 01:57:36PM +0300, Georgy Kirichenko 
wrote:
> > > > > > > > diff --git a/src/box/iproto_constants.h
> > > > > > > > b/src/box/iproto_constants.h
> > > > > > > > index 728514297..d01cdf840 100644
> > > > > > > > --- a/src/box/iproto_constants.h
> > > > > > > > +++ b/src/box/iproto_constants.h
> > > > > > > > @@ -60,6 +60,9 @@ enum iproto_key {
> > > > > > > > 
> > > > > > > >  	IPROTO_SCHEMA_VERSION = 0x05,
> > > > > > > >  	IPROTO_SERVER_VERSION = 0x06,
> > > > > > > >  	IPROTO_GROUP_ID = 0x07,
> > > > > > > > 
> > > > > > > > +	IPROTO_TXN_ID = 0x08,
> > > > > > > > +	IPROTO_TXN_REPLICA_ID = 0x09,
> > > > > > > 
> > > > > > > Do we really need to introduce both TXN_ID and TXN_REPLICA_ID? I
> > > > > > > would
> > > > > > > expect TXN_ID to be enough - we could use REPLICA_ID to make
> > > > > > > sure
> > > > > > > that
> > > > > > > transaction identifiers from different instances don't overlap,
> > > > > > > because
> > > > > > > there couldn't be a "multi-instance" transaction, could there?
> > > > 
> > > > txn_replica_id might be differ from replica_id in case when
> > > > transaction
> > > > would be finished on other node, e.g. after recovery.
> > > 
> > > I'm kinda out of scope here. I think you need to write an RFC or
> > > something explaining all the design decisions you made when you
> > > introduced transaction boundaries. Or we should get together with
> > > Kostja and discuss it. Or both. Neither the commit message nor
> > > comments shed the light on the big picture.
> > > 
> > > > > > > > +	IPROTO_TXN_LAST = 0x0a,
> > > > > > > 
> > > > > > > I think we should instead introduce BEGIN and COMMIT commands, 
because:
> > > > > > I completely do not like any auto-commit logic in a xlog file. You
> > > > > > suggestion breaks backward compatibility because previous logs do
> > > > > > not
> > > > > > have any BEGIN/
> > > > > 
> > > > > It wouldn't break backward compatibility. It might break forward
> > > > > compatibility, which is fine by me (we do it all the time).
> > > > 
> > > > Suggested xrow encoding/decoding rules means that any xrow without
> > > > txn_id,
> > > > txn_replica_id, txn_last should be processed as a single statement
> > > > transaction as it was before.
> > > > If we would require explicit begin/commit then previous logs turns
> > > > into an
> > > > invalid stream without autocommit semantic.
> > > 
> > > Yeah, we would have to add autocommit.
> > 
> > I am disagreed with autocommit. There are my key points:
> > 1. We could not interleave autocommit transactions with others, or we
> > would
> > have to have special mark for this transaction but this is no autocommit
> > after all.
> 
> Do we need to support interleaving transactions in xlog? I'm not sure.
> May be, for sync replication we do.
> 
> Anyway, we could add BEGIN/COMMIT *in addition* to TXN_ID stored in each
> row. They would serve as a header and footer. This would give us a place
> holder to store transaction-wide information if we ever need it.
> Besides, it would make the stream more robust: if we loose BEGIN, we
> will detect it while if we loose the first transaction row in your case
> we won't. Yeah, explicit BEGIN/COMMIT will add a few bytes to each
> transaction, but IMO it's not critical.
> 
> Let's discuss it f2f when we can, because we seem to repeat ourselves
> without moving forward.
> 
> > 2. Applier would not able to check transaction boundaries.
> > 3. There is no difference between an autocommit transaction and a
> > single-row transaction.
> > 4. Autocommit is about interface behavior but not about log format.
> > 
> > Also let me explain xrow encoding rules:
> > If we have only one row in transaction, then we do not encode txn_id and
> > txn_last. So single-row transaction (or autocommit if you wish) does not
> > change its' representation. And then while decoding we set txn_id = lsn
> > and
> > txn_last if txn_id was not set. In that case each row from previous xlog
> > acts as one single-row transaction.
> > For multi-row transaction we have only penalty in transaction identifiers
> > for each row and one commit flag for the last.
> > 
> > > > > > > > @@ -917,6 +918,39 @@ wal_assign_lsn(struct vclock *vclock,
> > > > > > > > struct
> > > > > > > > xrow_header **row,>
> > > > > > > > 
> > > > > > > >  			vclock_follow_xrow(vclock, *row);
> > > > > > > >  		
> > > > > > > >  		}
> > > > > > > >  	
> > > > > > > >  	}
> > > > > > > > 
> > > > > > > > +	if ((*begin)->replica_id != instance_id) {
> > > > > > > > +		/*
> > > > > > > > +		 * Move all local changes to the end of rows array 
and
> > > > > > > > +		 * a fake local transaction (like an autonomous
> > > > > > > > transaction)
> > > > > > > > +		 * because we could not replicate the transaction 
back.
> > > > > > > > +		 */
> > > > > > > > +		struct xrow_header **row = end - 1;
> > > > > > > > +		while (row >= begin) {
> > > > > > > > +			if (row[0]->replica_id != instance_id) {
> > > > > > > > +				--row;
> > > > > > > > +				continue;
> > > > > > > > +			}
> > > > > > > > +			/* Local row, move it back. */
> > > > > > > > +			struct xrow_header **local_row = row;
> > > > > > > > +			while (local_row < end - 1 &&
> > > > > > > > +			       local_row[1]->replica_id != 
instance_id) {
> > > > > > > > +				struct xrow_header *tmp = local_row[0];
> > > > > > > > +				local_row[0] = local_row[1];
> > > > > > > > +				local_row[1] = tmp;
> > > > > > > 
> > > > > > > There's SWAP for this.
> > > > > > > 
> > > > > > > > +			}
> > > > > > > > +			--row;
> > > > > > > > +		}
> > > > > > > > +		while (begin < end && begin[0]->replica_id != 
instance_id)
> > > > > > > > +			++begin;
> > > > > > > > +	}
> > > > > > > 
> > > > > > > I don't understand why we need to move rows generated locally
> > > > > > > (by
> > > > > > > an on_replace trigger I surmise) to the end of a transaction. We
> > > > > > > have TXN_ID attached to each row so we could leave the
> > > > > > > transactions
> > > > > > > interleaved, couldn't we?
> > > > > > 
> > > > > > You are right, but in that case applier should track a lot of
> > > > > > transaction
> > > > > > simultaneously. Also it complicates recovery too. I hope it will
> > > > > > be
> > > > > > fixed
> > > > > > while parallel applier implementing.
> > > > > 
> > > > > May be, we should implement parallel applier in the scope of this
> > > > > issue
> > > > > then? Anyway, without it, sync replication won't scale with the
> > > > > number
> > > > > of parallel transactions.
> > > > 
> > > > I do not think so. Parallel applier depends on that feature, but it is
> > > > completely different issue.
> > > 
> > > But as I said, sync replication won't scale without it.
> > > 
> > > > Also parallel applier is ignorant on a xlog format.
> > > 
> > > Depends on how you implement it. For example, MySQL marks all
> > > transactions that were committed in parallel on the master so that
> > > they can be committed in parallel on replicas.
> > 
> > In case of vinyl it is not true.
> 
> In case of vinyl we can mark transactions that overlapped in time on the
> master. In fact, that's what MySQL actually does. I guess I misleaded
> you by saying "committed in parallel". I meant "executed in parallel".
The only thing we need is to send all transaction to commit in right order and 
restart a transaction in case of conflict. For example master did transactions 
t1, t2, t3. Replica starts t1, t2 and t3 in received order. But if t3 finished 
earlier that t2 then we should hold t3 commit until t2 sent to wal. After t3 
resume there are two possibilities: t3 is conflicting or not. In the second 
case we will send it to wal, in the case of conflict - just restart - we have 
all info to do that.
And we do not have to have any information about transaction batches on master 
- because all job will be done by local transaction manager.
In comparison with MySQL we also could process transaction in parallel event 
they ware done in a sequence.

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

next prev parent reply	other threads:[~2019-02-01  9:31 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-22 10:57 [tarantool-patches] [PATCH v2 0/2] Transaction boundaries in replication protocol Georgy Kirichenko
2019-01-22 10:57 ` [tarantool-patches] [PATCH v2 1/2] Journal transaction boundaries Georgy Kirichenko
2019-01-28 12:58   ` Vladimir Davydov
2019-01-29 10:09     ` Георгий Кириченко
2019-01-29 11:00       ` Vladimir Davydov
2019-01-31  7:34         ` Георгий Кириченко
2019-01-31  8:19           ` Vladimir Davydov
2019-01-31 14:25             ` Георгий Кириченко
2019-01-31 14:54               ` Vladimir Davydov
2019-02-01  9:31                 ` Георгий Кириченко [this message]
2019-01-28 13:00   ` Vladimir Davydov
2019-01-28 13:08     ` [tarantool-patches] " Vladislav Shpilevoy
2019-02-08 16:56   ` Konstantin Osipov
2019-01-22 10:57 ` [tarantool-patches] [PATCH v2 2/2] Transaction support for applier Georgy Kirichenko
2019-01-28 13:35   ` Vladimir Davydov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1612584.9ML3igCmdI@home.lan \
    --to=georgy@tarantool.org \
    --cc=tarantool-patches@freelists.org \
    --cc=vdavydov.dev@gmail.com \
    --subject='Re: [tarantool-patches] [PATCH v2 1/2] Journal transaction boundaries' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox