From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: =?utf-8?B?0JPQtdC+0YDQs9C40Lkg0JrQuNGA0LjRh9C10L3QutC+?= Subject: Re: [tarantool-patches] [PATCH v2 1/2] Journal transaction boundaries Date: Fri, 01 Feb 2019 12:31:04 +0300 Message-ID: <1612584.9ML3igCmdI@home.lan> In-Reply-To: <20190131145454.hhuwbgff2hznsrkx@esperanza> References: <5221637.HvkEVp0ifG@home.lan> <20190131145454.hhuwbgff2hznsrkx@esperanza> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart2351800.UXdekoGaJD"; micalg="pgp-sha256"; protocol="application/pgp-signature" To: Vladimir Davydov Cc: tarantool-patches@freelists.org List-ID: --nextPart2351800.UXdekoGaJD Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="UTF-8" On Thursday, January 31, 2019 5:54:54 PM MSK Vladimir Davydov wrote: > On Thu, Jan 31, 2019 at 05:25:24PM +0300, =D0=93=D0=B5=D0=BE=D1=80=D0=B3= =D0=B8=D0=B9 =D0=9A=D0=B8=D1=80=D0=B8=D1=87=D0=B5=D0=BD=D0=BA=D0=BE wrote: > > On Thursday, January 31, 2019 11:19:51 AM MSK Vladimir Davydov wrote: > > > On Thu, Jan 31, 2019 at 10:34:43AM +0300, =D0=93=D0=B5=D0=BE=D1=80=D0= =B3=D0=B8=D0=B9 =D0=9A=D0=B8=D1=80=D0=B8=D1=87=D0=B5=D0=BD=D0=BA=D0=BE wrot= e: > > > > On Tuesday, January 29, 2019 2:00:10 PM MSK Vladimir Davydov wrote: > > > > > On Tue, Jan 29, 2019 at 01:09:50PM +0300, =D0=93=D0=B5=D0=BE=D1= =80=D0=B3=D0=B8=D0=B9 =D0=9A=D0=B8=D1=80=D0=B8=D1=87=D0=B5=D0=BD=D0=BA=D0= =BE wrote: > > > > > > On Monday, January 28, 2019 3:58:59 PM MSK Vladimir Davydov wro= te: > > > > > > > On Tue, Jan 22, 2019 at 01:57:36PM +0300, Georgy Kirichenko=20 wrote: > > > > > > > > diff --git a/src/box/iproto_constants.h > > > > > > > > b/src/box/iproto_constants.h > > > > > > > > index 728514297..d01cdf840 100644 > > > > > > > > --- a/src/box/iproto_constants.h > > > > > > > > +++ b/src/box/iproto_constants.h > > > > > > > > @@ -60,6 +60,9 @@ enum iproto_key { > > > > > > > >=20 > > > > > > > > IPROTO_SCHEMA_VERSION =3D 0x05, > > > > > > > > IPROTO_SERVER_VERSION =3D 0x06, > > > > > > > > IPROTO_GROUP_ID =3D 0x07, > > > > > > > >=20 > > > > > > > > + IPROTO_TXN_ID =3D 0x08, > > > > > > > > + IPROTO_TXN_REPLICA_ID =3D 0x09, > > > > > > >=20 > > > > > > > Do we really need to introduce both TXN_ID and TXN_REPLICA_ID= ? I > > > > > > > would > > > > > > > expect TXN_ID to be enough - we could use REPLICA_ID to make > > > > > > > sure > > > > > > > that > > > > > > > transaction identifiers from different instances don't overla= p, > > > > > > > because > > > > > > > there couldn't be a "multi-instance" transaction, could there? > > > >=20 > > > > txn_replica_id might be differ from replica_id in case when > > > > transaction > > > > would be finished on other node, e.g. after recovery. > > >=20 > > > I'm kinda out of scope here. I think you need to write an RFC or > > > something explaining all the design decisions you made when you > > > introduced transaction boundaries. Or we should get together with > > > Kostja and discuss it. Or both. Neither the commit message nor > > > comments shed the light on the big picture. > > >=20 > > > > > > > > + IPROTO_TXN_LAST =3D 0x0a, > > > > > > >=20 > > > > > > > I think we should instead introduce BEGIN and COMMIT commands= ,=20 because: > > > > > > I completely do not like any auto-commit logic in a xlog file. = You > > > > > > suggestion breaks backward compatibility because previous logs = do > > > > > > not > > > > > > have any BEGIN/ > > > > >=20 > > > > > It wouldn't break backward compatibility. It might break forward > > > > > compatibility, which is fine by me (we do it all the time). > > > >=20 > > > > Suggested xrow encoding/decoding rules means that any xrow without > > > > txn_id, > > > > txn_replica_id, txn_last should be processed as a single statement > > > > transaction as it was before. > > > > If we would require explicit begin/commit then previous logs turns > > > > into an > > > > invalid stream without autocommit semantic. > > >=20 > > > Yeah, we would have to add autocommit. > >=20 > > I am disagreed with autocommit. There are my key points: > > 1. We could not interleave autocommit transactions with others, or we > > would > > have to have special mark for this transaction but this is no autocommit > > after all. >=20 > Do we need to support interleaving transactions in xlog? I'm not sure. > May be, for sync replication we do. >=20 > Anyway, we could add BEGIN/COMMIT *in addition* to TXN_ID stored in each > row. They would serve as a header and footer. This would give us a place > holder to store transaction-wide information if we ever need it. > Besides, it would make the stream more robust: if we loose BEGIN, we > will detect it while if we loose the first transaction row in your case > we won't. Yeah, explicit BEGIN/COMMIT will add a few bytes to each > transaction, but IMO it's not critical. >=20 > Let's discuss it f2f when we can, because we seem to repeat ourselves > without moving forward. >=20 > > 2. Applier would not able to check transaction boundaries. > > 3. There is no difference between an autocommit transaction and a > > single-row transaction. > > 4. Autocommit is about interface behavior but not about log format. > >=20 > > Also let me explain xrow encoding rules: > > If we have only one row in transaction, then we do not encode txn_id and > > txn_last. So single-row transaction (or autocommit if you wish) does not > > change its' representation. And then while decoding we set txn_id =3D l= sn > > and > > txn_last if txn_id was not set. In that case each row from previous xlog > > acts as one single-row transaction. > > For multi-row transaction we have only penalty in transaction identifie= rs > > for each row and one commit flag for the last. > >=20 > > > > > > > > @@ -917,6 +918,39 @@ wal_assign_lsn(struct vclock *vclock, > > > > > > > > struct > > > > > > > > xrow_header **row,> > > > > > > > >=20 > > > > > > > > vclock_follow_xrow(vclock, *row); > > > > > > > > =09 > > > > > > > > } > > > > > > > > =09 > > > > > > > > } > > > > > > > >=20 > > > > > > > > + if ((*begin)->replica_id !=3D instance_id) { > > > > > > > > + /* > > > > > > > > + * Move all local changes to the end of rows array=20 and > > > > > > > > + * a fake local transaction (like an autonomous > > > > > > > > transaction) > > > > > > > > + * because we could not replicate the transaction=20 back. > > > > > > > > + */ > > > > > > > > + struct xrow_header **row =3D end - 1; > > > > > > > > + while (row >=3D begin) { > > > > > > > > + if (row[0]->replica_id !=3D instance_id) { > > > > > > > > + --row; > > > > > > > > + continue; > > > > > > > > + } > > > > > > > > + /* Local row, move it back. */ > > > > > > > > + struct xrow_header **local_row =3D row; > > > > > > > > + while (local_row < end - 1 && > > > > > > > > + local_row[1]->replica_id !=3D=20 instance_id) { > > > > > > > > + struct xrow_header *tmp =3D local_row[0]; > > > > > > > > + local_row[0] =3D local_row[1]; > > > > > > > > + local_row[1] =3D tmp; > > > > > > >=20 > > > > > > > There's SWAP for this. > > > > > > >=20 > > > > > > > > + } > > > > > > > > + --row; > > > > > > > > + } > > > > > > > > + while (begin < end && begin[0]->replica_id !=3D=20 instance_id) > > > > > > > > + ++begin; > > > > > > > > + } > > > > > > >=20 > > > > > > > I don't understand why we need to move rows generated locally > > > > > > > (by > > > > > > > an on_replace trigger I surmise) to the end of a transaction.= We > > > > > > > have TXN_ID attached to each row so we could leave the > > > > > > > transactions > > > > > > > interleaved, couldn't we? > > > > > >=20 > > > > > > You are right, but in that case applier should track a lot of > > > > > > transaction > > > > > > simultaneously. Also it complicates recovery too. I hope it will > > > > > > be > > > > > > fixed > > > > > > while parallel applier implementing. > > > > >=20 > > > > > May be, we should implement parallel applier in the scope of this > > > > > issue > > > > > then? Anyway, without it, sync replication won't scale with the > > > > > number > > > > > of parallel transactions. > > > >=20 > > > > I do not think so. Parallel applier depends on that feature, but it= is > > > > completely different issue. > > >=20 > > > But as I said, sync replication won't scale without it. > > >=20 > > > > Also parallel applier is ignorant on a xlog format. > > >=20 > > > Depends on how you implement it. For example, MySQL marks all > > > transactions that were committed in parallel on the master so that > > > they can be committed in parallel on replicas. > >=20 > > In case of vinyl it is not true. >=20 > In case of vinyl we can mark transactions that overlapped in time on the > master. In fact, that's what MySQL actually does. I guess I misleaded > you by saying "committed in parallel". I meant "executed in parallel". The only thing we need is to send all transaction to commit in right order = and=20 restart a transaction in case of conflict. For example master did transacti= ons=20 t1, t2, t3. Replica starts t1, t2 and t3 in received order. But if t3 finis= hed=20 earlier that t2 then we should hold t3 commit until t2 sent to wal. After t= 3=20 resume there are two possibilities: t3 is conflicting or not. In the second= =20 case we will send it to wal, in the case of conflict - just restart - we ha= ve=20 all info to do that. And we do not have to have any information about transaction batches on mas= ter=20 =2D because all job will be done by local transaction manager. In comparison with MySQL we also could process transaction in parallel even= t=20 they ware done in a sequence. --nextPart2351800.UXdekoGaJD Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part. Content-Transfer-Encoding: 7Bit -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEFB+nbqWGnp59Rk9ZFSyY70x8X3sFAlxUEdgACgkQFSyY70x8 X3u2uQf/TZCeHsum7Sa1nAPDFKan/diR5EfwzhwubcCMYizs+H5DBGYSvg2At4xD Ud7HPEC8OqNJCOZSBwzU8tJMnxIUHHiyYlhQQjKLlhB2h97pVhtnONkqUUDuU6KZ knXXj7rk6tEjzwtBaJC00IsU+bxMkr5zT++nQPexrCoD14W6aId2qOxtFNHoG/XW PB80KaCgWkG9G+i0t6w8PRq6IuZENN73GUYpHZRjV4ZWj43Iz8mU2A3XRA69ExKG h402H7KTjMwI24qGA8i8FQVi/kiZ6zoG+kmZDWI5kNRtq+GESfOto1F7OMT7Duqz RapCrdmo4u9vCNg8iMvKNFSD8MepSg== =Vk+5 -----END PGP SIGNATURE----- --nextPart2351800.UXdekoGaJD--