From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: =?utf-8?B?0JPQtdC+0YDQs9C40Lkg0JrQuNGA0LjRh9C10L3QutC+?= Subject: Re: [tarantool-patches] [PATCH v2 1/2] Journal transaction boundaries Date: Thu, 31 Jan 2019 17:25:24 +0300 Message-ID: <5221637.HvkEVp0ifG@home.lan> In-Reply-To: <20190131081951.e5bijlb7wcqzcqiv@esperanza> References: <3544285.z36fh00o7k@home.lan> <20190131081951.e5bijlb7wcqzcqiv@esperanza> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart2787205.CJaNPQNBCu"; micalg="pgp-sha256"; protocol="application/pgp-signature" To: Vladimir Davydov Cc: tarantool-patches@freelists.org List-ID: --nextPart2787205.CJaNPQNBCu Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="UTF-8" On Thursday, January 31, 2019 11:19:51 AM MSK Vladimir Davydov wrote: > On Thu, Jan 31, 2019 at 10:34:43AM +0300, =D0=93=D0=B5=D0=BE=D1=80=D0=B3= =D0=B8=D0=B9 =D0=9A=D0=B8=D1=80=D0=B8=D1=87=D0=B5=D0=BD=D0=BA=D0=BE wrote: > > On Tuesday, January 29, 2019 2:00:10 PM MSK Vladimir Davydov wrote: > > > On Tue, Jan 29, 2019 at 01:09:50PM +0300, =D0=93=D0=B5=D0=BE=D1=80=D0= =B3=D0=B8=D0=B9 =D0=9A=D0=B8=D1=80=D0=B8=D1=87=D0=B5=D0=BD=D0=BA=D0=BE wrot= e: > > > > On Monday, January 28, 2019 3:58:59 PM MSK Vladimir Davydov wrote: > > > > > On Tue, Jan 22, 2019 at 01:57:36PM +0300, Georgy Kirichenko wrote: > > > > > > diff --git a/src/box/iproto_constants.h > > > > > > b/src/box/iproto_constants.h > > > > > > index 728514297..d01cdf840 100644 > > > > > > --- a/src/box/iproto_constants.h > > > > > > +++ b/src/box/iproto_constants.h > > > > > > @@ -60,6 +60,9 @@ enum iproto_key { > > > > > >=20 > > > > > > IPROTO_SCHEMA_VERSION =3D 0x05, > > > > > > IPROTO_SERVER_VERSION =3D 0x06, > > > > > > IPROTO_GROUP_ID =3D 0x07, > > > > > >=20 > > > > > > + IPROTO_TXN_ID =3D 0x08, > > > > > > + IPROTO_TXN_REPLICA_ID =3D 0x09, > > > > >=20 > > > > > Do we really need to introduce both TXN_ID and TXN_REPLICA_ID? I > > > > > would > > > > > expect TXN_ID to be enough - we could use REPLICA_ID to make sure > > > > > that > > > > > transaction identifiers from different instances don't overlap, > > > > > because > > > > > there couldn't be a "multi-instance" transaction, could there? > >=20 > > txn_replica_id might be differ from replica_id in case when transaction > > would be finished on other node, e.g. after recovery. >=20 > I'm kinda out of scope here. I think you need to write an RFC or > something explaining all the design decisions you made when you > introduced transaction boundaries. Or we should get together with > Kostja and discuss it. Or both. Neither the commit message nor > comments shed the light on the big picture. >=20 > > > > > > + IPROTO_TXN_LAST =3D 0x0a, > > > > >=20 > > > > > I think we should instead introduce BEGIN and COMMIT commands,=20 because: > > > > I completely do not like any auto-commit logic in a xlog file. You > > > > suggestion breaks backward compatibility because previous logs do n= ot > > > > have any BEGIN/ > > >=20 > > > It wouldn't break backward compatibility. It might break forward > > > compatibility, which is fine by me (we do it all the time). > >=20 > > Suggested xrow encoding/decoding rules means that any xrow without txn_= id, > > txn_replica_id, txn_last should be processed as a single statement > > transaction as it was before. > > If we would require explicit begin/commit then previous logs turns into= an > > invalid stream without autocommit semantic. >=20 > Yeah, we would have to add autocommit. I am disagreed with autocommit. There are my key points: 1. We could not interleave autocommit transactions with others, or we would= =20 have to have special mark for this transaction but this is no autocommit af= ter=20 all. 2. Applier would not able to check transaction boundaries. 3. There is no difference between an autocommit transaction and a single-ro= w=20 transaction. 4. Autocommit is about interface behavior but not about log format. Also let me explain xrow encoding rules: If we have only one row in transaction, then we do not encode txn_id and=20 txn_last. So single-row transaction (or autocommit if you wish) does not=20 change its' representation. And then while decoding we set txn_id =3D lsn a= nd=20 txn_last if txn_id was not set. In that case each row from previous xlog ac= ts=20 as one single-row transaction. =46or multi-row transaction we have only penalty in transaction identifiers= for=20 each row and one commit flag for the last. >=20 > > But I think, that txn_last should be renamed into txn_commit. > > Also explicit begin operation is redundant because a new one pair txn_i= d/ > > txn_replica_id already means begin of an transaction. > >=20 > > > > COMMIT. Also separate BEGIN and COMMIT messages increase transaction > > > > size. > > >=20 > > > I doubt that after compression you'll see a difference. > >=20 > > replication stream does not have any compression. >=20 > OK. >=20 > > > > It is worth noting, that IPROTO_NOP with a new txn_id or txn_last > > > > emulates > > > > BEGIN or COMMIT. > > >=20 > > > Yeah, but that would look weird. > > >=20 > > > > > - We might need to attach some extra information to each > > > > > transaction, > > > > > =20 > > > > > e.g. mark transactions that were committed in parallel on the > > > > > master > > > > > so that they can be committed in parallel on a replica. Attach= ing > > > > > such information to each row would be excessive. > > > >=20 > > > > The patch is not about this. > > >=20 > > > But we have to think about that in advance, don't we? > >=20 > > We do not have clear view how it should be done. >=20 > Which means we should think it through IMO. The two issues are closely > related. The main issue - proposed decision (track which transaction were batched) i= s=20 not applicable, because vinyl might yield in a completely unpredictable=20 manner. Also I have full-parallel solution but it does not require any=20 information except transaction boundaries. >=20 > > > > > - We will need BEGIN and COMMIT for IPROTO transactions. It would > > > > > be > > > > > =20 > > > > > nice if we could share the code with them. > > > >=20 > > > > The biggest issue we could not know transaction identifier in case = of > > > > IPROTO. Iproto is single stream proto, but wal might be not as it is > > > > multiplexing a lot of transactions in a one output, so it might be = bad > > > > be > > > > in paradigm of universally format for both IPROTO and WAL. > > >=20 > > > OK. I think we need to discuss the options with Kostja. > > >=20 > > > > > - Having separate BEGIN/COMMIT rows would make tarantoolctl-cat > > > > > output > > > > > =20 > > > > > easier to read. > > > > > =20 > > > > > > /* Leave a gap for other keys in the header. */ > > > > > > IPROTO_SPACE_ID =3D 0x10, > > > > > > IPROTO_INDEX_ID =3D 0x11, > > > > > >=20 > > > > > > diff --git a/src/box/wal.c b/src/box/wal.c > > > > > > index 4c3537672..17ead08e7 100644 > > > > > > --- a/src/box/wal.c > > > > > > +++ b/src/box/wal.c > > > > > > @@ -905,9 +905,10 @@ wal_writer_begin_rollback(struct wal_writer > > > > > > *writer) > > > > > >=20 > > > > > > } > > > > > > =20 > > > > > > static void > > > > > >=20 > > > > > > -wal_assign_lsn(struct vclock *vclock, struct xrow_header **row, > > > > > > +wal_assign_lsn(struct vclock *vclock, struct xrow_header **beg= in, > > > > > >=20 > > > > > > struct xrow_header **end) > > > > > > =20 > > > > > > { > > > > > >=20 > > > > > > + struct xrow_header **row =3D begin; > > > > > >=20 > > > > > > /** Assign LSN to all local rows. */ > > > > > > for ( ; row < end; row++) { > > > > > > =09 > > > > > > if ((*row)->replica_id =3D=3D 0) { > > > > > >=20 > > > > > > @@ -917,6 +918,39 @@ wal_assign_lsn(struct vclock *vclock, stru= ct > > > > > > xrow_header **row,> > > > > > >=20 > > > > > > vclock_follow_xrow(vclock, *row); > > > > > > =09 > > > > > > } > > > > > > =09 > > > > > > } > > > > > >=20 > > > > > > + if ((*begin)->replica_id !=3D instance_id) { > > > > > > + /* > > > > > > + * Move all local changes to the end of rows array and > > > > > > + * a fake local transaction (like an autonomous=20 transaction) > > > > > > + * because we could not replicate the transaction=20 back. > > > > > > + */ > > > > > > + struct xrow_header **row =3D end - 1; > > > > > > + while (row >=3D begin) { > > > > > > + if (row[0]->replica_id !=3D instance_id) { > > > > > > + --row; > > > > > > + continue; > > > > > > + } > > > > > > + /* Local row, move it back. */ > > > > > > + struct xrow_header **local_row =3D row; > > > > > > + while (local_row < end - 1 && > > > > > > + local_row[1]->replica_id !=3D instance_id)=20 { > > > > > > + struct xrow_header *tmp =3D local_row[0]; > > > > > > + local_row[0] =3D local_row[1]; > > > > > > + local_row[1] =3D tmp; > > > > >=20 > > > > > There's SWAP for this. > > > > >=20 > > > > > > + } > > > > > > + --row; > > > > > > + } > > > > > > + while (begin < end && begin[0]->replica_id !=3D=20 instance_id) > > > > > > + ++begin; > > > > > > + } > > > > >=20 > > > > > I don't understand why we need to move rows generated locally (by > > > > > an on_replace trigger I surmise) to the end of a transaction. We > > > > > have TXN_ID attached to each row so we could leave the transactio= ns > > > > > interleaved, couldn't we? > > > >=20 > > > > You are right, but in that case applier should track a lot of > > > > transaction > > > > simultaneously. Also it complicates recovery too. I hope it will be > > > > fixed > > > > while parallel applier implementing. > > >=20 > > > May be, we should implement parallel applier in the scope of this iss= ue > > > then? Anyway, without it, sync replication won't scale with the number > > > of parallel transactions. > >=20 > > I do not think so. Parallel applier depends on that feature, but it is > > completely different issue. >=20 > But as I said, sync replication won't scale without it. >=20 > > Also parallel applier is ignorant on a xlog format. >=20 > Depends on how you implement it. For example, MySQL marks all > transactions that were committed in parallel on the master so that > they can be committed in parallel on replicas. In case of vinyl it is not true. >=20 > > > > > > diff --git a/test/vinyl/errinj_stat.result > > > > > > b/test/vinyl/errinj_stat.result > > > > > > index 08801dbc6..361ddf5db 100644 > > > > > > --- a/test/vinyl/errinj_stat.result > > > > > > +++ b/test/vinyl/errinj_stat.result > > > > > > @@ -69,7 +69,7 @@ i:stat().disk.compaction.queue -- 30 statemen= ts > > > > > >=20 > > > > > > - bytes_compressed: > > > > > > =20 > > > > > > pages: 3 > > > > > > rows: 30 > > > > > >=20 > > > > > > - bytes: 411 > > > > > > + bytes: 477 > > > > > >=20 > > > > > > ... > > > > > > i:stat().disk.compaction.queue.bytes =3D=3D > > > > > > box.stat.vinyl().scheduler.compaction_queue --- > > > > > >=20 > > > > > > @@ -83,7 +83,7 @@ i:stat().disk.compaction.queue -- 40 statemen= ts > > > > > >=20 > > > > > > - bytes_compressed: > > > > > > =20 > > > > > > pages: 4 > > > > > > rows: 40 > > > > > >=20 > > > > > > - bytes: 548 > > > > > > + bytes: 636 > > > > >=20 > > > > > I wouldn't expect vinyl stats to be changed by this patch. > > > > > Why did it happen? > > > >=20 > > > > Because if rows were written in an one entry, then wal creates > > > > transaction. > > >=20 > > > But those are vinyl files (run, index). They shouldn't be affected by > > > this, should they? > >=20 > > vinyl uses xlog_write_entry that forms a transaction. >=20 > No, not for run/index files, it does not. Take a look at > vy_run_dump_stmt. There's no point in writing any extra > transaction information to run/index files. Thanks, will check it >=20 > > I do not think that introducing a new version of xlog_write_entry > > with/without transaction is a good idea. --nextPart2787205.CJaNPQNBCu Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part. Content-Transfer-Encoding: 7Bit -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEFB+nbqWGnp59Rk9ZFSyY70x8X3sFAlxTBVQACgkQFSyY70x8 X3vL4wf9HPa/CjcYSQ4HAVmDVyS3FCeMuyddNMB8bUm54b3vRxGvIfr6cLzWhXue QNmobUE0eGI5nPfHzTu1z8ZcglcRC53fxGr+NA76HhjHOx4etjA+9uawcnT4UKnj qLRaScz4f6yH/58spd2ysj16mQJwf44jtauVHWWqbiFJOMjnm0uCpp2Qk3VB6+gJ P/8mkgTYlpRbn8rEoYtL65yc4ERNHiiiL4rf/csTn+OXXWMaZmW+CG+E700pRRTN Bmfc7L/PG6IFaFgGhd5eIMBTDDfXnKs3uAn5xBArN815kywUHWyqNj1GJEMXlv2F x2X5TOaXmM7T+0SWO6WTjUut+BPDKg== =auh7 -----END PGP SIGNATURE----- --nextPart2787205.CJaNPQNBCu--