From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: =?utf-8?B?0JPQtdC+0YDQs9C40Lkg0JrQuNGA0LjRh9C10L3QutC+?= Subject: Re: [tarantool-patches] [PATCH v2 1/2] Journal transaction boundaries Date: Thu, 31 Jan 2019 10:34:43 +0300 Message-ID: <3544285.z36fh00o7k@home.lan> In-Reply-To: <20190129110010.qav5cedodjfs4ju4@esperanza> References: <3229189.u9recYUF4a@home.lan> <20190129110010.qav5cedodjfs4ju4@esperanza> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart8380479.LmgfT5MsT3"; micalg="pgp-sha256"; protocol="application/pgp-signature" To: Vladimir Davydov Cc: tarantool-patches@freelists.org List-ID: --nextPart8380479.LmgfT5MsT3 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="UTF-8" On Tuesday, January 29, 2019 2:00:10 PM MSK Vladimir Davydov wrote: > On Tue, Jan 29, 2019 at 01:09:50PM +0300, =D0=93=D0=B5=D0=BE=D1=80=D0=B3= =D0=B8=D0=B9 =D0=9A=D0=B8=D1=80=D0=B8=D1=87=D0=B5=D0=BD=D0=BA=D0=BE wrote: > > On Monday, January 28, 2019 3:58:59 PM MSK Vladimir Davydov wrote: > > > On Tue, Jan 22, 2019 at 01:57:36PM +0300, Georgy Kirichenko wrote: > > > > Append txn_id, txn_replica_id and txn_last to xrow_header structure. > > > > txn_replica_id identifies replica where transaction was started and > > > > txn_id identifies transaction id on that replica. As transaction id > > > > a lsn of the first row in this transaction is used. > > > > txn_last set to true if it is the last row in a transaction, so we > > > > could commit transaction with last row or use additional NOP reques= ts > > > > with txn_last =3D true ans valid txn_id and txn_replica_id. > > > > For replication all local changes moved to xrows array tail to form > > > > a separate transaction (like autonomous transaction) because it is = not > > > > possible to replicate such transaction back to it's creator. > > > >=20 > > > > As encoding/deconding rules assumed: > > > > 1. txn_replica_id is encoded only if it is not equal with replica > > > > =20 > > > > id. This might have point because of replication trigger > > > > =20 > > > > 2. txn_id and txn_last are encoded only for multi-row transaction. > > > > =20 > > > > So if we do not have txn_id in a xstream then this means that it > > > > is a single row transaction. > > > >=20 > > > > This rules provides compatibility with previous xlog handling. > > > >=20 > > > > Needed for: 2798 > > > > --- > > > >=20 > > > > src/box/iproto_constants.h | 3 ++ > > >=20 > > > Looks like you forgot to update iproto_constants.c. > > >=20 > > > > src/box/wal.c | 36 +++++++++++++++- > > > > src/box/xrow.c | 38 +++++++++++++++++ > > > > src/box/xrow.h | 5 ++- > > > > test/unit/xrow.cc | 3 ++ > > > > test/vinyl/errinj_stat.result | 8 ++-- > > > > test/vinyl/layout.result | 24 +++++------ > > > > test/vinyl/stat.result | 78 > > > > +++++++++++++++++------------------ > > > > 8 files changed, 138 insertions(+), 57 deletions(-) > > > >=20 > > > > diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h > > > > index 728514297..d01cdf840 100644 > > > > --- a/src/box/iproto_constants.h > > > > +++ b/src/box/iproto_constants.h > > > > @@ -60,6 +60,9 @@ enum iproto_key { > > > >=20 > > > > IPROTO_SCHEMA_VERSION =3D 0x05, > > > > IPROTO_SERVER_VERSION =3D 0x06, > > > > IPROTO_GROUP_ID =3D 0x07, > > > >=20 > > > > + IPROTO_TXN_ID =3D 0x08, > > > > + IPROTO_TXN_REPLICA_ID =3D 0x09, > > >=20 > > > Do we really need to introduce both TXN_ID and TXN_REPLICA_ID? I would > > > expect TXN_ID to be enough - we could use REPLICA_ID to make sure that > > > transaction identifiers from different instances don't overlap, becau= se > > > there couldn't be a "multi-instance" transaction, could there? txn_replica_id might be differ from replica_id in case when transaction wou= ld=20 be finished on other node, e.g. after recovery. > > >=20 > > > > + IPROTO_TXN_LAST =3D 0x0a, > > >=20 > > > I think we should instead introduce BEGIN and COMMIT commands, becaus= e: > > I completely do not like any auto-commit logic in a xlog file. You > > suggestion breaks backward compatibility because previous logs do not > > have any BEGIN/ > It wouldn't break backward compatibility. It might break forward > compatibility, which is fine by me (we do it all the time). Suggested xrow encoding/decoding rules means that any xrow without txn_id,= =20 txn_replica_id, txn_last should be processed as a single statement transact= ion=20 as it was before. If we would require explicit begin/commit then previous logs turns into an= =20 invalid stream without autocommit semantic. But I think, that txn_last should be renamed into txn_commit. Also explicit begin operation is redundant because a new one pair txn_id/ txn_replica_id already means begin of an transaction.=20 >=20 > > COMMIT. Also separate BEGIN and COMMIT messages increase transaction si= ze. >=20 > I doubt that after compression you'll see a difference. replication stream does not have any compression. >=20 > > It is worth noting, that IPROTO_NOP with a new txn_id or txn_last emula= tes > > BEGIN or COMMIT. >=20 > Yeah, but that would look weird. >=20 > > > - We might need to attach some extra information to each transaction, > > > =20 > > > e.g. mark transactions that were committed in parallel on the mast= er > > > so that they can be committed in parallel on a replica. Attaching > > > such information to each row would be excessive. > >=20 > > The patch is not about this. >=20 > But we have to think about that in advance, don't we? We do not have clear view how it should be done. >=20 > > > - We will need BEGIN and COMMIT for IPROTO transactions. It would be > > > =20 > > > nice if we could share the code with them. > >=20 > > The biggest issue we could not know transaction identifier in case of > > IPROTO. Iproto is single stream proto, but wal might be not as it is > > multiplexing a lot of transactions in a one output, so it might be bad = be > > in paradigm of universally format for both IPROTO and WAL. >=20 > OK. I think we need to discuss the options with Kostja. >=20 > > > - Having separate BEGIN/COMMIT rows would make tarantoolctl-cat outp= ut > > > =20 > > > easier to read. > > > =20 > > > > /* Leave a gap for other keys in the header. */ > > > > IPROTO_SPACE_ID =3D 0x10, > > > > IPROTO_INDEX_ID =3D 0x11, > > > >=20 > > > > diff --git a/src/box/wal.c b/src/box/wal.c > > > > index 4c3537672..17ead08e7 100644 > > > > --- a/src/box/wal.c > > > > +++ b/src/box/wal.c > > > > @@ -905,9 +905,10 @@ wal_writer_begin_rollback(struct wal_writer > > > > *writer) > > > >=20 > > > > } > > > > =20 > > > > static void > > > >=20 > > > > -wal_assign_lsn(struct vclock *vclock, struct xrow_header **row, > > > > +wal_assign_lsn(struct vclock *vclock, struct xrow_header **begin, > > > >=20 > > > > struct xrow_header **end) > > > > =20 > > > > { > > > >=20 > > > > + struct xrow_header **row =3D begin; > > > >=20 > > > > /** Assign LSN to all local rows. */ > > > > for ( ; row < end; row++) { > > > > =09 > > > > if ((*row)->replica_id =3D=3D 0) { > > > >=20 > > > > @@ -917,6 +918,39 @@ wal_assign_lsn(struct vclock *vclock, struct > > > > xrow_header **row,> > > > >=20 > > > > vclock_follow_xrow(vclock, *row); > > > > =09 > > > > } > > > > =09 > > > > } > > > >=20 > > > > + if ((*begin)->replica_id !=3D instance_id) { > > > > + /* > > > > + * Move all local changes to the end of rows array and > > > > + * a fake local transaction (like an autonomous transaction) > > > > + * because we could not replicate the transaction back. > > > > + */ > > > > + struct xrow_header **row =3D end - 1; > > > > + while (row >=3D begin) { > > > > + if (row[0]->replica_id !=3D instance_id) { > > > > + --row; > > > > + continue; > > > > + } > > > > + /* Local row, move it back. */ > > > > + struct xrow_header **local_row =3D row; > > > > + while (local_row < end - 1 && > > > > + local_row[1]->replica_id !=3D instance_id) { > > > > + struct xrow_header *tmp =3D local_row[0]; > > > > + local_row[0] =3D local_row[1]; > > > > + local_row[1] =3D tmp; > > >=20 > > > There's SWAP for this. > > >=20 > > > > + } > > > > + --row; > > > > + } > > > > + while (begin < end && begin[0]->replica_id !=3D instance_id) > > > > + ++begin; > > > > + } > > >=20 > > > I don't understand why we need to move rows generated locally (by > > > an on_replace trigger I surmise) to the end of a transaction. We > > > have TXN_ID attached to each row so we could leave the transactions > > > interleaved, couldn't we? > >=20 > > You are right, but in that case applier should track a lot of transacti= on > > simultaneously. Also it complicates recovery too. I hope it will be fix= ed > > while parallel applier implementing. >=20 > May be, we should implement parallel applier in the scope of this issue > then? Anyway, without it, sync replication won't scale with the number > of parallel transactions. I do not think so. Parallel applier depends on that feature, but it is=20 completely different issue. Also parallel applier is ignorant on a xlog for= mat. >=20 > > > > + /* Setup txn_id and tnx_replica_id for localy generated rows. */ > > > > + row =3D begin; > > > > + while (row < end) { > > > > + row[0]->txn_id =3D begin[0]->lsn; > > > > + row[0]->txn_replica_id =3D instance_id; > > > > + row[0]->txn_last =3D row =3D=3D end - 1 ? 1 : 0; > > > > + ++row; > > > > + } > > >=20 > > > I think we better use txn->id for TXN_ID rather than LSN. > > > Why do you think LSN should be used? I don't see any rationale > > > for that anywhere in the comments. Also, setting TXN_ID looks > > > like a job that should be done by txn_add_redo... > > >=20 > > > > } > > > > =20 > > > > static void > > > >=20 > > > > diff --git a/src/box/xrow.c b/src/box/xrow.c > > > > index ef3f81add..db524b3c8 100644 > > > > --- a/src/box/xrow.c > > > > +++ b/src/box/xrow.c > > > >=20 > > > > @@ -133,12 +133,32 @@ error: > > > > case IPROTO_SCHEMA_VERSION: > > > > header->schema_version =3D mp_decode_uint(pos); > > > > break; > > > >=20 > > > > + case IPROTO_TXN_ID: > > > > + header->txn_id =3D mp_decode_uint(pos); > > > > + break; > > > > + case IPROTO_TXN_REPLICA_ID: > > > > + header->txn_replica_id =3D mp_decode_uint(pos); > > > > + break; > > > > + case IPROTO_TXN_LAST: > > > > + header->txn_last =3D mp_decode_uint(pos); > > >=20 > > > Should be bool? > > >=20 > > > > + break; > > > >=20 > > > > default: > > > > /* unknown header */ > > > > mp_next(pos); > > > > =09 > > > > } > > > > =09 > > > > } > > > > assert(*pos <=3D end); > > > >=20 > > > > + if (header->txn_id =3D=3D 0) { > > > > + /* > > > > + * Transaction id is not set so it is a single statement > > > > + * transaction. > > > > + */ > > > > + header->txn_id =3D header->lsn; > > > > + header->txn_last =3D true; > > > > + } > > > > + if (header->txn_replica_id =3D=3D 0) > > > > + header->txn_replica_id =3D header->replica_id; > > > > + > > > >=20 > > > > /* Nop requests aren't supposed to have a body. */ > > > > if (*pos < end && header->type !=3D IPROTO_NOP) { > > > > =09 > > > > const char *body =3D *pos; > > > >=20 > > > > @@ -223,6 +243,24 @@ xrow_header_encode(const struct xrow_header > > > > *header, > > > > uint64_t sync,> > > > >=20 > > > > d =3D mp_encode_double(d, header->tm); > > > > map_size++; > > > > =09 > > > > } > > > >=20 > > > > + if (header->txn_id !=3D header->lsn || header->txn_last =3D=3D 0)= { > > > > + /* Encode txn id for multi row transaction members. */ > > > > + d =3D mp_encode_uint(d, IPROTO_TXN_ID); > > > > + d =3D mp_encode_uint(d, header->txn_id); > > > > + map_size++; > > > > + } > > > > + if (header->txn_replica_id !=3D header->replica_id) { > > > > + d =3D mp_encode_uint(d, IPROTO_TXN_REPLICA_ID); > > > > + d =3D mp_encode_uint(d, header->txn_replica_id); > > > > + map_size++; > > > > + } > > > > + if (header->txn_last && !(header->txn_id =3D=3D header->lsn && > > > > + header->txn_replica_id =3D=3D header->replica_id)) > >=20 > > { > >=20 > > > > + /* Set last row for multi row transaction. */ > > > > + d =3D mp_encode_uint(d, IPROTO_TXN_LAST); > > > > + d =3D mp_encode_uint(d, header->txn_last); > > > > + map_size++; > > > > + } > > > >=20 > > > > assert(d <=3D data + XROW_HEADER_LEN_MAX); > > > > mp_encode_map(data, map_size); > > > > out->iov_len =3D d - (char *) out->iov_base; > > > >=20 > > > > diff --git a/src/box/xrow.h b/src/box/xrow.h > > > > index 6bab0a1fd..4acd84d56 100644 > > > > --- a/src/box/xrow.h > > > > +++ b/src/box/xrow.h > > > > @@ -47,7 +47,7 @@ enum { > > > >=20 > > > > XROW_HEADER_IOVMAX =3D 1, > > > > XROW_BODY_IOVMAX =3D 2, > > > > XROW_IOVMAX =3D XROW_HEADER_IOVMAX + XROW_BODY_IOVMAX, > > > >=20 > > > > - XROW_HEADER_LEN_MAX =3D 40, > > > > + XROW_HEADER_LEN_MAX =3D 60, > > > >=20 > > > > XROW_BODY_LEN_MAX =3D 128, > > > > IPROTO_HEADER_LEN =3D 28, > > > > /** 7 =3D sizeof(iproto_body_bin). */ > > > >=20 > > > > @@ -69,6 +69,9 @@ struct xrow_header { > > > >=20 > > > > uint64_t sync; > > > > int64_t lsn; /* LSN must be signed for correct comparison */ > > > > double tm; > > > >=20 > > > > + int64_t txn_id; > > > > + uint32_t txn_replica_id; > > > > + uint32_t txn_last; > > > >=20 > > > > int bodycnt; > > > > uint32_t schema_version; > > > >=20 > > > > diff --git a/test/unit/xrow.cc b/test/unit/xrow.cc > > > > index 165a543cf..0b796d728 100644 > > > > --- a/test/unit/xrow.cc > > > > +++ b/test/unit/xrow.cc > > > > @@ -215,6 +215,9 @@ test_xrow_header_encode_decode() > > > >=20 > > > > header.lsn =3D 400; > > > > header.tm =3D 123.456; > > > > header.bodycnt =3D 0; > > > >=20 > > > > + header.txn_id =3D header.lsn; > > > > + header.txn_replica_id =3D header.replica_id; > > > > + header.txn_last =3D true; > > > >=20 > > > > uint64_t sync =3D 100500; > > > > struct iovec vec[1]; > > > > is(1, xrow_header_encode(&header, sync, vec, 200), "encode"); > > > >=20 > > > > diff --git a/test/vinyl/errinj_stat.result > > > > b/test/vinyl/errinj_stat.result > > > > index 08801dbc6..361ddf5db 100644 > > > > --- a/test/vinyl/errinj_stat.result > > > > +++ b/test/vinyl/errinj_stat.result > > > > @@ -69,7 +69,7 @@ i:stat().disk.compaction.queue -- 30 statements > > > >=20 > > > > - bytes_compressed: > > > > =20 > > > > pages: 3 > > > > rows: 30 > > > >=20 > > > > - bytes: 411 > > > > + bytes: 477 > > > >=20 > > > > ... > > > > i:stat().disk.compaction.queue.bytes =3D=3D > > > > box.stat.vinyl().scheduler.compaction_queue --- > > > >=20 > > > > @@ -83,7 +83,7 @@ i:stat().disk.compaction.queue -- 40 statements > > > >=20 > > > > - bytes_compressed: > > > > =20 > > > > pages: 4 > > > > rows: 40 > > > >=20 > > > > - bytes: 548 > > > > + bytes: 636 > > >=20 > > > I wouldn't expect vinyl stats to be changed by this patch. > > > Why did it happen? > >=20 > > Because if rows were written in an one entry, then wal creates > > transaction. >=20 > But those are vinyl files (run, index). They shouldn't be affected by > this, should they? vinyl uses xlog_write_entry that forms a transaction. I do not think that=20 introducing a new version of xlog_write_entry with/without transaction is a= =20 good idea. --nextPart8380479.LmgfT5MsT3 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part. Content-Transfer-Encoding: 7Bit -----BEGIN PGP SIGNATURE----- iQEyBAABCAAdFiEEFB+nbqWGnp59Rk9ZFSyY70x8X3sFAlxSpRMACgkQFSyY70x8 X3vOEQf0DsOWsHdYzKy9GqToqHV7EmdemOpiajOIcuLYJsCzyxwq5akv70urjTCo mIYLaNwUTFXqqhsbAqz5mJc3sYYDvQJcpExAi0r22SY09ByMRp6pfiIGOmGNFhl0 EDApkkPCOoldhNapRz5OCaCTUdiBVdsVRl0x3TPM8Wn9lXs9NL005+2c7bFtbN60 DAZfLR5cOdV1rhdK2/73D5NtvLl7qRSOOOOGr2CXnVcem/iEZfgeNPP7thGGheU+ l1wfR3KDs3invLsTWEcBc/ugLiDfGD5pox8LKPrYGsFJT4JEmFbm5IAOxlspY4bN 3Q9m+BKT0h5aDeG/sHbqzCh2PQRT =N+Qo -----END PGP SIGNATURE----- --nextPart8380479.LmgfT5MsT3--