Please take a look once time permit. I'm mostly worried about exporting limbo internals but without it we can't test order of promotion tracking. branch gorcunov/gh-6036-rollback-confirm-17 issue https://github.com/tarantool/tarantool/issues/6036 previous series https://lists.tarantool.org/tarantool-patches/YUso7ImbeNg82tfv@grain/T/#t Cyrill Gorcunov (5): latch: add latch_is_locked helper qsync: order access to the limbo terms qsync: track confirmed_lsn upon CONFIRM packet qsync: export more details on promote tracking test: add gh-6036-term-order test src/box/applier.cc | 16 +- src/box/box.cc | 71 +++--- src/box/lua/info.c | 11 +- src/box/memtx_engine.cc | 3 +- src/box/txn_limbo.c | 48 ++++- src/box/txn_limbo.h | 49 ++++- src/lib/core/latch.h | 11 + test/replication/gh-6036-order-master.lua | 1 + test/replication/gh-6036-order-node.lua | 60 ++++++ test/replication/gh-6036-order-replica1.lua | 1 + test/replication/gh-6036-order-replica2.lua | 1 + test/replication/gh-6036-term-order.result | 216 +++++++++++++++++++ test/replication/gh-6036-term-order.test.lua | 94 ++++++++ test/replication/suite.cfg | 1 + test/replication/suite.ini | 2 +- 15 files changed, 531 insertions(+), 54 deletions(-) create mode 120000 test/replication/gh-6036-order-master.lua create mode 100644 test/replication/gh-6036-order-node.lua create mode 120000 test/replication/gh-6036-order-replica1.lua create mode 120000 test/replication/gh-6036-order-replica2.lua create mode 100644 test/replication/gh-6036-term-order.result create mode 100644 test/replication/gh-6036-term-order.test.lua base-commit: e509054ccd9a6ccea8a0c201a4ebd12b07826026 -- 2.31.1
To test if latch is locked. Part-of #6036 Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> --- src/lib/core/latch.h | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/src/lib/core/latch.h b/src/lib/core/latch.h index 49c59cf63..0aaa8b634 100644 --- a/src/lib/core/latch.h +++ b/src/lib/core/latch.h @@ -95,6 +95,17 @@ latch_owner(struct latch *l) return l->owner; } +/** + * Return true if the latch is locked. + * + * @param l - latch to be tested. + */ +static inline bool +latch_is_locked(const struct latch *l) +{ + return l->owner != NULL; +} + /** * Lock a latch. If the latch is already locked by another fiber, * waits for timeout. -- 2.31.1
Limbo terms tracking is shared between appliers and when one of appliers is waiting for write to complete inside journal_write() routine, an other may need to access read term value to figure out if promote request is valid to apply. Due to cooperative multitasking access to the terms is not consistent so we need to be sure that other fibers read up to date terms (ie written to the WAL). For this sake we use a latching mechanism, when one fiber takes a lock for updating other readers are waiting until the operation is complete. For example here is a call graph of two appliers applier 1 --------- applier_apply_tx (promote term = 3 current max term = 2) applier_synchro_filter_tx apply_synchro_row journal_write (sleeping) at this moment another applier comes in with obsolete data and term 2 applier 2 --------- applier_apply_tx (term 2) applier_synchro_filter_tx txn_limbo_is_replica_outdated -> false journal_write (sleep) applier 1 --------- journal wakes up apply_synchro_row_cb set max term to 3 So the applier 2 didn't notice that term 3 is already seen and wrote obsolete data. With locking the applier 2 will wait until applier 1 has finished its write. We introduce the following helpers: 1) txn_limbo_process_begin: which takes a lock (and will be validating request in future in sake of split brain detection) 2) txn_limbo_process_commit and txn_limbo_process_rollback which simply release the lock but have different names for better semantics 3) txn_limbo_process is a general function which uses x_begin and x_commit helper internally Since txn_limbo_process_begin() can fail we had to change all callers to return error. In particular box.ctrl.promote() and box.ctrl.demote() commands now may fail. Internally they write promote or demote packet to the WAL and process synchro queue. So to eliminate code duplication we use new box_issue_synchro() helper which respects locking mechanism. Part-of #6036 Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> --- src/box/applier.cc | 16 +++++++--- src/box/box.cc | 71 +++++++++++++++++++---------------------- src/box/memtx_engine.cc | 3 +- src/box/txn_limbo.c | 34 ++++++++++++++++++-- src/box/txn_limbo.h | 49 +++++++++++++++++++++++++--- 5 files changed, 121 insertions(+), 52 deletions(-) diff --git a/src/box/applier.cc b/src/box/applier.cc index b981bd436..f0751b68a 100644 --- a/src/box/applier.cc +++ b/src/box/applier.cc @@ -458,7 +458,8 @@ applier_wait_snapshot(struct applier *applier) struct synchro_request req; if (xrow_decode_synchro(&row, &req) != 0) diag_raise(); - txn_limbo_process(&txn_limbo, &req); + if (txn_limbo_process(&txn_limbo, &req) != 0) + diag_raise(); } else if (iproto_type_is_raft_request(row.type)) { struct raft_request req; if (xrow_decode_raft(&row, &req, NULL) != 0) @@ -857,7 +858,7 @@ apply_synchro_row_cb(struct journal_entry *entry) applier_rollback_by_wal_io(entry->res); } else { replica_txn_wal_write_cb(synchro_entry->rcb); - txn_limbo_process(&txn_limbo, synchro_entry->req); + txn_limbo_process_run(&txn_limbo, synchro_entry->req); trigger_run(&replicaset.applier.on_wal_write, NULL); } fiber_wakeup(synchro_entry->owner); @@ -873,6 +874,9 @@ apply_synchro_row(uint32_t replica_id, struct xrow_header *row) if (xrow_decode_synchro(row, &req) != 0) goto err; + if (txn_limbo_process_begin(&txn_limbo, &req) != 0) + goto err; + struct replica_cb_data rcb_data; struct synchro_entry entry; /* @@ -910,12 +914,16 @@ apply_synchro_row(uint32_t replica_id, struct xrow_header *row) * transactions side, including the async ones. */ if (journal_write(&entry.base) != 0) - goto err; + goto err_rollback; if (entry.base.res < 0) { diag_set_journal_res(entry.base.res); - goto err; + goto err_rollback; } + txn_limbo_process_commit(&txn_limbo); return 0; + +err_rollback: + txn_limbo_process_rollback(&txn_limbo); err: diag_log(); return -1; diff --git a/src/box/box.cc b/src/box/box.cc index 7b11d56d6..19e67b186 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -424,8 +424,7 @@ wal_stream_apply_synchro_row(struct wal_stream *stream, struct xrow_header *row) say_error("couldn't decode a synchro request"); return -1; } - txn_limbo_process(&txn_limbo, &syn_req); - return 0; + return txn_limbo_process(&txn_limbo, &syn_req); } static int @@ -1670,48 +1669,43 @@ box_wait_limbo_acked(double timeout) return wait_lsn; } -/** Write and process a PROMOTE request. */ -static void -box_issue_promote(uint32_t prev_leader_id, int64_t promote_lsn) +/** Write and process PROMOTE or DEMOTE request. */ +static int +box_issue_synchro(uint16_t type, uint32_t prev_leader_id, int64_t promote_lsn) { struct raft *raft = box_raft(); + + assert(type == IPROTO_RAFT_PROMOTE || + type == IPROTO_RAFT_DEMOTE); assert(raft->volatile_term == raft->term); assert(promote_lsn >= 0); - txn_limbo_write_promote(&txn_limbo, promote_lsn, - raft->term); + struct synchro_request req = { - .type = IPROTO_RAFT_PROMOTE, - .replica_id = prev_leader_id, - .origin_id = instance_id, - .lsn = promote_lsn, - .term = raft->term, + .type = type, + .replica_id = prev_leader_id, + .origin_id = instance_id, + .lsn = promote_lsn, + .term = raft->term, }; - txn_limbo_process(&txn_limbo, &req); + + if (txn_limbo_process_begin(&txn_limbo, &req) != 0) + return -1; + + if (type == IPROTO_RAFT_PROMOTE) + txn_limbo_write_promote(&txn_limbo, req.lsn, req.term); + else + txn_limbo_write_demote(&txn_limbo, req.lsn, req.term); + + txn_limbo_process_run(&txn_limbo, &req); assert(txn_limbo_is_empty(&txn_limbo)); + + txn_limbo_process_commit(&txn_limbo); + return 0; } /** A guard to block multiple simultaneous promote()/demote() invocations. */ static bool is_in_box_promote = false; -/** Write and process a DEMOTE request. */ -static void -box_issue_demote(uint32_t prev_leader_id, int64_t promote_lsn) -{ - assert(box_raft()->volatile_term == box_raft()->term); - assert(promote_lsn >= 0); - txn_limbo_write_demote(&txn_limbo, promote_lsn, - box_raft()->term); - struct synchro_request req = { - .type = IPROTO_RAFT_DEMOTE, - .replica_id = prev_leader_id, - .origin_id = instance_id, - .lsn = promote_lsn, - .term = box_raft()->term, - }; - txn_limbo_process(&txn_limbo, &req); - assert(txn_limbo_is_empty(&txn_limbo)); -} - int box_promote_qsync(void) { @@ -1732,8 +1726,8 @@ box_promote_qsync(void) diag_set(ClientError, ER_NOT_LEADER, raft->leader); return -1; } - box_issue_promote(txn_limbo.owner_id, wait_lsn); - return 0; + return box_issue_synchro(IPROTO_RAFT_PROMOTE, + txn_limbo.owner_id, wait_lsn); } int @@ -1789,9 +1783,8 @@ box_promote(void) if (wait_lsn < 0) return -1; - box_issue_promote(txn_limbo.owner_id, wait_lsn); - - return 0; + return box_issue_synchro(IPROTO_RAFT_PROMOTE, + txn_limbo.owner_id, wait_lsn); } int @@ -1826,8 +1819,8 @@ box_demote(void) int64_t wait_lsn = box_wait_limbo_acked(replication_synchro_timeout); if (wait_lsn < 0) return -1; - box_issue_demote(txn_limbo.owner_id, wait_lsn); - return 0; + return box_issue_synchro(IPROTO_RAFT_DEMOTE, + txn_limbo.owner_id, wait_lsn); } int diff --git a/src/box/memtx_engine.cc b/src/box/memtx_engine.cc index de918c335..09f1d671c 100644 --- a/src/box/memtx_engine.cc +++ b/src/box/memtx_engine.cc @@ -238,8 +238,7 @@ memtx_engine_recover_synchro(const struct xrow_header *row) * because all its rows have a zero replica_id. */ req.origin_id = req.replica_id; - txn_limbo_process(&txn_limbo, &req); - return 0; + return txn_limbo_process(&txn_limbo, &req); } static int diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c index 70447caaf..eb9aa7780 100644 --- a/src/box/txn_limbo.c +++ b/src/box/txn_limbo.c @@ -47,6 +47,7 @@ txn_limbo_create(struct txn_limbo *limbo) vclock_create(&limbo->vclock); vclock_create(&limbo->promote_term_map); limbo->promote_greatest_term = 0; + latch_create(&limbo->promote_latch); limbo->confirmed_lsn = 0; limbo->rollback_count = 0; limbo->is_in_rollback = false; @@ -724,11 +725,14 @@ txn_limbo_wait_empty(struct txn_limbo *limbo, double timeout) } void -txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req) +txn_limbo_process_run(struct txn_limbo *limbo, + const struct synchro_request *req) { + assert(latch_is_locked(&limbo->promote_latch)); + uint64_t term = req->term; uint32_t origin = req->origin_id; - if (txn_limbo_replica_term(limbo, origin) < term) { + if (vclock_get(&limbo->promote_term_map, origin) < (int64_t)term) { vclock_follow(&limbo->promote_term_map, origin, term); if (term > limbo->promote_greatest_term) limbo->promote_greatest_term = term; @@ -786,6 +790,32 @@ txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req) return; } +int +txn_limbo_process_begin(struct txn_limbo *limbo, + const struct synchro_request *req) +{ + latch_lock(&limbo->promote_latch); + /* + * FIXME: For now we take a lock only but idea + * is to verify incoming requests to detect split + * brain situation. Thus we need to change the code + * semantics in advance. + */ + (void)req; + return 0; +} + +int +txn_limbo_process(struct txn_limbo *limbo, + const struct synchro_request *req) +{ + if (txn_limbo_process_begin(limbo, req) != 0) + return -1; + txn_limbo_process_run(limbo, req); + txn_limbo_process_commit(limbo); + return 0; +} + void txn_limbo_on_parameters_change(struct txn_limbo *limbo) { diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h index 53e52f676..afe9b429f 100644 --- a/src/box/txn_limbo.h +++ b/src/box/txn_limbo.h @@ -31,6 +31,7 @@ */ #include "small/rlist.h" #include "vclock/vclock.h" +#include "latch.h" #include <stdint.h> @@ -147,6 +148,10 @@ struct txn_limbo { * limbo and raft are in sync and the terms are the same. */ uint64_t promote_greatest_term; + /** + * To order access to the promote data. + */ + struct latch promote_latch; /** * Maximal LSN gathered quorum and either already confirmed in WAL, or * whose confirmation is in progress right now. Any attempt to confirm @@ -216,9 +221,12 @@ txn_limbo_last_entry(struct txn_limbo *limbo) * @a replica_id. */ static inline uint64_t -txn_limbo_replica_term(const struct txn_limbo *limbo, uint32_t replica_id) +txn_limbo_replica_term(struct txn_limbo *limbo, uint32_t replica_id) { - return vclock_get(&limbo->promote_term_map, replica_id); + latch_lock(&limbo->promote_latch); + uint64_t v = vclock_get(&limbo->promote_term_map, replica_id); + latch_unlock(&limbo->promote_latch); + return v; } /** @@ -226,11 +234,14 @@ txn_limbo_replica_term(const struct txn_limbo *limbo, uint32_t replica_id) * data from it. The check is only valid when elections are enabled. */ static inline bool -txn_limbo_is_replica_outdated(const struct txn_limbo *limbo, +txn_limbo_is_replica_outdated(struct txn_limbo *limbo, uint32_t replica_id) { - return txn_limbo_replica_term(limbo, replica_id) < - limbo->promote_greatest_term; + latch_lock(&limbo->promote_latch); + uint64_t v = vclock_get(&limbo->promote_term_map, replica_id); + bool res = v < limbo->promote_greatest_term; + latch_unlock(&limbo->promote_latch); + return res; } /** @@ -300,8 +311,36 @@ txn_limbo_ack(struct txn_limbo *limbo, uint32_t replica_id, int64_t lsn); int txn_limbo_wait_complete(struct txn_limbo *limbo, struct txn_limbo_entry *entry); +/** + * Initiate execution of a synchronous replication request. + */ +int +txn_limbo_process_begin(struct txn_limbo *limbo, + const struct synchro_request *req); + +/** Commit a synchronous replication request. */ +static inline void +txn_limbo_process_commit(struct txn_limbo *limbo) +{ + assert(latch_is_locked(&limbo->promote_latch)); + latch_unlock(&limbo->promote_latch); +} + +/** Rollback a synchronous replication request. */ +static inline void +txn_limbo_process_rollback(struct txn_limbo *limbo) +{ + assert(latch_is_locked(&limbo->promote_latch)); + latch_unlock(&limbo->promote_latch); +} + /** Execute a synchronous replication request. */ void +txn_limbo_process_run(struct txn_limbo *limbo, + const struct synchro_request *req); + +/** Process a synchronous replication request. */ +int txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req); /** -- 2.31.1
While been investigating various cluster split-brain scenarios and trying to gather valid incoming synchro request domains and ranges we've discovered that limbo::confirmed_lsn updated not dense enough to cover our needs. In particular this variable is always updated by a limbo owner upon write of syncro entry (to a journal) while replica just reads such record without confirmed_lsn update, so when the replica become a cluster leader it sends a promote request back to the former leader where the packet carries zero LSN instead of previous confirmed_lsn and validation of such packet won't pass. Note the packet validation is not yet implemented in this patch so it is rather a preparatory work for future. Part-of #6036 Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> --- src/box/txn_limbo.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c index eb9aa7780..959811309 100644 --- a/src/box/txn_limbo.c +++ b/src/box/txn_limbo.c @@ -774,6 +774,20 @@ txn_limbo_process_run(struct txn_limbo *limbo, switch (req->type) { case IPROTO_RAFT_CONFIRM: txn_limbo_read_confirm(limbo, lsn); + /* + * We have to adjust confirmed_lsn according + * to LSN coming from the request. It is because + * we will need to report it as old's limbo owner + * LSN inside PROMOTE requests (if administrator + * or election engine will make us so). + * + * We could update confirmed_lsn on every + * txn_limbo_read_confirm call but this function + * is usually called in a couple with + * txn_limbo_write_confirm, thus to eliminate redundant + * variables update we make so once but explicitly. + */ + limbo->confirmed_lsn = req->lsn; break; case IPROTO_RAFT_ROLLBACK: txn_limbo_read_rollback(limbo, lsn); -- 2.31.1
The patch introduces `promote` leaf to `box.info.synchro` table. | tarantool> box.info.synchro | --- | - queue: | len: 0 | owner: 1 | quorum: 1 | promote: | term_max: 4 | term_map: {1: 4} | ... An idea is to be able to track changes of seen requests. Since it is internal implementation details I prefer to not document it. Actually better to mark is as non-API somehow. Part-of #6036 Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> --- src/box/lua/info.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/src/box/lua/info.c b/src/box/lua/info.c index 040af306a..144fba12d 100644 --- a/src/box/lua/info.c +++ b/src/box/lua/info.c @@ -607,7 +607,7 @@ lbox_info_election(struct lua_State *L) static int lbox_info_synchro(struct lua_State *L) { - lua_createtable(L, 0, 2); + lua_createtable(L, 0, 3); /* Quorum value may be evaluated via formula */ lua_pushinteger(L, replication_synchro_quorum); @@ -622,6 +622,15 @@ lbox_info_synchro(struct lua_State *L) lua_setfield(L, -2, "owner"); lua_setfield(L, -2, "queue"); + /* Promote related info. Suitable for debug. */ + lua_createtable(L, 0, 2); + lua_pushnumber(L, queue->promote_greatest_term); + lua_setfield(L, -2, "term_max"); + lua_pushstring(L, "term_map"); + lbox_pushvclock(L, &queue->promote_term_map); + lua_settable(L, -3); + lua_setfield(L, -2, "promote"); + return 1; } -- 2.31.1
To test that promotion requests are handled only when appropriate write to WAL completes, because we update memory data before the write finishes. Part-of #6036 Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> --- test/replication/gh-6036-order-master.lua | 1 + test/replication/gh-6036-order-node.lua | 60 ++++++ test/replication/gh-6036-order-replica1.lua | 1 + test/replication/gh-6036-order-replica2.lua | 1 + test/replication/gh-6036-term-order.result | 216 +++++++++++++++++++ test/replication/gh-6036-term-order.test.lua | 94 ++++++++ test/replication/suite.cfg | 1 + test/replication/suite.ini | 2 +- 8 files changed, 375 insertions(+), 1 deletion(-) create mode 120000 test/replication/gh-6036-order-master.lua create mode 100644 test/replication/gh-6036-order-node.lua create mode 120000 test/replication/gh-6036-order-replica1.lua create mode 120000 test/replication/gh-6036-order-replica2.lua create mode 100644 test/replication/gh-6036-term-order.result create mode 100644 test/replication/gh-6036-term-order.test.lua diff --git a/test/replication/gh-6036-order-master.lua b/test/replication/gh-6036-order-master.lua new file mode 120000 index 000000000..82a6073a1 --- /dev/null +++ b/test/replication/gh-6036-order-master.lua @@ -0,0 +1 @@ +gh-6036-order-node.lua \ No newline at end of file diff --git a/test/replication/gh-6036-order-node.lua b/test/replication/gh-6036-order-node.lua new file mode 100644 index 000000000..b22a7cb4c --- /dev/null +++ b/test/replication/gh-6036-order-node.lua @@ -0,0 +1,60 @@ +local INSTANCE_ID = string.match(arg[0], "gh%-6036%-order%-(.+)%.lua") + +local function unix_socket(name) + return "unix/:./" .. name .. '.sock'; +end + +require('console').listen(os.getenv('ADMIN')) + +if INSTANCE_ID == "master" then + box.cfg({ + listen = unix_socket(INSTANCE_ID), + replication = { + unix_socket(INSTANCE_ID), + unix_socket("replica1"), + unix_socket("replica2"), + }, + replication_connect_quorum = 1, + replication_synchro_quorum = 1, + replication_synchro_timeout = 10000, + replication_sync_timeout = 5, + read_only = false, + election_mode = "off", + }) +elseif INSTANCE_ID == "replica1" then + box.cfg({ + listen = unix_socket(INSTANCE_ID), + replication = { + unix_socket("master"), + unix_socket(INSTANCE_ID), + unix_socket("replica2"), + }, + replication_connect_quorum = 1, + replication_synchro_quorum = 1, + replication_synchro_timeout = 10000, + replication_sync_timeout = 5, + read_only = false, + election_mode = "off", + }) +else + assert(INSTANCE_ID == "replica2") + box.cfg({ + listen = unix_socket(INSTANCE_ID), + replication = { + unix_socket("master"), + unix_socket("replica1"), + unix_socket(INSTANCE_ID), + }, + replication_connect_quorum = 1, + replication_synchro_quorum = 1, + replication_synchro_timeout = 10000, + replication_sync_timeout = 5, + read_only = true, + election_mode = "off", + }) +end + +--box.ctl.wait_rw() +box.once("bootstrap", function() + box.schema.user.grant('guest', 'super') +end) diff --git a/test/replication/gh-6036-order-replica1.lua b/test/replication/gh-6036-order-replica1.lua new file mode 120000 index 000000000..82a6073a1 --- /dev/null +++ b/test/replication/gh-6036-order-replica1.lua @@ -0,0 +1 @@ +gh-6036-order-node.lua \ No newline at end of file diff --git a/test/replication/gh-6036-order-replica2.lua b/test/replication/gh-6036-order-replica2.lua new file mode 120000 index 000000000..82a6073a1 --- /dev/null +++ b/test/replication/gh-6036-order-replica2.lua @@ -0,0 +1 @@ +gh-6036-order-node.lua \ No newline at end of file diff --git a/test/replication/gh-6036-term-order.result b/test/replication/gh-6036-term-order.result new file mode 100644 index 000000000..6b19fc2c8 --- /dev/null +++ b/test/replication/gh-6036-term-order.result @@ -0,0 +1,216 @@ +-- test-run result file version 2 +-- +-- gh-6036: verify that terms are locked when we're inside journal +-- write routine, because parallel appliers may ignore the fact that +-- the term is updated already but not yet written leading to data +-- inconsistency. +-- +test_run = require('test_run').new() + | --- + | ... + +test_run:cmd('create server master with script="replication/gh-6036-order-master.lua"') + | --- + | - true + | ... +test_run:cmd('create server replica1 with script="replication/gh-6036-order-replica1.lua"') + | --- + | - true + | ... +test_run:cmd('create server replica2 with script="replication/gh-6036-order-replica2.lua"') + | --- + | - true + | ... + +test_run:cmd('start server master with wait=False') + | --- + | - true + | ... +test_run:cmd('start server replica1 with wait=False') + | --- + | - true + | ... +test_run:cmd('start server replica2 with wait=False') + | --- + | - true + | ... + +test_run:wait_fullmesh({"master", "replica1", "replica2"}) + | --- + | ... + +test_run:switch("master") + | --- + | - true + | ... +box.ctl.demote() + | --- + | ... + +test_run:switch("replica1") + | --- + | - true + | ... +box.ctl.demote() + | --- + | ... + +test_run:switch("replica2") + | --- + | - true + | ... +box.ctl.demote() + | --- + | ... + +-- +-- Drop connection between master and replica1. +test_run:switch("master") + | --- + | - true + | ... +box.cfg({ \ + replication = { \ + "unix/:./master.sock", \ + "unix/:./replica2.sock", \ + }, \ +}) + | --- + | ... +test_run:switch("replica1") + | --- + | - true + | ... +box.cfg({ \ + replication = { \ + "unix/:./replica1.sock", \ + "unix/:./replica2.sock", \ + }, \ +}) + | --- + | ... + +-- +-- Initiate disk delay and remember the max term seen so far. +test_run:switch("replica2") + | --- + | - true + | ... +box.error.injection.set('ERRINJ_WAL_DELAY', true) + | --- + | - ok + | ... +term_max_replica2 = box.info.synchro.promote.term_max + | --- + | ... + +-- +-- Ping-pong the promote action between master and +-- replica1 nodes, the term updates get queued on +-- replica2 because of disk being disabled. +test_run:switch("master") + | --- + | - true + | ... +box.ctl.promote() + | --- + | ... +box.ctl.demote() + | --- + | ... + +test_run:switch("replica1") + | --- + | - true + | ... +box.ctl.promote() + | --- + | ... +box.ctl.demote() + | --- + | ... + +test_run:switch("master") + | --- + | - true + | ... +box.ctl.promote() + | --- + | ... + +-- +-- Since we're guarding max promote term make sure that +-- 1) The max term has not yet been updated because WAL +-- is in sleeping state. +-- 2) Max term on master and replica1 nodes are greater +-- than we have now because terms update is locked. +-- 3) Once WAL is unlocked we make sure that terms has +-- reached us. +test_run:switch("replica2") + | --- + | - true + | ... +assert(term_max_replica2 == box.info.synchro.promote.term_max) + | --- + | - true + | ... + +term_max_master = test_run:eval('master', 'box.info.synchro.promote.term_max')[1] + | --- + | ... +term_max_replica1 = test_run:eval('replica1', 'box.info.synchro.promote.term_max')[1] + | --- + | ... +assert(term_max_master > term_max_replica2) + | --- + | - true + | ... +assert(term_max_replica1 > term_max_replica2) + | --- + | - true + | ... +term_max_wait4 = term_max_master + | --- + | ... +if term_max_wait4 < term_max_replica1 then term_max_wait4 = term_max_replica1 end + | --- + | ... + +box.error.injection.set('ERRINJ_WAL_DELAY', false) + | --- + | - ok + | ... +test_run:wait_cond(function() return box.info.synchro.promote.term_max == term_max_wait4 end) + | --- + | - true + | ... + +test_run:switch("default") + | --- + | - true + | ... +test_run:cmd('stop server master') + | --- + | - true + | ... +test_run:cmd('stop server replica1') + | --- + | - true + | ... +test_run:cmd('stop server replica2') + | --- + | - true + | ... + +test_run:cmd('delete server master') + | --- + | - true + | ... +test_run:cmd('delete server replica1') + | --- + | - true + | ... +test_run:cmd('delete server replica2') + | --- + | - true + | ... diff --git a/test/replication/gh-6036-term-order.test.lua b/test/replication/gh-6036-term-order.test.lua new file mode 100644 index 000000000..01d73ac55 --- /dev/null +++ b/test/replication/gh-6036-term-order.test.lua @@ -0,0 +1,94 @@ +-- +-- gh-6036: verify that terms are locked when we're inside journal +-- write routine, because parallel appliers may ignore the fact that +-- the term is updated already but not yet written leading to data +-- inconsistency. +-- +test_run = require('test_run').new() + +test_run:cmd('create server master with script="replication/gh-6036-order-master.lua"') +test_run:cmd('create server replica1 with script="replication/gh-6036-order-replica1.lua"') +test_run:cmd('create server replica2 with script="replication/gh-6036-order-replica2.lua"') + +test_run:cmd('start server master with wait=False') +test_run:cmd('start server replica1 with wait=False') +test_run:cmd('start server replica2 with wait=False') + +test_run:wait_fullmesh({"master", "replica1", "replica2"}) + +test_run:switch("master") +box.ctl.demote() + +test_run:switch("replica1") +box.ctl.demote() + +test_run:switch("replica2") +box.ctl.demote() + +-- +-- Drop connection between master and replica1. +test_run:switch("master") +box.cfg({ \ + replication = { \ + "unix/:./master.sock", \ + "unix/:./replica2.sock", \ + }, \ +}) +test_run:switch("replica1") +box.cfg({ \ + replication = { \ + "unix/:./replica1.sock", \ + "unix/:./replica2.sock", \ + }, \ +}) + +-- +-- Initiate disk delay and remember the max term seen so far. +test_run:switch("replica2") +box.error.injection.set('ERRINJ_WAL_DELAY', true) +term_max_replica2 = box.info.synchro.promote.term_max + +-- +-- Ping-pong the promote action between master and +-- replica1 nodes, the term updates get queued on +-- replica2 because of disk being disabled. +test_run:switch("master") +box.ctl.promote() +box.ctl.demote() + +test_run:switch("replica1") +box.ctl.promote() +box.ctl.demote() + +test_run:switch("master") +box.ctl.promote() + +-- +-- Since we're guarding max promote term make sure that +-- 1) The max term has not yet been updated because WAL +-- is in sleeping state. +-- 2) Max term on master and replica1 nodes are greater +-- than we have now because terms update is locked. +-- 3) Once WAL is unlocked we make sure that terms has +-- reached us. +test_run:switch("replica2") +assert(term_max_replica2 == box.info.synchro.promote.term_max) + +term_max_master = test_run:eval('master', 'box.info.synchro.promote.term_max')[1] +term_max_replica1 = test_run:eval('replica1', 'box.info.synchro.promote.term_max')[1] +assert(term_max_master > term_max_replica2) +assert(term_max_replica1 > term_max_replica2) +term_max_wait4 = term_max_master +if term_max_wait4 < term_max_replica1 then term_max_wait4 = term_max_replica1 end + +box.error.injection.set('ERRINJ_WAL_DELAY', false) +test_run:wait_cond(function() return box.info.synchro.promote.term_max == term_max_wait4 end) + +test_run:switch("default") +test_run:cmd('stop server master') +test_run:cmd('stop server replica1') +test_run:cmd('stop server replica2') + +test_run:cmd('delete server master') +test_run:cmd('delete server replica1') +test_run:cmd('delete server replica2') diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg index 3eee0803c..ac2bedfd9 100644 --- a/test/replication/suite.cfg +++ b/test/replication/suite.cfg @@ -59,6 +59,7 @@ "gh-6094-rs-uuid-mismatch.test.lua": {}, "gh-6127-election-join-new.test.lua": {}, "gh-6035-applier-filter.test.lua": {}, + "gh-6036-term-order.test.lua": {}, "election-candidate-promote.test.lua": {}, "*": { "memtx": {"engine": "memtx"}, diff --git a/test/replication/suite.ini b/test/replication/suite.ini index 77eb95f49..16840e01f 100644 --- a/test/replication/suite.ini +++ b/test/replication/suite.ini @@ -3,7 +3,7 @@ core = tarantool script = master.lua description = tarantool/box, replication disabled = consistent.test.lua -release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5430-qsync-promote-crash.test.lua gh-5430-cluster-mvcc.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua gh-6027-applier-error-show.test.lua gh-6032-promote-wal-write.test.lua gh-6057-qsync-confirm-async-no-wal.test.lua gh-5447-downstream-lag.test.lua gh-4040-invalid-msgpack.test.lua +release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5430-qsync-promote-crash.test.lua gh-5430-cluster-mvcc.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua gh-6027-applier-error-show.test.lua gh-6032-promote-wal-write.test.lua gh-6057-qsync-confirm-async-no-wal.test.lua gh-5447-downstream-lag.test.lua gh-4040-invalid-msgpack.test.lua gh-6036-term-order.test.lua config = suite.cfg lua_libs = lua/fast_replica.lua lua/rlimit.lua use_unix_sockets = True -- 2.31.1
Hi! Thanks for the patch! I consider this series as an independent patchset which fixes the ordering, not split-brain. Like you said. But now it is not independent. The main problem is that you just blended in a few changes from the split-brain patches. I point them out in my comments. See 4 comments below. > src/box/applier.cc | 16 +++++++--- > src/box/box.cc | 71 +++++++++++++++++++---------------------- > src/box/memtx_engine.cc | 3 +- > src/box/txn_limbo.c | 34 ++++++++++++++++++-- > src/box/txn_limbo.h | 49 +++++++++++++++++++++++++--- > 5 files changed, 121 insertions(+), 52 deletions(-) > > diff --git a/src/box/applier.cc b/src/box/applier.cc > index b981bd436..f0751b68a 100644 > --- a/src/box/applier.cc > +++ b/src/box/applier.cc > @@ -458,7 +458,8 @@ applier_wait_snapshot(struct applier *applier) > struct synchro_request req; > if (xrow_decode_synchro(&row, &req) != 0) > diag_raise(); > - txn_limbo_process(&txn_limbo, &req); > + if (txn_limbo_process(&txn_limbo, &req) != 0) > + diag_raise(); 1. How txn_limbo_process() can fail? You just fixed ordering. Essentially, added a few yields in some places. You didn't add any validation, any new errors in this patchset. Please, drop the empty 'return 0's from this patchset. They can't be and are not tested here anyway. Addition of a lock-unlock pair to txn_limbo_process didn't affect whether it can fail. It just blocks the execution sometimes for a while. > } else if (iproto_type_is_raft_request(row.type)) { > struct raft_request req; > if (xrow_decode_raft(&row, &req, NULL) != 0) > @@ -857,7 +858,7 @@ apply_synchro_row_cb(struct journal_entry *entry) > applier_rollback_by_wal_io(entry->res); > } else { > replica_txn_wal_write_cb(synchro_entry->rcb); > - txn_limbo_process(&txn_limbo, synchro_entry->req); > + txn_limbo_process_run(&txn_limbo, synchro_entry->req); 2. _run is usually used for infinite loops in fibers, or for handling a sequence of something. Like trigger_run(). Here you handle a single request. The only difference from process() is that the lock is taken. I propose to rename it to _do or _ex or _in_tx or something. > diff --git a/src/box/box.cc b/src/box/box.cc > index 7b11d56d6..19e67b186 100644 > --- a/src/box/box.cc > +++ b/src/box/box.cc > @@ -1670,48 +1669,43 @@ box_wait_limbo_acked(double timeout) > return wait_lsn; > } > > -/** Write and process a PROMOTE request. */ > -static void > -box_issue_promote(uint32_t prev_leader_id, int64_t promote_lsn) > +/** Write and process PROMOTE or DEMOTE request. */ > +static int > +box_issue_synchro(uint16_t type, uint32_t prev_leader_id, int64_t promote_lsn) > { > struct raft *raft = box_raft(); > + > + assert(type == IPROTO_RAFT_PROMOTE || > + type == IPROTO_RAFT_DEMOTE); > assert(raft->volatile_term == raft->term); > assert(promote_lsn >= 0); > - txn_limbo_write_promote(&txn_limbo, promote_lsn, > - raft->term); > + > struct synchro_request req = { > - .type = IPROTO_RAFT_PROMOTE, > - .replica_id = prev_leader_id, > - .origin_id = instance_id, > - .lsn = promote_lsn, > - .term = raft->term, > + .type = type, > + .replica_id = prev_leader_id, > + .origin_id = instance_id, > + .lsn = promote_lsn, > + .term = raft->term, > }; > - txn_limbo_process(&txn_limbo, &req); > + > + if (txn_limbo_process_begin(&txn_limbo, &req) != 0) > + return -1; > + > + if (type == IPROTO_RAFT_PROMOTE) > + txn_limbo_write_promote(&txn_limbo, req.lsn, req.term); > + else > + txn_limbo_write_demote(&txn_limbo, req.lsn, req.term); > + > + txn_limbo_process_run(&txn_limbo, &req); > assert(txn_limbo_is_empty(&txn_limbo)); > + > + txn_limbo_process_commit(&txn_limbo); > + return 0; > } > > /** A guard to block multiple simultaneous promote()/demote() invocations. */ > static bool is_in_box_promote = false; > > -/** Write and process a DEMOTE request. */ > -static void > -box_issue_demote(uint32_t prev_leader_id, int64_t promote_lsn) > -{ > - assert(box_raft()->volatile_term == box_raft()->term); > - assert(promote_lsn >= 0); > - txn_limbo_write_demote(&txn_limbo, promote_lsn, > - box_raft()->term); > - struct synchro_request req = { > - .type = IPROTO_RAFT_DEMOTE, > - .replica_id = prev_leader_id, > - .origin_id = instance_id, > - .lsn = promote_lsn, > - .term = box_raft()->term, > - }; > - txn_limbo_process(&txn_limbo, &req); > - assert(txn_limbo_is_empty(&txn_limbo)); 3. Why did you merge these 2 functions? AFAIR, their split was deliberate. To make each of them simpler to understand and maintain. > diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c > index 70447caaf..eb9aa7780 100644 > --- a/src/box/txn_limbo.c > +++ b/src/box/txn_limbo.c > @@ -786,6 +790,32 @@ txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req) > return; > } > > +int > +txn_limbo_process_begin(struct txn_limbo *limbo, > + const struct synchro_request *req) > +{ > + latch_lock(&limbo->promote_latch); > + /* > + * FIXME: For now we take a lock only but idea > + * is to verify incoming requests to detect split > + * brain situation. Thus we need to change the code > + * semantics in advance. > + */ > + (void)req; > + return 0; 4. Return value is a part of the split-brain patch, not of the ordering patch. It is clearly seen from this patchset, because this series never changes `return 0` to anything else. I get that you want to merge something. Hence we are working on this independent issue of reodering. But then lets make it truly independent.
How is this patch related to the ordering issue? Your test passes without it.
Thanks for the patch! See 6 comments below. > diff --git a/test/replication/gh-6036-order-node.lua b/test/replication/gh-6036-order-node.lua > new file mode 100644 > index 000000000..b22a7cb4c > --- /dev/null > +++ b/test/replication/gh-6036-order-node.lua > @@ -0,0 +1,60 @@ > +local INSTANCE_ID = string.match(arg[0], "gh%-6036%-order%-(.+)%.lua") > + > +local function unix_socket(name) > + return "unix/:./" .. name .. '.sock'; > +end > + > +require('console').listen(os.getenv('ADMIN')) > + > +if INSTANCE_ID == "master" then > + box.cfg({ > + listen = unix_socket(INSTANCE_ID), > + replication = { > + unix_socket(INSTANCE_ID), > + unix_socket("replica1"), > + unix_socket("replica2"), > + }, > + replication_connect_quorum = 1, > + replication_synchro_quorum = 1, > + replication_synchro_timeout = 10000, > + replication_sync_timeout = 5, 1. Why do you need sync_timeout 5? > + read_only = false, > + election_mode = "off", > + }) > +elseif INSTANCE_ID == "replica1" then > + box.cfg({ > + listen = unix_socket(INSTANCE_ID), > + replication = { > + unix_socket("master"), > + unix_socket(INSTANCE_ID), > + unix_socket("replica2"), > + }, > + replication_connect_quorum = 1, > + replication_synchro_quorum = 1, > + replication_synchro_timeout = 10000, > + replication_sync_timeout = 5, > + read_only = false, > + election_mode = "off", > + }) > +else > + assert(INSTANCE_ID == "replica2") > + box.cfg({ > + listen = unix_socket(INSTANCE_ID), > + replication = { > + unix_socket("master"), > + unix_socket("replica1"), > + unix_socket(INSTANCE_ID), > + }, > + replication_connect_quorum = 1, > + replication_synchro_quorum = 1, > + replication_synchro_timeout = 10000, > + replication_sync_timeout = 5, > + read_only = true, > + election_mode = "off", > + }) > +end > + > +--box.ctl.wait_rw() 2. Please, remove commented out code. > +box.once("bootstrap", function() > + box.schema.user.grant('guest', 'super') > +end) > diff --git a/test/replication/gh-6036-term-order.result b/test/replication/gh-6036-term-order.result > new file mode 100644 > index 000000000..6b19fc2c8 > --- /dev/null > +++ b/test/replication/gh-6036-term-order.result 3. Please, use prefix gh-####-qsync to be consistent with other qsync tests. Having 'qsync' in the test name helps to run all qsync tests in a single command python test-run.py qsync > @@ -0,0 +1,216 @@ > +-- test-run result file version 2 > +-- > +-- gh-6036: verify that terms are locked when we're inside journal > +-- write routine, because parallel appliers may ignore the fact that > +-- the term is updated already but not yet written leading to data > +-- inconsistency. > +-- > +test_run = require('test_run').new() > + | --- > + | ... > + > +test_run:cmd('create server master with script="replication/gh-6036-order-master.lua"') > + | --- > + | - true > + | ... > +test_run:cmd('create server replica1 with script="replication/gh-6036-order-replica1.lua"') > + | --- > + | - true > + | ... > +test_run:cmd('create server replica2 with script="replication/gh-6036-order-replica2.lua"') > + | --- > + | - true > + | ... > + > +test_run:cmd('start server master with wait=False') > + | --- > + | - true > + | ... > +test_run:cmd('start server replica1 with wait=False') > + | --- > + | - true > + | ... > +test_run:cmd('start server replica2 with wait=False') > + | --- > + | - true > + | ... > + > +test_run:wait_fullmesh({"master", "replica1", "replica2"}) > + | --- > + | ... > + > +test_run:switch("master") > + | --- > + | - true > + | ... > +box.ctl.demote() > + | --- > + | ... > + > +test_run:switch("replica1") > + | --- > + | - true > + | ... > +box.ctl.demote() > + | --- > + | ... > + > +test_run:switch("replica2") > + | --- > + | - true > + | ... > +box.ctl.demote() 4. I dropped all 3 demotes and the test passed. Why do you need them? > + | --- > + | ... > + > +-- > +-- Drop connection between master and replica1. > +test_run:switch("master") > + | --- > + | - true > + | ... > +box.cfg({ \ > + replication = { \ > + "unix/:./master.sock", \ > + "unix/:./replica2.sock", \ > + }, \ > +}) > + | --- > + | ... > +test_run:switch("replica1") > + | --- > + | - true > + | ... > +box.cfg({ \ > + replication = { \ > + "unix/:./replica1.sock", \ > + "unix/:./replica2.sock", \ > + }, \ > +}) > + | --- > + | ... > + > +-- > +-- Initiate disk delay and remember the max term seen so far. > +test_run:switch("replica2") > + | --- > + | - true > + | ... > +box.error.injection.set('ERRINJ_WAL_DELAY', true) > + | --- > + | - ok > + | ... > +term_max_replica2 = box.info.synchro.promote.term_max > + | --- > + | ... > + > +-- > +-- Ping-pong the promote action between master and > +-- replica1 nodes, the term updates get queued on > +-- replica2 because of disk being disabled. > +test_run:switch("master") > + | --- > + | - true > + | ... > +box.ctl.promote() > + | --- > + | ... > +box.ctl.demote() > + | --- > + | ... > + > +test_run:switch("replica1") > + | --- > + | - true > + | ... > +box.ctl.promote() > + | --- > + | ... > +box.ctl.demote() > + | --- > + | ... > + > +test_run:switch("master") > + | --- > + | - true > + | ... > +box.ctl.promote() > + | --- > + | ... > + > +-- > +-- Since we're guarding max promote term make sure that > +-- 1) The max term has not yet been updated because WAL > +-- is in sleeping state. > +-- 2) Max term on master and replica1 nodes are greater > +-- than we have now because terms update is locked. > +-- 3) Once WAL is unlocked we make sure that terms has > +-- reached us. > +test_run:switch("replica2") > + | --- > + | - true > + | ... > +assert(term_max_replica2 == box.info.synchro.promote.term_max) > + | --- > + | - true > + | ... > + > +term_max_master = test_run:eval('master', 'box.info.synchro.promote.term_max')[1] > + | --- > + | ... > +term_max_replica1 = test_run:eval('replica1', 'box.info.synchro.promote.term_max')[1] > + | --- > + | ... > +assert(term_max_master > term_max_replica2) > + | --- > + | - true > + | ... > +assert(term_max_replica1 > term_max_replica2) > + | --- > + | - true > + | ... > +term_max_wait4 = term_max_master > + | --- > + | ... > +if term_max_wait4 < term_max_replica1 then term_max_wait4 = term_max_replica1 end 5. How is it possible? The master did more promotes, it should have a bigger term for sure. > + | --- > + | ... > + > +box.error.injection.set('ERRINJ_WAL_DELAY', false) > + | --- > + | - ok > + | ... > +test_run:wait_cond(function() return box.info.synchro.promote.term_max == term_max_wait4 end) > + | --- > + | - true > + | ... > + > +test_run:switch("default") > + | --- > + | - true > + | ... > +test_run:cmd('stop server master') > + | --- > + | - true > + | ... > +test_run:cmd('stop server replica1') > + | --- > + | - true > + | ... > +test_run:cmd('stop server replica2') > + | --- > + | - true > + | ... > + > +test_run:cmd('delete server master') > + | --- > + | - true > + | ... > +test_run:cmd('delete server replica1') > + | --- > + | - true > + | ... > +test_run:cmd('delete server replica2') 6. Can you add some data to the test? Which before the patches was applied, and now is rejected/nopified. Otherwise you added a lock, tested the lock, but if I move the data filtering before the lock, your test still will pass.
22.09.2021 16:05, Cyrill Gorcunov пишет: > The patch introduces `promote` leaf to `box.info.synchro` table. > > | tarantool> box.info.synchro > | --- > | - queue: > | len: 0 > | owner: 1 > | quorum: 1 > | promote: > | term_max: 4 > | term_map: {1: 4} > | ... > > An idea is to be able to track changes of seen requests. Since it is > internal implementation details I prefer to not document it. Actually > better to mark is as non-API somehow. I think this info might be useful, so maybe document it as well? I'd call it `journal`, probably. box.info.synchro.journal.term - what you call term_max box.info.synchro.journal.term_map > > Part-of #6036 > > Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> > --- > src/box/lua/info.c | 11 ++++++++++- > 1 file changed, 10 insertions(+), 1 deletion(-) > > diff --git a/src/box/lua/info.c b/src/box/lua/info.c > index 040af306a..144fba12d 100644 > --- a/src/box/lua/info.c > +++ b/src/box/lua/info.c > @@ -607,7 +607,7 @@ lbox_info_election(struct lua_State *L) > static int > lbox_info_synchro(struct lua_State *L) > { > - lua_createtable(L, 0, 2); > + lua_createtable(L, 0, 3); > > /* Quorum value may be evaluated via formula */ > lua_pushinteger(L, replication_synchro_quorum); > @@ -622,6 +622,15 @@ lbox_info_synchro(struct lua_State *L) > lua_setfield(L, -2, "owner"); > lua_setfield(L, -2, "queue"); > > + /* Promote related info. Suitable for debug. */ > + lua_createtable(L, 0, 2); > + lua_pushnumber(L, queue->promote_greatest_term); > + lua_setfield(L, -2, "term_max"); > + lua_pushstring(L, "term_map"); > + lbox_pushvclock(L, &queue->promote_term_map); > + lua_settable(L, -3); > + lua_setfield(L, -2, "promote"); > + > return 1; > } > -- Serge Petrenko
22.09.2021 16:05, Cyrill Gorcunov пишет:
> While been investigating various cluster split-brain scenarios and
> trying to gather valid incoming synchro request domains and ranges
> we've discovered that limbo::confirmed_lsn updated not dense enough
> to cover our needs.
>
> In particular this variable is always updated by a limbo owner upon
> write of syncro entry (to a journal) while replica just reads such
> record without confirmed_lsn update, so when the replica become a cluster
> leader it sends a promote request back to the former leader where the
> packet carries zero LSN instead of previous confirmed_lsn and validation
> of such packet won't pass.
>
> Note the packet validation is not yet implemented in this patch so it
> is rather a preparatory work for future.
>
> Part-of #6036
>
> Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
> ---
> src/box/txn_limbo.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
> index eb9aa7780..959811309 100644
> --- a/src/box/txn_limbo.c
> +++ b/src/box/txn_limbo.c
> @@ -774,6 +774,20 @@ txn_limbo_process_run(struct txn_limbo *limbo,
> switch (req->type) {
> case IPROTO_RAFT_CONFIRM:
> txn_limbo_read_confirm(limbo, lsn);
> + /*
> + * We have to adjust confirmed_lsn according
> + * to LSN coming from the request. It is because
> + * we will need to report it as old's limbo owner
> + * LSN inside PROMOTE requests (if administrator
> + * or election engine will make us so).
> + *
> + * We could update confirmed_lsn on every
> + * txn_limbo_read_confirm call but this function
> + * is usually called in a couple with
> + * txn_limbo_write_confirm, thus to eliminate redundant
> + * variables update we make so once but explicitly.
> + */
> + limbo->confirmed_lsn = req->lsn;
> break;
> case IPROTO_RAFT_ROLLBACK:
> txn_limbo_read_rollback(limbo, lsn);
LGTM.
--
Serge Petrenko
22.09.2021 16:05, Cyrill Gorcunov пишет: > To test that promotion requests are handled only when appropriate > write to WAL completes, because we update memory data before the > write finishes. > > Part-of #6036 > > Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> > --- > test/replication/gh-6036-order-master.lua | 1 + > test/replication/gh-6036-order-node.lua | 60 ++++++ > test/replication/gh-6036-order-replica1.lua | 1 + > test/replication/gh-6036-order-replica2.lua | 1 + > test/replication/gh-6036-term-order.result | 216 +++++++++++++++++++ > test/replication/gh-6036-term-order.test.lua | 94 ++++++++ > test/replication/suite.cfg | 1 + > test/replication/suite.ini | 2 +- > 8 files changed, 375 insertions(+), 1 deletion(-) > create mode 120000 test/replication/gh-6036-order-master.lua > create mode 100644 test/replication/gh-6036-order-node.lua > create mode 120000 test/replication/gh-6036-order-replica1.lua > create mode 120000 test/replication/gh-6036-order-replica2.lua > create mode 100644 test/replication/gh-6036-term-order.result > create mode 100644 test/replication/gh-6036-term-order.test.lua > > diff --git a/test/replication/gh-6036-order-master.lua b/test/replication/gh-6036-order-master.lua > new file mode 120000 > index 000000000..82a6073a1 > --- /dev/null > +++ b/test/replication/gh-6036-order-master.lua > @@ -0,0 +1 @@ > +gh-6036-order-node.lua > \ No newline at end of file > diff --git a/test/replication/gh-6036-order-node.lua b/test/replication/gh-6036-order-node.lua > new file mode 100644 > index 000000000..b22a7cb4c > --- /dev/null > +++ b/test/replication/gh-6036-order-node.lua > @@ -0,0 +1,60 @@ > +local INSTANCE_ID = string.match(arg[0], "gh%-6036%-order%-(.+)%.lua") > + > +local function unix_socket(name) > + return "unix/:./" .. name .. '.sock'; > +end > + > +require('console').listen(os.getenv('ADMIN')) > + > +if INSTANCE_ID == "master" then > + box.cfg({ > + listen = unix_socket(INSTANCE_ID), > + replication = { > + unix_socket(INSTANCE_ID), > + unix_socket("replica1"), > + unix_socket("replica2"), > + }, > + replication_connect_quorum = 1, > + replication_synchro_quorum = 1, > + replication_synchro_timeout = 10000, > + replication_sync_timeout = 5, > + read_only = false, > + election_mode = "off", > + }) > +elseif INSTANCE_ID == "replica1" then > + box.cfg({ > + listen = unix_socket(INSTANCE_ID), > + replication = { > + unix_socket("master"), > + unix_socket(INSTANCE_ID), > + unix_socket("replica2"), > + }, > + replication_connect_quorum = 1, > + replication_synchro_quorum = 1, > + replication_synchro_timeout = 10000, > + replication_sync_timeout = 5, > + read_only = false, > + election_mode = "off", > + }) > +else > + assert(INSTANCE_ID == "replica2") > + box.cfg({ > + listen = unix_socket(INSTANCE_ID), > + replication = { > + unix_socket("master"), > + unix_socket("replica1"), > + unix_socket(INSTANCE_ID), > + }, > + replication_connect_quorum = 1, > + replication_synchro_quorum = 1, > + replication_synchro_timeout = 10000, > + replication_sync_timeout = 5, > + read_only = true, > + election_mode = "off", > + }) > +end I think it would be simpler to do something like: cfg_tbl = {every common parameter} if INSTANCE_ID == 'replica2' then cfg_tbl.read_only = true box.cfg{} calls for master and replica 1 are identical, replica2 only differs in read_only = true. > + > +--box.ctl.wait_rw() > +box.once("bootstrap", function() > + box.schema.user.grant('guest', 'super') > +end) > diff --git a/test/replication/gh-6036-order-replica1.lua b/test/replication/gh-6036-order-replica1.lua > new file mode 120000 > index 000000000..82a6073a1 > --- /dev/null > +++ b/test/replication/gh-6036-order-replica1.lua > @@ -0,0 +1 @@ > +gh-6036-order-node.lua > \ No newline at end of file > diff --git a/test/replication/gh-6036-order-replica2.lua b/test/replication/gh-6036-order-replica2.lua > new file mode 120000 > index 000000000..82a6073a1 > --- /dev/null > +++ b/test/replication/gh-6036-order-replica2.lua > @@ -0,0 +1 @@ > +gh-6036-order-node.lua > \ No newline at end of file > diff --git a/test/replication/gh-6036-term-order.result b/test/replication/gh-6036-term-order.result > new file mode 100644 > index 000000000..6b19fc2c8 > --- /dev/null > +++ b/test/replication/gh-6036-term-order.result > @@ -0,0 +1,216 @@ > +-- test-run result file version 2 > +-- > +-- gh-6036: verify that terms are locked when we're inside journal > +-- write routine, because parallel appliers may ignore the fact that > +-- the term is updated already but not yet written leading to data > +-- inconsistency. > +-- > +test_run = require('test_run').new() > + | --- > + | ... > + > +test_run:cmd('create server master with script="replication/gh-6036-order-master.lua"') > + | --- > + | - true > + | ... > +test_run:cmd('create server replica1 with script="replication/gh-6036-order-replica1.lua"') > + | --- > + | - true > + | ... > +test_run:cmd('create server replica2 with script="replication/gh-6036-order-replica2.lua"') > + | --- > + | - true > + | ... > + > +test_run:cmd('start server master with wait=False') > + | --- > + | - true > + | ... > +test_run:cmd('start server replica1 with wait=False') > + | --- > + | - true > + | ... > +test_run:cmd('start server replica2 with wait=False') > + | --- > + | - true > + | ... > + > +test_run:wait_fullmesh({"master", "replica1", "replica2"}) > + | --- > + | ... > + > +test_run:switch("master") > + | --- > + | - true > + | ... > +box.ctl.demote() > + | --- > + | ... > + > +test_run:switch("replica1") > + | --- > + | - true > + | ... > +box.ctl.demote() > + | --- > + | ... > + > +test_run:switch("replica2") > + | --- > + | - true > + | ... > +box.ctl.demote() > + | --- > + | ... > + > +-- > +-- Drop connection between master and replica1. > +test_run:switch("master") > + | --- > + | - true > + | ... > +box.cfg({ \ > + replication = { \ > + "unix/:./master.sock", \ > + "unix/:./replica2.sock", \ > + }, \ > +}) > + | --- > + | ... > +test_run:switch("replica1") > + | --- > + | - true > + | ... > +box.cfg({ \ > + replication = { \ > + "unix/:./replica1.sock", \ > + "unix/:./replica2.sock", \ > + }, \ > +}) > + | --- > + | ... > + > +-- > +-- Initiate disk delay and remember the max term seen so far. > +test_run:switch("replica2") > + | --- > + | - true > + | ... > +box.error.injection.set('ERRINJ_WAL_DELAY', true) > + | --- > + | - ok > + | ... > +term_max_replica2 = box.info.synchro.promote.term_max > + | --- > + | ... > + > +-- > +-- Ping-pong the promote action between master and > +-- replica1 nodes, the term updates get queued on > +-- replica2 because of disk being disabled. > +test_run:switch("master") > + | --- > + | - true > + | ... > +box.ctl.promote() > + | --- > + | ... > +box.ctl.demote() > + | --- > + | ... > + > +test_run:switch("replica1") > + | --- > + | - true > + | ... > +box.ctl.promote() > + | --- > + | ... > +box.ctl.demote() > + | --- > + | ... > + > +test_run:switch("master") > + | --- > + | - true > + | ... > +box.ctl.promote() > + | --- > + | ... > + > +-- > +-- Since we're guarding max promote term make sure that > +-- 1) The max term has not yet been updated because WAL > +-- is in sleeping state. > +-- 2) Max term on master and replica1 nodes are greater > +-- than we have now because terms update is locked. > +-- 3) Once WAL is unlocked we make sure that terms has > +-- reached us. > +test_run:switch("replica2") > + | --- > + | - true > + | ... > +assert(term_max_replica2 == box.info.synchro.promote.term_max) > + | --- > + | - true > + | ... > + > +term_max_master = test_run:eval('master', 'box.info.synchro.promote.term_max')[1] > + | --- > + | ... > +term_max_replica1 = test_run:eval('replica1', 'box.info.synchro.promote.term_max')[1] > + | --- > + | ... > +assert(term_max_master > term_max_replica2) > + | --- > + | - true > + | ... > +assert(term_max_replica1 > term_max_replica2) > + | --- > + | - true > + | ... > +term_max_wait4 = term_max_master > + | --- > + | ... > +if term_max_wait4 < term_max_replica1 then term_max_wait4 = term_max_replica1 end > + | --- > + | ... > + > +box.error.injection.set('ERRINJ_WAL_DELAY', false) > + | --- > + | - ok > + | ... > +test_run:wait_cond(function() return box.info.synchro.promote.term_max == term_max_wait4 end) > + | --- > + | - true > + | ... > + > +test_run:switch("default") > + | --- > + | - true > + | ... > +test_run:cmd('stop server master') > + | --- > + | - true > + | ... > +test_run:cmd('stop server replica1') > + | --- > + | - true > + | ... > +test_run:cmd('stop server replica2') > + | --- > + | - true > + | ... > + > +test_run:cmd('delete server master') > + | --- > + | - true > + | ... > +test_run:cmd('delete server replica1') > + | --- > + | - true > + | ... > +test_run:cmd('delete server replica2') > + | --- > + | - true > + | ... > diff --git a/test/replication/gh-6036-term-order.test.lua b/test/replication/gh-6036-term-order.test.lua > new file mode 100644 > index 000000000..01d73ac55 > --- /dev/null > +++ b/test/replication/gh-6036-term-order.test.lua > @@ -0,0 +1,94 @@ > +-- > +-- gh-6036: verify that terms are locked when we're inside journal > +-- write routine, because parallel appliers may ignore the fact that > +-- the term is updated already but not yet written leading to data > +-- inconsistency. > +-- > +test_run = require('test_run').new() > + > +test_run:cmd('create server master with script="replication/gh-6036-order-master.lua"') > +test_run:cmd('create server replica1 with script="replication/gh-6036-order-replica1.lua"') > +test_run:cmd('create server replica2 with script="replication/gh-6036-order-replica2.lua"') > + > +test_run:cmd('start server master with wait=False') > +test_run:cmd('start server replica1 with wait=False') > +test_run:cmd('start server replica2 with wait=False') > + > +test_run:wait_fullmesh({"master", "replica1", "replica2"}) > + > +test_run:switch("master") > +box.ctl.demote() > + > +test_run:switch("replica1") > +box.ctl.demote() > + > +test_run:switch("replica2") > +box.ctl.demote() > + > +-- > +-- Drop connection between master and replica1. > +test_run:switch("master") > +box.cfg({ \ > + replication = { \ > + "unix/:./master.sock", \ > + "unix/:./replica2.sock", \ > + }, \ > +}) > +test_run:switch("replica1") > +box.cfg({ \ > + replication = { \ > + "unix/:./replica1.sock", \ > + "unix/:./replica2.sock", \ > + }, \ > +}) > + > +-- > +-- Initiate disk delay and remember the max term seen so far. > +test_run:switch("replica2") > +box.error.injection.set('ERRINJ_WAL_DELAY', true) > +term_max_replica2 = box.info.synchro.promote.term_max > + > +-- > +-- Ping-pong the promote action between master and > +-- replica1 nodes, the term updates get queued on > +-- replica2 because of disk being disabled. > +test_run:switch("master") > +box.ctl.promote() > +box.ctl.demote() > + > +test_run:switch("replica1") > +box.ctl.promote() > +box.ctl.demote() > + > +test_run:switch("master") > +box.ctl.promote() > + > +-- > +-- Since we're guarding max promote term make sure that > +-- 1) The max term has not yet been updated because WAL > +-- is in sleeping state. > +-- 2) Max term on master and replica1 nodes are greater > +-- than we have now because terms update is locked. > +-- 3) Once WAL is unlocked we make sure that terms has > +-- reached us. > +test_run:switch("replica2") > +assert(term_max_replica2 == box.info.synchro.promote.term_max) > + > +term_max_master = test_run:eval('master', 'box.info.synchro.promote.term_max')[1] > +term_max_replica1 = test_run:eval('replica1', 'box.info.synchro.promote.term_max')[1] > +assert(term_max_master > term_max_replica2) > +assert(term_max_replica1 > term_max_replica2) > +term_max_wait4 = term_max_master > +if term_max_wait4 < term_max_replica1 then term_max_wait4 = term_max_replica1 end > + > +box.error.injection.set('ERRINJ_WAL_DELAY', false) > +test_run:wait_cond(function() return box.info.synchro.promote.term_max == term_max_wait4 end) > + > +test_run:switch("default") > +test_run:cmd('stop server master') > +test_run:cmd('stop server replica1') > +test_run:cmd('stop server replica2') > + > +test_run:cmd('delete server master') > +test_run:cmd('delete server replica1') > +test_run:cmd('delete server replica2') > diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg > index 3eee0803c..ac2bedfd9 100644 > --- a/test/replication/suite.cfg > +++ b/test/replication/suite.cfg > @@ -59,6 +59,7 @@ > "gh-6094-rs-uuid-mismatch.test.lua": {}, > "gh-6127-election-join-new.test.lua": {}, > "gh-6035-applier-filter.test.lua": {}, > + "gh-6036-term-order.test.lua": {}, > "election-candidate-promote.test.lua": {}, > "*": { > "memtx": {"engine": "memtx"}, > diff --git a/test/replication/suite.ini b/test/replication/suite.ini > index 77eb95f49..16840e01f 100644 > --- a/test/replication/suite.ini > +++ b/test/replication/suite.ini > @@ -3,7 +3,7 @@ core = tarantool > script = master.lua > description = tarantool/box, replication > disabled = consistent.test.lua > -release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5430-qsync-promote-crash.test.lua gh-5430-cluster-mvcc.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua gh-6027-applier-error-show.test.lua gh-6032-promote-wal-write.test.lua gh-6057-qsync-confirm-async-no-wal.test.lua gh-5447-downstream-lag.test.lua gh-4040-invalid-msgpack.test.lua > +release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5430-qsync-promote-crash.test.lua gh-5430-cluster-mvcc.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua gh-6027-applier-error-show.test.lua gh-6032-promote-wal-write.test.lua gh-6057-qsync-confirm-async-no-wal.test.lua gh-5447-downstream-lag.test.lua gh-4040-invalid-msgpack.test.lua gh-6036-term-order.test.lua > config = suite.cfg > lua_libs = lua/fast_replica.lua lua/rlimit.lua > use_unix_sockets = True -- Serge Petrenko
On Mon, Sep 27, 2021 at 10:13:32AM +0300, Serge Petrenko wrote:
>
> I think it would be simpler to do something like:
>
> cfg_tbl = {every common parameter}
>
> if INSTANCE_ID == 'replica2' then cfg_tbl.read_only = true
>
> box.cfg{} calls for master and replica 1 are identical, replica2 only
> differs in read_only = true.
Good point! Will do, thanks!
Cyrill
On Sun, Sep 26, 2021 at 04:29:13PM +0200, Vladislav Shpilevoy wrote: > Hi! Thanks for the patch! Hi! Thanks for review! > I consider this series as an independent patchset which fixes > the ordering, not split-brain. Like you said. > > But now it is not independent. The main problem is that you > just blended in a few changes from the split-brain patches. I > point them out in my comments. It can't be completely separate and self consisting because of the locks we introduce to order operations. In split brain series we agreed that locking should be hidden inside txn_process_begin() which can fail. In result plain txn_process() call become a plain wrapper over begin/commit|rollback and since we're changing architecture i think altering semantics immediately will make next patches less intrusive. If you prefer this way I can make it so but I think this is not that good idea, though I don't have a strong preference here. > > > > diff --git a/src/box/applier.cc b/src/box/applier.cc > > index b981bd436..f0751b68a 100644 > > --- a/src/box/applier.cc > > +++ b/src/box/applier.cc > > @@ -458,7 +458,8 @@ applier_wait_snapshot(struct applier *applier) > > struct synchro_request req; > > if (xrow_decode_synchro(&row, &req) != 0) > > diag_raise(); > > - txn_limbo_process(&txn_limbo, &req); > > + if (txn_limbo_process(&txn_limbo, &req) != 0) > > + diag_raise(); > > 1. How txn_limbo_process() can fail? You just fixed ordering. Essentially, added > a few yields in some places. You didn't add any validation, any new errors in > this patchset. Please, drop the empty 'return 0's from this patchset. They > can't be and are not tested here anyway. It is not just order fixing but scaffolds for second series of patches on top. Anyway, will do the way you're asking since you prefer. > Addition of a lock-unlock pair to txn_limbo_process didn't affect whether it > can fail. It just blocks the execution sometimes for a while. I changed semantics early on a purpose, will rework to fit your request. > > } else if (iproto_type_is_raft_request(row.type)) { > > struct raft_request req; > > if (xrow_decode_raft(&row, &req, NULL) != 0) > > @@ -857,7 +858,7 @@ apply_synchro_row_cb(struct journal_entry *entry) > > applier_rollback_by_wal_io(entry->res); > > } else { > > replica_txn_wal_write_cb(synchro_entry->rcb); > > - txn_limbo_process(&txn_limbo, synchro_entry->req); > > + txn_limbo_process_run(&txn_limbo, synchro_entry->req); > > 2. _run is usually used for infinite loops in fibers, or for handling a > sequence of something. Like trigger_run(). Here you handle a single > request. The only difference from process() is that the lock is taken. > I propose to rename it to _do or _ex or _in_tx or something. OK > > > diff --git a/src/box/box.cc b/src/box/box.cc > > index 7b11d56d6..19e67b186 100644 > > --- a/src/box/box.cc > > +++ b/src/box/box.cc > > @@ -1670,48 +1669,43 @@ box_wait_limbo_acked(double timeout) > > return wait_lsn; > > } > > > > -/** Write and process a PROMOTE request. */ > > -static void > > -box_issue_promote(uint32_t prev_leader_id, int64_t promote_lsn) > > +/** Write and process PROMOTE or DEMOTE request. */ > > +static int > > +box_issue_synchro(uint16_t type, uint32_t prev_leader_id, int64_t promote_lsn) > > { > > struct raft *raft = box_raft(); > > + > > + assert(type == IPROTO_RAFT_PROMOTE || > > + type == IPROTO_RAFT_DEMOTE); > > assert(raft->volatile_term == raft->term); > > assert(promote_lsn >= 0); > > - txn_limbo_write_promote(&txn_limbo, promote_lsn, > > - raft->term); > > + > > struct synchro_request req = { > > - .type = IPROTO_RAFT_PROMOTE, > > - .replica_id = prev_leader_id, > > - .origin_id = instance_id, > > - .lsn = promote_lsn, > > - .term = raft->term, > > + .type = type, > > + .replica_id = prev_leader_id, > > + .origin_id = instance_id, > > + .lsn = promote_lsn, > > + .term = raft->term, > > }; > > - txn_limbo_process(&txn_limbo, &req); > > + > > + if (txn_limbo_process_begin(&txn_limbo, &req) != 0) > > + return -1; > > + > > + if (type == IPROTO_RAFT_PROMOTE) > > + txn_limbo_write_promote(&txn_limbo, req.lsn, req.term); > > + else > > + txn_limbo_write_demote(&txn_limbo, req.lsn, req.term); > > + > > + txn_limbo_process_run(&txn_limbo, &req); > > assert(txn_limbo_is_empty(&txn_limbo)); > > + > > + txn_limbo_process_commit(&txn_limbo); > > + return 0; > > } > > > > /** A guard to block multiple simultaneous promote()/demote() invocations. */ > > static bool is_in_box_promote = false; > > > > -/** Write and process a DEMOTE request. */ > > -static void > > -box_issue_demote(uint32_t prev_leader_id, int64_t promote_lsn) > > -{ > > - assert(box_raft()->volatile_term == box_raft()->term); > > - assert(promote_lsn >= 0); > > - txn_limbo_write_demote(&txn_limbo, promote_lsn, > > - box_raft()->term); > > - struct synchro_request req = { > > - .type = IPROTO_RAFT_DEMOTE, > > - .replica_id = prev_leader_id, > > - .origin_id = instance_id, > > - .lsn = promote_lsn, > > - .term = box_raft()->term, > > - }; > > - txn_limbo_process(&txn_limbo, &req); > > - assert(txn_limbo_is_empty(&txn_limbo)); > > 3. Why did you merge these 2 functions? AFAIR, their split was > deliberate. To make each of them simpler to understand and maintain. To eliminate code duplication. These two functions are _completely_ identical in terms of operations: they write promote/demote packet, which in turn are the same except packet type. So I don't follow how this is easier to understand and maintain: if functions are the same better have one instance, no? But since you're asking I'll move these two functions back, no problem. > > diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c > > index 70447caaf..eb9aa7780 100644 > > --- a/src/box/txn_limbo.c > > +++ b/src/box/txn_limbo.c > > @@ -786,6 +790,32 @@ txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req) > > return; > > } > > > > +int > > +txn_limbo_process_begin(struct txn_limbo *limbo, > > + const struct synchro_request *req) > > +{ > > + latch_lock(&limbo->promote_latch); > > + /* > > + * FIXME: For now we take a lock only but idea > > + * is to verify incoming requests to detect split > > + * brain situation. Thus we need to change the code > > + * semantics in advance. > > + */ > > + (void)req; > > + return 0; > > 4. Return value is a part of the split-brain patch, not of the > ordering patch. It is clearly seen from this patchset, because > this series never changes `return 0` to anything else. > > I get that you want to merge something. Hence we are working on this > independent issue of reodering. But then lets make it truly > independent. OK. Thanks for comments! Cyrill
On Mon, Sep 27, 2021 at 10:00:28AM +0300, Serge Petrenko wrote:
>
>
> 22.09.2021 16:05, Cyrill Gorcunov пишет:
> > The patch introduces `promote` leaf to `box.info.synchro` table.
> >
> > | tarantool> box.info.synchro
> > | ---
> > | - queue:
> > | len: 0
> > | owner: 1
> > | quorum: 1
> > | promote:
> > | term_max: 4
> > | term_map: {1: 4}
> > | ...
> >
> > An idea is to be able to track changes of seen requests. Since it is
> > internal implementation details I prefer to not document it. Actually
> > better to mark is as non-API somehow.
>
> I think this info might be useful, so maybe document it as well?
>
> I'd call it `journal`, probably.
> box.info.synchro.journal.term - what you call term_max
> box.info.synchro.journal.term_map
You know, I don't mind :) We need to choose "info" leaves and branches
names very carefully though, because it becomes API. So I guess we
could consider journal branch where we would gather not only synchro
related information but extend output for debug/stat sake in future?
box.info
journal:
synchro: (replication related info)
term: 4
term_map: {1: 4}
queue: (flush queue related info)
len: 12
stat:
wrote: 400 bytes
I don't have strong opinion which form is better, I'm fine with any,
just sharing an idea.
On Sun, Sep 26, 2021 at 04:30:38PM +0200, Vladislav Shpilevoy wrote: > > +if INSTANCE_ID == "master" then > > + box.cfg({ > > + listen = unix_socket(INSTANCE_ID), > > + replication = { > > + unix_socket(INSTANCE_ID), > > + unix_socket("replica1"), > > + unix_socket("replica2"), > > + }, > > + replication_connect_quorum = 1, > > + replication_synchro_quorum = 1, > > + replication_synchro_timeout = 10000, > > + replication_sync_timeout = 5, > > 1. Why do you need sync_timeout 5? To make sure it has some sane short value, our default 300 seconds is too big I think. > > + > > +--box.ctl.wait_rw() > > 2. Please, remove commented out code. ok > > +box.once("bootstrap", function() > > + box.schema.user.grant('guest', 'super') > > +end) > > diff --git a/test/replication/gh-6036-term-order.result b/test/replication/gh-6036-term-order.result > > new file mode 100644 > > index 000000000..6b19fc2c8 > > --- /dev/null > > +++ b/test/replication/gh-6036-term-order.result > > 3. Please, use prefix gh-####-qsync to be consistent with other qsync tests. Having > 'qsync' in the test name helps to run all qsync tests in a single command > > python test-run.py qsync sure, will do > > +test_run:switch("replica2") > > + | --- > > + | - true > > + | ... > > +box.ctl.demote() > > 4. I dropped all 3 demotes and the test passed. Why do you need them? To make sure none of the fresh booted up nodes are owning the limbo even if something is changed in future inside test-run engine (test engine is not a stable API while our demote() operation is part of API and I can be sure that I may don't care how exactly nodes has been started, they all won't be owning the limbo after this command). > > +term_max_wait4 = term_max_master > > + | --- > > + | ... > > +if term_max_wait4 < term_max_replica1 then term_max_wait4 = term_max_replica1 end > > 5. How is it possible? The master did more promotes, it should have a > bigger term for sure. IIRC I've hunting a race buggy testcase and left this snippet untouched. So this snippet simply sneaked in, it is harmless, I'll cleanup, thanks! > > > +test_run:cmd('delete server replica2') > > 6. Can you add some data to the test? Which before the patches was applied, and now > is rejected/nopified. Otherwise you added a lock, tested the lock, but if I move > the data filtering before the lock, your test still will pass. Will try, thanks!