Changes in v4: - various cleanups and fixes as per review from Vlad: * refactor box_promote()/box_demote() - reorder the patches for a more logical progression Changes in v3: - change demote() behaviour as discussed with Vlad: * make it work only on the current leader * make it demote the current leader and always bump the term - change how limbo and raft snapshots are sent in response to JOIN: * encode replica's version in JOIN request * introduce a special stage: JOIN_META with raft and limbo snapshots. Send it based on replica's version.\ https://github.com/tarantool/tarantool/issues/5438 https://github.com/tarantool/tarantool/issues/6034 https://github.com/tarantool/tarantool/tree/sp/gh-6034-empty-limbo-transition Serge Petrenko (16): replication: always send raft state to subscribers txn_limbo: fix promote term filtering txn_limbo: persist the latest effective promote in snapshot replication: encode version in JOIN request replication: add META stage to JOIN replication: send latest effective promote in initial join replication: send current Raft term in join response raft: refactor raft_new_term() box: split promote() into reasonable parts box: make promote always bump the term box: make promote on the current leader a no-op box: fix an assertion failure after a spurious wakeup in promote box: allow calling promote on a candidate box: extract promote() settings to a separate method replication: forbid implicit limbo owner transition box: introduce `box.ctl.demote` src/box/applier.cc | 27 +- src/box/box.cc | 435 ++++++++++++------ src/box/box.h | 3 + src/box/errcode.h | 2 + src/box/iproto_constants.h | 12 +- src/box/lua/ctl.c | 9 + src/box/lua/info.c | 4 +- src/box/memtx_engine.c | 32 ++ src/box/raft.c | 48 +- src/box/raft.h | 4 + src/box/relay.cc | 36 +- src/box/relay.h | 4 +- src/box/txn_limbo.c | 90 ++-- src/box/txn_limbo.h | 14 + src/box/xrow.c | 4 +- src/box/xrow.h | 25 +- src/lib/raft/raft.c | 3 +- test/box/alter.result | 2 +- test/box/error.result | 2 + test/replication/election_basic.result | 3 + test/replication/election_basic.test.lua | 1 + test/replication/election_qsync.result | 3 + test/replication/election_qsync.test.lua | 1 + .../gh-3055-promote-wakeup-crash.result | 43 ++ .../gh-3055-promote-wakeup-crash.test.lua | 20 + .../gh-4114-local-space-replication.result | 7 +- .../gh-4114-local-space-replication.test.lua | 4 +- .../gh-5140-qsync-casc-rollback.result | 6 + .../gh-5140-qsync-casc-rollback.test.lua | 2 + .../gh-5144-qsync-dup-confirm.result | 6 + .../gh-5144-qsync-dup-confirm.test.lua | 2 + .../gh-5163-qsync-restart-crash.result | 6 + .../gh-5163-qsync-restart-crash.test.lua | 2 + .../gh-5167-qsync-rollback-snap.result | 6 + .../gh-5167-qsync-rollback-snap.test.lua | 2 + .../gh-5195-qsync-replica-write.result | 10 +- .../gh-5195-qsync-replica-write.test.lua | 6 +- .../gh-5213-qsync-applier-order-3.result | 9 + .../gh-5213-qsync-applier-order-3.test.lua | 3 + .../gh-5213-qsync-applier-order.result | 6 + .../gh-5213-qsync-applier-order.test.lua | 2 + .../replication/gh-5288-qsync-recovery.result | 6 + .../gh-5288-qsync-recovery.test.lua | 2 + .../gh-5298-qsync-recovery-snap.result | 6 + .../gh-5298-qsync-recovery-snap.test.lua | 2 + .../gh-5426-election-on-off.result | 3 + .../gh-5426-election-on-off.test.lua | 1 + .../gh-5433-election-restart-recovery.result | 3 + ...gh-5433-election-restart-recovery.test.lua | 1 + ...sync-clear-synchro-queue-commit-all.result | 3 + ...nc-clear-synchro-queue-commit-all.test.lua | 1 + .../replication/gh-5438-election-state.result | 66 +++ .../gh-5438-election-state.test.lua | 29 ++ test/replication/gh-5440-qsync-ro.result | 133 ------ test/replication/gh-5440-qsync-ro.test.lua | 53 --- .../gh-5446-qsync-eval-quorum.result | 6 + .../gh-5446-qsync-eval-quorum.test.lua | 2 + .../gh-5506-election-on-off.result | 3 + .../gh-5506-election-on-off.test.lua | 1 + .../gh-5566-final-join-synchro.result | 6 + .../gh-5566-final-join-synchro.test.lua | 2 + .../gh-5874-qsync-txn-recovery.result | 6 + .../gh-5874-qsync-txn-recovery.test.lua | 2 + .../gh-6032-promote-wal-write.result | 3 + .../gh-6032-promote-wal-write.test.lua | 1 + .../gh-6034-election-candidate-promote.result | 84 ++++ ...h-6034-election-candidate-promote.test.lua | 42 ++ .../gh-6034-election-promote-bump-term.result | 26 ++ ...h-6034-election-promote-bump-term.test.lua | 12 + .../gh-6034-qsync-limbo-ownership.result | 186 ++++++++ .../gh-6034-qsync-limbo-ownership.test.lua | 68 +++ .../gh-6057-qsync-confirm-async-no-wal.result | 7 + ...h-6057-qsync-confirm-async-no-wal.test.lua | 3 + test/replication/hang_on_synchro_fail.result | 6 + .../replication/hang_on_synchro_fail.test.lua | 2 + test/replication/qsync_advanced.result | 12 + test/replication/qsync_advanced.test.lua | 4 + test/replication/qsync_basic.result | 33 +- test/replication/qsync_basic.test.lua | 16 +- test/replication/qsync_errinj.result | 6 + test/replication/qsync_errinj.test.lua | 2 + test/replication/qsync_snapshots.result | 6 + test/replication/qsync_snapshots.test.lua | 2 + test/replication/qsync_with_anon.result | 6 + test/replication/qsync_with_anon.test.lua | 2 + test/replication/replica_rejoin.result | 77 ++-- test/replication/replica_rejoin.test.lua | 50 +- test/replication/suite.cfg | 6 +- test/unit/raft.c | 15 +- test/unit/raft.result | 3 +- 90 files changed, 1436 insertions(+), 487 deletions(-) create mode 100644 test/replication/gh-3055-promote-wakeup-crash.result create mode 100644 test/replication/gh-3055-promote-wakeup-crash.test.lua create mode 100644 test/replication/gh-5438-election-state.result create mode 100644 test/replication/gh-5438-election-state.test.lua delete mode 100644 test/replication/gh-5440-qsync-ro.result delete mode 100644 test/replication/gh-5440-qsync-ro.test.lua create mode 100644 test/replication/gh-6034-election-candidate-promote.result create mode 100644 test/replication/gh-6034-election-candidate-promote.test.lua create mode 100644 test/replication/gh-6034-election-promote-bump-term.result create mode 100644 test/replication/gh-6034-election-promote-bump-term.test.lua create mode 100644 test/replication/gh-6034-qsync-limbo-ownership.result create mode 100644 test/replication/gh-6034-qsync-limbo-ownership.test.lua -- 2.30.1 (Apple Git-130)
Tarantool used to send out raft state on subscribe only when raft was enabled. This was a safeguard against partially-upgraded clusters, where some nodes had no clue about Raft messages and couldn't handle them properly. Actually, Raft state should be sent out always. For example, promote will be changed to bump Raft term even when Raft is disabled, and it's important that everyone in cluster has the same term for the sake of promote at least. So, send out Raft state to every subscriber with version >= 2.6.0 (that's when Raft was introduced). Do the same for Raft broadcasts. They should be sent only to replicas with version >= 2.6.0 Closes #5438 --- src/box/box.cc | 11 ++-- src/box/relay.cc | 4 +- .../replication/gh-5438-election-state.result | 63 +++++++++++++++++++ .../gh-5438-election-state.test.lua | 28 +++++++++ test/replication/suite.cfg | 1 + 5 files changed, 100 insertions(+), 7 deletions(-) create mode 100644 test/replication/gh-5438-election-state.result create mode 100644 test/replication/gh-5438-election-state.test.lua diff --git a/src/box/box.cc b/src/box/box.cc index eeb57b04e..5dcf5b460 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -82,6 +82,7 @@ #include "msgpack.h" #include "raft.h" #include "trivia/util.h" +#include "version.h" enum { IPROTO_THREADS_MAX = 1000, @@ -2831,13 +2832,13 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header) tt_uuid_str(&replica_uuid), sio_socketname(io->fd)); say_info("remote vclock %s local vclock %s", vclock_to_string(&replica_clock), vclock_to_string(&vclock)); - if (raft_is_enabled(box_raft())) { + if (replica_version_id >= version_id(2, 6, 0) && !anon) { /* * Send out the current raft state of the instance. Don't do - * that if Raft is disabled. It can be that a part of the - * cluster still contains old versions, which can't handle Raft - * messages. So when it is disabled, its network footprint - * should be 0. + * that if the remote instance is old. It can be that a part of + * the cluster still contains old versions, which can't handle + * Raft messages. Raft's network footprint should be 0 as seen + * by such instances. */ struct raft_request req; box_raft_checkpoint_remote(&req); diff --git a/src/box/relay.cc b/src/box/relay.cc index 115037fc3..60f527b7f 100644 --- a/src/box/relay.cc +++ b/src/box/relay.cc @@ -800,7 +800,7 @@ relay_subscribe_f(va_list ap) &relay->relay_pipe, NULL, NULL, cbus_process); struct relay_is_raft_enabled_msg raft_enabler; - if (!relay->replica->anon) + if (!relay->replica->anon && relay->version_id >= version_id(2, 6, 0)) relay_send_is_raft_enabled(relay, &raft_enabler, true); /* @@ -883,7 +883,7 @@ relay_subscribe_f(va_list ap) cpipe_push(&relay->tx_pipe, &relay->status_msg.msg); } - if (!relay->replica->anon) + if (!relay->replica->anon && relay->version_id >= version_id(2, 6, 0)) relay_send_is_raft_enabled(relay, &raft_enabler, false); /* diff --git a/test/replication/gh-5438-election-state.result b/test/replication/gh-5438-election-state.result new file mode 100644 index 000000000..6985f026a --- /dev/null +++ b/test/replication/gh-5438-election-state.result @@ -0,0 +1,63 @@ +-- test-run result file version 2 +test_run = require('test_run').new() + | --- + | ... + +-- +-- gh-5428 send out Raft state to subscribers, even when Raft is disabled. +-- +-- Bump Raft term while the replica's offline. +term = box.info.election.term + | --- + | ... +old_election_mode = box.cfg.election_mode + | --- + | ... +box.cfg{election_mode = 'candidate'} + | --- + | ... +test_run:wait_cond(function() return box.info.election.term > term end) + | --- + | - true + | ... + +-- Make sure the replica receives new term on subscribe. +box.cfg{election_mode = 'off'} + | --- + | ... + +box.schema.user.grant('guest', 'replication') + | --- + | ... +test_run:cmd('create server replica with rpl_master=default,\ + script="replication/replica.lua"') + | --- + | - true + | ... +test_run:cmd('start server replica') + | --- + | - true + | ... +test_run:wait_cond(function()\ + return test_run:eval('replica', 'return box.info.election.term')[1] ==\ + box.info.election.term\ +end) + | --- + | - true + | ... + +-- Cleanup. +box.cfg{election_mode = old_election_mode} + | --- + | ... +test_run:cmd('stop server replica') + | --- + | - true + | ... +test_run:cmd('delete server replica') + | --- + | - true + | ... +box.schema.user.revoke('guest', 'replication') + | --- + | ... diff --git a/test/replication/gh-5438-election-state.test.lua b/test/replication/gh-5438-election-state.test.lua new file mode 100644 index 000000000..60c3366c1 --- /dev/null +++ b/test/replication/gh-5438-election-state.test.lua @@ -0,0 +1,28 @@ +test_run = require('test_run').new() + +-- +-- gh-5428 send out Raft state to subscribers, even when Raft is disabled. +-- +-- Bump Raft term while the replica's offline. +term = box.info.election.term +old_election_mode = box.cfg.election_mode +box.cfg{election_mode = 'candidate'} +test_run:wait_cond(function() return box.info.election.term > term end) + +-- Make sure the replica receives new term on subscribe. +box.cfg{election_mode = 'off'} + +box.schema.user.grant('guest', 'replication') +test_run:cmd('create server replica with rpl_master=default,\ + script="replication/replica.lua"') +test_run:cmd('start server replica') +test_run:wait_cond(function()\ + return test_run:eval('replica', 'return box.info.election.term')[1] ==\ + box.info.election.term\ +end) + +-- Cleanup. +box.cfg{election_mode = old_election_mode} +test_run:cmd('stop server replica') +test_run:cmd('delete server replica') +box.schema.user.revoke('guest', 'replication') diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg index 69f2f3511..ae146c366 100644 --- a/test/replication/suite.cfg +++ b/test/replication/suite.cfg @@ -19,6 +19,7 @@ "gh-5213-qsync-applier-order-3.test.lua": {}, "gh-5426-election-on-off.test.lua": {}, "gh-5433-election-restart-recovery.test.lua": {}, + "gh-5438-election-state.test.lua": {}, "gh-5445-leader-inconsistency.test.lua": {}, "gh-5506-election-on-off.test.lua": {}, "once.test.lua": {}, -- 2.30.1 (Apple Git-130)
txn_limbo_process() used to filter out promote requests whose term was equal to the greatest term seen. This wasn't correct for PROMOTE entries with term 1. Such entries appear after box.ctl.promote() is issued on an instance with disabled elections. Every PROMOTE entry from such an instance has term 1, but should still be applied. Fix this in the patch. Also, when an outdated PROMOTE entry with term smaller than already applied from some replica arrived, it wasn't filtered at all. Such a situation shouldn't be possible, but fix it as well. Part-of #6034 --- src/box/txn_limbo.c | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c index fdea287c7..6e5d6d04e 100644 --- a/src/box/txn_limbo.c +++ b/src/box/txn_limbo.c @@ -707,15 +707,18 @@ txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req) uint32_t origin = req->origin_id; if (txn_limbo_replica_term(limbo, origin) < term) { vclock_follow(&limbo->promote_term_map, origin, term); - if (term > limbo->promote_greatest_term) { + if (term > limbo->promote_greatest_term) limbo->promote_greatest_term = term; - } else if (req->type == IPROTO_PROMOTE) { - /* - * PROMOTE for outdated term. Ignore. - */ - return; - } + } else if (req->type == IPROTO_PROMOTE && + limbo->promote_greatest_term > 1) { + /* PROMOTE for outdated term. Ignore. */ + say_info("RAFT: ignoring PROMOTE request from instance " + "id %u for term %llu. Greatest term seen " + "before (%llu) is bigger.", origin, (long long)term, + (long long)limbo->promote_greatest_term); + return; } + int64_t lsn = req->lsn; if (req->replica_id == REPLICA_ID_NIL) { /* -- 2.30.1 (Apple Git-130)
Previously PROMOTE entries, just like CONFIRM and ROLLBACK were only stored in WALs. This is because snapshots consist solely of confirmed transactions, so there's nothing to CONFIRM or ROLLBACK. PROMOTE has gained additional meaning recently: it pins limbo ownership to a specific instance, rendering everyone else read-only. So now PROMOTE information must be stored in snapshots as well. Save the latest limbo state (owner id and latest confirmed lsn) to the snapshot as a PROMOTE request. Prerequisite #6034 --- src/box/memtx_engine.c | 32 ++++++++++++++++++++++++++++++++ src/box/txn_limbo.c | 10 ++++++++++ src/box/txn_limbo.h | 7 +++++++ 3 files changed, 49 insertions(+) diff --git a/src/box/memtx_engine.c b/src/box/memtx_engine.c index c85dc6af3..0b06e5e63 100644 --- a/src/box/memtx_engine.c +++ b/src/box/memtx_engine.c @@ -50,6 +50,7 @@ #include "schema.h" #include "gc.h" #include "raft.h" +#include "txn_limbo.h" /* sync snapshot every 16MB */ #define SNAP_SYNC_INTERVAL (1 << 24) @@ -225,6 +226,22 @@ memtx_engine_recover_raft(const struct xrow_header *row) return 0; } +static int +memtx_engine_recover_synchro(const struct xrow_header *row) +{ + assert(row->type == IPROTO_PROMOTE); + struct synchro_request req; + if (xrow_decode_synchro(row, &req) != 0) + return -1; + /* + * Origin id cannot be deduced from row.replica_id in a checkpoint, + * because all its rows have a zero replica_id. + */ + req.origin_id = req.replica_id; + txn_limbo_process(&txn_limbo, &req); + return 0; +} + static int memtx_engine_recover_snapshot_row(struct memtx_engine *memtx, struct xrow_header *row, int *is_space_system) @@ -233,6 +250,8 @@ memtx_engine_recover_snapshot_row(struct memtx_engine *memtx, if (row->type != IPROTO_INSERT) { if (row->type == IPROTO_RAFT) return memtx_engine_recover_raft(row); + if (row->type == IPROTO_PROMOTE) + return memtx_engine_recover_synchro(row); diag_set(ClientError, ER_UNKNOWN_REQUEST_TYPE, (uint32_t) row->type); return -1; @@ -546,6 +565,7 @@ struct checkpoint { struct vclock vclock; struct xdir dir; struct raft_request raft; + struct synchro_request synchro_state; /** * Do nothing, just touch the snapshot file - the * checkpoint already exists. @@ -571,6 +591,7 @@ checkpoint_new(const char *snap_dirname, uint64_t snap_io_rate_limit) xdir_create(&ckpt->dir, snap_dirname, SNAP, &INSTANCE_UUID, &opts); vclock_create(&ckpt->vclock); box_raft_checkpoint_local(&ckpt->raft); + txn_limbo_checkpoint(&txn_limbo, &ckpt->synchro_state); ckpt->touch = false; return ckpt; } @@ -659,6 +680,15 @@ finish: return rc; } +static int +checkpoint_write_synchro(struct xlog *l, const struct synchro_request *req) +{ + struct xrow_header row; + char body[XROW_SYNCHRO_BODY_LEN_MAX]; + xrow_encode_synchro(&row, body, req); + return checkpoint_write_row(l, &row); +} + static int checkpoint_f(va_list ap) { @@ -696,6 +726,8 @@ checkpoint_f(va_list ap) } if (checkpoint_write_raft(&snap, &ckpt->raft) != 0) goto fail; + if (checkpoint_write_synchro(&snap, &ckpt->synchro_state) != 0) + goto fail; if (xlog_flush(&snap) < 0) goto fail; diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c index 6e5d6d04e..991c47698 100644 --- a/src/box/txn_limbo.c +++ b/src/box/txn_limbo.c @@ -306,6 +306,16 @@ complete: return 0; } +void +txn_limbo_checkpoint(const struct txn_limbo *limbo, + struct synchro_request *req) +{ + req->type = IPROTO_PROMOTE; + req->replica_id = limbo->owner_id; + req->lsn = limbo->confirmed_lsn; + req->term = limbo->promote_greatest_term; +} + static void txn_limbo_write_synchro(struct txn_limbo *limbo, uint16_t type, int64_t lsn, uint64_t term) diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h index 7debbc0b9..7151843f4 100644 --- a/src/box/txn_limbo.h +++ b/src/box/txn_limbo.h @@ -315,6 +315,13 @@ txn_limbo_wait_confirm(struct txn_limbo *limbo); int txn_limbo_wait_empty(struct txn_limbo *limbo, double timeout); +/** + * Persist limbo state to a given synchro request. + */ +void +txn_limbo_checkpoint(const struct txn_limbo *limbo, + struct synchro_request *req); + /** * Write a PROMOTE request, which has the same effect as CONFIRM(@a lsn) and * ROLLBACK(@a lsn + 1) combined. -- 2.30.1 (Apple Git-130)
The replica's version will be needed once sending limbo and election state snapshot is implemented. Prerequisite #6034 @TarantoolBot document Title: New field in JOIN request JOIN request now comes with a new field: replica's version. The field uses IPROTO_SERVER_VERSION key. --- src/box/box.cc | 7 +++++-- src/box/xrow.c | 4 +++- src/box/xrow.h | 25 +++++++++++++++---------- 3 files changed, 23 insertions(+), 13 deletions(-) diff --git a/src/box/box.cc b/src/box/box.cc index 5dcf5b460..6d5516682 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -2495,7 +2495,9 @@ box_process_register(struct ev_io *io, struct xrow_header *header) struct tt_uuid instance_uuid = uuid_nil; struct vclock replica_vclock; - xrow_decode_register_xc(header, &instance_uuid, &replica_vclock); + uint32_t replica_version_id; + xrow_decode_register_xc(header, &instance_uuid, &replica_vclock, + &replica_version_id); if (!is_box_configured) tnt_raise(ClientError, ER_LOADING); @@ -2620,7 +2622,8 @@ box_process_join(struct ev_io *io, struct xrow_header *header) /* Decode JOIN request */ struct tt_uuid instance_uuid = uuid_nil; - xrow_decode_join_xc(header, &instance_uuid); + uint32_t replica_version_id; + xrow_decode_join_xc(header, &instance_uuid, &replica_version_id); /* Check that bootstrap has been finished */ if (!is_box_configured) diff --git a/src/box/xrow.c b/src/box/xrow.c index 16cb2484c..400e0456f 100644 --- a/src/box/xrow.c +++ b/src/box/xrow.c @@ -1639,10 +1639,12 @@ xrow_encode_join(struct xrow_header *row, const struct tt_uuid *instance_uuid) return -1; } char *data = buf; - data = mp_encode_map(data, 1); + data = mp_encode_map(data, 2); data = mp_encode_uint(data, IPROTO_INSTANCE_UUID); /* Greet the remote replica with our replica UUID */ data = xrow_encode_uuid(data, instance_uuid); + data = mp_encode_uint(data, IPROTO_SERVER_VERSION); + data = mp_encode_uint(data, tarantool_version_id()); assert(data <= buf + size); row->body[0].iov_base = buf; diff --git a/src/box/xrow.h b/src/box/xrow.h index 9f61bfd47..6b06db5c1 100644 --- a/src/box/xrow.h +++ b/src/box/xrow.h @@ -471,15 +471,17 @@ xrow_encode_join(struct xrow_header *row, const struct tt_uuid *instance_uuid); * Decode JOIN command. * @param row Row to decode. * @param[out] instance_uuid. + * @param[out] version_id. * * @retval 0 Success. * @retval -1 Memory or format error. */ static inline int -xrow_decode_join(struct xrow_header *row, struct tt_uuid *instance_uuid) +xrow_decode_join(struct xrow_header *row, struct tt_uuid *instance_uuid, + uint32_t *version_id) { - return xrow_decode_subscribe(row, NULL, instance_uuid, NULL, NULL, NULL, - NULL); + return xrow_decode_subscribe(row, NULL, instance_uuid, NULL, version_id, + NULL, NULL); } /** @@ -487,15 +489,17 @@ xrow_decode_join(struct xrow_header *row, struct tt_uuid *instance_uuid) * @param row Row to decode. * @param[out] instance_uuid Instance uuid. * @param[out] vclock Instance vclock. + * @param[out] version_id Replica version id. + * * @retval 0 Success. * @retval -1 Memory or format error. */ static inline int xrow_decode_register(struct xrow_header *row, struct tt_uuid *instance_uuid, - struct vclock *vclock) + struct vclock *vclock, uint32_t *version_id) { - return xrow_decode_subscribe(row, NULL, instance_uuid, vclock, NULL, - NULL, NULL); + return xrow_decode_subscribe(row, NULL, instance_uuid, vclock, + version_id, NULL, NULL); } /** @@ -960,18 +964,19 @@ xrow_encode_join_xc(struct xrow_header *row, /** @copydoc xrow_decode_join. */ static inline void -xrow_decode_join_xc(struct xrow_header *row, struct tt_uuid *instance_uuid) +xrow_decode_join_xc(struct xrow_header *row, struct tt_uuid *instance_uuid, + uint32_t *version_id) { - if (xrow_decode_join(row, instance_uuid) != 0) + if (xrow_decode_join(row, instance_uuid, version_id) != 0) diag_raise(); } /** @copydoc xrow_decode_register. */ static inline void xrow_decode_register_xc(struct xrow_header *row, struct tt_uuid *instance_uuid, - struct vclock *vclock) + struct vclock *vclock, uint32_t *version_id) { - if (xrow_decode_register(row, instance_uuid, vclock) != 0) + if (xrow_decode_register(row, instance_uuid, vclock, version_id) != 0) diag_raise(); } -- 2.30.1 (Apple Git-130)
The new META stage is part of server's response to a join request. It's marked by IPROTO_JOIN_META and IPROTO_JOIN_SNAPSHOT requests and goes before the actual snapshot data. Prerequisite #6034 @TarantoolBot document Title: new protocol stage during JOIN A new stage is added to the stream of JOIN rows coming from master. The stage is marked with a bodyless row with type IPROTO_JOIN_META = 71 Once all the rows from the stage are sent out, the JOIN continues as before (as a stream of snapshot rows). The end of META stage is marked with a row of type IPROTO_JOIN_SNAPSHOT = 72 The stage contains the rows that are necessary for instance initialization (current Raft term, current state of synchronous transaction queue), but do not belong to any system space. --- src/box/applier.cc | 17 ++++++++++++++++- src/box/box.cc | 5 +++-- src/box/iproto_constants.h | 2 ++ src/box/relay.cc | 19 ++++++++++++++++++- src/box/relay.h | 4 +++- 5 files changed, 42 insertions(+), 5 deletions(-) diff --git a/src/box/applier.cc b/src/box/applier.cc index 07fe7f5c7..0f81b7cc4 100644 --- a/src/box/applier.cc +++ b/src/box/applier.cc @@ -447,12 +447,26 @@ applier_wait_snapshot(struct applier *applier) xrow_decode_vclock_xc(&row, &replicaset.vclock); } + coio_read_xrow(coio, ibuf, &row); + if (row.type == IPROTO_JOIN_META) { + /* Read additional metadata. Empty at the moment. */ + do { + coio_read_xrow(coio, ibuf, &row); + if (iproto_type_is_error(row.type)) { + xrow_decode_error_xc(&row); + } else if (row.type != IPROTO_JOIN_SNAPSHOT) { + tnt_raise(ClientError, ER_UNKNOWN_REQUEST_TYPE, + (uint32_t)row.type); + } + } while (row.type != IPROTO_JOIN_SNAPSHOT); + coio_read_xrow(coio, ibuf, &row); + } + /* * Receive initial data. */ uint64_t row_count = 0; while (true) { - coio_read_xrow(coio, ibuf, &row); applier->last_row_time = ev_monotonic_now(loop()); if (iproto_type_is_dml(row.type)) { if (apply_snapshot_row(&row) != 0) @@ -477,6 +491,7 @@ applier_wait_snapshot(struct applier *applier) tnt_raise(ClientError, ER_UNKNOWN_REQUEST_TYPE, (uint32_t) row.type); } + coio_read_xrow(coio, ibuf, &row); } return row_count; diff --git a/src/box/box.cc b/src/box/box.cc index 6d5516682..8c695686e 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -2474,7 +2474,7 @@ box_process_fetch_snapshot(struct ev_io *io, struct xrow_header *header) /* Send the snapshot data to the instance. */ struct vclock start_vclock; - relay_initial_join(io->fd, header->sync, &start_vclock); + relay_initial_join(io->fd, header->sync, &start_vclock, 0); say_info("read-view sent."); /* Remember master's vclock after the last request */ @@ -2672,7 +2672,8 @@ box_process_join(struct ev_io *io, struct xrow_header *header) * Initial stream: feed replica with dirty data from engines. */ struct vclock start_vclock; - relay_initial_join(io->fd, header->sync, &start_vclock); + relay_initial_join(io->fd, header->sync, &start_vclock, + replica_version_id); say_info("initial data sent."); /** diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h index 137bee9da..e913801a8 100644 --- a/src/box/iproto_constants.h +++ b/src/box/iproto_constants.h @@ -261,6 +261,8 @@ enum iproto_type { IPROTO_FETCH_SNAPSHOT = 69, /** REGISTER request to leave anonymous replication. */ IPROTO_REGISTER = 70, + IPROTO_JOIN_META = 71, + IPROTO_JOIN_SNAPSHOT = 72, /** Vinyl run info stored in .index file */ VY_INDEX_RUN_INFO = 100, diff --git a/src/box/relay.cc b/src/box/relay.cc index 60f527b7f..4ebe0fb06 100644 --- a/src/box/relay.cc +++ b/src/box/relay.cc @@ -392,7 +392,8 @@ relay_set_cord_name(int fd) } void -relay_initial_join(int fd, uint64_t sync, struct vclock *vclock) +relay_initial_join(int fd, uint64_t sync, struct vclock *vclock, + uint32_t replica_version_id) { struct relay *relay = relay_new(NULL); if (relay == NULL) @@ -432,6 +433,22 @@ relay_initial_join(int fd, uint64_t sync, struct vclock *vclock) row.sync = sync; coio_write_xrow(&relay->io, &row); + /* + * Version is present starting with 2.7.3, 2.8.2, 2.9.1 + * All these versions know of additional META stage of initial join. + */ + if (replica_version_id > 0) { + /* Mark the beginning of the metadata stream. */ + row.type = IPROTO_JOIN_META; + coio_write_xrow(&relay->io, &row); + + /* Empty at the moment. */ + + /* Mark the end of the metadata stream. */ + row.type = IPROTO_JOIN_SNAPSHOT; + coio_write_xrow(&relay->io, &row); + } + /* Send read view to the replica. */ engine_join_xc(&ctx, &relay->stream); } diff --git a/src/box/relay.h b/src/box/relay.h index 615ffb75d..112428ae8 100644 --- a/src/box/relay.h +++ b/src/box/relay.h @@ -116,9 +116,11 @@ relay_push_raft(struct relay *relay, const struct raft_request *req); * @param fd client connection * @param sync sync from incoming JOIN request * @param vclock[out] vclock of the read view sent to the replica + * @param replica_version_id peer's version */ void -relay_initial_join(int fd, uint64_t sync, struct vclock *vclock); +relay_initial_join(int fd, uint64_t sync, struct vclock *vclock, + uint32_t replica_version_id); /** * Send final JOIN rows to the replica. -- 2.30.1 (Apple Git-130)
A joining instance may never receive the latest PROMOTE request, which is the only source of information about the limbo owner. Send out the latest limbo state (e.g. the latest applied PROMOTE request) together with the initial join snapshot. Prerequisite #6034 --- src/box/applier.cc | 5 ++ src/box/relay.cc | 9 ++- test/replication/replica_rejoin.result | 77 ++++++++++++++---------- test/replication/replica_rejoin.test.lua | 50 +++++++-------- 4 files changed, 85 insertions(+), 56 deletions(-) diff --git a/src/box/applier.cc b/src/box/applier.cc index 0f81b7cc4..4088fcc21 100644 --- a/src/box/applier.cc +++ b/src/box/applier.cc @@ -454,6 +454,11 @@ applier_wait_snapshot(struct applier *applier) coio_read_xrow(coio, ibuf, &row); if (iproto_type_is_error(row.type)) { xrow_decode_error_xc(&row); + } else if (iproto_type_is_promote_request(row.type)) { + struct synchro_request req; + if (xrow_decode_synchro(&row, &req) != 0) + diag_raise(); + txn_limbo_process(&txn_limbo, &req); } else if (row.type != IPROTO_JOIN_SNAPSHOT) { tnt_raise(ClientError, ER_UNKNOWN_REQUEST_TYPE, (uint32_t)row.type); diff --git a/src/box/relay.cc b/src/box/relay.cc index 4ebe0fb06..4b102a777 100644 --- a/src/box/relay.cc +++ b/src/box/relay.cc @@ -427,6 +427,9 @@ relay_initial_join(int fd, uint64_t sync, struct vclock *vclock, if (txn_limbo_wait_confirm(&txn_limbo) != 0) diag_raise(); + struct synchro_request req; + txn_limbo_checkpoint(&txn_limbo, &req); + /* Respond to the JOIN request with the current vclock. */ struct xrow_header row; xrow_encode_vclock_xc(&row, vclock); @@ -442,7 +445,11 @@ relay_initial_join(int fd, uint64_t sync, struct vclock *vclock, row.type = IPROTO_JOIN_META; coio_write_xrow(&relay->io, &row); - /* Empty at the moment. */ + char body[XROW_SYNCHRO_BODY_LEN_MAX]; + xrow_encode_synchro(&row, body, &req); + row.replica_id = req.replica_id; + row.sync = sync; + coio_write_xrow(&relay->io, &row); /* Mark the end of the metadata stream. */ row.type = IPROTO_JOIN_SNAPSHOT; diff --git a/test/replication/replica_rejoin.result b/test/replication/replica_rejoin.result index 843333a19..e489c150a 100644 --- a/test/replication/replica_rejoin.result +++ b/test/replication/replica_rejoin.result @@ -7,10 +7,19 @@ test_run = env.new() log = require('log') --- ... -engine = test_run:get_cfg('engine') +test_run:cmd("create server master with script='replication/master1.lua'") --- +- true ... -test_run:cleanup_cluster() +test_run:cmd("start server master") +--- +- true +... +test_run:switch("master") +--- +- true +... +engine = test_run:get_cfg('engine') --- ... -- @@ -43,7 +52,7 @@ _ = box.space.test:insert{3} --- ... -- Join a replica, then stop it. -test_run:cmd("create server replica with rpl_master=default, script='replication/replica_rejoin.lua'") +test_run:cmd("create server replica with rpl_master=master, script='replication/replica_rejoin.lua'") --- - true ... @@ -65,7 +74,7 @@ box.space.test:select() - [2] - [3] ... -test_run:cmd("switch default") +test_run:cmd("switch master") --- - true ... @@ -75,7 +84,7 @@ test_run:cmd("stop server replica") ... -- Restart the server to purge the replica from -- the garbage collection state. -test_run:cmd("restart server default") +test_run:cmd("restart server master") box.cfg{wal_cleanup_delay = 0} --- ... @@ -146,7 +155,7 @@ box.space.test:select() - [20] - [30] ... -test_run:cmd("switch default") +test_run:cmd("switch master") --- - true ... @@ -154,7 +163,7 @@ test_run:cmd("switch default") for i = 10, 30, 10 do box.space.test:update(i, {{'!', 1, i}}) end --- ... -vclock = test_run:get_vclock('default') +vclock = test_run:get_vclock('master') --- ... vclock[0] = nil @@ -191,7 +200,7 @@ box.space.test:replace{1, 2, 3} -- bumps LSN on the replica --- - [1, 2, 3] ... -test_run:cmd("switch default") +test_run:cmd("switch master") --- - true ... @@ -199,7 +208,7 @@ test_run:cmd("stop server replica") --- - true ... -test_run:cmd("restart server default") +test_run:cmd("restart server master") box.cfg{wal_cleanup_delay = 0} --- ... @@ -253,7 +262,7 @@ box.space.test:select() -- from the replica. -- -- Bootstrap a new replica. -test_run:cmd("switch default") +test_run:cmd("switch master") --- - true ... @@ -295,7 +304,7 @@ box.cfg{replication = ''} --- ... -- Bump vclock on the master. -test_run:cmd("switch default") +test_run:cmd("switch master") --- - true ... @@ -317,15 +326,15 @@ vclock = test_run:get_vclock('replica') vclock[0] = nil --- ... -_ = test_run:wait_vclock('default', vclock) +_ = test_run:wait_vclock('master', vclock) --- ... -- Restart the master and force garbage collection. -test_run:cmd("switch default") +test_run:cmd("switch master") --- - true ... -test_run:cmd("restart server default") +test_run:cmd("restart server master") box.cfg{wal_cleanup_delay = 0} --- ... @@ -373,7 +382,7 @@ vclock = test_run:get_vclock('replica') vclock[0] = nil --- ... -_ = test_run:wait_vclock('default', vclock) +_ = test_run:wait_vclock('master', vclock) --- ... -- Restart the replica. It should successfully rebootstrap. @@ -396,38 +405,42 @@ test_run:cmd("switch default") --- - true ... -box.cfg{replication = ''} +test_run:cmd("stop server replica") --- +- true ... -test_run:cmd("stop server replica") +test_run:cmd("delete server replica") --- - true ... -test_run:cmd("cleanup server replica") +test_run:cmd("stop server master") --- - true ... -test_run:cmd("delete server replica") +test_run:cmd("delete server master") --- - true ... -test_run:cleanup_cluster() +-- +-- gh-4107: rebootstrap fails if the replica was deleted from +-- the cluster on the master. +-- +test_run:cmd("create server master with script='replication/master1.lua'") --- +- true ... -box.space.test:drop() +test_run:cmd("start server master") --- +- true ... -box.schema.user.revoke('guest', 'replication') +test_run:switch("master") --- +- true ... --- --- gh-4107: rebootstrap fails if the replica was deleted from --- the cluster on the master. --- box.schema.user.grant('guest', 'replication') --- ... -test_run:cmd("create server replica with rpl_master=default, script='replication/replica_uuid.lua'") +test_run:cmd("create server replica with rpl_master=master, script='replication/replica_uuid.lua'") --- - true ... @@ -462,11 +475,11 @@ box.space._cluster:get(2) ~= nil --- - true ... -test_run:cmd("stop server replica") +test_run:switch("default") --- - true ... -test_run:cmd("cleanup server replica") +test_run:cmd("stop server replica") --- - true ... @@ -474,9 +487,11 @@ test_run:cmd("delete server replica") --- - true ... -box.schema.user.revoke('guest', 'replication') +test_run:cmd("stop server master") --- +- true ... -test_run:cleanup_cluster() +test_run:cmd("delete server master") --- +- true ... diff --git a/test/replication/replica_rejoin.test.lua b/test/replication/replica_rejoin.test.lua index c3ba9bf3f..2563177cf 100644 --- a/test/replication/replica_rejoin.test.lua +++ b/test/replication/replica_rejoin.test.lua @@ -1,9 +1,11 @@ env = require('test_run') test_run = env.new() log = require('log') -engine = test_run:get_cfg('engine') -test_run:cleanup_cluster() +test_run:cmd("create server master with script='replication/master1.lua'") +test_run:cmd("start server master") +test_run:switch("master") +engine = test_run:get_cfg('engine') -- -- gh-5806: this replica_rejoin test relies on the wal cleanup fiber @@ -23,17 +25,17 @@ _ = box.space.test:insert{2} _ = box.space.test:insert{3} -- Join a replica, then stop it. -test_run:cmd("create server replica with rpl_master=default, script='replication/replica_rejoin.lua'") +test_run:cmd("create server replica with rpl_master=master, script='replication/replica_rejoin.lua'") test_run:cmd("start server replica") test_run:cmd("switch replica") box.info.replication[1].upstream.status == 'follow' or log.error(box.info) box.space.test:select() -test_run:cmd("switch default") +test_run:cmd("switch master") test_run:cmd("stop server replica") -- Restart the server to purge the replica from -- the garbage collection state. -test_run:cmd("restart server default") +test_run:cmd("restart server master") box.cfg{wal_cleanup_delay = 0} -- Make some checkpoints to remove old xlogs. @@ -58,11 +60,11 @@ box.info.replication[2].downstream.vclock ~= nil or log.error(box.info) test_run:cmd("switch replica") box.info.replication[1].upstream.status == 'follow' or log.error(box.info) box.space.test:select() -test_run:cmd("switch default") +test_run:cmd("switch master") -- Make sure the replica follows new changes. for i = 10, 30, 10 do box.space.test:update(i, {{'!', 1, i}}) end -vclock = test_run:get_vclock('default') +vclock = test_run:get_vclock('master') vclock[0] = nil _ = test_run:wait_vclock('replica', vclock) test_run:cmd("switch replica") @@ -76,9 +78,9 @@ box.space.test:select() -- Check that rebootstrap is NOT initiated unless the replica -- is strictly behind the master. box.space.test:replace{1, 2, 3} -- bumps LSN on the replica -test_run:cmd("switch default") +test_run:cmd("switch master") test_run:cmd("stop server replica") -test_run:cmd("restart server default") +test_run:cmd("restart server master") box.cfg{wal_cleanup_delay = 0} checkpoint_count = box.cfg.checkpoint_count box.cfg{checkpoint_count = 1} @@ -99,7 +101,7 @@ box.space.test:select() -- -- Bootstrap a new replica. -test_run:cmd("switch default") +test_run:cmd("switch master") test_run:cmd("stop server replica") test_run:cmd("cleanup server replica") test_run:cleanup_cluster() @@ -113,17 +115,17 @@ box.cfg{replication = replica_listen} test_run:cmd("switch replica") box.cfg{replication = ''} -- Bump vclock on the master. -test_run:cmd("switch default") +test_run:cmd("switch master") box.space.test:replace{1} -- Bump vclock on the replica. test_run:cmd("switch replica") for i = 1, 10 do box.space.test:replace{2} end vclock = test_run:get_vclock('replica') vclock[0] = nil -_ = test_run:wait_vclock('default', vclock) +_ = test_run:wait_vclock('master', vclock) -- Restart the master and force garbage collection. -test_run:cmd("switch default") -test_run:cmd("restart server default") +test_run:cmd("switch master") +test_run:cmd("restart server master") box.cfg{wal_cleanup_delay = 0} replica_listen = test_run:cmd("eval replica 'return box.cfg.listen'") replica_listen ~= nil @@ -139,7 +141,7 @@ test_run:cmd("switch replica") for i = 1, 10 do box.space.test:replace{2} end vclock = test_run:get_vclock('replica') vclock[0] = nil -_ = test_run:wait_vclock('default', vclock) +_ = test_run:wait_vclock('master', vclock) -- Restart the replica. It should successfully rebootstrap. test_run:cmd("restart server replica with args='true'") box.space.test:select() @@ -148,20 +150,20 @@ box.space.test:replace{2} -- Cleanup. test_run:cmd("switch default") -box.cfg{replication = ''} test_run:cmd("stop server replica") -test_run:cmd("cleanup server replica") test_run:cmd("delete server replica") -test_run:cleanup_cluster() -box.space.test:drop() -box.schema.user.revoke('guest', 'replication') +test_run:cmd("stop server master") +test_run:cmd("delete server master") -- -- gh-4107: rebootstrap fails if the replica was deleted from -- the cluster on the master. -- +test_run:cmd("create server master with script='replication/master1.lua'") +test_run:cmd("start server master") +test_run:switch("master") box.schema.user.grant('guest', 'replication') -test_run:cmd("create server replica with rpl_master=default, script='replication/replica_uuid.lua'") +test_run:cmd("create server replica with rpl_master=master, script='replication/replica_uuid.lua'") start_cmd = string.format("start server replica with args='%s'", require('uuid').new()) box.space._cluster:get(2) == nil test_run:cmd(start_cmd) @@ -170,8 +172,8 @@ test_run:cmd("cleanup server replica") box.space._cluster:delete(2) ~= nil test_run:cmd(start_cmd) box.space._cluster:get(2) ~= nil +test_run:switch("default") test_run:cmd("stop server replica") -test_run:cmd("cleanup server replica") test_run:cmd("delete server replica") -box.schema.user.revoke('guest', 'replication') -test_run:cleanup_cluster() +test_run:cmd("stop server master") +test_run:cmd("delete server master") -- 2.30.1 (Apple Git-130)
Make Raft nodes send out their latest persisted term to joining replicas. This is needed to avoid the situation when txn_limbo-managed 'promote greatest term' is greater than current Raft term. Otherwise the following may happen: replica joins off some instance and receives its latest limbo state. The state includes "greatest term seen" and makes limbo filter out any data coming from instances with smaller terms. Imagine that master this replica has joined from dies before replica has a chance to subscribe to it. Then it doesn't receive its current Raft term and start elections at smallest term possible, 2 (when there are no suitable Raft nodes besides the replica). Once the elections in a small term number are won, a ton of problems arises: starting with filtering out PROMOTE requests for "old" term and nop-ifying any data coming from terms smaller than "greatest term seen". Prerequisite #6034 --- src/box/applier.cc | 5 +++++ src/box/relay.cc | 6 ++++++ 2 files changed, 11 insertions(+) diff --git a/src/box/applier.cc b/src/box/applier.cc index 4088fcc21..92ec088ea 100644 --- a/src/box/applier.cc +++ b/src/box/applier.cc @@ -459,6 +459,11 @@ applier_wait_snapshot(struct applier *applier) if (xrow_decode_synchro(&row, &req) != 0) diag_raise(); txn_limbo_process(&txn_limbo, &req); + } else if (iproto_type_is_raft_request(row.type)) { + struct raft_request req; + if (xrow_decode_raft(&row, &req, NULL) != 0) + diag_raise(); + box_raft_recover(&req); } else if (row.type != IPROTO_JOIN_SNAPSHOT) { tnt_raise(ClientError, ER_UNKNOWN_REQUEST_TYPE, (uint32_t)row.type); diff --git a/src/box/relay.cc b/src/box/relay.cc index 4b102a777..70f1a045b 100644 --- a/src/box/relay.cc +++ b/src/box/relay.cc @@ -428,7 +428,9 @@ relay_initial_join(int fd, uint64_t sync, struct vclock *vclock, diag_raise(); struct synchro_request req; + struct raft_request raft_req; txn_limbo_checkpoint(&txn_limbo, &req); + box_raft_checkpoint_local(&raft_req); /* Respond to the JOIN request with the current vclock. */ struct xrow_header row; @@ -451,6 +453,10 @@ relay_initial_join(int fd, uint64_t sync, struct vclock *vclock, row.sync = sync; coio_write_xrow(&relay->io, &row); + xrow_encode_raft(&row, &fiber()->gc, &raft_req); + row.sync = sync; + coio_write_xrow(&relay->io, &row); + /* Mark the end of the metadata stream. */ row.type = IPROTO_JOIN_SNAPSHOT; coio_write_xrow(&relay->io, &row); -- 2.30.1 (Apple Git-130)
Make raft_new_term() always bump the current term, even when Raft is disabled. Part-of #6034 --- src/box/box.cc | 2 +- src/lib/raft/raft.c | 3 +-- test/unit/raft.c | 15 ++++++++++++++- test/unit/raft.result | 3 ++- 4 files changed, 18 insertions(+), 5 deletions(-) diff --git a/src/box/box.cc b/src/box/box.cc index 8c695686e..86370514a 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -3513,7 +3513,7 @@ box_cfg_xc(void) if (!is_bootstrap_leader) { replicaset_sync(); - } else { + } else if (raft_is_enabled(box_raft())) { /* * When the cluster is just bootstrapped and this instance is a * leader, it makes no sense to wait for a leader appearance. diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c index ef11ef89f..22f38b330 100644 --- a/src/lib/raft/raft.c +++ b/src/lib/raft/raft.c @@ -988,8 +988,7 @@ raft_cfg_vclock(struct raft *raft, const struct vclock *vclock) void raft_new_term(struct raft *raft) { - if (raft->is_enabled) - raft_sm_schedule_new_term(raft, raft->volatile_term + 1); + raft_sm_schedule_new_term(raft, raft->volatile_term + 1); } static void diff --git a/test/unit/raft.c b/test/unit/raft.c index b9bc49b78..9637abf43 100644 --- a/test/unit/raft.c +++ b/test/unit/raft.c @@ -1184,7 +1184,7 @@ raft_test_death_timeout(void) static void raft_test_enable_disable(void) { - raft_start_test(10); + raft_start_test(11); struct raft_node node; raft_node_create(&node); @@ -1276,7 +1276,20 @@ raft_test_enable_disable(void) "{0: 2}" /* Vclock. */ ), "nothing changed"); + /* Disabled node still bumps the term when needed. */ + raft_node_new_term(&node); + + ok(raft_node_check_full_state(&node, + RAFT_STATE_FOLLOWER /* State. */, + 0 /* Leader. */, + 4 /* Term. */, + 0 /* Vote. */, + 4 /* Volatile term. */, + 0 /* Volatile vote. */, + "{0: 3}" /* Vclock. */ + ), "term bump when disabled"); raft_node_destroy(&node); + raft_finish_test(); } diff --git a/test/unit/raft.result b/test/unit/raft.result index f89cd1658..41463db69 100644 --- a/test/unit/raft.result +++ b/test/unit/raft.result @@ -205,7 +205,7 @@ ok 10 - subtests ok 11 - subtests *** raft_test_death_timeout: done *** *** raft_test_enable_disable *** - 1..10 + 1..11 ok 1 - accepted a leader notification ok 2 - leader is seen ok 3 - death timeout passed @@ -216,6 +216,7 @@ ok 11 - subtests ok 8 - became leader ok 9 - resigned from leader state ok 10 - nothing changed + ok 11 - term bump when disabled ok 12 - subtests *** raft_test_enable_disable: done *** *** raft_test_too_long_wal_write *** -- 2.30.1 (Apple Git-130)
box_promote() is a monster. It does a lot of different things based on flags: try_wait and run_elections. The flags themselves depend on the node's Raft state and the lunar calendar. Moreover, there are multiple cancellation points and places where external state may have changed and needs a re-check. Things are going to get even worse with the introduction of box.ctl.demote(). So it's time to split up box_promote() into reasonable parts, each doing exactly one thing. This commit mostly addresses the multiple cancellation points issue, so that promote() doesn't look like a huge pile of if(something_changed) blocks. Some other functions will look like that instead. Part of #6034 --- src/box/box.cc | 269 ++++++++++++++++++++++++++++--------------------- 1 file changed, 155 insertions(+), 114 deletions(-) diff --git a/src/box/box.cc b/src/box/box.cc index 86370514a..445875f8f 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -1527,6 +1527,147 @@ box_wait_quorum(uint32_t lead_id, int64_t target_lsn, int quorum, return 0; } +/** + * A helper to start new Raft election round and wait until the election results + * are known. + * Returns 0 in case this instance has won the elections, -1 otherwise. + */ +static int +box_run_elections(void) +{ + assert(box_raft()->is_enabled); + assert(box_election_mode != ELECTION_MODE_VOTER); + /* + * Make this instance a candidate and run until some leader, not + * necessarily this instance, emerges. + */ + raft_start_candidate(box_raft()); + /* + * Trigger new elections without waiting for an old leader to + * disappear. + */ + raft_new_term(box_raft()); + int rc = box_raft_wait_leader_found(); + + if (box_election_mode == ELECTION_MODE_MANUAL) + raft_stop_candidate(box_raft(), false); + if (rc != 0) + return -1; + if (!box_raft()->is_enabled) { + diag_set(ClientError, ER_RAFT_DISABLED); + return -1; + } + if (box_raft()->state != RAFT_STATE_LEADER) { + diag_set(ClientError, ER_INTERFERING_PROMOTE, + box_raft()->leader); + return -1; + } + + return 0; +} + +/** + * Check whether the greatest promote term has changed since it was last read. + * IOW check that a foreign PROMOTE arrived while we were sleeping. + */ +static int +box_check_promote_term_changed(uint64_t promote_term) +{ + if (txn_limbo.promote_greatest_term != promote_term) { + diag_set(ClientError, ER_INTERFERING_PROMOTE, + txn_limbo.owner_id); + return -1; + } + return 0; +} + +/** Try waiting until limbo is emptied up to given timeout. */ +static int +box_try_wait_confirm(double timeout) +{ + uint64_t promote_term = txn_limbo.promote_greatest_term; + txn_limbo_wait_empty(&txn_limbo, timeout); + return box_check_promote_term_changed(promote_term); +} + +/** + * A helper to wait until all limbo entries are ready to be confirmed, i.e. + * written to WAL and have gathered a quorum of ACKs from replicas. + * Return lsn of the last limbo entry on success, -1 on error. + */ +static int64_t +box_wait_limbo_acked(void) +{ + if (txn_limbo_is_empty(&txn_limbo)) + return txn_limbo.confirmed_lsn; + + uint64_t promote_term = txn_limbo.promote_greatest_term; + int quorum = replication_synchro_quorum; + struct txn_limbo_entry *last_entry; + last_entry = txn_limbo_last_synchro_entry(&txn_limbo); + /* Wait for the last entries WAL write. */ + if (last_entry->lsn < 0) { + int64_t tid = last_entry->txn->id; + + if (wal_sync(NULL) < 0) + return -1; + + if (box_check_promote_term_changed(promote_term) < 0) + return -1; + if (txn_limbo_is_empty(&txn_limbo)) + return txn_limbo.confirmed_lsn; + if (tid != txn_limbo_last_synchro_entry(&txn_limbo)->txn->id) { + diag_set(ClientError, ER_QUORUM_WAIT, quorum, + "new synchronous transactions appeared"); + return -1; + } + } + assert(last_entry->lsn > 0); + int64_t wait_lsn = last_entry->lsn; + + if (box_wait_quorum(txn_limbo.owner_id, wait_lsn, quorum, + replication_synchro_timeout) < 0) + return -1; + + if (box_check_promote_term_changed(promote_term) < 0) + return -1; + + if (txn_limbo_is_empty(&txn_limbo)) + return txn_limbo.confirmed_lsn; + + if (quorum < replication_synchro_quorum) { + diag_set(ClientError, ER_QUORUM_WAIT, quorum, + "quorum was increased while waiting"); + return -1; + } + if (wait_lsn < txn_limbo_last_synchro_entry(&txn_limbo)->lsn) { + diag_set(ClientError, ER_QUORUM_WAIT, quorum, + "new synchronous transactions appeared"); + return -1; + } + + return wait_lsn; +} + +/** Write and process a PROMOTE request. */ +static void +box_issue_promote(uint32_t prev_leader_id, int64_t promote_lsn) +{ + assert(box_raft()->volatile_term == box_raft()->term); + assert(promote_lsn >= 0); + txn_limbo_write_promote(&txn_limbo, promote_lsn, + box_raft()->term); + struct synchro_request req = { + .type = IPROTO_PROMOTE, + .replica_id = prev_leader_id, + .origin_id = instance_id, + .lsn = promote_lsn, + .term = box_raft()->term, + }; + txn_limbo_process(&txn_limbo, &req); + assert(txn_limbo_is_empty(&txn_limbo)); +} + int box_promote(void) { @@ -1537,6 +1678,10 @@ box_promote(void) "simultaneous invocations"); return -1; } + in_promote = true; + auto promote_guard = make_scoped_guard([&] { + in_promote = false; + }); /* * Do nothing when box isn't configured and when PROMOTE was already @@ -1582,122 +1727,18 @@ box_promote(void) unreachable(); } - uint32_t former_leader_id = txn_limbo.owner_id; - int64_t wait_lsn = txn_limbo.confirmed_lsn; - int rc = 0; - int quorum = replication_synchro_quorum; - in_promote = true; - auto promote_guard = make_scoped_guard([&] { - in_promote = false; - }); - - if (run_elections) { - /* - * Make this instance a candidate and run until some leader, not - * necessarily this instance, emerges. - */ - raft_start_candidate(box_raft()); - /* - * Trigger new elections without waiting for an old leader to - * disappear. - */ - raft_new_term(box_raft()); - rc = box_raft_wait_leader_found(); - /* - * Do not reset raft mode if it was changed while running the - * elections. - */ - if (box_election_mode == ELECTION_MODE_MANUAL) - raft_stop_candidate(box_raft(), false); - if (rc != 0) - return -1; - if (!box_raft()->is_enabled) { - diag_set(ClientError, ER_RAFT_DISABLED); - return -1; - } - if (box_raft()->state != RAFT_STATE_LEADER) { - diag_set(ClientError, ER_INTERFERING_PROMOTE, - box_raft()->leader); - return -1; - } - } - - if (txn_limbo_is_empty(&txn_limbo)) - goto promote; + int64_t wait_lsn = -1; - if (try_wait) { - /* Wait until pending confirmations/rollbacks reach us. */ - double timeout = 2 * replication_synchro_timeout; - txn_limbo_wait_empty(&txn_limbo, timeout); - /* - * Our mission was to clear the limbo from former leader's - * transactions. Exit in case someone did that for us. - */ - if (former_leader_id != txn_limbo.owner_id) { - diag_set(ClientError, ER_INTERFERING_PROMOTE, - txn_limbo.owner_id); - return -1; - } - if (txn_limbo_is_empty(&txn_limbo)) { - wait_lsn = txn_limbo.confirmed_lsn; - goto promote; - } - } - - struct txn_limbo_entry *last_entry; - last_entry = txn_limbo_last_synchro_entry(&txn_limbo); - /* Wait for the last entries WAL write. */ - if (last_entry->lsn < 0) { - int64_t tid = last_entry->txn->id; - if (wal_sync(NULL) < 0) - return -1; - if (former_leader_id != txn_limbo.owner_id) { - diag_set(ClientError, ER_INTERFERING_PROMOTE, - txn_limbo.owner_id); - return -1; - } - if (txn_limbo_is_empty(&txn_limbo)) { - wait_lsn = txn_limbo.confirmed_lsn; - goto promote; - } - if (tid != txn_limbo_last_synchro_entry(&txn_limbo)->txn->id) { - diag_set(ClientError, ER_QUORUM_WAIT, quorum, - "new synchronous transactions appeared"); - return -1; - } - } - wait_lsn = last_entry->lsn; - assert(wait_lsn > 0); + if (run_elections && box_run_elections() < 0) + return -1; + if (try_wait && + box_try_wait_confirm(2 * replication_synchro_timeout) < 0) + return -1; + if ((wait_lsn = box_wait_limbo_acked()) < 0) + return -1; - rc = box_wait_quorum(former_leader_id, wait_lsn, quorum, - replication_synchro_timeout); - if (rc == 0) { - if (quorum < replication_synchro_quorum) { - diag_set(ClientError, ER_QUORUM_WAIT, quorum, - "quorum was increased while waiting"); - rc = -1; - } else if (wait_lsn < txn_limbo_last_synchro_entry(&txn_limbo)->lsn) { - diag_set(ClientError, ER_QUORUM_WAIT, quorum, - "new synchronous transactions appeared"); - rc = -1; - } else { -promote: - /* We cannot possibly get here in a volatile state. */ - assert(box_raft()->volatile_term == box_raft()->term); - txn_limbo_write_promote(&txn_limbo, wait_lsn, - box_raft()->term); - struct synchro_request req = { - .type = IPROTO_PROMOTE, - .replica_id = former_leader_id, - .origin_id = instance_id, - .lsn = wait_lsn, - .term = box_raft()->term, - }; - txn_limbo_process(&txn_limbo, &req); - assert(txn_limbo_is_empty(&txn_limbo)); - } - } - return rc; + box_issue_promote(txn_limbo.owner_id, wait_lsn); + return 0; } int -- 2.30.1 (Apple Git-130)
When called without elections, promote resulted in multiple PROMOTE entries for the same term. This is not correct, because all the promotions for the same term except the first one would be ignored as already seen. Part-of #6034 --- src/box/box.cc | 20 ++++++++-- src/box/raft.c | 40 +++++++++++++++++++ src/box/raft.h | 4 ++ .../gh-4114-local-space-replication.result | 7 ++-- .../gh-4114-local-space-replication.test.lua | 4 +- .../gh-6034-election-promote-bump-term.result | 21 ++++++++++ ...h-6034-election-promote-bump-term.test.lua | 9 +++++ test/replication/suite.cfg | 1 + 8 files changed, 97 insertions(+), 9 deletions(-) create mode 100644 test/replication/gh-6034-election-promote-bump-term.result create mode 100644 test/replication/gh-6034-election-promote-bump-term.test.lua diff --git a/src/box/box.cc b/src/box/box.cc index 445875f8f..ac6c487e4 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -1581,6 +1581,17 @@ box_check_promote_term_changed(uint64_t promote_term) return 0; } +/** Trigger a new election round but don't wait for its result. */ +static int +box_trigger_elections(void) +{ + uint64_t promote_term = txn_limbo.promote_greatest_term; + raft_new_term(box_raft()); + if (box_raft_wait_term_persisted() < 0) + return -1; + return box_check_promote_term_changed(promote_term); +} + /** Try waiting until limbo is emptied up to given timeout. */ static int box_try_wait_confirm(double timeout) @@ -1731,9 +1742,12 @@ box_promote(void) if (run_elections && box_run_elections() < 0) return -1; - if (try_wait && - box_try_wait_confirm(2 * replication_synchro_timeout) < 0) - return -1; + if (try_wait) { + if (box_try_wait_confirm(2 * replication_synchro_timeout) < 0) + return -1; + if (box_trigger_elections() < 0) + return -1; + } if ((wait_lsn = box_wait_limbo_acked()) < 0) return -1; diff --git a/src/box/raft.c b/src/box/raft.c index 7f787c0c5..b04932cd9 100644 --- a/src/box/raft.c +++ b/src/box/raft.c @@ -354,6 +354,46 @@ box_raft_wait_leader_found(void) return 0; } +struct raft_wait_persisted_data { + struct fiber *waiter; + uint64_t term; +}; + +static int +box_raft_wait_term_persisted_f(struct trigger *trig, void *event) +{ + struct raft *raft = event; + struct raft_wait_persisted_data *data = trig->data; + if (raft->term >= data->term) + fiber_wakeup(data->waiter); + return 0; +} + +int +box_raft_wait_term_persisted(void) +{ + if (box_raft()->term == box_raft()->volatile_term) + return 0; + struct raft_wait_persisted_data data = { + .waiter = fiber(), + .term = box_raft()->volatile_term, + }; + struct trigger trig; + trigger_create(&trig, box_raft_wait_term_persisted_f, &data, NULL); + raft_on_update(box_raft(), &trig); + + do { + fiber_yield(); + } while (box_raft()->term < data.term && !fiber_is_cancelled()); + + trigger_clear(&trig); + if (fiber_is_cancelled()) { + diag_set(FiberIsCancelled); + return -1; + } + return 0; +} + void box_raft_init(void) { diff --git a/src/box/raft.h b/src/box/raft.h index 6b6136510..4c9c7306d 100644 --- a/src/box/raft.h +++ b/src/box/raft.h @@ -101,6 +101,10 @@ box_raft_process(struct raft_request *req, uint32_t source); int box_raft_wait_leader_found(); +/** Block this fiber until the current volatile term is persisted. */ +int +box_raft_wait_term_persisted(void); + void box_raft_init(void); diff --git a/test/replication/gh-4114-local-space-replication.result b/test/replication/gh-4114-local-space-replication.result index 9b63a4b99..e71eb60a8 100644 --- a/test/replication/gh-4114-local-space-replication.result +++ b/test/replication/gh-4114-local-space-replication.result @@ -45,9 +45,8 @@ test_run:cmd('switch replica') | --- | - true | ... -box.info.vclock[0] +a = box.info.vclock[0] or 0 | --- - | - null | ... box.cfg{checkpoint_count=1} | --- @@ -77,9 +76,9 @@ box.space.test:insert{3} | - [3] | ... -box.info.vclock[0] +assert(box.info.vclock[0] == a + 3) | --- - | - 3 + | - true | ... test_run:cmd('switch default') diff --git a/test/replication/gh-4114-local-space-replication.test.lua b/test/replication/gh-4114-local-space-replication.test.lua index c18fb3b10..65fef3bf6 100644 --- a/test/replication/gh-4114-local-space-replication.test.lua +++ b/test/replication/gh-4114-local-space-replication.test.lua @@ -18,7 +18,7 @@ for i = 1,10 do box.space.test:insert{i} end box.info.vclock[0] == a + 10 or box.info.vclock[0] - a test_run:cmd('switch replica') -box.info.vclock[0] +a = box.info.vclock[0] or 0 box.cfg{checkpoint_count=1} box.space.test:select{} box.space.test:insert{1} @@ -27,7 +27,7 @@ box.space.test:insert{2} box.snapshot() box.space.test:insert{3} -box.info.vclock[0] +assert(box.info.vclock[0] == a + 3) test_run:cmd('switch default') diff --git a/test/replication/gh-6034-election-promote-bump-term.result b/test/replication/gh-6034-election-promote-bump-term.result new file mode 100644 index 000000000..8be4e8243 --- /dev/null +++ b/test/replication/gh-6034-election-promote-bump-term.result @@ -0,0 +1,21 @@ +-- test-run result file version 2 +test_run = require('test_run').new() + | --- + | ... + +-- gh-6034: test that every box.ctl.promote() bumps +-- the instance's term. Even when elections are disabled. +box.cfg{election_mode='off'} + | --- + | ... + +term = box.info.election.term + | --- + | ... +box.ctl.promote() + | --- + | ... +assert(box.info.election.term == term + 1) + | --- + | - true + | ... diff --git a/test/replication/gh-6034-election-promote-bump-term.test.lua b/test/replication/gh-6034-election-promote-bump-term.test.lua new file mode 100644 index 000000000..1e814bf5d --- /dev/null +++ b/test/replication/gh-6034-election-promote-bump-term.test.lua @@ -0,0 +1,9 @@ +test_run = require('test_run').new() + +-- gh-6034: test that every box.ctl.promote() bumps +-- the instance's term. Even when elections are disabled. +box.cfg{election_mode='off'} + +term = box.info.election.term +box.ctl.promote() +assert(box.info.election.term == term + 1) diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg index ae146c366..7f9014b22 100644 --- a/test/replication/suite.cfg +++ b/test/replication/suite.cfg @@ -48,6 +48,7 @@ "gh-5613-bootstrap-prefer-booted.test.lua": {}, "gh-6027-applier-error-show.test.lua": {}, "gh-6032-promote-wal-write.test.lua": {}, + "gh-6034-election-promote-bump-term.test.lua": {}, "gh-6057-qsync-confirm-async-no-wal.test.lua": {}, "gh-6094-rs-uuid-mismatch.test.lua": {}, "gh-6127-election-join-new.test.lua": {}, -- 2.30.1 (Apple Git-130)
It was allowed to call promote on any instance, even when it's already the limbo owner (iow Raft leader, when elections are enabled). This doesn't break anything, when elections are disabled, but is rather strange. When elections are enabled, in contrary, calling promote() should be a no-op on the leader. Otherwise it would make the leader read-only until it wins the next election round, which's quite inconvenient. Part-of #6034 --- src/box/box.cc | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/src/box/box.cc b/src/box/box.cc index ac6c487e4..6a534952f 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -1701,6 +1701,9 @@ box_promote(void) */ if (!is_box_configured) return 0; + if (txn_limbo_replica_term(&txn_limbo, instance_id) == + box_raft()->term) + return 0; bool run_elections = false; bool try_wait = false; @@ -1729,10 +1732,6 @@ box_promote(void) "'candidate'", "manual elections"); return -1; } - if (txn_limbo_replica_term(&txn_limbo, instance_id) == - box_raft()->term) - return 0; - break; default: unreachable(); -- 2.30.1 (Apple Git-130)
Follow-up #3055 --- src/box/raft.c | 8 +++- .../gh-3055-promote-wakeup-crash.result | 43 +++++++++++++++++++ .../gh-3055-promote-wakeup-crash.test.lua | 20 +++++++++ test/replication/suite.cfg | 1 + 4 files changed, 70 insertions(+), 2 deletions(-) create mode 100644 test/replication/gh-3055-promote-wakeup-crash.result create mode 100644 test/replication/gh-3055-promote-wakeup-crash.test.lua diff --git a/src/box/raft.c b/src/box/raft.c index b04932cd9..d16ec952a 100644 --- a/src/box/raft.c +++ b/src/box/raft.c @@ -344,13 +344,17 @@ box_raft_wait_leader_found(void) struct trigger trig; trigger_create(&trig, box_raft_wait_leader_found_f, fiber(), NULL); raft_on_update(box_raft(), &trig); - fiber_yield(); + + do { + fiber_yield(); + } while (box_raft()->is_enabled && !fiber_is_cancelled() && + box_raft()->leader == REPLICA_ID_NIL); + trigger_clear(&trig); if (fiber_is_cancelled()) { diag_set(FiberIsCancelled); return -1; } - assert(box_raft()->leader != REPLICA_ID_NIL || !box_raft()->is_enabled); return 0; } diff --git a/test/replication/gh-3055-promote-wakeup-crash.result b/test/replication/gh-3055-promote-wakeup-crash.result new file mode 100644 index 000000000..e508611e5 --- /dev/null +++ b/test/replication/gh-3055-promote-wakeup-crash.result @@ -0,0 +1,43 @@ +-- test-run result file version 2 +test_run = require('test_run').new() + | --- + | ... +-- +-- gh-3055 follow-up: box.ctl.promote() could crash on an assertion after a +-- spurious wakeup. +-- +_ = box.space._cluster:insert{2, require('uuid').str()} + | --- + | ... +box.cfg{election_mode='manual',\ + replication_synchro_quorum=2,\ + election_timeout=1000} + | --- + | ... + +fiber = require('fiber') + | --- + | ... +f = fiber.create(function() box.ctl.promote() end) + | --- + | ... +f:set_joinable(true) + | --- + | ... +f:wakeup() + | --- + | ... +fiber.yield() + | --- + | ... + +-- Cleanup. +f:cancel() + | --- + | ... +box.cfg{election_mode='off'} + | --- + | ... +test_run:cleanup_cluster() + | --- + | ... diff --git a/test/replication/gh-3055-promote-wakeup-crash.test.lua b/test/replication/gh-3055-promote-wakeup-crash.test.lua new file mode 100644 index 000000000..2ac901b08 --- /dev/null +++ b/test/replication/gh-3055-promote-wakeup-crash.test.lua @@ -0,0 +1,20 @@ +test_run = require('test_run').new() +-- +-- gh-3055 follow-up: box.ctl.promote() could crash on an assertion after a +-- spurious wakeup. +-- +_ = box.space._cluster:insert{2, require('uuid').str()} +box.cfg{election_mode='manual',\ + replication_synchro_quorum=2,\ + election_timeout=1000} + +fiber = require('fiber') +f = fiber.create(function() box.ctl.promote() end) +f:set_joinable(true) +f:wakeup() +fiber.yield() + +-- Cleanup. +f:cancel() +box.cfg{election_mode='off'} +test_run:cleanup_cluster() diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg index 7f9014b22..8b2204e2a 100644 --- a/test/replication/suite.cfg +++ b/test/replication/suite.cfg @@ -3,6 +3,7 @@ "anon_register_gap.test.lua": {}, "gh-2991-misc-asserts-on-update.test.lua": {}, "gh-3055-election-promote.test.lua": {}, + "gh-3055-promote-wakeup-crash.test.lua": {}, "gh-3111-misc-rebootstrap-from-ro-master.test.lua": {}, "gh-3160-misc-heartbeats-on-master-changes.test.lua": {}, "gh-3247-misc-iproto-sequence-value-not-replicated.test.lua": {}, -- 2.30.1 (Apple Git-130)
Part of #6034 --- src/box/box.cc | 15 +--- .../gh-6034-election-candidate-promote.result | 84 +++++++++++++++++++ ...h-6034-election-candidate-promote.test.lua | 42 ++++++++++ test/replication/suite.cfg | 1 + 4 files changed, 128 insertions(+), 14 deletions(-) create mode 100644 test/replication/gh-6034-election-candidate-promote.result create mode 100644 test/replication/gh-6034-election-candidate-promote.test.lua diff --git a/src/box/box.cc b/src/box/box.cc index 6a534952f..9130fc322 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -1717,21 +1717,8 @@ box_promote(void) "manual elections"); return -1; case ELECTION_MODE_MANUAL: - if (box_raft()->state == RAFT_STATE_LEADER) - return 0; - run_elections = true; - break; case ELECTION_MODE_CANDIDATE: - /* - * Leader elections are enabled, and this instance is allowed to - * promote only if it's already an elected leader. No manual - * elections. - */ - if (box_raft()->state != RAFT_STATE_LEADER) { - diag_set(ClientError, ER_UNSUPPORTED, "election_mode=" - "'candidate'", "manual elections"); - return -1; - } + run_elections = box_raft()->state != RAFT_STATE_LEADER; break; default: unreachable(); diff --git a/test/replication/gh-6034-election-candidate-promote.result b/test/replication/gh-6034-election-candidate-promote.result new file mode 100644 index 000000000..76a54d5a6 --- /dev/null +++ b/test/replication/gh-6034-election-candidate-promote.result @@ -0,0 +1,84 @@ +-- test-run result file version 2 +test_run = require('test_run').new() + | --- + | ... + +SERVERS = {'election_replica1', 'election_replica2', 'election_replica3'} + | --- + | ... +test_run:create_cluster(SERVERS, 'replication') + | --- + | ... +test_run:wait_fullmesh(SERVERS) + | --- + | ... +is_leader_cmd = 'return box.info.election.state == \'leader\'' + | --- + | ... + +test_run:cmd("setopt delimiter ';'") + | --- + | - true + | ... +function get_leader_nr() + local leader_nr = 0 + test_run:wait_cond(function() + for nr = 1,3 do + local is_leader = test_run:eval('election_replica'..nr, is_leader_cmd)[1] + if is_leader then + leader_nr = nr + return true + end + end + return false + end) + return leader_nr +end; + | --- + | ... +test_run:cmd("setopt delimiter ''"); + | --- + | - true + | ... + +leader_nr = get_leader_nr() + | --- + | ... + +assert(leader_nr ~= 0) + | --- + | - true + | ... + +term = test_run:eval('election_replica'..leader_nr,\ + 'return box.info.election.term')[1] + | --- + | ... + +next_nr = leader_nr % 3 + 1 + | --- + | ... +-- Someone else may become a leader, thus promote may fail. But we're testing +-- that it takes effect at all, so that's fine. +_ = pcall(test_run:eval('election_replica'..next_nr, 'box.ctl.promote()')) + | --- + | ... +new_term = test_run:eval('election_replica'..next_nr,\ + 'return box.info.election.term')[1] + | --- + | ... +assert(new_term > term) + | --- + | - true + | ... +leader_nr = get_leader_nr() + | --- + | ... +assert(leader_nr ~= 0) + | --- + | - true + | ... + +test_run:drop_cluster(SERVERS) + | --- + | ... diff --git a/test/replication/gh-6034-election-candidate-promote.test.lua b/test/replication/gh-6034-election-candidate-promote.test.lua new file mode 100644 index 000000000..24c57f4cb --- /dev/null +++ b/test/replication/gh-6034-election-candidate-promote.test.lua @@ -0,0 +1,42 @@ +test_run = require('test_run').new() + +SERVERS = {'election_replica1', 'election_replica2', 'election_replica3'} +test_run:create_cluster(SERVERS, 'replication') +test_run:wait_fullmesh(SERVERS) +is_leader_cmd = 'return box.info.election.state == \'leader\'' + +test_run:cmd("setopt delimiter ';'") +function get_leader_nr() + local leader_nr = 0 + test_run:wait_cond(function() + for nr = 1,3 do + local is_leader = test_run:eval('election_replica'..nr, is_leader_cmd)[1] + if is_leader then + leader_nr = nr + return true + end + end + return false + end) + return leader_nr +end; +test_run:cmd("setopt delimiter ''"); + +leader_nr = get_leader_nr() + +assert(leader_nr ~= 0) + +term = test_run:eval('election_replica'..leader_nr,\ + 'return box.info.election.term')[1] + +next_nr = leader_nr % 3 + 1 +-- Someone else may become a leader, thus promote may fail. But we're testing +-- that it takes effect at all, so that's fine. +_ = pcall(test_run:eval('election_replica'..next_nr, 'box.ctl.promote()')) +new_term = test_run:eval('election_replica'..next_nr,\ + 'return box.info.election.term')[1] +assert(new_term > term) +leader_nr = get_leader_nr() +assert(leader_nr ~= 0) + +test_run:drop_cluster(SERVERS) diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg index 8b2204e2a..6f42db081 100644 --- a/test/replication/suite.cfg +++ b/test/replication/suite.cfg @@ -53,6 +53,7 @@ "gh-6057-qsync-confirm-async-no-wal.test.lua": {}, "gh-6094-rs-uuid-mismatch.test.lua": {}, "gh-6127-election-join-new.test.lua": {}, + "gh-6034-election-candidate-promote.test.lua": {}, "gh-6035-applier-filter.test.lua": {}, "*": { "memtx": {"engine": "memtx"}, -- 2.30.1 (Apple Git-130)
Extract the switch(box_election_mode) { } block to a separate method, box_check_promote_election_mode(), and make it rule out whether the instance may issue a promote(), and how the promote() has to perform. This will simplify the code a bit once demote() is introduced. Part-of #6034 --- src/box/box.cc | 44 +++++++++++++++++++++++++++++--------------- 1 file changed, 29 insertions(+), 15 deletions(-) diff --git a/src/box/box.cc b/src/box/box.cc index 9130fc322..dcfaab29e 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -1679,6 +1679,34 @@ box_issue_promote(uint32_t prev_leader_id, int64_t promote_lsn) assert(txn_limbo_is_empty(&txn_limbo)); } +/** + * Check whether this instance may run a promote() and set promote parameters + * according to its election mode. + */ +static int +box_check_promote_election_mode(bool *try_wait, bool *run_elections) +{ + switch (box_election_mode) { + case ELECTION_MODE_OFF: + if (try_wait != NULL) + *try_wait = true; + break; + case ELECTION_MODE_VOTER: + assert(box_raft()->state == RAFT_STATE_FOLLOWER); + diag_set(ClientError, ER_UNSUPPORTED, "election_mode='voter'", + "manual elections"); + return -1; + case ELECTION_MODE_MANUAL: + case ELECTION_MODE_CANDIDATE: + if (run_elections != NULL) + *run_elections = box_raft()->state != RAFT_STATE_LEADER; + break; + default: + unreachable(); + } + return 0; +} + int box_promote(void) { @@ -1707,22 +1735,8 @@ box_promote(void) bool run_elections = false; bool try_wait = false; - switch (box_election_mode) { - case ELECTION_MODE_OFF: - try_wait = true; - break; - case ELECTION_MODE_VOTER: - assert(box_raft()->state == RAFT_STATE_FOLLOWER); - diag_set(ClientError, ER_UNSUPPORTED, "election_mode='voter'", - "manual elections"); + if (box_check_promote_election_mode(&try_wait, &run_elections) < 0) return -1; - case ELECTION_MODE_MANUAL: - case ELECTION_MODE_CANDIDATE: - run_elections = box_raft()->state != RAFT_STATE_LEADER; - break; - default: - unreachable(); - } int64_t wait_lsn = -1; -- 2.30.1 (Apple Git-130)
Forbid limbo ownership transition without an explicit promote. Make it so that synchronous transactions may be committed only after it is claimed by some instance via a PROMOTE request. Make everyone but the limbo owner read-only even when the limbo is empty. Part-of #6034 @TarantoolBot document Title: synchronous replication changes `box.info.synchro.queue` receives a new field: `owner`. It's a replica id of the instance owning the synchronous transaction queue. Once some instance owns the queue, every other instance becomes read-only. When the queue is unclaimed, e.g. `box.info.synchro.queue.owner` is `0`, everyone may be writeable, but cannot create synchronous transactions. In order to claim or re-claim the queue, you have to issue `box.ctl.promote()` on the instance you wish to promote. When elections are enabled, the instance issues `box.ctl.promote()` automatically once it wins the elections, no additional actions are required. --- src/box/errcode.h | 2 + src/box/lua/info.c | 4 +- src/box/txn_limbo.c | 32 ++--- test/box/alter.result | 2 +- test/box/error.result | 2 + test/replication/gh-5440-qsync-ro.result | 133 --------------------- test/replication/gh-5440-qsync-ro.test.lua | 53 -------- test/replication/suite.cfg | 1 - 8 files changed, 19 insertions(+), 210 deletions(-) delete mode 100644 test/replication/gh-5440-qsync-ro.result delete mode 100644 test/replication/gh-5440-qsync-ro.test.lua diff --git a/src/box/errcode.h b/src/box/errcode.h index 49aec4bf6..d42b64ef4 100644 --- a/src/box/errcode.h +++ b/src/box/errcode.h @@ -278,6 +278,8 @@ struct errcode_record { /*223 */_(ER_INTERFERING_PROMOTE, "Instance with replica id %u was promoted first") \ /*224 */_(ER_RAFT_DISABLED, "Elections were turned off while running box.ctl.promote()")\ /*225 */_(ER_TXN_ROLLBACK, "Transaction was rolled back") \ + /*226 */_(ER_SYNC_QUEUE_UNCLAIMED, "The synchronous transaction queue doesn't belong to any instance")\ + /*227 */_(ER_SYNC_QUEUE_FOREIGN, "The synchronous transaction queue belongs to other instance with id %u")\ /* * !IMPORTANT! Please follow instructions at start of the file diff --git a/src/box/lua/info.c b/src/box/lua/info.c index 1d8fe7938..ef3d6f5b2 100644 --- a/src/box/lua/info.c +++ b/src/box/lua/info.c @@ -615,9 +615,11 @@ lbox_info_synchro(struct lua_State *L) /* Queue information. */ struct txn_limbo *queue = &txn_limbo; - lua_createtable(L, 0, 1); + lua_createtable(L, 0, 2); lua_pushnumber(L, queue->len); lua_setfield(L, -2, "len"); + lua_pushnumber(L, queue->owner_id); + lua_setfield(L, -2, "owner"); lua_setfield(L, -2, "queue"); return 1; diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c index 991c47698..c8c4f587c 100644 --- a/src/box/txn_limbo.c +++ b/src/box/txn_limbo.c @@ -55,7 +55,8 @@ txn_limbo_create(struct txn_limbo *limbo) bool txn_limbo_is_ro(struct txn_limbo *limbo) { - return limbo->owner_id != instance_id && !txn_limbo_is_empty(limbo); + return limbo->owner_id != REPLICA_ID_NIL && + limbo->owner_id != instance_id; } struct txn_limbo_entry * @@ -95,18 +96,18 @@ txn_limbo_append(struct txn_limbo *limbo, uint32_t id, struct txn *txn) } if (id == 0) id = instance_id; - bool make_ro = false; - if (limbo->owner_id != id) { - if (rlist_empty(&limbo->queue)) { - limbo->owner_id = id; - limbo->confirmed_lsn = 0; - if (id != instance_id) - make_ro = true; + if (limbo->owner_id == REPLICA_ID_NIL) { + diag_set(ClientError, ER_SYNC_QUEUE_UNCLAIMED); + return NULL; + } else if (limbo->owner_id != id) { + if (txn_limbo_is_empty(limbo)) { + diag_set(ClientError, ER_SYNC_QUEUE_FOREIGN, + limbo->owner_id); } else { diag_set(ClientError, ER_UNCOMMITTED_FOREIGN_SYNC_TXNS, limbo->owner_id); - return NULL; } + return NULL; } size_t size; struct txn_limbo_entry *e = region_alloc_object(&txn->region, @@ -122,12 +123,6 @@ txn_limbo_append(struct txn_limbo *limbo, uint32_t id, struct txn *txn) e->is_rollback = false; rlist_add_tail_entry(&limbo->queue, e, in_queue); limbo->len++; - /* - * We added new entries from a remote instance to an empty limbo. - * Time to make this instance read-only. - */ - if (make_ro) - box_update_ro_summary(); return e; } @@ -442,9 +437,6 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn) assert(e->txn->signature >= 0); txn_complete_success(e->txn); } - /* Update is_ro once the limbo is clear. */ - if (txn_limbo_is_empty(limbo)) - box_update_ro_summary(); } /** @@ -492,9 +484,6 @@ txn_limbo_read_rollback(struct txn_limbo *limbo, int64_t lsn) if (e == last_rollback) break; } - /* Update is_ro once the limbo is clear. */ - if (txn_limbo_is_empty(limbo)) - box_update_ro_summary(); } void @@ -525,6 +514,7 @@ txn_limbo_read_promote(struct txn_limbo *limbo, uint32_t replica_id, txn_limbo_read_rollback(limbo, lsn + 1); assert(txn_limbo_is_empty(&txn_limbo)); limbo->owner_id = replica_id; + box_update_ro_summary(); limbo->confirmed_lsn = 0; } diff --git a/test/box/alter.result b/test/box/alter.result index a7bffce10..6a64f6b84 100644 --- a/test/box/alter.result +++ b/test/box/alter.result @@ -1464,7 +1464,7 @@ assert(s.is_sync) ... s:replace{1} --- -- error: Quorum collection for a synchronous transaction is timed out +- error: The synchronous transaction queue doesn't belong to any instance ... -- When not specified or nil - ignored. s:alter({is_sync = nil}) diff --git a/test/box/error.result b/test/box/error.result index 062a90399..8f783f927 100644 --- a/test/box/error.result +++ b/test/box/error.result @@ -444,6 +444,8 @@ t; | 223: box.error.INTERFERING_PROMOTE | 224: box.error.RAFT_DISABLED | 225: box.error.TXN_ROLLBACK + | 226: box.error.SYNC_QUEUE_UNCLAIMED + | 227: box.error.SYNC_QUEUE_FOREIGN | ... test_run:cmd("setopt delimiter ''"); diff --git a/test/replication/gh-5440-qsync-ro.result b/test/replication/gh-5440-qsync-ro.result deleted file mode 100644 index 1ece26a42..000000000 --- a/test/replication/gh-5440-qsync-ro.result +++ /dev/null @@ -1,133 +0,0 @@ --- test-run result file version 2 --- --- gh-5440 everyone but the limbo owner is read-only on non-empty limbo. --- -env = require('test_run') - | --- - | ... -test_run = env.new() - | --- - | ... -fiber = require('fiber') - | --- - | ... - -box.schema.user.grant('guest', 'replication') - | --- - | ... -test_run:cmd('create server replica with rpl_master=default, script="replication/replica.lua"') - | --- - | - true - | ... -test_run:cmd('start server replica with wait=True, wait_load=True') - | --- - | - true - | ... - -_ = box.schema.space.create('test', {is_sync=true}) - | --- - | ... -_ = box.space.test:create_index('pk') - | --- - | ... - -old_synchro_quorum = box.cfg.replication_synchro_quorum - | --- - | ... -old_synchro_timeout = box.cfg.replication_synchro_timeout - | --- - | ... - --- Make sure that the master stalls on commit leaving the limbo non-empty. -box.cfg{replication_synchro_quorum=3, replication_synchro_timeout=1000} - | --- - | ... - -f = fiber.new(function() box.space.test:insert{1} end) - | --- - | ... -f:status() - | --- - | - suspended - | ... - --- Wait till replica's limbo is non-empty. -test_run:wait_lsn('replica', 'default') - | --- - | ... -test_run:cmd('switch replica') - | --- - | - true - | ... - -box.info.ro - | --- - | - true - | ... -box.space.test:insert{2} - | --- - | - error: Can't modify data because this instance is in read-only mode. - | ... -success = false - | --- - | ... -f = require('fiber').new(function() box.ctl.wait_rw() success = true end) - | --- - | ... -f:status() - | --- - | - suspended - | ... - -test_run:cmd('switch default') - | --- - | - true - | ... - --- Empty the limbo. -box.cfg{replication_synchro_quorum=2} - | --- - | ... - -test_run:cmd('switch replica') - | --- - | - true - | ... - -test_run:wait_cond(function() return success end) - | --- - | - true - | ... -box.info.ro - | --- - | - false - | ... --- Should succeed now. -box.space.test:insert{2} - | --- - | - [2] - | ... - --- Cleanup. -test_run:cmd('switch default') - | --- - | - true - | ... -box.cfg{replication_synchro_quorum=old_synchro_quorum,\ - replication_synchro_timeout=old_synchro_timeout} - | --- - | ... -box.space.test:drop() - | --- - | ... -test_run:cmd('stop server replica') - | --- - | - true - | ... -test_run:cmd('delete server replica') - | --- - | - true - | ... -box.schema.user.revoke('guest', 'replication') - | --- - | ... diff --git a/test/replication/gh-5440-qsync-ro.test.lua b/test/replication/gh-5440-qsync-ro.test.lua deleted file mode 100644 index d63ec9c1e..000000000 --- a/test/replication/gh-5440-qsync-ro.test.lua +++ /dev/null @@ -1,53 +0,0 @@ --- --- gh-5440 everyone but the limbo owner is read-only on non-empty limbo. --- -env = require('test_run') -test_run = env.new() -fiber = require('fiber') - -box.schema.user.grant('guest', 'replication') -test_run:cmd('create server replica with rpl_master=default, script="replication/replica.lua"') -test_run:cmd('start server replica with wait=True, wait_load=True') - -_ = box.schema.space.create('test', {is_sync=true}) -_ = box.space.test:create_index('pk') - -old_synchro_quorum = box.cfg.replication_synchro_quorum -old_synchro_timeout = box.cfg.replication_synchro_timeout - --- Make sure that the master stalls on commit leaving the limbo non-empty. -box.cfg{replication_synchro_quorum=3, replication_synchro_timeout=1000} - -f = fiber.new(function() box.space.test:insert{1} end) -f:status() - --- Wait till replica's limbo is non-empty. -test_run:wait_lsn('replica', 'default') -test_run:cmd('switch replica') - -box.info.ro -box.space.test:insert{2} -success = false -f = require('fiber').new(function() box.ctl.wait_rw() success = true end) -f:status() - -test_run:cmd('switch default') - --- Empty the limbo. -box.cfg{replication_synchro_quorum=2} - -test_run:cmd('switch replica') - -test_run:wait_cond(function() return success end) -box.info.ro --- Should succeed now. -box.space.test:insert{2} - --- Cleanup. -test_run:cmd('switch default') -box.cfg{replication_synchro_quorum=old_synchro_quorum,\ - replication_synchro_timeout=old_synchro_timeout} -box.space.test:drop() -test_run:cmd('stop server replica') -test_run:cmd('delete server replica') -box.schema.user.revoke('guest', 'replication') diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg index 6f42db081..7a6fd7052 100644 --- a/test/replication/suite.cfg +++ b/test/replication/suite.cfg @@ -42,7 +42,6 @@ "gh-4739-vclock-assert.test.lua": {}, "gh-4730-applier-rollback.test.lua": {}, "gh-4928-tx-boundaries.test.lua": {}, - "gh-5440-qsync-ro.test.lua": {}, "gh-5435-qsync-clear-synchro-queue-commit-all.test.lua": {}, "gh-5536-wal-limit.test.lua": {}, "gh-5566-final-join-synchro.test.lua": {}, -- 2.30.1 (Apple Git-130)
Introduce a new journal entry, DEMOTE. The entry has the same meaning as PROMOTE, with the only difference that it clears limbo ownership instead of transferring it to the issuer. Introduce `box.ctl.demote`, a counterpart to `box.ctl.promote`, which clears the limbo ownership (when elections are off) via writing the DEMOTE, or simply makes this instance step down from the leader's role (when elections are enabled in any mode). A new request was necessary instead of simply writing PROMOTE(origin_id = 0), because origin_id is deduced from row.replica_id, which cannot be 0 for replicated rows (it's always equal to instance_id of the row originator). Closes #6034 @TarantoolBot document Title: box.ctl.demote `box.ctl.demote()` is a counterpart to `box.ctl.promote()` which works as follows: - when `box.cfg.election_mode` is not 'off': make the instance give up leadership. - when `box.cfg.election_mode` is 'off': write a DEMOTE entry to WAL. `box.ctl.demote()` may only be issued on the synchronous transaction queue owner (i.e. leader when elections are enabled). A DEMOTE request (DEMOTE = 32) copies PROMOTE behaviour (it clears the limbo as well), but clears the synchronous transaction queue ownership instead of assigning it to a new instance. --- src/box/box.cc | 99 ++++++++-- src/box/box.h | 3 + src/box/iproto_constants.h | 10 +- src/box/lua/ctl.c | 9 + src/box/txn_limbo.c | 37 +++- src/box/txn_limbo.h | 7 + test/replication/election_basic.result | 3 + test/replication/election_basic.test.lua | 1 + test/replication/election_qsync.result | 3 + test/replication/election_qsync.test.lua | 1 + .../gh-5140-qsync-casc-rollback.result | 6 + .../gh-5140-qsync-casc-rollback.test.lua | 2 + .../gh-5144-qsync-dup-confirm.result | 6 + .../gh-5144-qsync-dup-confirm.test.lua | 2 + .../gh-5163-qsync-restart-crash.result | 6 + .../gh-5163-qsync-restart-crash.test.lua | 2 + .../gh-5167-qsync-rollback-snap.result | 6 + .../gh-5167-qsync-rollback-snap.test.lua | 2 + .../gh-5195-qsync-replica-write.result | 10 +- .../gh-5195-qsync-replica-write.test.lua | 6 +- .../gh-5213-qsync-applier-order-3.result | 9 + .../gh-5213-qsync-applier-order-3.test.lua | 3 + .../gh-5213-qsync-applier-order.result | 6 + .../gh-5213-qsync-applier-order.test.lua | 2 + .../replication/gh-5288-qsync-recovery.result | 6 + .../gh-5288-qsync-recovery.test.lua | 2 + .../gh-5298-qsync-recovery-snap.result | 6 + .../gh-5298-qsync-recovery-snap.test.lua | 2 + .../gh-5426-election-on-off.result | 3 + .../gh-5426-election-on-off.test.lua | 1 + .../gh-5433-election-restart-recovery.result | 3 + ...gh-5433-election-restart-recovery.test.lua | 1 + ...sync-clear-synchro-queue-commit-all.result | 3 + ...nc-clear-synchro-queue-commit-all.test.lua | 1 + .../replication/gh-5438-election-state.result | 3 + .../gh-5438-election-state.test.lua | 1 + .../gh-5446-qsync-eval-quorum.result | 6 + .../gh-5446-qsync-eval-quorum.test.lua | 2 + .../gh-5506-election-on-off.result | 3 + .../gh-5506-election-on-off.test.lua | 1 + .../gh-5566-final-join-synchro.result | 6 + .../gh-5566-final-join-synchro.test.lua | 2 + .../gh-5874-qsync-txn-recovery.result | 6 + .../gh-5874-qsync-txn-recovery.test.lua | 2 + .../gh-6032-promote-wal-write.result | 3 + .../gh-6032-promote-wal-write.test.lua | 1 + .../gh-6034-election-promote-bump-term.result | 5 + ...h-6034-election-promote-bump-term.test.lua | 3 + .../gh-6034-qsync-limbo-ownership.result | 186 ++++++++++++++++++ .../gh-6034-qsync-limbo-ownership.test.lua | 68 +++++++ .../gh-6057-qsync-confirm-async-no-wal.result | 7 + ...h-6057-qsync-confirm-async-no-wal.test.lua | 3 + test/replication/hang_on_synchro_fail.result | 6 + .../replication/hang_on_synchro_fail.test.lua | 2 + test/replication/qsync_advanced.result | 12 ++ test/replication/qsync_advanced.test.lua | 4 + test/replication/qsync_basic.result | 33 ++-- test/replication/qsync_basic.test.lua | 16 +- test/replication/qsync_errinj.result | 6 + test/replication/qsync_errinj.test.lua | 2 + test/replication/qsync_snapshots.result | 6 + test/replication/qsync_snapshots.test.lua | 2 + test/replication/qsync_with_anon.result | 6 + test/replication/qsync_with_anon.test.lua | 2 + test/replication/suite.cfg | 1 + 65 files changed, 623 insertions(+), 52 deletions(-) create mode 100644 test/replication/gh-6034-qsync-limbo-ownership.result create mode 100644 test/replication/gh-6034-qsync-limbo-ownership.test.lua diff --git a/src/box/box.cc b/src/box/box.cc index dcfaab29e..f68fffcab 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -1679,6 +1679,25 @@ box_issue_promote(uint32_t prev_leader_id, int64_t promote_lsn) assert(txn_limbo_is_empty(&txn_limbo)); } +/** Write and process a DEMOTE request. */ +static void +box_issue_demote(uint32_t prev_leader_id, int64_t promote_lsn) +{ + assert(box_raft()->volatile_term == box_raft()->term); + assert(promote_lsn >= 0); + txn_limbo_write_demote(&txn_limbo, promote_lsn, + box_raft()->term); + struct synchro_request req = { + .type = IPROTO_DEMOTE, + .replica_id = prev_leader_id, + .origin_id = instance_id, + .lsn = promote_lsn, + .term = box_raft()->term, + }; + txn_limbo_process(&txn_limbo, &req); + assert(txn_limbo_is_empty(&txn_limbo)); +} + /** * Check whether this instance may run a promote() and set promote parameters * according to its election mode. @@ -1707,31 +1726,37 @@ box_check_promote_election_mode(bool *try_wait, bool *run_elections) return 0; } +/* A guard to block multiple simultaneous promote()/demote() invocations. */ +static bool box_in_promote = false; + int box_promote(void) { - /* A guard to block multiple simultaneous function invocations. */ - static bool in_promote = false; - if (in_promote) { + if (box_in_promote) { diag_set(ClientError, ER_UNSUPPORTED, "box.ctl.promote", "simultaneous invocations"); return -1; } - in_promote = true; + box_in_promote = true; auto promote_guard = make_scoped_guard([&] { - in_promote = false; + box_in_promote = false; }); - /* - * Do nothing when box isn't configured and when PROMOTE was already - * written for this term (synchronous replication and leader election - * are in sync, and both chose this node as a leader). - */ if (!is_box_configured) return 0; - if (txn_limbo_replica_term(&txn_limbo, instance_id) == - box_raft()->term) + + /* + * Currently active leader (the instance that is seen as leader by both + * raft and txn_limbo) can't issue another PROMOTE. + */ + bool is_leader = txn_limbo_replica_term(&txn_limbo, instance_id) == + box_raft()->term && txn_limbo.owner_id == instance_id; + if (box_election_mode != ELECTION_MODE_OFF) + is_leader = is_leader && box_raft()->state == RAFT_STATE_LEADER; + + if (is_leader) return 0; + bool run_elections = false; bool try_wait = false; @@ -1752,6 +1777,56 @@ box_promote(void) return -1; box_issue_promote(txn_limbo.owner_id, wait_lsn); + + return 0; +} + +int +box_demote(void) +{ + if (box_in_promote) { + diag_set(ClientError, ER_UNSUPPORTED, "box.ctl.demote", + "simultaneous invocations"); + return -1; + } + box_in_promote = true; + auto promote_guard = make_scoped_guard([&] { + box_in_promote = false; + }); + + if (!is_box_configured) + return 0; + + /* Currently active leader is the only one who can issue a DEMOTE. */ + bool is_leader = txn_limbo_replica_term(&txn_limbo, instance_id) == + box_raft()->term && txn_limbo.owner_id == instance_id; + if (box_election_mode != ELECTION_MODE_OFF) + is_leader = is_leader && box_raft()->state == RAFT_STATE_LEADER; + if (!is_leader) + return 0; + + bool try_wait = false; + + if (box_check_promote_election_mode(&try_wait, NULL) < 0) + return -1; + + int64_t wait_lsn = -1; + + if (box_trigger_elections() < 0) + return -1; + + if (box_election_mode != ELECTION_MODE_OFF) + return 0; + + if (try_wait && + box_try_wait_confirm(2 * replication_synchro_timeout) < 0) + return -1; + + if ((wait_lsn = box_wait_limbo_acked()) < 0) + return -1; + + box_issue_demote(txn_limbo.owner_id, wait_lsn); + return 0; } diff --git a/src/box/box.h b/src/box/box.h index ecf32240d..aaf20d9dd 100644 --- a/src/box/box.h +++ b/src/box/box.h @@ -276,6 +276,9 @@ typedef struct tuple box_tuple_t; int box_promote(void); +int +box_demote(void); + /* box_select is private and used only by FFI */ API_EXPORT int box_select(uint32_t space_id, uint32_t index_id, diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h index e913801a8..247ca6f37 100644 --- a/src/box/iproto_constants.h +++ b/src/box/iproto_constants.h @@ -241,6 +241,8 @@ enum iproto_type { IPROTO_RAFT = 30, /** PROMOTE request. */ IPROTO_PROMOTE = 31, + /** DEMOTE request. */ + IPROTO_DEMOTE = 32, /** A confirmation message for synchronous transactions. */ IPROTO_CONFIRM = 40, @@ -312,6 +314,8 @@ iproto_type_name(uint16_t type) return "RAFT"; case IPROTO_PROMOTE: return "PROMOTE"; + case IPROTO_DEMOTE: + return "DEMOTE"; case IPROTO_CONFIRM: return "CONFIRM"; case IPROTO_ROLLBACK: @@ -366,14 +370,14 @@ static inline bool iproto_type_is_synchro_request(uint16_t type) { return type == IPROTO_CONFIRM || type == IPROTO_ROLLBACK || - type == IPROTO_PROMOTE; + type == IPROTO_PROMOTE || type == IPROTO_DEMOTE; } -/** PROMOTE entry (synchronous replication and leader elections). */ +/** PROMOTE/DEMOTE entry (synchronous replication and leader elections). */ static inline bool iproto_type_is_promote_request(uint32_t type) { - return type == IPROTO_PROMOTE; + return type == IPROTO_PROMOTE || type == IPROTO_DEMOTE; } static inline bool diff --git a/src/box/lua/ctl.c b/src/box/lua/ctl.c index 368b9ab60..a613c4111 100644 --- a/src/box/lua/ctl.c +++ b/src/box/lua/ctl.c @@ -89,6 +89,14 @@ lbox_ctl_promote(struct lua_State *L) return 0; } +static int +lbox_ctl_demote(struct lua_State *L) +{ + if (box_demote() != 0) + return luaT_error(L); + return 0; +} + static int lbox_ctl_is_recovery_finished(struct lua_State *L) { @@ -127,6 +135,7 @@ static const struct luaL_Reg lbox_ctl_lib[] = { {"promote", lbox_ctl_promote}, /* An old alias. */ {"clear_synchro_queue", lbox_ctl_promote}, + {"demote", lbox_ctl_demote}, {"is_recovery_finished", lbox_ctl_is_recovery_finished}, {"set_on_shutdown_timeout", lbox_ctl_set_on_shutdown_timeout}, {NULL, NULL} diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c index c8c4f587c..570f77c46 100644 --- a/src/box/txn_limbo.c +++ b/src/box/txn_limbo.c @@ -518,6 +518,29 @@ txn_limbo_read_promote(struct txn_limbo *limbo, uint32_t replica_id, limbo->confirmed_lsn = 0; } +void +txn_limbo_write_demote(struct txn_limbo *limbo, int64_t lsn, uint64_t term) +{ + limbo->confirmed_lsn = lsn; + limbo->is_in_rollback = true; + struct txn_limbo_entry *e = txn_limbo_last_synchro_entry(limbo); + assert(e == NULL || e->lsn <= lsn); + (void)e; + txn_limbo_write_synchro(limbo, IPROTO_DEMOTE, lsn, term); + limbo->is_in_rollback = false; +} + +/** + * Process a DEMOTE request, which's like PROMOTE, but clears the limbo + * ownership. + * @sa txn_limbo_read_promote. + */ +static void +txn_limbo_read_demote(struct txn_limbo *limbo, int64_t lsn) +{ + return txn_limbo_read_promote(limbo, REPLICA_ID_NIL, lsn); +} + void txn_limbo_ack(struct txn_limbo *limbo, uint32_t replica_id, int64_t lsn) { @@ -709,12 +732,13 @@ txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req) vclock_follow(&limbo->promote_term_map, origin, term); if (term > limbo->promote_greatest_term) limbo->promote_greatest_term = term; - } else if (req->type == IPROTO_PROMOTE && + } else if (iproto_type_is_promote_request(req->type) && limbo->promote_greatest_term > 1) { /* PROMOTE for outdated term. Ignore. */ - say_info("RAFT: ignoring PROMOTE request from instance " + say_info("RAFT: ignoring %s request from instance " "id %u for term %llu. Greatest term seen " - "before (%llu) is bigger.", origin, (long long)term, + "before (%llu) is bigger.", + iproto_type_name(req->type), origin, (long long)term, (long long)limbo->promote_greatest_term); return; } @@ -725,7 +749,7 @@ txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req) * The limbo was empty on the instance issuing the request. * This means this instance must empty its limbo as well. */ - assert(lsn == 0 && req->type == IPROTO_PROMOTE); + assert(lsn == 0 && iproto_type_is_promote_request(req->type)); } else if (req->replica_id != limbo->owner_id) { /* * Ignore CONFIRM/ROLLBACK messages for a foreign master. @@ -733,7 +757,7 @@ txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req) * data from an old leader, who has just started and written * confirm right on synchronous transaction recovery. */ - if (req->type != IPROTO_PROMOTE) + if (!iproto_type_is_promote_request(req->type)) return; /* * Promote has a bigger term, and tries to steal the limbo. It @@ -753,6 +777,9 @@ txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req) case IPROTO_PROMOTE: txn_limbo_read_promote(limbo, req->origin_id, lsn); break; + case IPROTO_DEMOTE: + txn_limbo_read_demote(limbo, lsn); + break; default: unreachable(); } diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h index 7151843f4..53e52f676 100644 --- a/src/box/txn_limbo.h +++ b/src/box/txn_limbo.h @@ -329,6 +329,13 @@ txn_limbo_checkpoint(const struct txn_limbo *limbo, void txn_limbo_write_promote(struct txn_limbo *limbo, int64_t lsn, uint64_t term); +/** + * Write a DEMOTE request. + * It has the same effect as PROMOTE and additionally clears limbo ownership. + */ +void +txn_limbo_write_demote(struct txn_limbo *limbo, int64_t lsn, uint64_t term); + /** * Update qsync parameters dynamically. */ diff --git a/test/replication/election_basic.result b/test/replication/election_basic.result index b64028c60..c0323a042 100644 --- a/test/replication/election_basic.result +++ b/test/replication/election_basic.result @@ -114,6 +114,9 @@ box.cfg{ } | --- | ... +box.ctl.demote() + | --- + | ... -- -- See if bootstrap with election enabled works. diff --git a/test/replication/election_basic.test.lua b/test/replication/election_basic.test.lua index 77fdf6340..085a499d5 100644 --- a/test/replication/election_basic.test.lua +++ b/test/replication/election_basic.test.lua @@ -43,6 +43,7 @@ box.cfg{ election_mode = 'off', \ election_timeout = old_election_timeout \ } +box.ctl.demote() -- -- See if bootstrap with election enabled works. diff --git a/test/replication/election_qsync.result b/test/replication/election_qsync.result index c06400b38..2402c8578 100644 --- a/test/replication/election_qsync.result +++ b/test/replication/election_qsync.result @@ -165,6 +165,9 @@ box.cfg{ } | --- | ... +box.ctl.demote() + | --- + | ... box.schema.user.revoke('guest', 'super') | --- | ... diff --git a/test/replication/election_qsync.test.lua b/test/replication/election_qsync.test.lua index ea6fc4a61..e1aca8351 100644 --- a/test/replication/election_qsync.test.lua +++ b/test/replication/election_qsync.test.lua @@ -84,4 +84,5 @@ box.cfg{ replication = old_replication, \ replication_synchro_timeout = old_replication_synchro_timeout, \ } +box.ctl.demote() box.schema.user.revoke('guest', 'super') diff --git a/test/replication/gh-5140-qsync-casc-rollback.result b/test/replication/gh-5140-qsync-casc-rollback.result index da77631dd..d3208e1a4 100644 --- a/test/replication/gh-5140-qsync-casc-rollback.result +++ b/test/replication/gh-5140-qsync-casc-rollback.result @@ -73,6 +73,9 @@ _ = box.schema.space.create('async', {is_sync=false, engine = engine}) _ = _:create_index('pk') | --- | ... +box.ctl.promote() + | --- + | ... -- Write something to flush the master state to replica. box.space.sync:replace{1} | --- @@ -222,3 +225,6 @@ test_run:cmd('delete server replica') box.schema.user.revoke('guest', 'super') | --- | ... +box.ctl.demote() + | --- + | ... diff --git a/test/replication/gh-5140-qsync-casc-rollback.test.lua b/test/replication/gh-5140-qsync-casc-rollback.test.lua index 69fc9ad02..96ddfd260 100644 --- a/test/replication/gh-5140-qsync-casc-rollback.test.lua +++ b/test/replication/gh-5140-qsync-casc-rollback.test.lua @@ -48,6 +48,7 @@ _ = box.schema.space.create('sync', {is_sync = true, engine = engine}) _ = _:create_index('pk') _ = box.schema.space.create('async', {is_sync=false, engine = engine}) _ = _:create_index('pk') +box.ctl.promote() -- Write something to flush the master state to replica. box.space.sync:replace{1} @@ -103,3 +104,4 @@ test_run:cmd('stop server replica') test_run:cmd('delete server replica') box.schema.user.revoke('guest', 'super') +box.ctl.demote() diff --git a/test/replication/gh-5144-qsync-dup-confirm.result b/test/replication/gh-5144-qsync-dup-confirm.result index 9d265d9ff..217e44412 100644 --- a/test/replication/gh-5144-qsync-dup-confirm.result +++ b/test/replication/gh-5144-qsync-dup-confirm.result @@ -46,6 +46,9 @@ _ = box.schema.space.create('sync', {is_sync = true, engine = engine}) _ = _:create_index('pk') | --- | ... +box.ctl.promote() + | --- + | ... -- Remember the current LSN. In the end, when the following synchronous -- transaction is committed, result LSN should be this value +2: for the @@ -148,6 +151,9 @@ test_run:cmd('delete server replica2') | - true | ... +box.ctl.demote() + | --- + | ... box.schema.user.revoke('guest', 'super') | --- | ... diff --git a/test/replication/gh-5144-qsync-dup-confirm.test.lua b/test/replication/gh-5144-qsync-dup-confirm.test.lua index 01a8351e0..1d6af2c62 100644 --- a/test/replication/gh-5144-qsync-dup-confirm.test.lua +++ b/test/replication/gh-5144-qsync-dup-confirm.test.lua @@ -19,6 +19,7 @@ box.cfg{replication_synchro_quorum = 2, replication_synchro_timeout = 1000} _ = box.schema.space.create('sync', {is_sync = true, engine = engine}) _ = _:create_index('pk') +box.ctl.promote() -- Remember the current LSN. In the end, when the following synchronous -- transaction is committed, result LSN should be this value +2: for the @@ -69,4 +70,5 @@ test_run:cmd('delete server replica1') test_run:cmd('stop server replica2') test_run:cmd('delete server replica2') +box.ctl.demote() box.schema.user.revoke('guest', 'super') diff --git a/test/replication/gh-5163-qsync-restart-crash.result b/test/replication/gh-5163-qsync-restart-crash.result index e57bc76d1..1b4d3d9b5 100644 --- a/test/replication/gh-5163-qsync-restart-crash.result +++ b/test/replication/gh-5163-qsync-restart-crash.result @@ -16,6 +16,9 @@ _ = box.schema.space.create('sync', {is_sync=true, engine=engine}) _ = box.space.sync:create_index('pk') | --- | ... +box.ctl.promote() + | --- + | ... box.space.sync:replace{1} | --- @@ -30,3 +33,6 @@ box.space.sync:select{} box.space.sync:drop() | --- | ... +box.ctl.demote() + | --- + | ... diff --git a/test/replication/gh-5163-qsync-restart-crash.test.lua b/test/replication/gh-5163-qsync-restart-crash.test.lua index d5aca4749..c8d54aad2 100644 --- a/test/replication/gh-5163-qsync-restart-crash.test.lua +++ b/test/replication/gh-5163-qsync-restart-crash.test.lua @@ -7,8 +7,10 @@ engine = test_run:get_cfg('engine') -- _ = box.schema.space.create('sync', {is_sync=true, engine=engine}) _ = box.space.sync:create_index('pk') +box.ctl.promote() box.space.sync:replace{1} test_run:cmd('restart server default') box.space.sync:select{} box.space.sync:drop() +box.ctl.demote() diff --git a/test/replication/gh-5167-qsync-rollback-snap.result b/test/replication/gh-5167-qsync-rollback-snap.result index 06f58526c..13166720f 100644 --- a/test/replication/gh-5167-qsync-rollback-snap.result +++ b/test/replication/gh-5167-qsync-rollback-snap.result @@ -41,6 +41,9 @@ _ = box.schema.space.create('sync', {is_sync = true, engine = engine}) _ = box.space.sync:create_index('pk') | --- | ... +box.ctl.promote() + | --- + | ... -- Write something to flush the current master's state to replica. _ = box.space.sync:insert{1} | --- @@ -163,3 +166,6 @@ box.cfg{ } | --- | ... +box.ctl.demote() + | --- + | ... diff --git a/test/replication/gh-5167-qsync-rollback-snap.test.lua b/test/replication/gh-5167-qsync-rollback-snap.test.lua index 475727e61..1a2a31b7c 100644 --- a/test/replication/gh-5167-qsync-rollback-snap.test.lua +++ b/test/replication/gh-5167-qsync-rollback-snap.test.lua @@ -16,6 +16,7 @@ fiber = require('fiber') box.cfg{replication_synchro_quorum = 2, replication_synchro_timeout = 1000} _ = box.schema.space.create('sync', {is_sync = true, engine = engine}) _ = box.space.sync:create_index('pk') +box.ctl.promote() -- Write something to flush the current master's state to replica. _ = box.space.sync:insert{1} _ = box.space.sync:delete{1} @@ -65,3 +66,4 @@ box.cfg{ replication_synchro_quorum = orig_synchro_quorum, \ replication_synchro_timeout = orig_synchro_timeout, \ } +box.ctl.demote() diff --git a/test/replication/gh-5195-qsync-replica-write.result b/test/replication/gh-5195-qsync-replica-write.result index 85e00e6ed..bc73bb599 100644 --- a/test/replication/gh-5195-qsync-replica-write.result +++ b/test/replication/gh-5195-qsync-replica-write.result @@ -40,6 +40,9 @@ _ = box.schema.space.create('sync', {engine = engine, is_sync = true}) _ = box.space.sync:create_index('pk') | --- | ... +box.ctl.promote() + | --- + | ... box.cfg{replication_synchro_timeout = 1000, replication_synchro_quorum = 3} | --- @@ -71,12 +74,12 @@ test_run:wait_lsn('replica', 'default') | --- | ... -- Normal DML is blocked - the limbo is not empty and does not belong to the --- replica. But synchro queue cleanup also does a WAL write, and propagates LSN +-- replica. But promote also does a WAL write, and propagates LSN -- of the instance. box.cfg{replication_synchro_timeout = 0.001} | --- | ... -box.ctl.clear_synchro_queue() +box.ctl.promote() | --- | ... @@ -144,6 +147,9 @@ test_run:cmd('delete server replica') | - true | ... +box.ctl.demote() + | --- + | ... box.space.sync:drop() | --- | ... diff --git a/test/replication/gh-5195-qsync-replica-write.test.lua b/test/replication/gh-5195-qsync-replica-write.test.lua index 64c48be99..a59ec154e 100644 --- a/test/replication/gh-5195-qsync-replica-write.test.lua +++ b/test/replication/gh-5195-qsync-replica-write.test.lua @@ -17,6 +17,7 @@ test_run:cmd('start server replica with wait=True, wait_load=True') -- _ = box.schema.space.create('sync', {engine = engine, is_sync = true}) _ = box.space.sync:create_index('pk') +box.ctl.promote() box.cfg{replication_synchro_timeout = 1000, replication_synchro_quorum = 3} lsn = box.info.lsn @@ -30,10 +31,10 @@ test_run:wait_cond(function() return box.info.lsn == lsn end) test_run:switch('replica') test_run:wait_lsn('replica', 'default') -- Normal DML is blocked - the limbo is not empty and does not belong to the --- replica. But synchro queue cleanup also does a WAL write, and propagates LSN +-- replica. But promote also does a WAL write, and propagates LSN -- of the instance. box.cfg{replication_synchro_timeout = 0.001} -box.ctl.clear_synchro_queue() +box.ctl.promote() test_run:switch('default') -- Wait second ACK receipt. @@ -59,6 +60,7 @@ test_run:switch('default') test_run:cmd('stop server replica') test_run:cmd('delete server replica') +box.ctl.demote() box.space.sync:drop() box.schema.user.revoke('guest', 'super') diff --git a/test/replication/gh-5213-qsync-applier-order-3.result b/test/replication/gh-5213-qsync-applier-order-3.result index bcb18b5c0..e788eec77 100644 --- a/test/replication/gh-5213-qsync-applier-order-3.result +++ b/test/replication/gh-5213-qsync-applier-order-3.result @@ -45,6 +45,9 @@ s = box.schema.space.create('test', {is_sync = true}) _ = s:create_index('pk') | --- | ... +box.ctl.promote() + | --- + | ... test_run:cmd('create server replica1 with rpl_master=default,\ script="replication/replica1.lua"') @@ -179,6 +182,9 @@ box.cfg{ -- Replica2 takes the limbo ownership and sends the transaction to the replica1. -- Along with the CONFIRM from the default node, which is still not applied -- on the replica1. +box.ctl.promote() + | --- + | ... fiber = require('fiber') | --- | ... @@ -261,3 +267,6 @@ box.cfg{ } | --- | ... +box.ctl.demote() + | --- + | ... diff --git a/test/replication/gh-5213-qsync-applier-order-3.test.lua b/test/replication/gh-5213-qsync-applier-order-3.test.lua index 37b569da7..304656de0 100644 --- a/test/replication/gh-5213-qsync-applier-order-3.test.lua +++ b/test/replication/gh-5213-qsync-applier-order-3.test.lua @@ -30,6 +30,7 @@ box.schema.user.grant('guest', 'super') s = box.schema.space.create('test', {is_sync = true}) _ = s:create_index('pk') +box.ctl.promote() test_run:cmd('create server replica1 with rpl_master=default,\ script="replication/replica1.lua"') @@ -90,6 +91,7 @@ box.cfg{ -- Replica2 takes the limbo ownership and sends the transaction to the replica1. -- Along with the CONFIRM from the default node, which is still not applied -- on the replica1. +box.ctl.promote() fiber = require('fiber') f = fiber.new(function() box.space.test:replace{2} end) @@ -123,3 +125,4 @@ box.cfg{ replication_synchro_quorum = old_synchro_quorum, \ replication_synchro_timeout = old_synchro_timeout, \ } +box.ctl.demote() diff --git a/test/replication/gh-5213-qsync-applier-order.result b/test/replication/gh-5213-qsync-applier-order.result index a8c24c289..ba6cdab06 100644 --- a/test/replication/gh-5213-qsync-applier-order.result +++ b/test/replication/gh-5213-qsync-applier-order.result @@ -29,6 +29,9 @@ s = box.schema.space.create('test', {is_sync = true}) _ = s:create_index('pk') | --- | ... +box.ctl.promote() + | --- + | ... test_run:cmd('create server replica with rpl_master=default,\ script="replication/gh-5213-replica.lua"') @@ -300,3 +303,6 @@ box.cfg{ } | --- | ... +box.ctl.demote() + | --- + | ... diff --git a/test/replication/gh-5213-qsync-applier-order.test.lua b/test/replication/gh-5213-qsync-applier-order.test.lua index f1eccfa84..39b1912e8 100644 --- a/test/replication/gh-5213-qsync-applier-order.test.lua +++ b/test/replication/gh-5213-qsync-applier-order.test.lua @@ -14,6 +14,7 @@ box.schema.user.grant('guest', 'super') s = box.schema.space.create('test', {is_sync = true}) _ = s:create_index('pk') +box.ctl.promote() test_run:cmd('create server replica with rpl_master=default,\ script="replication/gh-5213-replica.lua"') @@ -120,3 +121,4 @@ box.cfg{ replication_synchro_quorum = old_synchro_quorum, \ replication_synchro_timeout = old_synchro_timeout, \ } +box.ctl.demote() diff --git a/test/replication/gh-5288-qsync-recovery.result b/test/replication/gh-5288-qsync-recovery.result index dc0babef6..704b71d93 100644 --- a/test/replication/gh-5288-qsync-recovery.result +++ b/test/replication/gh-5288-qsync-recovery.result @@ -12,6 +12,9 @@ s = box.schema.space.create('sync', {is_sync = true}) _ = s:create_index('pk') | --- | ... +box.ctl.promote() + | --- + | ... s:insert{1} | --- | - [1] @@ -25,3 +28,6 @@ test_run:cmd('restart server default') box.space.sync:drop() | --- | ... +box.ctl.demote() + | --- + | ... diff --git a/test/replication/gh-5288-qsync-recovery.test.lua b/test/replication/gh-5288-qsync-recovery.test.lua index 00bff7b87..2455f7278 100644 --- a/test/replication/gh-5288-qsync-recovery.test.lua +++ b/test/replication/gh-5288-qsync-recovery.test.lua @@ -5,7 +5,9 @@ test_run = require('test_run').new() -- s = box.schema.space.create('sync', {is_sync = true}) _ = s:create_index('pk') +box.ctl.promote() s:insert{1} box.snapshot() test_run:cmd('restart server default') box.space.sync:drop() +box.ctl.demote() diff --git a/test/replication/gh-5298-qsync-recovery-snap.result b/test/replication/gh-5298-qsync-recovery-snap.result index 922831552..0883fe5f5 100644 --- a/test/replication/gh-5298-qsync-recovery-snap.result +++ b/test/replication/gh-5298-qsync-recovery-snap.result @@ -17,6 +17,9 @@ _ = box.schema.space.create('sync', {is_sync = true, engine = engine}) _ = box.space.sync:create_index('pk') | --- | ... +box.ctl.promote() + | --- + | ... for i = 1, 10 do box.space.sync:replace{i} end | --- | ... @@ -98,3 +101,6 @@ box.space.sync:drop() box.space.loc:drop() | --- | ... +box.ctl.demote() + | --- + | ... diff --git a/test/replication/gh-5298-qsync-recovery-snap.test.lua b/test/replication/gh-5298-qsync-recovery-snap.test.lua index 187f60d75..084cde963 100644 --- a/test/replication/gh-5298-qsync-recovery-snap.test.lua +++ b/test/replication/gh-5298-qsync-recovery-snap.test.lua @@ -8,6 +8,7 @@ engine = test_run:get_cfg('engine') -- _ = box.schema.space.create('sync', {is_sync = true, engine = engine}) _ = box.space.sync:create_index('pk') +box.ctl.promote() for i = 1, 10 do box.space.sync:replace{i} end -- Local rows could affect this by increasing the signature. @@ -46,3 +47,4 @@ box.cfg{ } box.space.sync:drop() box.space.loc:drop() +box.ctl.demote() diff --git a/test/replication/gh-5426-election-on-off.result b/test/replication/gh-5426-election-on-off.result index 7444ef7f2..2bdc17ec6 100644 --- a/test/replication/gh-5426-election-on-off.result +++ b/test/replication/gh-5426-election-on-off.result @@ -168,6 +168,9 @@ box.cfg{ } | --- | ... +box.ctl.demote() + | --- + | ... box.schema.user.revoke('guest', 'super') | --- | ... diff --git a/test/replication/gh-5426-election-on-off.test.lua b/test/replication/gh-5426-election-on-off.test.lua index bdf06903b..6277e9ef2 100644 --- a/test/replication/gh-5426-election-on-off.test.lua +++ b/test/replication/gh-5426-election-on-off.test.lua @@ -69,4 +69,5 @@ box.cfg{ election_mode = old_election_mode, \ replication_timeout = old_replication_timeout, \ } +box.ctl.demote() box.schema.user.revoke('guest', 'super') diff --git a/test/replication/gh-5433-election-restart-recovery.result b/test/replication/gh-5433-election-restart-recovery.result index f8f32416e..ed63ff409 100644 --- a/test/replication/gh-5433-election-restart-recovery.result +++ b/test/replication/gh-5433-election-restart-recovery.result @@ -169,6 +169,9 @@ box.cfg{ } | --- | ... +box.ctl.demote() + | --- + | ... box.schema.user.revoke('guest', 'super') | --- | ... diff --git a/test/replication/gh-5433-election-restart-recovery.test.lua b/test/replication/gh-5433-election-restart-recovery.test.lua index 4aff000bf..ae1f42c4d 100644 --- a/test/replication/gh-5433-election-restart-recovery.test.lua +++ b/test/replication/gh-5433-election-restart-recovery.test.lua @@ -84,4 +84,5 @@ box.cfg{ election_mode = old_election_mode, \ replication_timeout = old_replication_timeout, \ } +box.ctl.demote() box.schema.user.revoke('guest', 'super') diff --git a/test/replication/gh-5435-qsync-clear-synchro-queue-commit-all.result b/test/replication/gh-5435-qsync-clear-synchro-queue-commit-all.result index 2699231e5..20fab4072 100644 --- a/test/replication/gh-5435-qsync-clear-synchro-queue-commit-all.result +++ b/test/replication/gh-5435-qsync-clear-synchro-queue-commit-all.result @@ -49,6 +49,9 @@ _ = box.schema.space.create('test', {is_sync=true}) _ = box.space.test:create_index('pk') | --- | ... +box.ctl.promote() + | --- + | ... -- Fill the limbo with pending entries. 3 mustn't receive them yet. test_run:cmd('stop server election_replica3') diff --git a/test/replication/gh-5435-qsync-clear-synchro-queue-commit-all.test.lua b/test/replication/gh-5435-qsync-clear-synchro-queue-commit-all.test.lua index 03705d96c..ec0f1d77e 100644 --- a/test/replication/gh-5435-qsync-clear-synchro-queue-commit-all.test.lua +++ b/test/replication/gh-5435-qsync-clear-synchro-queue-commit-all.test.lua @@ -21,6 +21,7 @@ box.ctl.wait_rw() _ = box.schema.space.create('test', {is_sync=true}) _ = box.space.test:create_index('pk') +box.ctl.promote() -- Fill the limbo with pending entries. 3 mustn't receive them yet. test_run:cmd('stop server election_replica3') diff --git a/test/replication/gh-5438-election-state.result b/test/replication/gh-5438-election-state.result index 6985f026a..68b6bfad8 100644 --- a/test/replication/gh-5438-election-state.result +++ b/test/replication/gh-5438-election-state.result @@ -47,6 +47,9 @@ end) | ... -- Cleanup. +box.ctl.demote() + | --- + | ... box.cfg{election_mode = old_election_mode} | --- | ... diff --git a/test/replication/gh-5438-election-state.test.lua b/test/replication/gh-5438-election-state.test.lua index 60c3366c1..cf0f4ca23 100644 --- a/test/replication/gh-5438-election-state.test.lua +++ b/test/replication/gh-5438-election-state.test.lua @@ -22,6 +22,7 @@ test_run:wait_cond(function()\ end) -- Cleanup. +box.ctl.demote() box.cfg{election_mode = old_election_mode} test_run:cmd('stop server replica') test_run:cmd('delete server replica') diff --git a/test/replication/gh-5446-qsync-eval-quorum.result b/test/replication/gh-5446-qsync-eval-quorum.result index 5f83b248c..b3c217913 100644 --- a/test/replication/gh-5446-qsync-eval-quorum.result +++ b/test/replication/gh-5446-qsync-eval-quorum.result @@ -88,6 +88,9 @@ s = box.schema.space.create('sync', {is_sync = true, engine = engine}) _ = s:create_index('pk') | --- | ... +box.ctl.promote() + | --- + | ... -- Only one master node -> 1/2 + 1 = 1 s:insert{1} -- should pass @@ -343,3 +346,6 @@ box.cfg{ } | --- | ... +box.ctl.demote() + | --- + | ... diff --git a/test/replication/gh-5446-qsync-eval-quorum.test.lua b/test/replication/gh-5446-qsync-eval-quorum.test.lua index 6b9e324ed..c2901b845 100644 --- a/test/replication/gh-5446-qsync-eval-quorum.test.lua +++ b/test/replication/gh-5446-qsync-eval-quorum.test.lua @@ -37,6 +37,7 @@ end -- Create a sync space we will operate on s = box.schema.space.create('sync', {is_sync = true, engine = engine}) _ = s:create_index('pk') +box.ctl.promote() -- Only one master node -> 1/2 + 1 = 1 s:insert{1} -- should pass @@ -135,3 +136,4 @@ box.cfg{ replication_synchro_quorum = old_synchro_quorum, \ replication_synchro_timeout = old_synchro_timeout, \ } +box.ctl.demote() diff --git a/test/replication/gh-5506-election-on-off.result b/test/replication/gh-5506-election-on-off.result index b8abd7ecd..a7f2b6a9c 100644 --- a/test/replication/gh-5506-election-on-off.result +++ b/test/replication/gh-5506-election-on-off.result @@ -138,3 +138,6 @@ box.cfg{ } | --- | ... +box.ctl.demote() + | --- + | ... diff --git a/test/replication/gh-5506-election-on-off.test.lua b/test/replication/gh-5506-election-on-off.test.lua index 476b00ec0..f8915c333 100644 --- a/test/replication/gh-5506-election-on-off.test.lua +++ b/test/replication/gh-5506-election-on-off.test.lua @@ -66,3 +66,4 @@ box.cfg{ election_mode = old_election_mode, \ replication_timeout = old_replication_timeout, \ } +box.ctl.demote() diff --git a/test/replication/gh-5566-final-join-synchro.result b/test/replication/gh-5566-final-join-synchro.result index a09882ba6..c5ae2f283 100644 --- a/test/replication/gh-5566-final-join-synchro.result +++ b/test/replication/gh-5566-final-join-synchro.result @@ -12,6 +12,9 @@ _ = box.schema.space.create('sync', {is_sync=true}) _ = box.space.sync:create_index('pk') | --- | ... +box.ctl.promote() + | --- + | ... box.schema.user.grant('guest', 'replication') | --- @@ -137,3 +140,6 @@ test_run:cleanup_cluster() box.schema.user.revoke('guest', 'replication') | --- | ... +box.ctl.demote() + | --- + | ... diff --git a/test/replication/gh-5566-final-join-synchro.test.lua b/test/replication/gh-5566-final-join-synchro.test.lua index 2db2c742f..25f411407 100644 --- a/test/replication/gh-5566-final-join-synchro.test.lua +++ b/test/replication/gh-5566-final-join-synchro.test.lua @@ -5,6 +5,7 @@ test_run = require('test_run').new() -- _ = box.schema.space.create('sync', {is_sync=true}) _ = box.space.sync:create_index('pk') +box.ctl.promote() box.schema.user.grant('guest', 'replication') box.schema.user.grant('guest', 'write', 'space', 'sync') @@ -59,3 +60,4 @@ box.cfg{\ box.space.sync:drop() test_run:cleanup_cluster() box.schema.user.revoke('guest', 'replication') +box.ctl.demote() diff --git a/test/replication/gh-5874-qsync-txn-recovery.result b/test/replication/gh-5874-qsync-txn-recovery.result index 73f903ca7..01328a9e3 100644 --- a/test/replication/gh-5874-qsync-txn-recovery.result +++ b/test/replication/gh-5874-qsync-txn-recovery.result @@ -31,6 +31,9 @@ sync = box.schema.create_space('sync', {is_sync = true, engine = engine}) _ = sync:create_index('pk') | --- | ... +box.ctl.promote() + | --- + | ... -- The transaction fails, but is written to the log anyway. box.begin() async:insert{1} sync:insert{1} box.commit() @@ -160,3 +163,6 @@ sync:drop() loc:drop() | --- | ... +box.ctl.demote() + | --- + | ... diff --git a/test/replication/gh-5874-qsync-txn-recovery.test.lua b/test/replication/gh-5874-qsync-txn-recovery.test.lua index f35eb68de..6ddf164ac 100644 --- a/test/replication/gh-5874-qsync-txn-recovery.test.lua +++ b/test/replication/gh-5874-qsync-txn-recovery.test.lua @@ -12,6 +12,7 @@ async = box.schema.create_space('async', {engine = engine}) _ = async:create_index('pk') sync = box.schema.create_space('sync', {is_sync = true, engine = engine}) _ = sync:create_index('pk') +box.ctl.promote() -- The transaction fails, but is written to the log anyway. box.begin() async:insert{1} sync:insert{1} box.commit() @@ -82,3 +83,4 @@ loc:select() async:drop() sync:drop() loc:drop() +box.ctl.demote() diff --git a/test/replication/gh-6032-promote-wal-write.result b/test/replication/gh-6032-promote-wal-write.result index 246c7974f..03112fb8d 100644 --- a/test/replication/gh-6032-promote-wal-write.result +++ b/test/replication/gh-6032-promote-wal-write.result @@ -67,3 +67,6 @@ box.cfg{\ box.space.sync:drop() | --- | ... +box.ctl.demote() + | --- + | ... diff --git a/test/replication/gh-6032-promote-wal-write.test.lua b/test/replication/gh-6032-promote-wal-write.test.lua index 8c1859083..9a036a8b4 100644 --- a/test/replication/gh-6032-promote-wal-write.test.lua +++ b/test/replication/gh-6032-promote-wal-write.test.lua @@ -26,3 +26,4 @@ box.cfg{\ replication_synchro_timeout = replication_synchro_timeout,\ } box.space.sync:drop() +box.ctl.demote() diff --git a/test/replication/gh-6034-election-promote-bump-term.result b/test/replication/gh-6034-election-promote-bump-term.result index 8be4e8243..c64da0d5d 100644 --- a/test/replication/gh-6034-election-promote-bump-term.result +++ b/test/replication/gh-6034-election-promote-bump-term.result @@ -19,3 +19,8 @@ assert(box.info.election.term == term + 1) | --- | - true | ... + +-- Cleanup. +box.ctl.demote() + | --- + | ... diff --git a/test/replication/gh-6034-election-promote-bump-term.test.lua b/test/replication/gh-6034-election-promote-bump-term.test.lua index 1e814bf5d..dfb78f906 100644 --- a/test/replication/gh-6034-election-promote-bump-term.test.lua +++ b/test/replication/gh-6034-election-promote-bump-term.test.lua @@ -7,3 +7,6 @@ box.cfg{election_mode='off'} term = box.info.election.term box.ctl.promote() assert(box.info.election.term == term + 1) + +-- Cleanup. +box.ctl.demote() diff --git a/test/replication/gh-6034-qsync-limbo-ownership.result b/test/replication/gh-6034-qsync-limbo-ownership.result new file mode 100644 index 000000000..0da3e5c2d --- /dev/null +++ b/test/replication/gh-6034-qsync-limbo-ownership.result @@ -0,0 +1,186 @@ +-- test-run result file version 2 +test_run = require('test_run').new() + | --- + | ... + +-- +-- gh-6034: test that transactional limbo isn't accessible without a promotion. +-- +synchro_quorum = box.cfg.replication_synchro_quorum + | --- + | ... +election_mode = box.cfg.election_mode + | --- + | ... +box.cfg{replication_synchro_quorum = 1, election_mode='off'} + | --- + | ... + +_ = box.schema.space.create('async'):create_index('pk') + | --- + | ... +_ = box.schema.space.create('sync', {is_sync=true}):create_index('pk') + | --- + | ... + +-- Limbo is initially unclaimed, everyone is writeable. +assert(not box.info.ro) + | --- + | - true + | ... +assert(box.info.synchro.queue.owner == 0) + | --- + | - true + | ... +box.space.async:insert{1} -- success. + | --- + | - [1] + | ... +-- Synchro spaces aren't writeable +box.space.sync:insert{1} -- error. + | --- + | - error: The synchronous transaction queue doesn't belong to any instance + | ... + +box.ctl.promote() + | --- + | ... +assert(not box.info.ro) + | --- + | - true + | ... +assert(box.info.synchro.queue.owner == box.info.id) + | --- + | - true + | ... +box.space.sync:insert{1} -- success. + | --- + | - [1] + | ... + +-- Everyone but the limbo owner is read-only. +box.schema.user.grant('guest', 'replication') + | --- + | ... +test_run:cmd('create server replica with rpl_master=default,\ + script="replication/replica.lua"') + | --- + | - true + | ... +test_run:cmd('start server replica with wait=True, wait_load=True') + | --- + | - true + | ... +test_run:cmd('set variable rpl_listen to "replica.listen"') + | --- + | - true + | ... +orig_replication = box.cfg.replication + | --- + | ... +box.cfg{replication=rpl_listen} + | --- + | ... + +test_run:switch('replica') + | --- + | - true + | ... +assert(box.info.ro) + | --- + | - true + | ... +assert(box.info.synchro.queue.owner == test_run:get_server_id('default')) + | --- + | - true + | ... +box.space.async:insert{2} -- failure. + | --- + | - error: Can't modify data because this instance is in read-only mode. + | ... + +-- Promotion on the other node. Default should become ro. +box.ctl.promote() + | --- + | ... +assert(not box.info.ro) + | --- + | - true + | ... +assert(box.info.synchro.queue.owner == box.info.id) + | --- + | - true + | ... +box.space.sync:insert{2} -- success. + | --- + | - [2] + | ... + +test_run:switch('default') + | --- + | - true + | ... +assert(box.info.ro) + | --- + | - true + | ... +assert(box.info.synchro.queue.owner == test_run:get_server_id('replica')) + | --- + | - true + | ... +box.space.sync:insert{3} -- failure. + | --- + | - error: Can't modify data because this instance is in read-only mode. + | ... + +box.ctl.promote() + | --- + | ... +box.ctl.demote() + | --- + | ... +assert(not box.info.ro) + | --- + | - true + | ... +box.space.sync:insert{3} -- still fails. + | --- + | - error: The synchronous transaction queue doesn't belong to any instance + | ... +assert(box.info.synchro.queue.owner == 0) + | --- + | - true + | ... +box.space.async:insert{3} -- success. + | --- + | - [3] + | ... + +-- Cleanup. +box.ctl.demote() + | --- + | ... +test_run:cmd('stop server replica') + | --- + | - true + | ... +test_run:cmd('delete server replica') + | --- + | - true + | ... +box.schema.user.revoke('guest', 'replication') + | --- + | ... +box.space.sync:drop() + | --- + | ... +box.space.async:drop() + | --- + | ... +box.cfg{\ + replication_synchro_quorum = synchro_quorum,\ + election_mode = election_mode,\ + replication = orig_replication,\ +} + | --- + | ... diff --git a/test/replication/gh-6034-qsync-limbo-ownership.test.lua b/test/replication/gh-6034-qsync-limbo-ownership.test.lua new file mode 100644 index 000000000..a0c49575a --- /dev/null +++ b/test/replication/gh-6034-qsync-limbo-ownership.test.lua @@ -0,0 +1,68 @@ +test_run = require('test_run').new() + +-- +-- gh-6034: test that transactional limbo isn't accessible without a promotion. +-- +synchro_quorum = box.cfg.replication_synchro_quorum +election_mode = box.cfg.election_mode +box.cfg{replication_synchro_quorum = 1, election_mode='off'} + +_ = box.schema.space.create('async'):create_index('pk') +_ = box.schema.space.create('sync', {is_sync=true}):create_index('pk') + +-- Limbo is initially unclaimed, everyone is writeable. +assert(not box.info.ro) +assert(box.info.synchro.queue.owner == 0) +box.space.async:insert{1} -- success. +-- Synchro spaces aren't writeable +box.space.sync:insert{1} -- error. + +box.ctl.promote() +assert(not box.info.ro) +assert(box.info.synchro.queue.owner == box.info.id) +box.space.sync:insert{1} -- success. + +-- Everyone but the limbo owner is read-only. +box.schema.user.grant('guest', 'replication') +test_run:cmd('create server replica with rpl_master=default,\ + script="replication/replica.lua"') +test_run:cmd('start server replica with wait=True, wait_load=True') +test_run:cmd('set variable rpl_listen to "replica.listen"') +orig_replication = box.cfg.replication +box.cfg{replication=rpl_listen} + +test_run:switch('replica') +assert(box.info.ro) +assert(box.info.synchro.queue.owner == test_run:get_server_id('default')) +box.space.async:insert{2} -- failure. + +-- Promotion on the other node. Default should become ro. +box.ctl.promote() +assert(not box.info.ro) +assert(box.info.synchro.queue.owner == box.info.id) +box.space.sync:insert{2} -- success. + +test_run:switch('default') +assert(box.info.ro) +assert(box.info.synchro.queue.owner == test_run:get_server_id('replica')) +box.space.sync:insert{3} -- failure. + +box.ctl.promote() +box.ctl.demote() +assert(not box.info.ro) +box.space.sync:insert{3} -- still fails. +assert(box.info.synchro.queue.owner == 0) +box.space.async:insert{3} -- success. + +-- Cleanup. +box.ctl.demote() +test_run:cmd('stop server replica') +test_run:cmd('delete server replica') +box.schema.user.revoke('guest', 'replication') +box.space.sync:drop() +box.space.async:drop() +box.cfg{\ + replication_synchro_quorum = synchro_quorum,\ + election_mode = election_mode,\ + replication = orig_replication,\ +} diff --git a/test/replication/gh-6057-qsync-confirm-async-no-wal.result b/test/replication/gh-6057-qsync-confirm-async-no-wal.result index 23c77729b..e7beefb2a 100644 --- a/test/replication/gh-6057-qsync-confirm-async-no-wal.result +++ b/test/replication/gh-6057-qsync-confirm-async-no-wal.result @@ -40,6 +40,10 @@ _ = s2:create_index('pk') | --- | ... +box.ctl.promote() + | --- + | ... + errinj = box.error.injection | --- | ... @@ -161,3 +165,6 @@ box.cfg{ } | --- | ... +box.ctl.demote() + | --- + | ... diff --git a/test/replication/gh-6057-qsync-confirm-async-no-wal.test.lua b/test/replication/gh-6057-qsync-confirm-async-no-wal.test.lua index a11ddc042..bb459ea02 100644 --- a/test/replication/gh-6057-qsync-confirm-async-no-wal.test.lua +++ b/test/replication/gh-6057-qsync-confirm-async-no-wal.test.lua @@ -21,6 +21,8 @@ _ = s:create_index('pk') s2 = box.schema.create_space('test2') _ = s2:create_index('pk') +box.ctl.promote() + errinj = box.error.injection function create_hanging_async_after_confirm(sync_key, async_key1, async_key2) \ @@ -86,3 +88,4 @@ box.cfg{ replication_synchro_quorum = old_synchro_quorum, \ replication_synchro_timeout = old_synchro_timeout, \ } +box.ctl.demote() diff --git a/test/replication/hang_on_synchro_fail.result b/test/replication/hang_on_synchro_fail.result index 9f6fac00b..dda15af20 100644 --- a/test/replication/hang_on_synchro_fail.result +++ b/test/replication/hang_on_synchro_fail.result @@ -19,6 +19,9 @@ _ = box.schema.space.create('sync', {is_sync=true}) _ = box.space.sync:create_index('pk') | --- | ... +box.ctl.promote() + | --- + | ... old_synchro_quorum = box.cfg.replication_synchro_quorum | --- @@ -127,4 +130,7 @@ box.space.sync:drop() box.schema.user.revoke('guest', 'replication') | --- | ... +box.ctl.demote() + | --- + | ... diff --git a/test/replication/hang_on_synchro_fail.test.lua b/test/replication/hang_on_synchro_fail.test.lua index 6c3b09fab..f0d494eae 100644 --- a/test/replication/hang_on_synchro_fail.test.lua +++ b/test/replication/hang_on_synchro_fail.test.lua @@ -8,6 +8,7 @@ box.schema.user.grant('guest', 'replication') _ = box.schema.space.create('sync', {is_sync=true}) _ = box.space.sync:create_index('pk') +box.ctl.promote() old_synchro_quorum = box.cfg.replication_synchro_quorum box.cfg{replication_synchro_quorum=3} @@ -54,4 +55,5 @@ box.cfg{replication_synchro_quorum=old_synchro_quorum,\ replication_synchro_timeout=old_synchro_timeout} box.space.sync:drop() box.schema.user.revoke('guest', 'replication') +box.ctl.demote() diff --git a/test/replication/qsync_advanced.result b/test/replication/qsync_advanced.result index 94b19b1f2..72ac0c326 100644 --- a/test/replication/qsync_advanced.result +++ b/test/replication/qsync_advanced.result @@ -72,6 +72,9 @@ _ = box.schema.space.create('sync', {is_sync=true, engine=engine}) _ = box.space.sync:create_index('pk') | --- | ... +box.ctl.promote() + | --- + | ... -- Testcase body. box.space.sync:insert{1} -- success | --- @@ -468,6 +471,9 @@ box.space.sync:select{} -- 1 box.cfg{read_only=false} -- promote replica to master | --- | ... +box.ctl.promote() + | --- + | ... test_run:switch('default') | --- | - true @@ -508,6 +514,9 @@ test_run:switch('default') box.cfg{read_only=false} | --- | ... +box.ctl.promote() + | --- + | ... test_run:switch('replica') | --- | - true @@ -781,3 +790,6 @@ box.cfg{ } | --- | ... +box.ctl.demote() + | --- + | ... diff --git a/test/replication/qsync_advanced.test.lua b/test/replication/qsync_advanced.test.lua index 058ece602..37c285b8d 100644 --- a/test/replication/qsync_advanced.test.lua +++ b/test/replication/qsync_advanced.test.lua @@ -30,6 +30,7 @@ test_run:switch('default') box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=1000} _ = box.schema.space.create('sync', {is_sync=true, engine=engine}) _ = box.space.sync:create_index('pk') +box.ctl.promote() -- Testcase body. box.space.sync:insert{1} -- success test_run:cmd('switch replica') @@ -170,6 +171,7 @@ box.space.sync:select{} -- 1 test_run:switch('replica') box.space.sync:select{} -- 1 box.cfg{read_only=false} -- promote replica to master +box.ctl.promote() test_run:switch('default') box.cfg{read_only=true} -- demote master to replica test_run:switch('replica') @@ -181,6 +183,7 @@ box.space.sync:select{} -- 1, 2 -- Revert cluster configuration. test_run:switch('default') box.cfg{read_only=false} +box.ctl.promote() test_run:switch('replica') box.cfg{read_only=true} -- Testcase cleanup. @@ -279,3 +282,4 @@ box.cfg{ replication_synchro_quorum = orig_synchro_quorum, \ replication_synchro_timeout = orig_synchro_timeout, \ } +box.ctl.demote() diff --git a/test/replication/qsync_basic.result b/test/replication/qsync_basic.result index 7e711ba13..bbdfc42fe 100644 --- a/test/replication/qsync_basic.result +++ b/test/replication/qsync_basic.result @@ -14,6 +14,9 @@ s1.is_sync pk = s1:create_index('pk') | --- | ... +box.ctl.promote() + | --- + | ... box.begin() s1:insert({1}) s1:insert({2}) box.commit() | --- | ... @@ -645,19 +648,12 @@ test_run:switch('default') | --- | - true | ... -box.cfg{replication_synchro_quorum = 3, replication_synchro_timeout = 1000} - | --- - | ... -f = fiber.create(function() box.space.sync:replace{1} end) +box.ctl.demote() | --- | ... -test_run:wait_lsn('replica', 'default') +box.space.sync:replace{1} | --- - | ... - -test_run:switch('replica') - | --- - | - true + | - error: The synchronous transaction queue doesn't belong to any instance | ... function skip_row() return nil end | --- @@ -674,26 +670,22 @@ box.space.sync:replace{2} box.space.sync:before_replace(nil, skip_row) | --- | ... -assert(box.space.sync:get{2} == nil) +assert(box.space.sync:get{1} == nil) | --- | - true | ... -assert(box.space.sync:get{1} ~= nil) +assert(box.space.sync:get{2} == nil) | --- | - true | ... - -test_run:switch('default') +assert(box.info.lsn == old_lsn + 1) | --- | - true | ... -box.cfg{replication_synchro_quorum = 2} +box.ctl.promote() | --- | ... -test_run:wait_cond(function() return f:status() == 'dead' end) - | --- - | - true - | ... + box.space.sync:truncate() | --- | ... @@ -758,3 +750,6 @@ box.space.sync:drop() box.schema.user.revoke('guest', 'replication') | --- | ... +box.ctl.demote() + | --- + | ... diff --git a/test/replication/qsync_basic.test.lua b/test/replication/qsync_basic.test.lua index 75c9b222b..eac465e25 100644 --- a/test/replication/qsync_basic.test.lua +++ b/test/replication/qsync_basic.test.lua @@ -6,6 +6,7 @@ s1 = box.schema.create_space('test1', {is_sync = true}) s1.is_sync pk = s1:create_index('pk') +box.ctl.promote() box.begin() s1:insert({1}) s1:insert({2}) box.commit() s1:select{} @@ -253,22 +254,18 @@ box.space.sync:count() -- instances, but also works for local rows. -- test_run:switch('default') -box.cfg{replication_synchro_quorum = 3, replication_synchro_timeout = 1000} -f = fiber.create(function() box.space.sync:replace{1} end) -test_run:wait_lsn('replica', 'default') - -test_run:switch('replica') +box.ctl.demote() +box.space.sync:replace{1} function skip_row() return nil end old_lsn = box.info.lsn _ = box.space.sync:before_replace(skip_row) box.space.sync:replace{2} box.space.sync:before_replace(nil, skip_row) +assert(box.space.sync:get{1} == nil) assert(box.space.sync:get{2} == nil) -assert(box.space.sync:get{1} ~= nil) +assert(box.info.lsn == old_lsn + 1) +box.ctl.promote() -test_run:switch('default') -box.cfg{replication_synchro_quorum = 2} -test_run:wait_cond(function() return f:status() == 'dead' end) box.space.sync:truncate() -- @@ -301,3 +298,4 @@ test_run:cmd('delete server replica') box.space.test:drop() box.space.sync:drop() box.schema.user.revoke('guest', 'replication') +box.ctl.demote() diff --git a/test/replication/qsync_errinj.result b/test/replication/qsync_errinj.result index 635bcf939..cf1e30a90 100644 --- a/test/replication/qsync_errinj.result +++ b/test/replication/qsync_errinj.result @@ -35,6 +35,9 @@ _ = box.schema.space.create('sync', {is_sync = true, engine = engine}) _ = box.space.sync:create_index('pk') | --- | ... +box.ctl.promote() + | --- + | ... -- -- gh-5100: slow ACK sending shouldn't stun replica for the @@ -542,3 +545,6 @@ box.space.sync:drop() box.schema.user.revoke('guest', 'super') | --- | ... +box.ctl.demote() + | --- + | ... diff --git a/test/replication/qsync_errinj.test.lua b/test/replication/qsync_errinj.test.lua index 6a9fd3e1a..e7c85c58c 100644 --- a/test/replication/qsync_errinj.test.lua +++ b/test/replication/qsync_errinj.test.lua @@ -12,6 +12,7 @@ test_run:cmd('start server replica with wait=True, wait_load=True') _ = box.schema.space.create('sync', {is_sync = true, engine = engine}) _ = box.space.sync:create_index('pk') +box.ctl.promote() -- -- gh-5100: slow ACK sending shouldn't stun replica for the @@ -222,3 +223,4 @@ test_run:cmd('delete server replica') box.space.sync:drop() box.schema.user.revoke('guest', 'super') +box.ctl.demote() diff --git a/test/replication/qsync_snapshots.result b/test/replication/qsync_snapshots.result index cafdd63c8..ca418b168 100644 --- a/test/replication/qsync_snapshots.result +++ b/test/replication/qsync_snapshots.result @@ -57,6 +57,9 @@ _ = box.schema.space.create('sync', {is_sync=true, engine=engine}) _ = box.space.sync:create_index('pk') | --- | ... +box.ctl.promote() + | --- + | ... -- Testcase body. box.space.sync:insert{1} | --- @@ -299,3 +302,6 @@ box.cfg{ } | --- | ... +box.ctl.demote() + | --- + | ... diff --git a/test/replication/qsync_snapshots.test.lua b/test/replication/qsync_snapshots.test.lua index 590610974..82c2e3f7c 100644 --- a/test/replication/qsync_snapshots.test.lua +++ b/test/replication/qsync_snapshots.test.lua @@ -23,6 +23,7 @@ test_run:switch('default') box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=1000} _ = box.schema.space.create('sync', {is_sync=true, engine=engine}) _ = box.space.sync:create_index('pk') +box.ctl.promote() -- Testcase body. box.space.sync:insert{1} box.space.sync:select{} -- 1 @@ -130,3 +131,4 @@ box.cfg{ replication_synchro_quorum = orig_synchro_quorum, \ replication_synchro_timeout = orig_synchro_timeout, \ } +box.ctl.demote() diff --git a/test/replication/qsync_with_anon.result b/test/replication/qsync_with_anon.result index 6a2952a32..99c6fb902 100644 --- a/test/replication/qsync_with_anon.result +++ b/test/replication/qsync_with_anon.result @@ -57,6 +57,9 @@ _ = box.schema.space.create('sync', {is_sync=true, engine=engine}) _ = box.space.sync:create_index('pk') | --- | ... +box.ctl.promote() + | --- + | ... -- Testcase body. test_run:switch('default') | --- @@ -220,6 +223,9 @@ box.cfg{ } | --- | ... +box.ctl.demote() + | --- + | ... test_run:cleanup_cluster() | --- | ... diff --git a/test/replication/qsync_with_anon.test.lua b/test/replication/qsync_with_anon.test.lua index d7ecaa107..e73880ec7 100644 --- a/test/replication/qsync_with_anon.test.lua +++ b/test/replication/qsync_with_anon.test.lua @@ -22,6 +22,7 @@ test_run:switch('default') box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=1000} _ = box.schema.space.create('sync', {is_sync=true, engine=engine}) _ = box.space.sync:create_index('pk') +box.ctl.promote() -- Testcase body. test_run:switch('default') box.space.sync:insert{1} -- success @@ -81,4 +82,5 @@ box.cfg{ replication_synchro_quorum = orig_synchro_quorum, \ replication_synchro_timeout = orig_synchro_timeout, \ } +box.ctl.demote() test_run:cleanup_cluster() diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg index 7a6fd7052..9e284b3f2 100644 --- a/test/replication/suite.cfg +++ b/test/replication/suite.cfg @@ -48,6 +48,7 @@ "gh-5613-bootstrap-prefer-booted.test.lua": {}, "gh-6027-applier-error-show.test.lua": {}, "gh-6032-promote-wal-write.test.lua": {}, + "gh-6034-qsync-limbo-ownership.test.lua": {}, "gh-6034-election-promote-bump-term.test.lua": {}, "gh-6057-qsync-confirm-async-no-wal.test.lua": {}, "gh-6094-rs-uuid-mismatch.test.lua": {}, -- 2.30.1 (Apple Git-130)
14.07.2021 21:25, Serge Petrenko пишет: > Part of #6034 > --- > src/box/box.cc | 15 +--- > .../gh-6034-election-candidate-promote.result | 84 +++++++++++++++++++ > ...h-6034-election-candidate-promote.test.lua | 42 ++++++++++ > test/replication/suite.cfg | 1 + > 4 files changed, 128 insertions(+), 14 deletions(-) > create mode 100644 test/replication/gh-6034-election-candidate-promote.result > create mode 100644 test/replication/gh-6034-election-candidate-promote.test.lua > Fixed an obvious typo: ========================= diff --git a/test/replication/gh-6034-election-candidate-promote.result b/test/replication/gh-6034-election-candidate-promote.result index 76a54d5a6..2b4bc0213 100644 --- a/test/replication/gh-6034-election-candidate-promote.result +++ b/test/replication/gh-6034-election-candidate-promote.result @@ -60,7 +60,7 @@ next_nr = leader_nr % 3 + 1 | ... -- Someone else may become a leader, thus promote may fail. But we're testing -- that it takes effect at all, so that's fine. -_ = pcall(test_run:eval('election_replica'..next_nr, 'box.ctl.promote()')) +_ = pcall(test_run.eval, test_run, 'election_replica'..next_nr, 'box.ctl.promote()') | --- | ... new_term = test_run:eval('election_replica'..next_nr,\ diff --git a/test/replication/gh-6034-election-candidate-promote.test.lua b/test/replication/gh-6034-election-candidate-promote.test.lua index 24c57f4cb..c25f9296d 100644 --- a/test/replication/gh-6034-election-candidate-promote.test.lua +++ b/test/replication/gh-6034-election-candidate-promote.test.lua @@ -32,7 +32,7 @@ term = test_run:eval('election_replica'..leader_nr,\ next_nr = leader_nr % 3 + 1 -- Someone else may become a leader, thus promote may fail. But we're testing -- that it takes effect at all, so that's fine. -_ = pcall(test_run:eval('election_replica'..next_nr, 'box.ctl.promote()')) +_ = pcall(test_run.eval, test_run, 'election_replica'..next_nr, 'box.ctl.promote()') new_term = test_run:eval('election_replica'..next_nr,\ 'return box.info.election.term')[1] assert(new_term > term) ========================= -- Serge Petrenko
14.07.2021 21:25, Serge Petrenko пишет: Another simple fix: ================================= diff --git a/test/replication/gh-6034-qsync-limbo-ownership.result b/test/replication/gh-6034-qsync-limbo-ownership.result index 0da3e5c2d..b4f53cd2a 100644 --- a/test/replication/gh-6034-qsync-limbo-ownership.result +++ b/test/replication/gh-6034-qsync-limbo-ownership.result @@ -120,6 +120,9 @@ test_run:switch('default') | --- | - true | ... +test_run:wait_lsn('default', 'replica') + | --- + | ... assert(box.info.ro) | --- | - true diff --git a/test/replication/gh-6034-qsync-limbo-ownership.test.lua b/test/replication/gh-6034-qsync-limbo-ownership.test.lua index a0c49575a..f9ef5ca41 100644 --- a/test/replication/gh-6034-qsync-limbo-ownership.test.lua +++ b/test/replication/gh-6034-qsync-limbo-ownership.test.lua @@ -43,6 +43,7 @@ assert(box.info.synchro.queue.owner == box.info.id) box.space.sync:insert{2} -- success. test_run:switch('default') +test_run:wait_lsn('default', 'replica') assert(box.info.ro) assert(box.info.synchro.queue.owner == test_run:get_server_id('replica')) box.space.sync:insert{3} -- failure. ================================= -- Serge Petrenko
Fix the test failing occasionally with the following result mismatch: [001] replication/election_qsync.test.lua memtx [ fail ] [001] [001] Test failed! Result content mismatch: [001] --- replication/election_qsync.result Thu Jul 15 17:15:48 2021 [001] +++ var/rejects/replication/election_qsync.reject Thu Jul 15 20:46:51 2021 [001] @@ -145,8 +145,7 @@ [001] | ... [001] box.space.test:select{} [001] | --- [001] - | - - [1] [001] - | - [2] [001] + | - - [2] [001] | ... [001] box.space.test:drop() [001] | --- [001] The issue happened because row [1] wasn't delivered to the 'default' instance from the 'replica' at all. The test does try to wait for [1] to be written to WAL and replicated, but sometimes it fails to wait until this event happens: box.ctl.promote() is issued asynchronously once the instance becomes the Raft leader. So issuing `box.ctl.wait_rw()` doesn't guarantee that the replica has already written the PROMOTE (the limbo is initially unclaimed so replica becomes writeable as soon as it becomes the Raft leader). Right after `wait_rw()` we wait for lsn propagation and for 'default' instance to reach replica's lsn. It may happen that lsn propagation happens due to PROMOTE being written to WAL, and not row [1]. When this is the case, the 'default' instance doesn't receive row [1] at all, resulting in the test error shown above. Fix the issue by waiting for the promotion to happen explicitly. Part of #5430 --- test/replication/election_qsync.result | 8 +++++++- test/replication/election_qsync.test.lua | 7 ++++++- 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/test/replication/election_qsync.result b/test/replication/election_qsync.result index 2402c8578..c6ec5e352 100644 --- a/test/replication/election_qsync.result +++ b/test/replication/election_qsync.result @@ -75,13 +75,19 @@ box.cfg{ | --- | ... -box.ctl.wait_rw() +-- Promote is written asynchronously to the instance becoming the leader, so +-- wait for it. As soon as it's written, the instance's definitely a leader. +test_run:wait_cond(function() \ + return box.info.synchro.queue.owner == box.info.id \ +end) | --- + | - true | ... assert(box.info.election.state == 'leader') | --- | - true | ... + lsn = box.info.lsn | --- | ... diff --git a/test/replication/election_qsync.test.lua b/test/replication/election_qsync.test.lua index e1aca8351..f3c7c290b 100644 --- a/test/replication/election_qsync.test.lua +++ b/test/replication/election_qsync.test.lua @@ -39,8 +39,13 @@ box.cfg{ replication_timeout = 0.1, \ } -box.ctl.wait_rw() +-- Promote is written asynchronously to the instance becoming the leader, so +-- wait for it. As soon as it's written, the instance's definitely a leader. +test_run:wait_cond(function() \ + return box.info.synchro.queue.owner == box.info.id \ +end) assert(box.info.election.state == 'leader') + lsn = box.info.lsn _ = fiber.create(function() \ ok, err = pcall(box.space.test.replace, box.space.test, {1}) \ -- 2.30.1 (Apple Git-130)
Hi! Good job on the patch!
> diff --git a/src/box/relay.cc b/src/box/relay.cc
> index 4ebe0fb06..4b102a777 100644
> --- a/src/box/relay.cc
> +++ b/src/box/relay.cc
> @@ -442,7 +445,11 @@ relay_initial_join(int fd, uint64_t sync, struct vclock *vclock,
> row.type = IPROTO_JOIN_META;
> coio_write_xrow(&relay->io, &row);
>
> - /* Empty at the moment. */
> + char body[XROW_SYNCHRO_BODY_LEN_MAX];
> + xrow_encode_synchro(&row, body, &req);
> + row.replica_id = req.replica_id;
> + row.sync = sync;
Shouldn't we also attach sync to IPROTO_JOIN_SNAPSHOT and
IPROTO_JOIN_META? Although I don't see where it is used in
the applier. But I see it is attached to xrow_encode_vclock
a bit above.
Thanks for the patch!
> diff --git a/src/box/relay.cc b/src/box/relay.cc
> index 4b102a777..70f1a045b 100644
> --- a/src/box/relay.cc
> +++ b/src/box/relay.cc
> @@ -451,6 +453,10 @@ relay_initial_join(int fd, uint64_t sync, struct vclock *vclock,
> row.sync = sync;
> coio_write_xrow(&relay->io, &row);
>
> + xrow_encode_raft(&row, &fiber()->gc, &raft_req);
> + row.sync = sync;
> + coio_write_xrow(&relay->io, &row);
You might need to send it before the limbo checkpoint. Otherwise
the problem is exactly the same - the master could send the limbo
state, the replica applied it, then the master dies before sending the
raft state, then the replica has limbo's term > raft term.
I wish we could test it somehow simple.
Thanks for working on this! See 3 comments below. > diff --git a/src/box/box.cc b/src/box/box.cc > index 86370514a..445875f8f 100644 > --- a/src/box/box.cc > +++ b/src/box/box.cc > @@ -1527,6 +1527,147 @@ box_wait_quorum(uint32_t lead_id, int64_t target_lsn, int quorum, <...> > + > +/** > + * Check whether the greatest promote term has changed since it was last read. > + * IOW check that a foreign PROMOTE arrived while we were sleeping. > + */ > +static int > +box_check_promote_term_changed(uint64_t promote_term) 1. Normally you call check functions using the pattern "check_something_correct". Here the correct behaviour is the term being intact. So I propose to rename it to box_check_promote_term_intact. > +{ > + if (txn_limbo.promote_greatest_term != promote_term) { > + diag_set(ClientError, ER_INTERFERING_PROMOTE, > + txn_limbo.owner_id); > + return -1; > + } > + return 0; > +} <...> > + > +/** > + * A helper to wait until all limbo entries are ready to be confirmed, i.e. > + * written to WAL and have gathered a quorum of ACKs from replicas. > + * Return lsn of the last limbo entry on success, -1 on error. > + */ > +static int64_t > +box_wait_limbo_acked(void) > +{ > + if (txn_limbo_is_empty(&txn_limbo)) > + return txn_limbo.confirmed_lsn; > + > + uint64_t promote_term = txn_limbo.promote_greatest_term; > + int quorum = replication_synchro_quorum; > + struct txn_limbo_entry *last_entry; > + last_entry = txn_limbo_last_synchro_entry(&txn_limbo); > + /* Wait for the last entries WAL write. */ > + if (last_entry->lsn < 0) { > + int64_t tid = last_entry->txn->id; > + > + if (wal_sync(NULL) < 0) > + return -1; > + > + if (box_check_promote_term_changed(promote_term) < 0) 2. Why < 0? It is not a in the code guidelines, but don't we usually use '!= 0'? '< 0' normally assumes you can get > 0, 0, and < 0 meaning different things, like it is done in iproto occassionally. > + return -1; > + if (txn_limbo_is_empty(&txn_limbo)) > + return txn_limbo.confirmed_lsn; > + if (tid != txn_limbo_last_synchro_entry(&txn_limbo)->txn->id) { > + diag_set(ClientError, ER_QUORUM_WAIT, quorum, > + "new synchronous transactions appeared"); > + return -1; > + } > + } <...> > + > +/** Write and process a PROMOTE request. */ > +static void > +box_issue_promote(uint32_t prev_leader_id, int64_t promote_lsn) > +{ > + assert(box_raft()->volatile_term == box_raft()->term); > + assert(promote_lsn >= 0); > + txn_limbo_write_promote(&txn_limbo, promote_lsn, > + box_raft()->term); 3. Maybe cache box_raft() value in a variable? Its usage would look shorter then. The same in other places where it is used more than once. Up to you. > + struct synchro_request req = { > + .type = IPROTO_PROMOTE, > + .replica_id = prev_leader_id, > + .origin_id = instance_id, > + .lsn = promote_lsn, > + .term = box_raft()->term, > + }; > + txn_limbo_process(&txn_limbo, &req); > + assert(txn_limbo_is_empty(&txn_limbo)); > +}
Thanks for the patch! Could you please add a test? Not even necessary in its own file. You could use qsync_basic/election_basic.
I appreciate the work you did here! See 2 comments below. > diff --git a/test/replication/gh-3055-promote-wakeup-crash.result b/test/replication/gh-3055-promote-wakeup-crash.result > new file mode 100644 > index 000000000..e508611e5 > --- /dev/null > +++ b/test/replication/gh-3055-promote-wakeup-crash.result > @@ -0,0 +1,43 @@ > +-- test-run result file version 2 > +test_run = require('test_run').new() > + | --- > + | ... > +-- > +-- gh-3055 follow-up: box.ctl.promote() could crash on an assertion after a > +-- spurious wakeup. > +-- > +_ = box.space._cluster:insert{2, require('uuid').str()} > + | --- > + | ... > +box.cfg{election_mode='manual',\ > + replication_synchro_quorum=2,\ > + election_timeout=1000} 1. Shouldn't you save and restore the old option values in the end of the test? > + | --- > + | ... > + > +fiber = require('fiber') > + | --- > + | ... > +f = fiber.create(function() box.ctl.promote() end) > + | --- > + | ... > +f:set_joinable(true) 2. But you never use :join(). Perhaps you might want to call it after :cancel(). To ensure the fiber really ended. > + | --- > + | ... > +f:wakeup() > + | --- > + | ... > +fiber.yield() > + | --- > + | ... > + > +-- Cleanup. > +f:cancel() > + | --- > + | ... > +box.cfg{election_mode='off'} > + | --- > + | ... > +test_run:cleanup_cluster() > + | --- > + | ...
22.07.2021 01:24, Vladislav Shpilevoy пишет: > Hi! Good job on the patch! > >> diff --git a/src/box/relay.cc b/src/box/relay.cc >> index 4ebe0fb06..4b102a777 100644 >> --- a/src/box/relay.cc >> +++ b/src/box/relay.cc >> @@ -442,7 +445,11 @@ relay_initial_join(int fd, uint64_t sync, struct vclock *vclock, >> row.type = IPROTO_JOIN_META; >> coio_write_xrow(&relay->io, &row); >> >> - /* Empty at the moment. */ >> + char body[XROW_SYNCHRO_BODY_LEN_MAX]; >> + xrow_encode_synchro(&row, body, &req); >> + row.replica_id = req.replica_id; >> + row.sync = sync; > Shouldn't we also attach sync to IPROTO_JOIN_SNAPSHOT and > IPROTO_JOIN_META? Although I don't see where it is used in > the applier. But I see it is attached to xrow_encode_vclock > a bit above. Yes, indeed. Will it be enough to set sync once only when sending the vclock? The field shouldn't be overwritten by anyone. =========================== diff --git a/src/box/relay.cc b/src/box/relay.cc index 4b102a777..db1d595e2 100644 --- a/src/box/relay.cc +++ b/src/box/relay.cc @@ -448,7 +448,6 @@ relay_initial_join(int fd, uint64_t sync, struct vclock *vclock, char body[XROW_SYNCHRO_BODY_LEN_MAX]; xrow_encode_synchro(&row, body, &req); row.replica_id = req.replica_id; - row.sync = sync; coio_write_xrow(&relay->io, &row); /* Mark the end of the metadata stream. */ ===========================
22.07.2021 01:24, Vladislav Shpilevoy пишет: > Thanks for the patch! > >> diff --git a/src/box/relay.cc b/src/box/relay.cc >> index 4b102a777..70f1a045b 100644 >> --- a/src/box/relay.cc >> +++ b/src/box/relay.cc >> @@ -451,6 +453,10 @@ relay_initial_join(int fd, uint64_t sync, struct vclock *vclock, >> row.sync = sync; >> coio_write_xrow(&relay->io, &row); >> >> + xrow_encode_raft(&row, &fiber()->gc, &raft_req); >> + row.sync = sync; >> + coio_write_xrow(&relay->io, &row); > You might need to send it before the limbo checkpoint. Otherwise > the problem is exactly the same - the master could send the limbo > state, the replica applied it, then the master dies before sending the > raft state, then the replica has limbo's term > raft term. Yep, thanks for noticing! Fixed. =============================== diff --git a/src/box/relay.cc b/src/box/relay.cc index b413b713c..805b5e7ff 100644 --- a/src/box/relay.cc +++ b/src/box/relay.cc @@ -447,15 +447,14 @@ relay_initial_join(int fd, uint64_t sync, struct vclock *vclock, row.type = IPROTO_JOIN_META; coio_write_xrow(&relay->io, &row); + xrow_encode_raft(&row, &fiber()->gc, &raft_req); + coio_write_xrow(&relay->io, &row); + char body[XROW_SYNCHRO_BODY_LEN_MAX]; xrow_encode_synchro(&row, body, &req); row.replica_id = req.replica_id; coio_write_xrow(&relay->io, &row); - xrow_encode_raft(&row, &fiber()->gc, &raft_req); - row.sync = sync; - coio_write_xrow(&relay->io, &row); - /* Mark the end of the metadata stream. */ row.type = IPROTO_JOIN_SNAPSHOT; coio_write_xrow(&relay->io, &row); =============================== > I wish we could test it somehow simple.
22.07.2021 01:26, Vladislav Shpilevoy пишет: > Thanks for working on this! > > See 3 comments below. Thanks for the review! >> diff --git a/src/box/box.cc b/src/box/box.cc >> index 86370514a..445875f8f 100644 >> --- a/src/box/box.cc >> +++ b/src/box/box.cc >> @@ -1527,6 +1527,147 @@ box_wait_quorum(uint32_t lead_id, int64_t target_lsn, int quorum, > <...> > >> + >> +/** >> + * Check whether the greatest promote term has changed since it was last read. >> + * IOW check that a foreign PROMOTE arrived while we were sleeping. >> + */ >> +static int >> +box_check_promote_term_changed(uint64_t promote_term) > 1. Normally you call check functions using the pattern > "check_something_correct". Here the correct behaviour is the term > being intact. So I propose to rename it to box_check_promote_term_intact. Ok, sure. >> +{ >> + if (txn_limbo.promote_greatest_term != promote_term) { >> + diag_set(ClientError, ER_INTERFERING_PROMOTE, >> + txn_limbo.owner_id); >> + return -1; >> + } >> + return 0; >> +} > <...> > >> + >> +/** >> + * A helper to wait until all limbo entries are ready to be confirmed, i.e. >> + * written to WAL and have gathered a quorum of ACKs from replicas. >> + * Return lsn of the last limbo entry on success, -1 on error. >> + */ >> +static int64_t >> +box_wait_limbo_acked(void) >> +{ >> + if (txn_limbo_is_empty(&txn_limbo)) >> + return txn_limbo.confirmed_lsn; >> + >> + uint64_t promote_term = txn_limbo.promote_greatest_term; >> + int quorum = replication_synchro_quorum; >> + struct txn_limbo_entry *last_entry; >> + last_entry = txn_limbo_last_synchro_entry(&txn_limbo); >> + /* Wait for the last entries WAL write. */ >> + if (last_entry->lsn < 0) { >> + int64_t tid = last_entry->txn->id; >> + >> + if (wal_sync(NULL) < 0) >> + return -1; >> + >> + if (box_check_promote_term_changed(promote_term) < 0) > 2. Why < 0? It is not a in the code guidelines, but don't we usually > use '!= 0'? '< 0' normally assumes you can get > 0, 0, and < 0 meaning > different things, like it is done in iproto occassionally. I've put '< 0' here without a second thought. I'm just used to if (smth() < 0) { err; }, I guess. AFAICS there are more places where we use if (rc != 0) { err;} more, so I'll change my code accordingly. >> + return -1; >> + if (txn_limbo_is_empty(&txn_limbo)) >> + return txn_limbo.confirmed_lsn; >> + if (tid != txn_limbo_last_synchro_entry(&txn_limbo)->txn->id) { >> + diag_set(ClientError, ER_QUORUM_WAIT, quorum, >> + "new synchronous transactions appeared"); >> + return -1; >> + } >> + } > <...> > >> + >> +/** Write and process a PROMOTE request. */ >> +static void >> +box_issue_promote(uint32_t prev_leader_id, int64_t promote_lsn) >> +{ >> + assert(box_raft()->volatile_term == box_raft()->term); >> + assert(promote_lsn >= 0); >> + txn_limbo_write_promote(&txn_limbo, promote_lsn, >> + box_raft()->term); > 3. Maybe cache box_raft() value in a variable? Its usage would look shorter > then. The same in other places where it is used more than once. Up to > you. Done. Incremental diff: ========================= diff --git a/src/box/box.cc b/src/box/box.cc index 341857267..d83c30918 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -1563,7 +1563,7 @@ box_run_elections(void) * IOW check that a foreign PROMOTE arrived while we were sleeping. */ static int -box_check_promote_term_changed(uint64_t promote_term) +box_check_promote_term_intact(uint64_t promote_term) { if (txn_limbo.promote_greatest_term != promote_term) { diag_set(ClientError, ER_INTERFERING_PROMOTE, @@ -1579,7 +1579,7 @@ box_try_wait_confirm(double timeout) { uint64_t promote_term = txn_limbo.promote_greatest_term; txn_limbo_wait_empty(&txn_limbo, timeout); - return box_check_promote_term_changed(promote_term); + return box_check_promote_term_intact(promote_term); } /** @@ -1604,7 +1604,7 @@ box_wait_limbo_acked(void) if (wal_sync(NULL) < 0) return -1; - if (box_check_promote_term_changed(promote_term) < 0) + if (box_check_promote_term_intact(promote_term) != 0) return -1; if (txn_limbo_is_empty(&txn_limbo)) return txn_limbo.confirmed_lsn; @@ -1618,10 +1618,10 @@ box_wait_limbo_acked(void) int64_t wait_lsn = last_entry->lsn; if (box_wait_quorum(txn_limbo.owner_id, wait_lsn, quorum, - replication_synchro_timeout) < 0) + replication_synchro_timeout) != 0) return -1; - if (box_check_promote_term_changed(promote_term) < 0) + if (box_check_promote_term_intact(promote_term) != 0) return -1; if (txn_limbo_is_empty(&txn_limbo)) @@ -1722,10 +1722,10 @@ box_promote(void) int64_t wait_lsn = -1; - if (run_elections && box_run_elections() < 0) + if (run_elections && box_run_elections() != 0) return -1; if (try_wait && - box_try_wait_confirm(2 * replication_synchro_timeout) < 0) + box_try_wait_confirm(2 * replication_synchro_timeout) != 0) return -1; if ((wait_lsn = box_wait_limbo_acked()) < 0) return -1; ========================= >> + struct synchro_request req = { >> + .type = IPROTO_PROMOTE, >> + .replica_id = prev_leader_id, >> + .origin_id = instance_id, >> + .lsn = promote_lsn, >> + .term = box_raft()->term, >> + }; >> + txn_limbo_process(&txn_limbo, &req); >> + assert(txn_limbo_is_empty(&txn_limbo)); >> +}
22.07.2021 01:26, Vladislav Shpilevoy пишет: > Thanks for the patch! > > Could you please add a test? Not even necessary in its own file. You > could use qsync_basic/election_basic. Ok, sure. I think gh-6034-election-promote-bump-term suits our needs better. ============================== diff --git a/test/replication/gh-6034-election-promote-bump-term.result b/test/replication/gh-6034-election-promote-bump-term.result index 8be4e8243..3999cfcc9 100644 --- a/test/replication/gh-6034-election-promote-bump-term.result +++ b/test/replication/gh-6034-election-promote-bump-term.result @@ -19,3 +19,12 @@ assert(box.info.election.term == term + 1) | --- | - true | ... + +-- Consequent promotes are no-ops on the leader. +box.ctl.promote() + | --- + | ... +assert(box.info.election.term == term + 1) + | --- + | - true + | ... diff --git a/test/replication/gh-6034-election-promote-bump-term.test.lua b/test/replication/gh-6034-election-promote-bump-term.test.lua index 1e814bf5d..52429394b 100644 --- a/test/replication/gh-6034-election-promote-bump-term.test.lua +++ b/test/replication/gh-6034-election-promote-bump-term.test.lua @@ -7,3 +7,7 @@ box.cfg{election_mode='off'} term = box.info.election.term box.ctl.promote() assert(box.info.election.term == term + 1) + +-- Consequent promotes are no-ops on the leader. +box.ctl.promote() +assert(box.info.election.term == term + 1) ==============================
22.07.2021 01:29, Vladislav Shpilevoy пишет: > I appreciate the work you did here! > > See 2 comments below. > >> diff --git a/test/replication/gh-3055-promote-wakeup-crash.result b/test/replication/gh-3055-promote-wakeup-crash.result >> new file mode 100644 >> index 000000000..e508611e5 >> --- /dev/null >> +++ b/test/replication/gh-3055-promote-wakeup-crash.result >> @@ -0,0 +1,43 @@ >> +-- test-run result file version 2 >> +test_run = require('test_run').new() >> + | --- >> + | ... >> +-- >> +-- gh-3055 follow-up: box.ctl.promote() could crash on an assertion after a >> +-- spurious wakeup. >> +-- >> +_ = box.space._cluster:insert{2, require('uuid').str()} >> + | --- >> + | ... >> +box.cfg{election_mode='manual',\ >> + replication_synchro_quorum=2,\ >> + election_timeout=1000} > 1. Shouldn't you save and restore the old option values in the end > of the test? Not anymore. As far as I know the default instance is restarted between each test run now. So we only need to reset persistent things now, like grants, roles, promotes/demotes, spaces and so on. >> + | --- >> + | ... >> + >> +fiber = require('fiber') >> + | --- >> + | ... >> +f = fiber.create(function() box.ctl.promote() end) >> + | --- >> + | ... >> +f:set_joinable(true) > 2. But you never use :join(). Perhaps you might want to call it > after :cancel(). To ensure the fiber really ended. Yep, my bad. Thanks for noticing! ===================================== diff --git a/test/replication/gh-3055-promote-wakeup-crash.result b/test/replication/gh-3055-promote-wakeup-crash.result index e508611e5..f915bf359 100644 --- a/test/replication/gh-3055-promote-wakeup-crash.result +++ b/test/replication/gh-3055-promote-wakeup-crash.result @@ -35,6 +35,11 @@ fiber.yield() f:cancel() | --- | ... +f:join() + | --- + | - false + | - fiber is cancelled + | ... box.cfg{election_mode='off'} | --- | ... diff --git a/test/replication/gh-3055-promote-wakeup-crash.test.lua b/test/replication/gh-3055-promote-wakeup-crash.test.lua index 2ac901b08..acf0c05b0 100644 --- a/test/replication/gh-3055-promote-wakeup-crash.test.lua +++ b/test/replication/gh-3055-promote-wakeup-crash.test.lua @@ -16,5 +16,6 @@ fiber.yield() -- Cleanup. f:cancel() +f:join() box.cfg{election_mode='off'} test_run:cleanup_cluster() ===================================== >> + | --- >> + | ... >> +f:wakeup() >> + | --- >> + | ... >> +fiber.yield() >> + | --- >> + | ... >> + >> +-- Cleanup. >> +f:cancel() >> + | --- >> + | ... >> +box.cfg{election_mode='off'} >> + | --- >> + | ... >> +test_run:cleanup_cluster() >> + | --- >> + | ...
Thanks for the fixes! On the latest version of the branch I see a lot of CI failures. Something got broken.
On 23.07.2021 09:44, Sergey Petrenko wrote:
>
> 22.07.2021 01:24, Vladislav Shpilevoy пишет:
>> Hi! Good job on the patch!
>>
>>> diff --git a/src/box/relay.cc b/src/box/relay.cc
>>> index 4ebe0fb06..4b102a777 100644
>>> --- a/src/box/relay.cc
>>> +++ b/src/box/relay.cc
>>> @@ -442,7 +445,11 @@ relay_initial_join(int fd, uint64_t sync, struct vclock *vclock,
>>> row.type = IPROTO_JOIN_META;
>>> coio_write_xrow(&relay->io, &row);
>>> - /* Empty at the moment. */
>>> + char body[XROW_SYNCHRO_BODY_LEN_MAX];
>>> + xrow_encode_synchro(&row, body, &req);
>>> + row.replica_id = req.replica_id;
>>> + row.sync = sync;
>> Shouldn't we also attach sync to IPROTO_JOIN_SNAPSHOT and
>> IPROTO_JOIN_META? Although I don't see where it is used in
>> the applier. But I see it is attached to xrow_encode_vclock
>> a bit above.
>
> Yes, indeed. Will it be enough to set sync once only when sending the vclock?
I don't even know why is it needed for vclock, tbh.
Applier does not keep track of these syncs anyway.
I suppose we can leave it just for vclock for now.
On 23.07.2021 09:44, Sergey Petrenko wrote:
>
> 22.07.2021 01:24, Vladislav Shpilevoy пишет:
>> Thanks for the patch!
>>
>>> diff --git a/src/box/relay.cc b/src/box/relay.cc
>>> index 4b102a777..70f1a045b 100644
>>> --- a/src/box/relay.cc
>>> +++ b/src/box/relay.cc
>>> @@ -451,6 +453,10 @@ relay_initial_join(int fd, uint64_t sync, struct vclock *vclock,
>>> row.sync = sync;
>>> coio_write_xrow(&relay->io, &row);
>>> + xrow_encode_raft(&row, &fiber()->gc, &raft_req);
>>> + row.sync = sync;
>>> + coio_write_xrow(&relay->io, &row);
>> You might need to send it before the limbo checkpoint. Otherwise
>> the problem is exactly the same - the master could send the limbo
>> state, the replica applied it, then the master dies before sending the
>> raft state, then the replica has limbo's term > raft term.
>
>
> Yep, thanks for noticing! Fixed.
Hm, but is it really possible? If the initial join is not finished
(the master didn't send data) and only the limbo is delivered, then
the remote instance is going to fail its bootstrap anyway and won't
use the partially received state.
Thanks for the fixes! > diff --git a/src/box/box.cc b/src/box/box.cc > index 85fac08bb..d83c30918 100644 > --- a/src/box/box.cc > +++ b/src/box/box.cc > @@ -1527,6 +1527,140 @@ box_wait_quorum(uint32_t lead_id, int64_t target_lsn, int quorum, <...> > + > +/** > + * A helper to wait until all limbo entries are ready to be confirmed, i.e. > + * written to WAL and have gathered a quorum of ACKs from replicas. > + * Return lsn of the last limbo entry on success, -1 on error. > + */ > +static int64_t > +box_wait_limbo_acked(void) > +{ > + if (txn_limbo_is_empty(&txn_limbo)) > + return txn_limbo.confirmed_lsn; > + > + uint64_t promote_term = txn_limbo.promote_greatest_term; > + int quorum = replication_synchro_quorum; > + struct txn_limbo_entry *last_entry; > + last_entry = txn_limbo_last_synchro_entry(&txn_limbo); > + /* Wait for the last entries WAL write. */ > + if (last_entry->lsn < 0) { > + int64_t tid = last_entry->txn->id; > + > + if (wal_sync(NULL) < 0) Could you please make it `!= 0`? To be consistent with the other places and to emphasize the result is binary - either an error or success.
Thanks for the fixes! > diff --git a/src/box/raft.c b/src/box/raft.c > index eb62e9630..35c471f58 100644 > --- a/src/box/raft.c > +++ b/src/box/raft.c > @@ -394,6 +394,46 @@ box_raft_wait_term_outcome(void) <...> > + > +int > +box_raft_wait_term_persisted(void) > +{ > + if (box_raft()->term == box_raft()->volatile_term) > + return 0; > + struct raft_wait_persisted_data data = { > + .waiter = fiber(), > + .term = box_raft()->volatile_term, > + }; > + struct trigger trig; > + trigger_create(&trig, box_raft_wait_term_persisted_f, &data, NULL); > + raft_on_update(box_raft(), &trig); > + > + do { > + fiber_yield(); > + } while (box_raft()->term < data.term && !fiber_is_cancelled()); Could I ask you please to cache box_raft() into a variable? It would be shorter and slightly easier to read. > + > + trigger_clear(&trig); > + if (fiber_is_cancelled()) { > + diag_set(FiberIsCancelled); > + return -1; > + } > + return 0; > +}
27.07.2021 02:43, Vladislav Shpilevoy пишет:
> On 23.07.2021 09:44, Sergey Petrenko wrote:
>> 22.07.2021 01:24, Vladislav Shpilevoy пишет:
>>> Thanks for the patch!
>>>
>>>> diff --git a/src/box/relay.cc b/src/box/relay.cc
>>>> index 4b102a777..70f1a045b 100644
>>>> --- a/src/box/relay.cc
>>>> +++ b/src/box/relay.cc
>>>> @@ -451,6 +453,10 @@ relay_initial_join(int fd, uint64_t sync, struct vclock *vclock,
>>>> row.sync = sync;
>>>> coio_write_xrow(&relay->io, &row);
>>>> + xrow_encode_raft(&row, &fiber()->gc, &raft_req);
>>>> + row.sync = sync;
>>>> + coio_write_xrow(&relay->io, &row);
>>> You might need to send it before the limbo checkpoint. Otherwise
>>> the problem is exactly the same - the master could send the limbo
>>> state, the replica applied it, then the master dies before sending the
>>> raft state, then the replica has limbo's term > raft term.
>> Yep, thanks for noticing! Fixed.
> Hm, but is it really possible? If the initial join is not finished
> (the master didn't send data) and only the limbo is delivered, then
> the remote instance is going to fail its bootstrap anyway and won't
> use the partially received state.
That's true. Anyway, let's send raft state and limbo state in correct order,
just to keep things consistent. Otherwise we'll have a point in time when
limbo term is bigger than raft term. Let's better get rid of such cases
completely.
27.07.2021 02:44, Vladislav Shpilevoy пишет:
> Thanks for the fixes!
>
>> diff --git a/src/box/box.cc b/src/box/box.cc
>> index 85fac08bb..d83c30918 100644
>> --- a/src/box/box.cc
>> +++ b/src/box/box.cc
>> @@ -1527,6 +1527,140 @@ box_wait_quorum(uint32_t lead_id, int64_t target_lsn, int quorum,
> <...>
>
>> +
>> +/**
>> + * A helper to wait until all limbo entries are ready to be confirmed, i.e.
>> + * written to WAL and have gathered a quorum of ACKs from replicas.
>> + * Return lsn of the last limbo entry on success, -1 on error.
>> + */
>> +static int64_t
>> +box_wait_limbo_acked(void)
>> +{
>> + if (txn_limbo_is_empty(&txn_limbo))
>> + return txn_limbo.confirmed_lsn;
>> +
>> + uint64_t promote_term = txn_limbo.promote_greatest_term;
>> + int quorum = replication_synchro_quorum;
>> + struct txn_limbo_entry *last_entry;
>> + last_entry = txn_limbo_last_synchro_entry(&txn_limbo);
>> + /* Wait for the last entries WAL write. */
>> + if (last_entry->lsn < 0) {
>> + int64_t tid = last_entry->txn->id;
>> +
>> + if (wal_sync(NULL) < 0)
> Could you please make it `!= 0`? To be consistent with the
> other places and to emphasize the result is binary - either an
> error or success.
Sure, fixed. I didn't touch it at first because it had "< 0"
even before my patchset.
27.07.2021 02:45, Vladislav Shpilevoy пишет: > Thanks for the fixes! > >> diff --git a/src/box/raft.c b/src/box/raft.c >> index eb62e9630..35c471f58 100644 >> --- a/src/box/raft.c >> +++ b/src/box/raft.c >> @@ -394,6 +394,46 @@ box_raft_wait_term_outcome(void) > <...> > >> + >> +int >> +box_raft_wait_term_persisted(void) >> +{ >> + if (box_raft()->term == box_raft()->volatile_term) >> + return 0; >> + struct raft_wait_persisted_data data = { >> + .waiter = fiber(), >> + .term = box_raft()->volatile_term, >> + }; >> + struct trigger trig; >> + trigger_create(&trig, box_raft_wait_term_persisted_f, &data, NULL); >> + raft_on_update(box_raft(), &trig); >> + >> + do { >> + fiber_yield(); >> + } while (box_raft()->term < data.term && !fiber_is_cancelled()); > Could I ask you please to cache box_raft() into a variable? > It would be shorter and slightly easier to read. Sure. Done. =========================== diff --git a/src/box/raft.c b/src/box/raft.c index 35c471f58..bef9414c3 100644 --- a/src/box/raft.c +++ b/src/box/raft.c @@ -412,19 +412,20 @@ box_raft_wait_term_persisted_f(struct trigger *trig, void *event) int box_raft_wait_term_persisted(void) { - if (box_raft()->term == box_raft()->volatile_term) + struct raft *raft = box_raft(); + if (raft->term == raft->volatile_term) return 0; struct raft_wait_persisted_data data = { .waiter = fiber(), - .term = box_raft()->volatile_term, + .term = raft->volatile_term, }; struct trigger trig; trigger_create(&trig, box_raft_wait_term_persisted_f, &data, NULL); - raft_on_update(box_raft(), &trig); + raft_on_update(raft, &trig); do { fiber_yield(); - } while (box_raft()->term < data.term && !fiber_is_cancelled()); + } while (raft->term < data.term && !fiber_is_cancelled()); trigger_clear(&trig); if (fiber_is_cancelled()) { =========================== >> + >> + trigger_clear(&trig); >> + if (fiber_is_cancelled()) { >> + diag_set(FiberIsCancelled); >> + return -1; >> + } >> + return 0; >> +}
27.07.2021 02:43, Vladislav Shpilevoy пишет:
> Thanks for the fixes!
>
> On the latest version of the branch I see a lot of CI failures.
> Something got broken.
Yep, I'm trying to figure out what's wrong now. vinyl/errinj.test.lua
hangs now on trying to join a replica.
I've updated the rest of the patch according to your comments, but haven't
fixed the test hang yet.