See the commit messages for the details. I did some basic tests for backward compatibility since I added a new field to the ballot, and it works fine. Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5613-boot-scores Issue: https://github.com/tarantool/tarantool/issues/5613 Vladislav Shpilevoy (6): replication: refactor replicaset_leader() replication: ballot.is_ro -> is_ro_cfg replication: ballot.is_loading -> is_ro replication: introduce ballot.is_booted replication: use 'score' to find a join-master replication: prefer to join from booted replicas .../gh-5613-bootstrap-prefer-booted.md | 6 ++ src/box/box.cc | 17 ++-- src/box/iproto_constants.h | 5 +- src/box/replication.cc | 82 ++++++++----------- src/box/replication.h | 5 +- src/box/xrow.c | 30 ++++--- src/box/xrow.h | 13 +-- .../gh-5613-bootstrap-prefer-booted.result | 70 ++++++++++++++++ .../gh-5613-bootstrap-prefer-booted.test.lua | 21 +++++ test/replication/gh-5613-master.lua | 11 +++ test/replication/gh-5613-replica1.lua | 13 +++ test/replication/gh-5613-replica2.lua | 11 +++ test/replication/suite.cfg | 1 + 13 files changed, 207 insertions(+), 78 deletions(-) create mode 100644 changelogs/unreleased/gh-5613-bootstrap-prefer-booted.md create mode 100644 test/replication/gh-5613-bootstrap-prefer-booted.result create mode 100644 test/replication/gh-5613-bootstrap-prefer-booted.test.lua create mode 100644 test/replication/gh-5613-master.lua create mode 100644 test/replication/gh-5613-replica1.lua create mode 100644 test/replication/gh-5613-replica2.lua -- 2.24.3 (Apple Git-128)
Firstly, rename it to replicaset_find_join_master(). Now, when there is Raft with a concept of an actual leader, the function name becomes confusing. Secondly, do not access ballot member in struct applier in such a long way - save the ballot pointer on the stack. This is going to become useful when in one of the next patches the ballot will be used more. Part of #5613 --- src/box/box.cc | 5 ++--- src/box/replication.cc | 12 +++++++----- src/box/replication.h | 5 +++-- 3 files changed, 12 insertions(+), 10 deletions(-) diff --git a/src/box/box.cc b/src/box/box.cc index 6dc991dc8..0e615e944 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -1369,7 +1369,7 @@ box_set_replication_anon(void) * Wait until the master has registered this * instance. */ - struct replica *master = replicaset_leader(); + struct replica *master = replicaset_find_join_master(); if (master == NULL || master->applier == NULL || master->applier->state != APPLIER_CONNECTED) { tnt_raise(ClientError, ER_CANNOT_REGISTER); @@ -3114,8 +3114,7 @@ bootstrap(const struct tt_uuid *instance_uuid, */ box_sync_replication(true); - /* Use the first replica by URI as a bootstrap leader */ - struct replica *master = replicaset_leader(); + struct replica *master = replicaset_find_join_master(); assert(master == NULL || master->applier != NULL); if (master != NULL && !tt_uuid_is_equal(&master->uuid, &INSTANCE_UUID)) { diff --git a/src/box/replication.cc b/src/box/replication.cc index aefb812b3..a1c6e3c7c 100644 --- a/src/box/replication.cc +++ b/src/box/replication.cc @@ -968,6 +968,7 @@ replicaset_round(bool skip_ro) struct applier *applier = replica->applier; if (applier == NULL) continue; + const struct ballot *ballot = &applier->ballot; /** * While bootstrapping a new cluster, read-only * replicas shouldn't be considered as a leader. @@ -975,17 +976,18 @@ replicaset_round(bool skip_ro) * replicas since there is still a possibility * that all replicas exist in cluster table. */ - if (skip_ro && applier->ballot.is_ro) + if (skip_ro && ballot->is_ro) continue; if (leader == NULL) { leader = replica; continue; } + const struct ballot *leader_ballot = &leader->applier->ballot; /* * Try to find a replica which has already left * orphan mode. */ - if (applier->ballot.is_loading && ! leader->applier->ballot.is_loading) + if (ballot->is_loading && !leader_ballot->is_loading) continue; /* * Choose the replica with the most advanced @@ -993,8 +995,8 @@ replicaset_round(bool skip_ro) * with the same vclock, prefer the one with * the lowest uuid. */ - int cmp = vclock_compare_ignore0(&applier->ballot.vclock, - &leader->applier->ballot.vclock); + int cmp = vclock_compare_ignore0(&ballot->vclock, + &leader_ballot->vclock); if (cmp < 0) continue; if (cmp == 0 && tt_uuid_compare(&replica->uuid, @@ -1006,7 +1008,7 @@ replicaset_round(bool skip_ro) } struct replica * -replicaset_leader(void) +replicaset_find_join_master(void) { bool skip_ro = true; /** diff --git a/src/box/replication.h b/src/box/replication.h index 2ad1cbf66..5cc380373 100644 --- a/src/box/replication.h +++ b/src/box/replication.h @@ -356,10 +356,11 @@ struct replica * replica_by_id(uint32_t replica_id); /** - * Return the replica set leader. + * Find a node in the replicaset on which the instance can try to register to + * join the replicaset. */ struct replica * -replicaset_leader(void); +replicaset_find_join_master(void); struct replica * replicaset_first(void); -- 2.24.3 (Apple Git-128)
Rename the member to show its actual meaning. It is not the real RO state of the instance. Only how it is configured. It can happen that the instance is read_only = false, but still is in RO state due to other reasons. The patch is done in scope of #5613 where the ballot is going to be extended and used a bit differently in the join-master search algorithm. Part of #5613 --- src/box/box.cc | 2 +- src/box/iproto_constants.h | 2 +- src/box/replication.cc | 2 +- src/box/xrow.c | 12 ++++++------ src/box/xrow.h | 4 ++-- 5 files changed, 11 insertions(+), 11 deletions(-) diff --git a/src/box/box.cc b/src/box/box.cc index 0e615e944..d35a339ad 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -2860,7 +2860,7 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header) void box_process_vote(struct ballot *ballot) { - ballot->is_ro = cfg_geti("read_only") != 0; + ballot->is_ro_cfg = cfg_geti("read_only") != 0; ballot->is_anon = replication_anon; /* * is_ro is true on initial load and is set to box.cfg.read_only diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h index 7362ddaf1..d4ee9e090 100644 --- a/src/box/iproto_constants.h +++ b/src/box/iproto_constants.h @@ -162,7 +162,7 @@ enum iproto_metadata_key { }; enum iproto_ballot_key { - IPROTO_BALLOT_IS_RO = 0x01, + IPROTO_BALLOT_IS_RO_CFG = 0x01, IPROTO_BALLOT_VCLOCK = 0x02, IPROTO_BALLOT_GC_VCLOCK = 0x03, IPROTO_BALLOT_IS_LOADING = 0x04, diff --git a/src/box/replication.cc b/src/box/replication.cc index a1c6e3c7c..ce2b74065 100644 --- a/src/box/replication.cc +++ b/src/box/replication.cc @@ -976,7 +976,7 @@ replicaset_round(bool skip_ro) * replicas since there is still a possibility * that all replicas exist in cluster table. */ - if (skip_ro && ballot->is_ro) + if (skip_ro && ballot->is_ro_cfg) continue; if (leader == NULL) { leader = replica; diff --git a/src/box/xrow.c b/src/box/xrow.c index 2e364cea5..6e2a87f8a 100644 --- a/src/box/xrow.c +++ b/src/box/xrow.c @@ -450,7 +450,7 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot, { size_t max_size = IPROTO_HEADER_LEN + mp_sizeof_map(1) + mp_sizeof_uint(UINT32_MAX) + mp_sizeof_map(5) + - mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro) + + mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro_cfg) + mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_loading) + mp_sizeof_uint(IPROTO_BALLOT_IS_ANON) + mp_sizeof_bool(ballot->is_anon) + @@ -470,8 +470,8 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot, data = mp_encode_map(data, 1); data = mp_encode_uint(data, IPROTO_BALLOT); data = mp_encode_map(data, 5); - data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO); - data = mp_encode_bool(data, ballot->is_ro); + data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO_CFG); + data = mp_encode_bool(data, ballot->is_ro_cfg); data = mp_encode_uint(data, IPROTO_BALLOT_IS_LOADING); data = mp_encode_bool(data, ballot->is_loading); data = mp_encode_uint(data, IPROTO_BALLOT_IS_ANON); @@ -1357,7 +1357,7 @@ xrow_encode_vote(struct xrow_header *row) int xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot) { - ballot->is_ro = false; + ballot->is_ro_cfg = false; ballot->is_loading = false; ballot->is_anon = false; vclock_create(&ballot->vclock); @@ -1399,10 +1399,10 @@ xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot) } uint32_t key = mp_decode_uint(&data); switch (key) { - case IPROTO_BALLOT_IS_RO: + case IPROTO_BALLOT_IS_RO_CFG: if (mp_typeof(*data) != MP_BOOL) goto err; - ballot->is_ro = mp_decode_bool(&data); + ballot->is_ro_cfg = mp_decode_bool(&data); break; case IPROTO_BALLOT_IS_LOADING: if (mp_typeof(*data) != MP_BOOL) diff --git a/src/box/xrow.h b/src/box/xrow.h index b3c664be2..241a7af8e 100644 --- a/src/box/xrow.h +++ b/src/box/xrow.h @@ -366,8 +366,8 @@ xrow_encode_auth(struct xrow_header *row, const char *salt, size_t salt_len, /** Reply to IPROTO_VOTE request. */ struct ballot { - /** Set if the instance is running in read-only mode. */ - bool is_ro; + /** Set if the instance is configured in read-only mode. */ + bool is_ro_cfg; /** * A flag whether the instance is anonymous, not having an * ID, and not going to request it. -- 2.24.3 (Apple Git-128)
Is_loading in the ballot used to mean the following: "the instance did not finish its box.cfg() or has read_only = true". Which is quite a strange property. For instance, it was 'true' even if the instance is not really loading anymore but has read_only = true. The patch renames it to 'is_ro' (which existed here before, but also with a wrong meaning). Its behaviour is slightly changed to report the RO state of the instance. Not its read_only. This way it incorporates all the possible RO conditions. Such as not finished bootstrap, having read_only = true, being a Raft follower, and so on. The patch is done in scope of #5613 where the ballot is going to be extended and used a bit differently in the join-master search algorithm. Part of #5613 --- src/box/box.cc | 9 +-------- src/box/iproto_constants.h | 2 +- src/box/replication.cc | 2 +- src/box/xrow.c | 12 ++++++------ src/box/xrow.h | 7 ++++--- 5 files changed, 13 insertions(+), 19 deletions(-) diff --git a/src/box/box.cc b/src/box/box.cc index d35a339ad..d56b44d33 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -2862,14 +2862,7 @@ box_process_vote(struct ballot *ballot) { ballot->is_ro_cfg = cfg_geti("read_only") != 0; ballot->is_anon = replication_anon; - /* - * is_ro is true on initial load and is set to box.cfg.read_only - * after box_cfg() returns, during dynamic box.cfg parameters setting. - * We would like to prefer already bootstrapped instances to the ones - * still bootstrapping and the ones still bootstrapping, but writeable - * to the ones that have box.cfg.read_only = true. - */ - ballot->is_loading = is_ro; + ballot->is_ro = is_ro_summary; vclock_copy(&ballot->vclock, &replicaset.vclock); vclock_copy(&ballot->gc_vclock, &gc.vclock); } diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h index d4ee9e090..0f84843d0 100644 --- a/src/box/iproto_constants.h +++ b/src/box/iproto_constants.h @@ -165,7 +165,7 @@ enum iproto_ballot_key { IPROTO_BALLOT_IS_RO_CFG = 0x01, IPROTO_BALLOT_VCLOCK = 0x02, IPROTO_BALLOT_GC_VCLOCK = 0x03, - IPROTO_BALLOT_IS_LOADING = 0x04, + IPROTO_BALLOT_IS_RO = 0x04, IPROTO_BALLOT_IS_ANON = 0x05, }; diff --git a/src/box/replication.cc b/src/box/replication.cc index ce2b74065..990f6239c 100644 --- a/src/box/replication.cc +++ b/src/box/replication.cc @@ -987,7 +987,7 @@ replicaset_round(bool skip_ro) * Try to find a replica which has already left * orphan mode. */ - if (ballot->is_loading && !leader_ballot->is_loading) + if (ballot->is_ro && !leader_ballot->is_ro) continue; /* * Choose the replica with the most advanced diff --git a/src/box/xrow.c b/src/box/xrow.c index 6e2a87f8a..115a25473 100644 --- a/src/box/xrow.c +++ b/src/box/xrow.c @@ -451,7 +451,7 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot, size_t max_size = IPROTO_HEADER_LEN + mp_sizeof_map(1) + mp_sizeof_uint(UINT32_MAX) + mp_sizeof_map(5) + mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro_cfg) + - mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_loading) + + mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro) + mp_sizeof_uint(IPROTO_BALLOT_IS_ANON) + mp_sizeof_bool(ballot->is_anon) + mp_sizeof_uint(UINT32_MAX) + @@ -472,8 +472,8 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot, data = mp_encode_map(data, 5); data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO_CFG); data = mp_encode_bool(data, ballot->is_ro_cfg); - data = mp_encode_uint(data, IPROTO_BALLOT_IS_LOADING); - data = mp_encode_bool(data, ballot->is_loading); + data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO); + data = mp_encode_bool(data, ballot->is_ro); data = mp_encode_uint(data, IPROTO_BALLOT_IS_ANON); data = mp_encode_bool(data, ballot->is_anon); data = mp_encode_uint(data, IPROTO_BALLOT_VCLOCK); @@ -1358,7 +1358,7 @@ int xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot) { ballot->is_ro_cfg = false; - ballot->is_loading = false; + ballot->is_ro = false; ballot->is_anon = false; vclock_create(&ballot->vclock); @@ -1404,10 +1404,10 @@ xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot) goto err; ballot->is_ro_cfg = mp_decode_bool(&data); break; - case IPROTO_BALLOT_IS_LOADING: + case IPROTO_BALLOT_IS_RO: if (mp_typeof(*data) != MP_BOOL) goto err; - ballot->is_loading = mp_decode_bool(&data); + ballot->is_ro = mp_decode_bool(&data); break; case IPROTO_BALLOT_IS_ANON: if (mp_typeof(*data) != MP_BOOL) diff --git a/src/box/xrow.h b/src/box/xrow.h index 241a7af8e..1d00b2e43 100644 --- a/src/box/xrow.h +++ b/src/box/xrow.h @@ -374,10 +374,11 @@ struct ballot { */ bool is_anon; /** - * Set if the instance hasn't finished bootstrap or recovery, or - * is syncing with other replicas in the replicaset. + * Set if the instance is not writable due to any reason. Could be + * config read_only=true; being orphan; being a Raft follower; not + * finished recovery/bootstrap; or anything else. */ - bool is_loading; + bool is_ro; /** Current instance vclock. */ struct vclock vclock; /** Oldest vclock available on the instance. */ -- 2.24.3 (Apple Git-128)
The new field reports whether the instance has finished its bootstrap/recovery, or IOW has finished box.cfg(). The new field will help in fixing #5613 so as not to try to join to a replicaset via non-bootstrapped instances if there are others. The problem is that otherwise, if all nodes are booted but are read-only, new instances bootstrap their own independent replicaset. It would be better to just fail and terminate the process than do such a bizarre action. Part of #5613 --- src/box/box.cc | 1 + src/box/iproto_constants.h | 1 + src/box/xrow.c | 14 ++++++++++++-- src/box/xrow.h | 2 ++ 4 files changed, 16 insertions(+), 2 deletions(-) diff --git a/src/box/box.cc b/src/box/box.cc index d56b44d33..6fca337bc 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -2863,6 +2863,7 @@ box_process_vote(struct ballot *ballot) ballot->is_ro_cfg = cfg_geti("read_only") != 0; ballot->is_anon = replication_anon; ballot->is_ro = is_ro_summary; + ballot->is_booted = is_box_configured; vclock_copy(&ballot->vclock, &replicaset.vclock); vclock_copy(&ballot->gc_vclock, &gc.vclock); } diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h index 0f84843d0..137bee9da 100644 --- a/src/box/iproto_constants.h +++ b/src/box/iproto_constants.h @@ -167,6 +167,7 @@ enum iproto_ballot_key { IPROTO_BALLOT_GC_VCLOCK = 0x03, IPROTO_BALLOT_IS_RO = 0x04, IPROTO_BALLOT_IS_ANON = 0x05, + IPROTO_BALLOT_IS_BOOTED = 0x06, }; #define bit(c) (1ULL<<IPROTO_##c) diff --git a/src/box/xrow.c b/src/box/xrow.c index 115a25473..0d0548fef 100644 --- a/src/box/xrow.c +++ b/src/box/xrow.c @@ -449,11 +449,13 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot, uint64_t sync, uint32_t schema_version) { size_t max_size = IPROTO_HEADER_LEN + mp_sizeof_map(1) + - mp_sizeof_uint(UINT32_MAX) + mp_sizeof_map(5) + + mp_sizeof_uint(UINT32_MAX) + mp_sizeof_map(6) + mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro_cfg) + mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro) + mp_sizeof_uint(IPROTO_BALLOT_IS_ANON) + mp_sizeof_bool(ballot->is_anon) + + mp_sizeof_uint(IPROTO_BALLOT_IS_BOOTED) + + mp_sizeof_bool(ballot->is_booted) + mp_sizeof_uint(UINT32_MAX) + mp_sizeof_vclock_ignore0(&ballot->vclock) + mp_sizeof_uint(UINT32_MAX) + @@ -469,13 +471,15 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot, char *data = buf + IPROTO_HEADER_LEN; data = mp_encode_map(data, 1); data = mp_encode_uint(data, IPROTO_BALLOT); - data = mp_encode_map(data, 5); + data = mp_encode_map(data, 6); data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO_CFG); data = mp_encode_bool(data, ballot->is_ro_cfg); data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO); data = mp_encode_bool(data, ballot->is_ro); data = mp_encode_uint(data, IPROTO_BALLOT_IS_ANON); data = mp_encode_bool(data, ballot->is_anon); + data = mp_encode_uint(data, IPROTO_BALLOT_IS_BOOTED); + data = mp_encode_bool(data, ballot->is_booted); data = mp_encode_uint(data, IPROTO_BALLOT_VCLOCK); data = mp_encode_vclock_ignore0(data, &ballot->vclock); data = mp_encode_uint(data, IPROTO_BALLOT_GC_VCLOCK); @@ -1360,6 +1364,7 @@ xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot) ballot->is_ro_cfg = false; ballot->is_ro = false; ballot->is_anon = false; + ballot->is_booted = true; vclock_create(&ballot->vclock); const char *start = NULL; @@ -1424,6 +1429,11 @@ xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot) &ballot->gc_vclock) != 0) goto err; break; + case IPROTO_BALLOT_IS_BOOTED: + if (mp_typeof(*data) != MP_BOOL) + goto err; + ballot->is_booted = mp_decode_bool(&data); + break; default: mp_next(&data); } diff --git a/src/box/xrow.h b/src/box/xrow.h index 1d00b2e43..055fd31e1 100644 --- a/src/box/xrow.h +++ b/src/box/xrow.h @@ -379,6 +379,8 @@ struct ballot { * finished recovery/bootstrap; or anything else. */ bool is_ro; + /** Set if the instance has finished its bootstrap/recovery. */ + bool is_booted; /** Current instance vclock. */ struct vclock vclock; /** Oldest vclock available on the instance. */ -- 2.24.3 (Apple Git-128)
The patch refactors the algorithm of finding a join-master (in replicaset_find_join_master()) to use scores instead of multiple iterations with different criteria. The original code was relatively fine as long as it had only one parameter to change - whether should it skip `box.cfg{read_only = true}` nodes. Although it was clear that it was "on the edge" of acceptable complexity due to a second non-configurable parameter whether a replica is in read-only state regardless of its config. It is going to get more complicated when the algorithm will take into account the third parameter whether an instance is bootstrapped. Then it should make decisions like "among bootstrapped nodes try to prefer instances not having read_only=true, and not being in read-only state". The easiest way to do so is to use scores/weights incremented according to the instance's parameters matching certain "good points". Part of #5613 --- src/box/replication.cc | 62 ++++++++++++++++-------------------------- 1 file changed, 23 insertions(+), 39 deletions(-) diff --git a/src/box/replication.cc b/src/box/replication.cc index 990f6239c..d33e70f28 100644 --- a/src/box/replication.cc +++ b/src/box/replication.cc @@ -960,71 +960,55 @@ replicaset_next(struct replica *replica) * replicas, choose a read-only replica with biggest vclock * as a leader, in hope it will become read-write soon. */ -static struct replica * -replicaset_round(bool skip_ro) +struct replica * +replicaset_find_join_master(void) { struct replica *leader = NULL; + int leader_score = -1; replicaset_foreach(replica) { struct applier *applier = replica->applier; if (applier == NULL) continue; const struct ballot *ballot = &applier->ballot; - /** - * While bootstrapping a new cluster, read-only - * replicas shouldn't be considered as a leader. - * The only exception if there is no read-write - * replicas since there is still a possibility - * that all replicas exist in cluster table. - */ - if (skip_ro && ballot->is_ro_cfg) - continue; - if (leader == NULL) { - leader = replica; - continue; - } - const struct ballot *leader_ballot = &leader->applier->ballot; + int score = 0; /* - * Try to find a replica which has already left - * orphan mode. + * Prefer instances not configured as read-only via box.cfg, and + * not being in read-only state due to any other reason. The + * config is stronger because if it is configured as read-only, + * it is in read-only state for sure, until the config is + * changed. */ - if (ballot->is_ro && !leader_ballot->is_ro) + if (!ballot->is_ro_cfg) + score += 5; + if (!ballot->is_ro) + score += 1; + if (leader_score < score) + goto elect; + if (score < leader_score) continue; + const struct ballot *leader_ballot; + leader_ballot = &leader->applier->ballot; /* * Choose the replica with the most advanced * vclock. If there are two or more replicas * with the same vclock, prefer the one with * the lowest uuid. */ - int cmp = vclock_compare_ignore0(&ballot->vclock, - &leader_ballot->vclock); + int cmp; + cmp = vclock_compare_ignore0(&ballot->vclock, + &leader_ballot->vclock); if (cmp < 0) continue; if (cmp == 0 && tt_uuid_compare(&replica->uuid, &leader->uuid) > 0) continue; + elect: leader = replica; + leader_score = score; } return leader; } -struct replica * -replicaset_find_join_master(void) -{ - bool skip_ro = true; - /** - * Two loops, first prefers read-write replicas among others. - * Second for backward compatibility, if there is no such - * replicas at all. - */ - struct replica *leader = replicaset_round(skip_ro); - if (leader == NULL) { - skip_ro = false; - leader = replicaset_round(skip_ro); - } - - return leader; -} - struct replica * replica_by_uuid(const struct tt_uuid *uuid) { -- 2.24.3 (Apple Git-128)
The algorithm of looking for an instance to join the replicaset from didn't take into account that some of the instances might be not bootstrapped but still perfectly available. As a result, a ridiculous situation could happen - an instance could connect to a cluster with just read-only instances, but it could have itself with box.cfg{read_only = false}. Then instead of failing or waiting it just booted a brand new cluster. And after that the node just started complaining about the others having a different replicaset UUID. The patch makes so a new instance always prefers a bootstrapped join-source to a non-boostrapped one, including self. In the situation above the new instance now terminates with an error. In future hopefully it should start a retry-loop instead. Closes #5613 @TarantoolBot document Title: IPROTO_BALLOT rework and a new field A couple of fields in `IPROTO_BALLOT 0x29` used to have values not matching with their names. They are changed. * `IPROTO_BALLOT_IS_RO 0x01` used to mean "the instance has `box.cfg{read_only = true}`". It was renamed in the source code to `IPROTO_BALLOT_IS_RO_CFG`. It has the same code `0x01`, and the value is the same. Only the name has changed, and in the doc should be too. * `IPROTO_BALLOT_IS_LOADING 0x04` used to mean "the instance has finished `box.cfg()` and it has `read_only = true`". The name was wrong therefore, because even if the instance finished loading, the flag still was false for `read_only = true` nodes. Also such a value is not very suitable for any sane usage. The name was changed to `IPROTO_BALLOT_IS_RO`, the code stayed the same, and the value now is "the instance is not writable". The reason for being not writable can be any: the node is an orphan; or it has `read_only = true`; or it is a Raft follower; or anything else. And there is a new field. `IPROTO_BALLOT_IS_BOOTED 0x06` means the instance has finished its bootstrap or recovery. --- .../gh-5613-bootstrap-prefer-booted.md | 6 ++ src/box/replication.cc | 20 +++--- .../gh-5613-bootstrap-prefer-booted.result | 70 +++++++++++++++++++ .../gh-5613-bootstrap-prefer-booted.test.lua | 21 ++++++ test/replication/gh-5613-master.lua | 11 +++ test/replication/gh-5613-replica1.lua | 13 ++++ test/replication/gh-5613-replica2.lua | 11 +++ test/replication/suite.cfg | 1 + 8 files changed, 144 insertions(+), 9 deletions(-) create mode 100644 changelogs/unreleased/gh-5613-bootstrap-prefer-booted.md create mode 100644 test/replication/gh-5613-bootstrap-prefer-booted.result create mode 100644 test/replication/gh-5613-bootstrap-prefer-booted.test.lua create mode 100644 test/replication/gh-5613-master.lua create mode 100644 test/replication/gh-5613-replica1.lua create mode 100644 test/replication/gh-5613-replica2.lua diff --git a/changelogs/unreleased/gh-5613-bootstrap-prefer-booted.md b/changelogs/unreleased/gh-5613-bootstrap-prefer-booted.md new file mode 100644 index 000000000..c022ee012 --- /dev/null +++ b/changelogs/unreleased/gh-5613-bootstrap-prefer-booted.md @@ -0,0 +1,6 @@ +## bugfix/replication + +* Fixed an error when a replica, at attempt to join a cluster with exclusively + read-only replicas available, instead of failing or retrying just decided to + boot its own replicaset. Now it fails with an error about the other nodes + being read-only so they can't register it (gh-5613). diff --git a/src/box/replication.cc b/src/box/replication.cc index d33e70f28..52086c65e 100644 --- a/src/box/replication.cc +++ b/src/box/replication.cc @@ -951,15 +951,6 @@ replicaset_next(struct replica *replica) return replica_hash_next(&replicaset.hash, replica); } -/** - * Compare vclock, read only mode and orphan status - * of all connected replicas and elect a leader. - * Initiallly, skip read-only replicas, since they - * can not properly act as bootstrap masters (register - * new nodes in _cluster table). If there are no read-write - * replicas, choose a read-only replica with biggest vclock - * as a leader, in hope it will become read-write soon. - */ struct replica * replicaset_find_join_master(void) { @@ -972,12 +963,23 @@ replicaset_find_join_master(void) const struct ballot *ballot = &applier->ballot; int score = 0; /* + * First of all try to ignore non-booted instances. Including + * self if not booted yet. For self it is even dangerous as the + * instance might decide to boot its own cluster if, for + * example, the other nodes are available, but read-only. It + * would be a mistake. + * + * For a new cluster it is ok to use a non-booted instance as it + * means the algorithm tries to find an initial "boot-master". + * * Prefer instances not configured as read-only via box.cfg, and * not being in read-only state due to any other reason. The * config is stronger because if it is configured as read-only, * it is in read-only state for sure, until the config is * changed. */ + if (ballot->is_booted) + score += 10; if (!ballot->is_ro_cfg) score += 5; if (!ballot->is_ro) diff --git a/test/replication/gh-5613-bootstrap-prefer-booted.result b/test/replication/gh-5613-bootstrap-prefer-booted.result new file mode 100644 index 000000000..e8e7fb792 --- /dev/null +++ b/test/replication/gh-5613-bootstrap-prefer-booted.result @@ -0,0 +1,70 @@ +-- test-run result file version 2 +test_run = require('test_run').new() + | --- + | ... + +test_run:cmd('create server master with script="replication/gh-5613-master.lua"') + | --- + | - true + | ... +test_run:cmd('start server master with wait=False') + | --- + | - true + | ... +test_run:cmd('create server replica1 with script="replication/gh-5613-replica1.lua"') + | --- + | - true + | ... +test_run:cmd('start server replica1') + | --- + | - true + | ... +test_run:switch('master') + | --- + | - true + | ... +box.cfg{read_only = true} + | --- + | ... +test_run:switch('default') + | --- + | - true + | ... + +test_run:cmd('create server replica2 with script="replication/gh-5613-replica2.lua"') + | --- + | - true + | ... +-- It returns false, but it is expected. +test_run:cmd('start server replica2 with crash_expected=True') + | --- + | - false + | ... +opts = {filename = 'gh-5613-replica2.log'} + | --- + | ... +assert(test_run:grep_log(nil, 'ER_READONLY', nil, opts) ~= nil) + | --- + | - true + | ... + +test_run:cmd('delete server replica2') + | --- + | - true + | ... +test_run:cmd('stop server replica1') + | --- + | - true + | ... +test_run:cmd('delete server replica1') + | --- + | - true + | ... +test_run:cmd('stop server master') + | --- + | - true + | ... +test_run:cmd('delete server master') + | --- + | - true + | ... diff --git a/test/replication/gh-5613-bootstrap-prefer-booted.test.lua b/test/replication/gh-5613-bootstrap-prefer-booted.test.lua new file mode 100644 index 000000000..d3c1c1189 --- /dev/null +++ b/test/replication/gh-5613-bootstrap-prefer-booted.test.lua @@ -0,0 +1,21 @@ +test_run = require('test_run').new() + +test_run:cmd('create server master with script="replication/gh-5613-master.lua"') +test_run:cmd('start server master with wait=False') +test_run:cmd('create server replica1 with script="replication/gh-5613-replica1.lua"') +test_run:cmd('start server replica1') +test_run:switch('master') +box.cfg{read_only = true} +test_run:switch('default') + +test_run:cmd('create server replica2 with script="replication/gh-5613-replica2.lua"') +-- It returns false, but it is expected. +test_run:cmd('start server replica2 with crash_expected=True') +opts = {filename = 'gh-5613-replica2.log'} +assert(test_run:grep_log(nil, 'ER_READONLY', nil, opts) ~= nil) + +test_run:cmd('delete server replica2') +test_run:cmd('stop server replica1') +test_run:cmd('delete server replica1') +test_run:cmd('stop server master') +test_run:cmd('delete server master') diff --git a/test/replication/gh-5613-master.lua b/test/replication/gh-5613-master.lua new file mode 100644 index 000000000..408427315 --- /dev/null +++ b/test/replication/gh-5613-master.lua @@ -0,0 +1,11 @@ +#!/usr/bin/env tarantool + +require('console').listen(os.getenv('ADMIN')) +box.cfg({ + listen = 'unix/:./gh-5613-master.sock', + replication = { + 'unix/:./gh-5613-master.sock', + 'unix/:./gh-5613-replica1.sock', + }, +}) +box.schema.user.grant('guest', 'super') diff --git a/test/replication/gh-5613-replica1.lua b/test/replication/gh-5613-replica1.lua new file mode 100644 index 000000000..d0d6e3372 --- /dev/null +++ b/test/replication/gh-5613-replica1.lua @@ -0,0 +1,13 @@ +#!/usr/bin/env tarantool + +require('console').listen(os.getenv('ADMIN')) +box.cfg({ + listen = 'unix/:./gh-5613-replica1.sock', + replication = { + 'unix/:./gh-5613-master.sock', + 'unix/:./gh-5613-replica1.sock', + }, + -- Set to read_only initially so as the bootstrap-master would be + -- known in advance. + read_only = true, +}) diff --git a/test/replication/gh-5613-replica2.lua b/test/replication/gh-5613-replica2.lua new file mode 100644 index 000000000..8cbd45b61 --- /dev/null +++ b/test/replication/gh-5613-replica2.lua @@ -0,0 +1,11 @@ +#!/usr/bin/env tarantool + +require('console').listen(os.getenv('ADMIN')) +box.cfg({ + listen = 'unix/:./gh-5613-replica2.sock', + replication = { + 'unix/:./gh-5613-master.sock', + 'unix/:./gh-5613-replica1.sock', + 'unix/:./gh-5613-replica2.sock', + }, +}) diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg index 27eab20c2..f9d5ce1cc 100644 --- a/test/replication/suite.cfg +++ b/test/replication/suite.cfg @@ -44,6 +44,7 @@ "gh-5435-qsync-clear-synchro-queue-commit-all.test.lua": {}, "gh-5536-wal-limit.test.lua": {}, "gh-5566-final-join-synchro.test.lua": {}, + "gh-5613-bootstrap-prefer-booted.test.lua": {}, "gh-6032-promote-wal-write.test.lua": {}, "gh-6057-qsync-confirm-async-no-wal.test.lua": {}, "gh-6094-rs-uuid-mismatch.test.lua": {}, -- 2.24.3 (Apple Git-128)
There was a bug that a new replica at join to a Raft cluster sometimes tried to register on a non-leader node which couldn't write to _cluster, so the join failed with ER_READONLY error. Now in scope of #5613 the algorithm of join-master selection is changed. A new node looks for writable members of the cluster to use a join-master. It will not choose a follower if there is a leader. Closes #6127 --- .../unreleased/gh-6127-raft-join-new.md | 4 + test/replication/gh-6127-master1.lua | 15 +++ test/replication/gh-6127-master2.lua | 13 +++ test/replication/gh-6127-raft-join-new.result | 105 ++++++++++++++++++ .../gh-6127-raft-join-new.test.lua | 41 +++++++ test/replication/gh-6127-replica.lua | 9 ++ 6 files changed, 187 insertions(+) create mode 100644 changelogs/unreleased/gh-6127-raft-join-new.md create mode 100644 test/replication/gh-6127-master1.lua create mode 100644 test/replication/gh-6127-master2.lua create mode 100644 test/replication/gh-6127-raft-join-new.result create mode 100644 test/replication/gh-6127-raft-join-new.test.lua create mode 100644 test/replication/gh-6127-replica.lua diff --git a/changelogs/unreleased/gh-6127-raft-join-new.md b/changelogs/unreleased/gh-6127-raft-join-new.md new file mode 100644 index 000000000..a2d898df0 --- /dev/null +++ b/changelogs/unreleased/gh-6127-raft-join-new.md @@ -0,0 +1,4 @@ +## bugfix/raft + +* Fixed an error when a new replica in a Raft cluster could try to join from a + follower instead of a leader and failed with an error `ER_READONLY` (gh-6127). diff --git a/test/replication/gh-6127-master1.lua b/test/replication/gh-6127-master1.lua new file mode 100644 index 000000000..708574322 --- /dev/null +++ b/test/replication/gh-6127-master1.lua @@ -0,0 +1,15 @@ +#!/usr/bin/env tarantool + +require('console').listen(os.getenv('ADMIN')) +box.cfg({ + listen = 'unix/:./master1.sock', + replication = { + 'unix/:./master1.sock', + 'unix/:./master2.sock' + }, + election_mode = 'candidate', + election_timeout = 0.1, + instance_uuid = '10f9828d-b5d5-46a9-b698-ddac7cce5e27', +}) +box.ctl.wait_rw() +box.schema.user.grant('guest', 'super') diff --git a/test/replication/gh-6127-master2.lua b/test/replication/gh-6127-master2.lua new file mode 100644 index 000000000..1851070c7 --- /dev/null +++ b/test/replication/gh-6127-master2.lua @@ -0,0 +1,13 @@ +#!/usr/bin/env tarantool + +require('console').listen(os.getenv('ADMIN')) +box.cfg({ + listen = 'unix/:./master2.sock', + replication = { + 'unix/:./master1.sock', + 'unix/:./master2.sock' + }, + election_mode = 'voter', + election_timeout = 0.1, + instance_uuid = '20f9828d-b5d5-46a9-b698-ddac7cce5e27', +}) diff --git a/test/replication/gh-6127-raft-join-new.result b/test/replication/gh-6127-raft-join-new.result new file mode 100644 index 000000000..be6f8489b --- /dev/null +++ b/test/replication/gh-6127-raft-join-new.result @@ -0,0 +1,105 @@ +-- test-run result file version 2 +test_run = require('test_run').new() + | --- + | ... + +-- +-- gh-6127: the algorithm selecting a node from which to join to a replicaset +-- should take into account who is the leader (is writable and can write to +-- _cluster) and who is a follower/candidate. +-- +test_run:cmd('create server master1 with script="replication/gh-6127-master1.lua"') + | --- + | - true + | ... +test_run:cmd('start server master1 with wait=False') + | --- + | - true + | ... +test_run:cmd('create server master2 with script="replication/gh-6127-master2.lua"') + | --- + | - true + | ... +test_run:cmd('start server master2') + | --- + | - true + | ... + +test_run:switch('master1') + | --- + | - true + | ... +box.cfg{election_mode = 'voter'} + | --- + | ... +test_run:switch('master2') + | --- + | - true + | ... +-- Perform manual election because it is faster - the automatic one still tries +-- to wait for 'death timeout' first which is several seconds. +box.cfg{ \ + election_mode = 'manual', \ + election_timeout = 0.1, \ +} + | --- + | ... +box.ctl.promote() + | --- + | ... +box.ctl.wait_rw() + | --- + | ... +-- Make sure the other node received the promotion row. Vclocks now should be +-- equal so the new node would select only using read-only state and min UUID. +test_run:wait_lsn('master1', 'master2') + | --- + | ... + +-- Min UUID is master1, but it is not writable. Therefore must join from +-- master2. +test_run:cmd('create server replica with script="replication/gh-6127-replica.lua"') + | --- + | - true + | ... +test_run:cmd('start server replica') + | --- + | - true + | ... +test_run:switch('replica') + | --- + | - true + | ... +assert(box.info.leader ~= 0) + | --- + | - true + | ... + +test_run:switch('default') + | --- + | - true + | ... +test_run:cmd('stop server replica') + | --- + | - true + | ... +test_run:cmd('delete server replica') + | --- + | - true + | ... +test_run:cmd('stop server master2') + | --- + | - true + | ... +test_run:cmd('delete server master2') + | --- + | - true + | ... +test_run:cmd('stop server master1') + | --- + | - true + | ... +test_run:cmd('delete server master1') + | --- + | - true + | ... diff --git a/test/replication/gh-6127-raft-join-new.test.lua b/test/replication/gh-6127-raft-join-new.test.lua new file mode 100644 index 000000000..3e0e9f226 --- /dev/null +++ b/test/replication/gh-6127-raft-join-new.test.lua @@ -0,0 +1,41 @@ +test_run = require('test_run').new() + +-- +-- gh-6127: the algorithm selecting a node from which to join to a replicaset +-- should take into account who is the leader (is writable and can write to +-- _cluster) and who is a follower/candidate. +-- +test_run:cmd('create server master1 with script="replication/gh-6127-master1.lua"') +test_run:cmd('start server master1 with wait=False') +test_run:cmd('create server master2 with script="replication/gh-6127-master2.lua"') +test_run:cmd('start server master2') + +test_run:switch('master1') +box.cfg{election_mode = 'voter'} +test_run:switch('master2') +-- Perform manual election because it is faster - the automatic one still tries +-- to wait for 'death timeout' first which is several seconds. +box.cfg{ \ + election_mode = 'manual', \ + election_timeout = 0.1, \ +} +box.ctl.promote() +box.ctl.wait_rw() +-- Make sure the other node received the promotion row. Vclocks now should be +-- equal so the new node would select only using read-only state and min UUID. +test_run:wait_lsn('master1', 'master2') + +-- Min UUID is master1, but it is not writable. Therefore must join from +-- master2. +test_run:cmd('create server replica with script="replication/gh-6127-replica.lua"') +test_run:cmd('start server replica') +test_run:switch('replica') +assert(box.info.leader ~= 0) + +test_run:switch('default') +test_run:cmd('stop server replica') +test_run:cmd('delete server replica') +test_run:cmd('stop server master2') +test_run:cmd('delete server master2') +test_run:cmd('stop server master1') +test_run:cmd('delete server master1') diff --git a/test/replication/gh-6127-replica.lua b/test/replication/gh-6127-replica.lua new file mode 100644 index 000000000..9f4c35ecd --- /dev/null +++ b/test/replication/gh-6127-replica.lua @@ -0,0 +1,9 @@ +#!/usr/bin/env tarantool + +require('console').listen(os.getenv('ADMIN')) +box.cfg({ + replication = { + 'unix/:./master1.sock', + 'unix/:./master2.sock' + }, +}) -- 2.24.3 (Apple Git-128)
I added the issue description to the test: diff --git a/test/replication/gh-5613-bootstrap-prefer-booted.test.lua b/test/replication/gh-5613-bootstrap-prefer-booted.test.lua index d3c1c1189..4dfc39175 100644 --- a/test/replication/gh-5613-bootstrap-prefer-booted.test.lua +++ b/test/replication/gh-5613-bootstrap-prefer-booted.test.lua @@ -1,5 +1,13 @@ test_run = require('test_run').new() +-- +-- gh-5613: when a new replica is joined to a cluster, it must prefer +-- bootstrapped join sources over non-bootstrapped ones, including self. Even +-- if all the bootstrapped ones are read-only. Otherwise the node might have +-- decided to bootstrap a new cluster on its own and won't be able to join the +-- existing one forever. +--
I made the test not run with 2 engines: ==================== diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg index f9d5ce1cc..3be64e133 100644 --- a/test/replication/suite.cfg +++ b/test/replication/suite.cfg @@ -48,6 +48,7 @@ "gh-6032-promote-wal-write.test.lua": {}, "gh-6057-qsync-confirm-async-no-wal.test.lua": {}, "gh-6094-rs-uuid-mismatch.test.lua": {}, + "gh-6127-raft-join-new.test.lua": {}, "*": { "memtx": {"engine": "memtx"}, "vinyl": {"engine": "vinyl"}
05.06.2021 02:37, Vladislav Shpilevoy пишет: > Firstly, rename it to replicaset_find_join_master(). Now, when > there is Raft with a concept of an actual leader, the function > name becomes confusing. > > Secondly, do not access ballot member in struct applier in such a > long way - save the ballot pointer on the stack. This is going to > become useful when in one of the next patches the ballot will be > used more. > > Part of #5613 Hi! Thanks for the patch! LGTM. > --- > src/box/box.cc | 5 ++--- > src/box/replication.cc | 12 +++++++----- > src/box/replication.h | 5 +++-- > 3 files changed, 12 insertions(+), 10 deletions(-) > > diff --git a/src/box/box.cc b/src/box/box.cc > index 6dc991dc8..0e615e944 100644 > --- a/src/box/box.cc > +++ b/src/box/box.cc > @@ -1369,7 +1369,7 @@ box_set_replication_anon(void) > * Wait until the master has registered this > * instance. > */ > - struct replica *master = replicaset_leader(); > + struct replica *master = replicaset_find_join_master(); > if (master == NULL || master->applier == NULL || > master->applier->state != APPLIER_CONNECTED) { > tnt_raise(ClientError, ER_CANNOT_REGISTER); > @@ -3114,8 +3114,7 @@ bootstrap(const struct tt_uuid *instance_uuid, > */ > box_sync_replication(true); > > - /* Use the first replica by URI as a bootstrap leader */ > - struct replica *master = replicaset_leader(); > + struct replica *master = replicaset_find_join_master(); > assert(master == NULL || master->applier != NULL); > > if (master != NULL && !tt_uuid_is_equal(&master->uuid, &INSTANCE_UUID)) { > diff --git a/src/box/replication.cc b/src/box/replication.cc > index aefb812b3..a1c6e3c7c 100644 > --- a/src/box/replication.cc > +++ b/src/box/replication.cc > @@ -968,6 +968,7 @@ replicaset_round(bool skip_ro) > struct applier *applier = replica->applier; > if (applier == NULL) > continue; > + const struct ballot *ballot = &applier->ballot; > /** > * While bootstrapping a new cluster, read-only > * replicas shouldn't be considered as a leader. > @@ -975,17 +976,18 @@ replicaset_round(bool skip_ro) > * replicas since there is still a possibility > * that all replicas exist in cluster table. > */ > - if (skip_ro && applier->ballot.is_ro) > + if (skip_ro && ballot->is_ro) > continue; > if (leader == NULL) { > leader = replica; > continue; > } > + const struct ballot *leader_ballot = &leader->applier->ballot; > /* > * Try to find a replica which has already left > * orphan mode. > */ > - if (applier->ballot.is_loading && ! leader->applier->ballot.is_loading) > + if (ballot->is_loading && !leader_ballot->is_loading) > continue; > /* > * Choose the replica with the most advanced > @@ -993,8 +995,8 @@ replicaset_round(bool skip_ro) > * with the same vclock, prefer the one with > * the lowest uuid. > */ > - int cmp = vclock_compare_ignore0(&applier->ballot.vclock, > - &leader->applier->ballot.vclock); > + int cmp = vclock_compare_ignore0(&ballot->vclock, > + &leader_ballot->vclock); > if (cmp < 0) > continue; > if (cmp == 0 && tt_uuid_compare(&replica->uuid, > @@ -1006,7 +1008,7 @@ replicaset_round(bool skip_ro) > } > > struct replica * > -replicaset_leader(void) > +replicaset_find_join_master(void) > { > bool skip_ro = true; > /** > diff --git a/src/box/replication.h b/src/box/replication.h > index 2ad1cbf66..5cc380373 100644 > --- a/src/box/replication.h > +++ b/src/box/replication.h > @@ -356,10 +356,11 @@ struct replica * > replica_by_id(uint32_t replica_id); > > /** > - * Return the replica set leader. > + * Find a node in the replicaset on which the instance can try to register to > + * join the replicaset. > */ > struct replica * > -replicaset_leader(void); > +replicaset_find_join_master(void); > > struct replica * > replicaset_first(void); -- Serge Petrenko
05.06.2021 02:37, Vladislav Shpilevoy пишет: > Rename the member to show its actual meaning. It is not the > real RO state of the instance. Only how it is configured. > > It can happen that the instance is read_only = false, but still is > in RO state due to other reasons. > > The patch is done in scope of #5613 where the ballot is going to > be extended and used a bit differently in the join-master search > algorithm. > > Part of #5613 LGTM. > --- > src/box/box.cc | 2 +- > src/box/iproto_constants.h | 2 +- > src/box/replication.cc | 2 +- > src/box/xrow.c | 12 ++++++------ > src/box/xrow.h | 4 ++-- > 5 files changed, 11 insertions(+), 11 deletions(-) > > diff --git a/src/box/box.cc b/src/box/box.cc > index 0e615e944..d35a339ad 100644 > --- a/src/box/box.cc > +++ b/src/box/box.cc > @@ -2860,7 +2860,7 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header) > void > box_process_vote(struct ballot *ballot) > { > - ballot->is_ro = cfg_geti("read_only") != 0; > + ballot->is_ro_cfg = cfg_geti("read_only") != 0; > ballot->is_anon = replication_anon; > /* > * is_ro is true on initial load and is set to box.cfg.read_only > diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h > index 7362ddaf1..d4ee9e090 100644 > --- a/src/box/iproto_constants.h > +++ b/src/box/iproto_constants.h > @@ -162,7 +162,7 @@ enum iproto_metadata_key { > }; > > enum iproto_ballot_key { > - IPROTO_BALLOT_IS_RO = 0x01, > + IPROTO_BALLOT_IS_RO_CFG = 0x01, > IPROTO_BALLOT_VCLOCK = 0x02, > IPROTO_BALLOT_GC_VCLOCK = 0x03, > IPROTO_BALLOT_IS_LOADING = 0x04, > diff --git a/src/box/replication.cc b/src/box/replication.cc > index a1c6e3c7c..ce2b74065 100644 > --- a/src/box/replication.cc > +++ b/src/box/replication.cc > @@ -976,7 +976,7 @@ replicaset_round(bool skip_ro) > * replicas since there is still a possibility > * that all replicas exist in cluster table. > */ > - if (skip_ro && ballot->is_ro) > + if (skip_ro && ballot->is_ro_cfg) > continue; > if (leader == NULL) { > leader = replica; > diff --git a/src/box/xrow.c b/src/box/xrow.c > index 2e364cea5..6e2a87f8a 100644 > --- a/src/box/xrow.c > +++ b/src/box/xrow.c > @@ -450,7 +450,7 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot, > { > size_t max_size = IPROTO_HEADER_LEN + mp_sizeof_map(1) + > mp_sizeof_uint(UINT32_MAX) + mp_sizeof_map(5) + > - mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro) + > + mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro_cfg) + > mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_loading) + > mp_sizeof_uint(IPROTO_BALLOT_IS_ANON) + > mp_sizeof_bool(ballot->is_anon) + > @@ -470,8 +470,8 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot, > data = mp_encode_map(data, 1); > data = mp_encode_uint(data, IPROTO_BALLOT); > data = mp_encode_map(data, 5); > - data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO); > - data = mp_encode_bool(data, ballot->is_ro); > + data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO_CFG); > + data = mp_encode_bool(data, ballot->is_ro_cfg); > data = mp_encode_uint(data, IPROTO_BALLOT_IS_LOADING); > data = mp_encode_bool(data, ballot->is_loading); > data = mp_encode_uint(data, IPROTO_BALLOT_IS_ANON); > @@ -1357,7 +1357,7 @@ xrow_encode_vote(struct xrow_header *row) > int > xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot) > { > - ballot->is_ro = false; > + ballot->is_ro_cfg = false; > ballot->is_loading = false; > ballot->is_anon = false; > vclock_create(&ballot->vclock); > @@ -1399,10 +1399,10 @@ xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot) > } > uint32_t key = mp_decode_uint(&data); > switch (key) { > - case IPROTO_BALLOT_IS_RO: > + case IPROTO_BALLOT_IS_RO_CFG: > if (mp_typeof(*data) != MP_BOOL) > goto err; > - ballot->is_ro = mp_decode_bool(&data); > + ballot->is_ro_cfg = mp_decode_bool(&data); > break; > case IPROTO_BALLOT_IS_LOADING: > if (mp_typeof(*data) != MP_BOOL) > diff --git a/src/box/xrow.h b/src/box/xrow.h > index b3c664be2..241a7af8e 100644 > --- a/src/box/xrow.h > +++ b/src/box/xrow.h > @@ -366,8 +366,8 @@ xrow_encode_auth(struct xrow_header *row, const char *salt, size_t salt_len, > > /** Reply to IPROTO_VOTE request. */ > struct ballot { > - /** Set if the instance is running in read-only mode. */ > - bool is_ro; > + /** Set if the instance is configured in read-only mode. */ > + bool is_ro_cfg; > /** > * A flag whether the instance is anonymous, not having an > * ID, and not going to request it. -- Serge Petrenko
05.06.2021 02:37, Vladislav Shpilevoy пишет: > Is_loading in the ballot used to mean the following: "the instance > did not finish its box.cfg() or has read_only = true". Which is > quite a strange property. > > For instance, it was 'true' even if the instance is not really > loading anymore but has read_only = true. > > The patch renames it to 'is_ro' (which existed here before, but > also with a wrong meaning). > > Its behaviour is slightly changed to report the RO state of the > instance. Not its read_only. This way it incorporates all the > possible RO conditions. Such as not finished bootstrap, having > read_only = true, being a Raft follower, and so on. > > The patch is done in scope of #5613 where the ballot is going to > be extended and used a bit differently in the join-master search > algorithm. > > Part of #5613 LGTM. > --- > src/box/box.cc | 9 +-------- > src/box/iproto_constants.h | 2 +- > src/box/replication.cc | 2 +- > src/box/xrow.c | 12 ++++++------ > src/box/xrow.h | 7 ++++--- > 5 files changed, 13 insertions(+), 19 deletions(-) > > diff --git a/src/box/box.cc b/src/box/box.cc > index d35a339ad..d56b44d33 100644 > --- a/src/box/box.cc > +++ b/src/box/box.cc > @@ -2862,14 +2862,7 @@ box_process_vote(struct ballot *ballot) > { > ballot->is_ro_cfg = cfg_geti("read_only") != 0; > ballot->is_anon = replication_anon; > - /* > - * is_ro is true on initial load and is set to box.cfg.read_only > - * after box_cfg() returns, during dynamic box.cfg parameters setting. > - * We would like to prefer already bootstrapped instances to the ones > - * still bootstrapping and the ones still bootstrapping, but writeable > - * to the ones that have box.cfg.read_only = true. > - */ > - ballot->is_loading = is_ro; > + ballot->is_ro = is_ro_summary; > vclock_copy(&ballot->vclock, &replicaset.vclock); > vclock_copy(&ballot->gc_vclock, &gc.vclock); > } > diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h > index d4ee9e090..0f84843d0 100644 > --- a/src/box/iproto_constants.h > +++ b/src/box/iproto_constants.h > @@ -165,7 +165,7 @@ enum iproto_ballot_key { > IPROTO_BALLOT_IS_RO_CFG = 0x01, > IPROTO_BALLOT_VCLOCK = 0x02, > IPROTO_BALLOT_GC_VCLOCK = 0x03, > - IPROTO_BALLOT_IS_LOADING = 0x04, > + IPROTO_BALLOT_IS_RO = 0x04, > IPROTO_BALLOT_IS_ANON = 0x05, > }; > > diff --git a/src/box/replication.cc b/src/box/replication.cc > index ce2b74065..990f6239c 100644 > --- a/src/box/replication.cc > +++ b/src/box/replication.cc > @@ -987,7 +987,7 @@ replicaset_round(bool skip_ro) > * Try to find a replica which has already left > * orphan mode. > */ > - if (ballot->is_loading && !leader_ballot->is_loading) > + if (ballot->is_ro && !leader_ballot->is_ro) > continue; > /* > * Choose the replica with the most advanced > diff --git a/src/box/xrow.c b/src/box/xrow.c > index 6e2a87f8a..115a25473 100644 > --- a/src/box/xrow.c > +++ b/src/box/xrow.c > @@ -451,7 +451,7 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot, > size_t max_size = IPROTO_HEADER_LEN + mp_sizeof_map(1) + > mp_sizeof_uint(UINT32_MAX) + mp_sizeof_map(5) + > mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro_cfg) + > - mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_loading) + > + mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro) + > mp_sizeof_uint(IPROTO_BALLOT_IS_ANON) + > mp_sizeof_bool(ballot->is_anon) + > mp_sizeof_uint(UINT32_MAX) + > @@ -472,8 +472,8 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot, > data = mp_encode_map(data, 5); > data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO_CFG); > data = mp_encode_bool(data, ballot->is_ro_cfg); > - data = mp_encode_uint(data, IPROTO_BALLOT_IS_LOADING); > - data = mp_encode_bool(data, ballot->is_loading); > + data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO); > + data = mp_encode_bool(data, ballot->is_ro); > data = mp_encode_uint(data, IPROTO_BALLOT_IS_ANON); > data = mp_encode_bool(data, ballot->is_anon); > data = mp_encode_uint(data, IPROTO_BALLOT_VCLOCK); > @@ -1358,7 +1358,7 @@ int > xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot) > { > ballot->is_ro_cfg = false; > - ballot->is_loading = false; > + ballot->is_ro = false; > ballot->is_anon = false; > vclock_create(&ballot->vclock); > > @@ -1404,10 +1404,10 @@ xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot) > goto err; > ballot->is_ro_cfg = mp_decode_bool(&data); > break; > - case IPROTO_BALLOT_IS_LOADING: > + case IPROTO_BALLOT_IS_RO: > if (mp_typeof(*data) != MP_BOOL) > goto err; > - ballot->is_loading = mp_decode_bool(&data); > + ballot->is_ro = mp_decode_bool(&data); > break; > case IPROTO_BALLOT_IS_ANON: > if (mp_typeof(*data) != MP_BOOL) > diff --git a/src/box/xrow.h b/src/box/xrow.h > index 241a7af8e..1d00b2e43 100644 > --- a/src/box/xrow.h > +++ b/src/box/xrow.h > @@ -374,10 +374,11 @@ struct ballot { > */ > bool is_anon; > /** > - * Set if the instance hasn't finished bootstrap or recovery, or > - * is syncing with other replicas in the replicaset. > + * Set if the instance is not writable due to any reason. Could be > + * config read_only=true; being orphan; being a Raft follower; not > + * finished recovery/bootstrap; or anything else. > */ > - bool is_loading; > + bool is_ro; > /** Current instance vclock. */ > struct vclock vclock; > /** Oldest vclock available on the instance. */ -- Serge Petrenko
05.06.2021 02:37, Vladislav Shpilevoy пишет: > The new field reports whether the instance has finished its > bootstrap/recovery, or IOW has finished box.cfg(). > > The new field will help in fixing #5613 so as not to try to join > to a replicaset via non-bootstrapped instances if there are > others. > > The problem is that otherwise, if all nodes are booted but > are read-only, new instances bootstrap their own independent > replicaset. It would be better to just fail and terminate the > process than do such a bizarre action. > > Part of #5613 Thanks! LGTM. > --- > src/box/box.cc | 1 + > src/box/iproto_constants.h | 1 + > src/box/xrow.c | 14 ++++++++++++-- > src/box/xrow.h | 2 ++ > 4 files changed, 16 insertions(+), 2 deletions(-) > > diff --git a/src/box/box.cc b/src/box/box.cc > index d56b44d33..6fca337bc 100644 > --- a/src/box/box.cc > +++ b/src/box/box.cc > @@ -2863,6 +2863,7 @@ box_process_vote(struct ballot *ballot) > ballot->is_ro_cfg = cfg_geti("read_only") != 0; > ballot->is_anon = replication_anon; > ballot->is_ro = is_ro_summary; > + ballot->is_booted = is_box_configured; > vclock_copy(&ballot->vclock, &replicaset.vclock); > vclock_copy(&ballot->gc_vclock, &gc.vclock); > } > diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h > index 0f84843d0..137bee9da 100644 > --- a/src/box/iproto_constants.h > +++ b/src/box/iproto_constants.h > @@ -167,6 +167,7 @@ enum iproto_ballot_key { > IPROTO_BALLOT_GC_VCLOCK = 0x03, > IPROTO_BALLOT_IS_RO = 0x04, > IPROTO_BALLOT_IS_ANON = 0x05, > + IPROTO_BALLOT_IS_BOOTED = 0x06, > }; > > #define bit(c) (1ULL<<IPROTO_##c) > diff --git a/src/box/xrow.c b/src/box/xrow.c > index 115a25473..0d0548fef 100644 > --- a/src/box/xrow.c > +++ b/src/box/xrow.c > @@ -449,11 +449,13 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot, > uint64_t sync, uint32_t schema_version) > { > size_t max_size = IPROTO_HEADER_LEN + mp_sizeof_map(1) + > - mp_sizeof_uint(UINT32_MAX) + mp_sizeof_map(5) + > + mp_sizeof_uint(UINT32_MAX) + mp_sizeof_map(6) + > mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro_cfg) + > mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro) + > mp_sizeof_uint(IPROTO_BALLOT_IS_ANON) + > mp_sizeof_bool(ballot->is_anon) + > + mp_sizeof_uint(IPROTO_BALLOT_IS_BOOTED) + > + mp_sizeof_bool(ballot->is_booted) + > mp_sizeof_uint(UINT32_MAX) + > mp_sizeof_vclock_ignore0(&ballot->vclock) + > mp_sizeof_uint(UINT32_MAX) + > @@ -469,13 +471,15 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot, > char *data = buf + IPROTO_HEADER_LEN; > data = mp_encode_map(data, 1); > data = mp_encode_uint(data, IPROTO_BALLOT); > - data = mp_encode_map(data, 5); > + data = mp_encode_map(data, 6); > data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO_CFG); > data = mp_encode_bool(data, ballot->is_ro_cfg); > data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO); > data = mp_encode_bool(data, ballot->is_ro); > data = mp_encode_uint(data, IPROTO_BALLOT_IS_ANON); > data = mp_encode_bool(data, ballot->is_anon); > + data = mp_encode_uint(data, IPROTO_BALLOT_IS_BOOTED); > + data = mp_encode_bool(data, ballot->is_booted); > data = mp_encode_uint(data, IPROTO_BALLOT_VCLOCK); > data = mp_encode_vclock_ignore0(data, &ballot->vclock); > data = mp_encode_uint(data, IPROTO_BALLOT_GC_VCLOCK); > @@ -1360,6 +1364,7 @@ xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot) > ballot->is_ro_cfg = false; > ballot->is_ro = false; > ballot->is_anon = false; > + ballot->is_booted = true; > vclock_create(&ballot->vclock); > > const char *start = NULL; > @@ -1424,6 +1429,11 @@ xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot) > &ballot->gc_vclock) != 0) > goto err; > break; > + case IPROTO_BALLOT_IS_BOOTED: > + if (mp_typeof(*data) != MP_BOOL) > + goto err; > + ballot->is_booted = mp_decode_bool(&data); > + break; > default: > mp_next(&data); > } > diff --git a/src/box/xrow.h b/src/box/xrow.h > index 1d00b2e43..055fd31e1 100644 > --- a/src/box/xrow.h > +++ b/src/box/xrow.h > @@ -379,6 +379,8 @@ struct ballot { > * finished recovery/bootstrap; or anything else. > */ > bool is_ro; > + /** Set if the instance has finished its bootstrap/recovery. */ > + bool is_booted; > /** Current instance vclock. */ > struct vclock vclock; > /** Oldest vclock available on the instance. */ -- Serge Petrenko
05.06.2021 02:37, Vladislav Shpilevoy пишет: > The patch refactors the algorithm of finding a join-master (in > replicaset_find_join_master()) to use scores instead of multiple > iterations with different criteria. > > The original code was relatively fine as long as it had only > one parameter to change - whether should it skip > `box.cfg{read_only = true}` nodes. > > Although it was clear that it was "on the edge" of acceptable > complexity due to a second non-configurable parameter whether a > replica is in read-only state regardless of its config. > > It is going to get more complicated when the algorithm will take > into account the third parameter whether an instance is > bootstrapped. > > Then it should make decisions like "among bootstrapped nodes try > to prefer instances not having read_only=true, and not being in > read-only state". The easiest way to do so is to use > scores/weights incremented according to the instance's parameters > matching certain "good points". > > Part of #5613 LGTM. > --- > src/box/replication.cc | 62 ++++++++++++++++-------------------------- > 1 file changed, 23 insertions(+), 39 deletions(-) > > diff --git a/src/box/replication.cc b/src/box/replication.cc > index 990f6239c..d33e70f28 100644 > --- a/src/box/replication.cc > +++ b/src/box/replication.cc > @@ -960,71 +960,55 @@ replicaset_next(struct replica *replica) > * replicas, choose a read-only replica with biggest vclock > * as a leader, in hope it will become read-write soon. > */ > -static struct replica * > -replicaset_round(bool skip_ro) > +struct replica * > +replicaset_find_join_master(void) > { > struct replica *leader = NULL; > + int leader_score = -1; > replicaset_foreach(replica) { > struct applier *applier = replica->applier; > if (applier == NULL) > continue; > const struct ballot *ballot = &applier->ballot; > - /** > - * While bootstrapping a new cluster, read-only > - * replicas shouldn't be considered as a leader. > - * The only exception if there is no read-write > - * replicas since there is still a possibility > - * that all replicas exist in cluster table. > - */ > - if (skip_ro && ballot->is_ro_cfg) > - continue; > - if (leader == NULL) { > - leader = replica; > - continue; > - } > - const struct ballot *leader_ballot = &leader->applier->ballot; > + int score = 0; > /* > - * Try to find a replica which has already left > - * orphan mode. > + * Prefer instances not configured as read-only via box.cfg, and > + * not being in read-only state due to any other reason. The > + * config is stronger because if it is configured as read-only, > + * it is in read-only state for sure, until the config is > + * changed. > */ > - if (ballot->is_ro && !leader_ballot->is_ro) > + if (!ballot->is_ro_cfg) > + score += 5; > + if (!ballot->is_ro) > + score += 1; > + if (leader_score < score) > + goto elect; > + if (score < leader_score) > continue; > + const struct ballot *leader_ballot; > + leader_ballot = &leader->applier->ballot; > /* > * Choose the replica with the most advanced > * vclock. If there are two or more replicas > * with the same vclock, prefer the one with > * the lowest uuid. > */ > - int cmp = vclock_compare_ignore0(&ballot->vclock, > - &leader_ballot->vclock); > + int cmp; > + cmp = vclock_compare_ignore0(&ballot->vclock, > + &leader_ballot->vclock); > if (cmp < 0) > continue; > if (cmp == 0 && tt_uuid_compare(&replica->uuid, > &leader->uuid) > 0) > continue; > + elect: > leader = replica; > + leader_score = score; > } > return leader; > } > > -struct replica * > -replicaset_find_join_master(void) > -{ > - bool skip_ro = true; > - /** > - * Two loops, first prefers read-write replicas among others. > - * Second for backward compatibility, if there is no such > - * replicas at all. > - */ > - struct replica *leader = replicaset_round(skip_ro); > - if (leader == NULL) { > - skip_ro = false; > - leader = replicaset_round(skip_ro); > - } > - > - return leader; > -} > - > struct replica * > replica_by_uuid(const struct tt_uuid *uuid) > { -- Serge Petrenko
05.06.2021 02:38, Vladislav Shpilevoy пишет: > The algorithm of looking for an instance to join the replicaset > from didn't take into account that some of the instances might be > not bootstrapped but still perfectly available. > > As a result, a ridiculous situation could happen - an instance > could connect to a cluster with just read-only instances, but it > could have itself with box.cfg{read_only = false}. Then instead of > failing or waiting it just booted a brand new cluster. And after > that the node just started complaining about the others having a > different replicaset UUID. > > The patch makes so a new instance always prefers a bootstrapped > join-source to a non-boostrapped one, including self. In the > situation above the new instance now terminates with an error. > > In future hopefully it should start a retry-loop instead. > > Closes #5613 Thanks! LGTM. > > @TarantoolBot document > Title: IPROTO_BALLOT rework and a new field > > A couple of fields in `IPROTO_BALLOT 0x29` used to have values not > matching with their names. They are changed. > > * `IPROTO_BALLOT_IS_RO 0x01` used to mean "the instance has > `box.cfg{read_only = true}`". It was renamed in the source code > to `IPROTO_BALLOT_IS_RO_CFG`. It has the same code `0x01`, and > the value is the same. Only the name has changed, and in the doc > should be too. > > * `IPROTO_BALLOT_IS_LOADING 0x04` used to mean "the instance has > finished `box.cfg()` and it has `read_only = true`". The name > was wrong therefore, because even if the instance finished > loading, the flag still was false for `read_only = true` nodes. > Also such a value is not very suitable for any sane usage. > The name was changed to `IPROTO_BALLOT_IS_RO`, the code stayed > the same, and the value now is "the instance is not writable". > The reason for being not writable can be any: the node is an > orphan; or it has `read_only = true`; or it is a Raft follower; > or anything else. > > And there is a new field. > > `IPROTO_BALLOT_IS_BOOTED 0x06` means the instance has finished its > bootstrap or recovery. > --- > .../gh-5613-bootstrap-prefer-booted.md | 6 ++ > src/box/replication.cc | 20 +++--- > .../gh-5613-bootstrap-prefer-booted.result | 70 +++++++++++++++++++ > .../gh-5613-bootstrap-prefer-booted.test.lua | 21 ++++++ > test/replication/gh-5613-master.lua | 11 +++ > test/replication/gh-5613-replica1.lua | 13 ++++ > test/replication/gh-5613-replica2.lua | 11 +++ > test/replication/suite.cfg | 1 + > 8 files changed, 144 insertions(+), 9 deletions(-) > create mode 100644 changelogs/unreleased/gh-5613-bootstrap-prefer-booted.md > create mode 100644 test/replication/gh-5613-bootstrap-prefer-booted.result > create mode 100644 test/replication/gh-5613-bootstrap-prefer-booted.test.lua > create mode 100644 test/replication/gh-5613-master.lua > create mode 100644 test/replication/gh-5613-replica1.lua > create mode 100644 test/replication/gh-5613-replica2.lua > > diff --git a/changelogs/unreleased/gh-5613-bootstrap-prefer-booted.md b/changelogs/unreleased/gh-5613-bootstrap-prefer-booted.md > new file mode 100644 > index 000000000..c022ee012 > --- /dev/null > +++ b/changelogs/unreleased/gh-5613-bootstrap-prefer-booted.md > @@ -0,0 +1,6 @@ > +## bugfix/replication > + > +* Fixed an error when a replica, at attempt to join a cluster with exclusively > + read-only replicas available, instead of failing or retrying just decided to > + boot its own replicaset. Now it fails with an error about the other nodes > + being read-only so they can't register it (gh-5613). > diff --git a/src/box/replication.cc b/src/box/replication.cc > index d33e70f28..52086c65e 100644 > --- a/src/box/replication.cc > +++ b/src/box/replication.cc > @@ -951,15 +951,6 @@ replicaset_next(struct replica *replica) > return replica_hash_next(&replicaset.hash, replica); > } > > -/** > - * Compare vclock, read only mode and orphan status > - * of all connected replicas and elect a leader. > - * Initiallly, skip read-only replicas, since they > - * can not properly act as bootstrap masters (register > - * new nodes in _cluster table). If there are no read-write > - * replicas, choose a read-only replica with biggest vclock > - * as a leader, in hope it will become read-write soon. > - */ > struct replica * > replicaset_find_join_master(void) > { > @@ -972,12 +963,23 @@ replicaset_find_join_master(void) > const struct ballot *ballot = &applier->ballot; > int score = 0; > /* > + * First of all try to ignore non-booted instances. Including > + * self if not booted yet. For self it is even dangerous as the > + * instance might decide to boot its own cluster if, for > + * example, the other nodes are available, but read-only. It > + * would be a mistake. > + * > + * For a new cluster it is ok to use a non-booted instance as it > + * means the algorithm tries to find an initial "boot-master". > + * > * Prefer instances not configured as read-only via box.cfg, and > * not being in read-only state due to any other reason. The > * config is stronger because if it is configured as read-only, > * it is in read-only state for sure, until the config is > * changed. > */ > + if (ballot->is_booted) > + score += 10; > if (!ballot->is_ro_cfg) > score += 5; > if (!ballot->is_ro) > diff --git a/test/replication/gh-5613-bootstrap-prefer-booted.result b/test/replication/gh-5613-bootstrap-prefer-booted.result > new file mode 100644 > index 000000000..e8e7fb792 > --- /dev/null > +++ b/test/replication/gh-5613-bootstrap-prefer-booted.result > @@ -0,0 +1,70 @@ > +-- test-run result file version 2 > +test_run = require('test_run').new() > + | --- > + | ... > + > +test_run:cmd('create server master with script="replication/gh-5613-master.lua"') > + | --- > + | - true > + | ... > +test_run:cmd('start server master with wait=False') > + | --- > + | - true > + | ... > +test_run:cmd('create server replica1 with script="replication/gh-5613-replica1.lua"') > + | --- > + | - true > + | ... > +test_run:cmd('start server replica1') > + | --- > + | - true > + | ... > +test_run:switch('master') > + | --- > + | - true > + | ... > +box.cfg{read_only = true} > + | --- > + | ... > +test_run:switch('default') > + | --- > + | - true > + | ... > + > +test_run:cmd('create server replica2 with script="replication/gh-5613-replica2.lua"') > + | --- > + | - true > + | ... > +-- It returns false, but it is expected. > +test_run:cmd('start server replica2 with crash_expected=True') > + | --- > + | - false > + | ... > +opts = {filename = 'gh-5613-replica2.log'} > + | --- > + | ... > +assert(test_run:grep_log(nil, 'ER_READONLY', nil, opts) ~= nil) > + | --- > + | - true > + | ... > + > +test_run:cmd('delete server replica2') > + | --- > + | - true > + | ... > +test_run:cmd('stop server replica1') > + | --- > + | - true > + | ... > +test_run:cmd('delete server replica1') > + | --- > + | - true > + | ... > +test_run:cmd('stop server master') > + | --- > + | - true > + | ... > +test_run:cmd('delete server master') > + | --- > + | - true > + | ... > diff --git a/test/replication/gh-5613-bootstrap-prefer-booted.test.lua b/test/replication/gh-5613-bootstrap-prefer-booted.test.lua > new file mode 100644 > index 000000000..d3c1c1189 > --- /dev/null > +++ b/test/replication/gh-5613-bootstrap-prefer-booted.test.lua > @@ -0,0 +1,21 @@ > +test_run = require('test_run').new() > + > +test_run:cmd('create server master with script="replication/gh-5613-master.lua"') > +test_run:cmd('start server master with wait=False') > +test_run:cmd('create server replica1 with script="replication/gh-5613-replica1.lua"') > +test_run:cmd('start server replica1') > +test_run:switch('master') > +box.cfg{read_only = true} > +test_run:switch('default') > + > +test_run:cmd('create server replica2 with script="replication/gh-5613-replica2.lua"') > +-- It returns false, but it is expected. > +test_run:cmd('start server replica2 with crash_expected=True') > +opts = {filename = 'gh-5613-replica2.log'} > +assert(test_run:grep_log(nil, 'ER_READONLY', nil, opts) ~= nil) > + > +test_run:cmd('delete server replica2') > +test_run:cmd('stop server replica1') > +test_run:cmd('delete server replica1') > +test_run:cmd('stop server master') > +test_run:cmd('delete server master') > diff --git a/test/replication/gh-5613-master.lua b/test/replication/gh-5613-master.lua > new file mode 100644 > index 000000000..408427315 > --- /dev/null > +++ b/test/replication/gh-5613-master.lua > @@ -0,0 +1,11 @@ > +#!/usr/bin/env tarantool > + > +require('console').listen(os.getenv('ADMIN')) > +box.cfg({ > + listen = 'unix/:./gh-5613-master.sock', > + replication = { > + 'unix/:./gh-5613-master.sock', > + 'unix/:./gh-5613-replica1.sock', > + }, > +}) > +box.schema.user.grant('guest', 'super') > diff --git a/test/replication/gh-5613-replica1.lua b/test/replication/gh-5613-replica1.lua > new file mode 100644 > index 000000000..d0d6e3372 > --- /dev/null > +++ b/test/replication/gh-5613-replica1.lua > @@ -0,0 +1,13 @@ > +#!/usr/bin/env tarantool > + > +require('console').listen(os.getenv('ADMIN')) > +box.cfg({ > + listen = 'unix/:./gh-5613-replica1.sock', > + replication = { > + 'unix/:./gh-5613-master.sock', > + 'unix/:./gh-5613-replica1.sock', > + }, > + -- Set to read_only initially so as the bootstrap-master would be > + -- known in advance. > + read_only = true, > +}) > diff --git a/test/replication/gh-5613-replica2.lua b/test/replication/gh-5613-replica2.lua > new file mode 100644 > index 000000000..8cbd45b61 > --- /dev/null > +++ b/test/replication/gh-5613-replica2.lua > @@ -0,0 +1,11 @@ > +#!/usr/bin/env tarantool > + > +require('console').listen(os.getenv('ADMIN')) > +box.cfg({ > + listen = 'unix/:./gh-5613-replica2.sock', > + replication = { > + 'unix/:./gh-5613-master.sock', > + 'unix/:./gh-5613-replica1.sock', > + 'unix/:./gh-5613-replica2.sock', > + }, > +}) > diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg > index 27eab20c2..f9d5ce1cc 100644 > --- a/test/replication/suite.cfg > +++ b/test/replication/suite.cfg > @@ -44,6 +44,7 @@ > "gh-5435-qsync-clear-synchro-queue-commit-all.test.lua": {}, > "gh-5536-wal-limit.test.lua": {}, > "gh-5566-final-join-synchro.test.lua": {}, > + "gh-5613-bootstrap-prefer-booted.test.lua": {}, > "gh-6032-promote-wal-write.test.lua": {}, > "gh-6057-qsync-confirm-async-no-wal.test.lua": {}, > "gh-6094-rs-uuid-mismatch.test.lua": {}, -- Serge Petrenko
06.06.2021 20:03, Vladislav Shpilevoy пишет: > There was a bug that a new replica at join to a Raft cluster > sometimes tried to register on a non-leader node which couldn't > write to _cluster, so the join failed with ER_READONLY error. > > Now in scope of #5613 the algorithm of join-master selection is > changed. A new node looks for writable members of the cluster to > use a join-master. It will not choose a follower if there is a > leader. > > Closes #6127 Thanks for working on this! LGTM. > --- > .../unreleased/gh-6127-raft-join-new.md | 4 + > test/replication/gh-6127-master1.lua | 15 +++ > test/replication/gh-6127-master2.lua | 13 +++ > test/replication/gh-6127-raft-join-new.result | 105 ++++++++++++++++++ > .../gh-6127-raft-join-new.test.lua | 41 +++++++ > test/replication/gh-6127-replica.lua | 9 ++ > 6 files changed, 187 insertions(+) > create mode 100644 changelogs/unreleased/gh-6127-raft-join-new.md > create mode 100644 test/replication/gh-6127-master1.lua > create mode 100644 test/replication/gh-6127-master2.lua > create mode 100644 test/replication/gh-6127-raft-join-new.result > create mode 100644 test/replication/gh-6127-raft-join-new.test.lua > create mode 100644 test/replication/gh-6127-replica.lua > > diff --git a/changelogs/unreleased/gh-6127-raft-join-new.md b/changelogs/unreleased/gh-6127-raft-join-new.md > new file mode 100644 > index 000000000..a2d898df0 > --- /dev/null > +++ b/changelogs/unreleased/gh-6127-raft-join-new.md > @@ -0,0 +1,4 @@ > +## bugfix/raft > + > +* Fixed an error when a new replica in a Raft cluster could try to join from a > + follower instead of a leader and failed with an error `ER_READONLY` (gh-6127). > diff --git a/test/replication/gh-6127-master1.lua b/test/replication/gh-6127-master1.lua > new file mode 100644 > index 000000000..708574322 > --- /dev/null > +++ b/test/replication/gh-6127-master1.lua > @@ -0,0 +1,15 @@ > +#!/usr/bin/env tarantool > + > +require('console').listen(os.getenv('ADMIN')) > +box.cfg({ > + listen = 'unix/:./master1.sock', > + replication = { > + 'unix/:./master1.sock', > + 'unix/:./master2.sock' > + }, > + election_mode = 'candidate', > + election_timeout = 0.1, > + instance_uuid = '10f9828d-b5d5-46a9-b698-ddac7cce5e27', > +}) > +box.ctl.wait_rw() > +box.schema.user.grant('guest', 'super') > diff --git a/test/replication/gh-6127-master2.lua b/test/replication/gh-6127-master2.lua > new file mode 100644 > index 000000000..1851070c7 > --- /dev/null > +++ b/test/replication/gh-6127-master2.lua > @@ -0,0 +1,13 @@ > +#!/usr/bin/env tarantool > + > +require('console').listen(os.getenv('ADMIN')) > +box.cfg({ > + listen = 'unix/:./master2.sock', > + replication = { > + 'unix/:./master1.sock', > + 'unix/:./master2.sock' > + }, > + election_mode = 'voter', > + election_timeout = 0.1, > + instance_uuid = '20f9828d-b5d5-46a9-b698-ddac7cce5e27', > +}) > diff --git a/test/replication/gh-6127-raft-join-new.result b/test/replication/gh-6127-raft-join-new.result > new file mode 100644 > index 000000000..be6f8489b > --- /dev/null > +++ b/test/replication/gh-6127-raft-join-new.result > @@ -0,0 +1,105 @@ > +-- test-run result file version 2 > +test_run = require('test_run').new() > + | --- > + | ... > + > +-- > +-- gh-6127: the algorithm selecting a node from which to join to a replicaset > +-- should take into account who is the leader (is writable and can write to > +-- _cluster) and who is a follower/candidate. > +-- > +test_run:cmd('create server master1 with script="replication/gh-6127-master1.lua"') > + | --- > + | - true > + | ... > +test_run:cmd('start server master1 with wait=False') > + | --- > + | - true > + | ... > +test_run:cmd('create server master2 with script="replication/gh-6127-master2.lua"') > + | --- > + | - true > + | ... > +test_run:cmd('start server master2') > + | --- > + | - true > + | ... > + > +test_run:switch('master1') > + | --- > + | - true > + | ... > +box.cfg{election_mode = 'voter'} > + | --- > + | ... > +test_run:switch('master2') > + | --- > + | - true > + | ... > +-- Perform manual election because it is faster - the automatic one still tries > +-- to wait for 'death timeout' first which is several seconds. > +box.cfg{ \ > + election_mode = 'manual', \ > + election_timeout = 0.1, \ > +} > + | --- > + | ... > +box.ctl.promote() > + | --- > + | ... > +box.ctl.wait_rw() > + | --- > + | ... > +-- Make sure the other node received the promotion row. Vclocks now should be > +-- equal so the new node would select only using read-only state and min UUID. > +test_run:wait_lsn('master1', 'master2') > + | --- > + | ... > + > +-- Min UUID is master1, but it is not writable. Therefore must join from > +-- master2. > +test_run:cmd('create server replica with script="replication/gh-6127-replica.lua"') > + | --- > + | - true > + | ... > +test_run:cmd('start server replica') > + | --- > + | - true > + | ... > +test_run:switch('replica') > + | --- > + | - true > + | ... > +assert(box.info.leader ~= 0) > + | --- > + | - true > + | ... > + > +test_run:switch('default') > + | --- > + | - true > + | ... > +test_run:cmd('stop server replica') > + | --- > + | - true > + | ... > +test_run:cmd('delete server replica') > + | --- > + | - true > + | ... > +test_run:cmd('stop server master2') > + | --- > + | - true > + | ... > +test_run:cmd('delete server master2') > + | --- > + | - true > + | ... > +test_run:cmd('stop server master1') > + | --- > + | - true > + | ... > +test_run:cmd('delete server master1') > + | --- > + | - true > + | ... > diff --git a/test/replication/gh-6127-raft-join-new.test.lua b/test/replication/gh-6127-raft-join-new.test.lua > new file mode 100644 > index 000000000..3e0e9f226 > --- /dev/null > +++ b/test/replication/gh-6127-raft-join-new.test.lua > @@ -0,0 +1,41 @@ > +test_run = require('test_run').new() > + > +-- > +-- gh-6127: the algorithm selecting a node from which to join to a replicaset > +-- should take into account who is the leader (is writable and can write to > +-- _cluster) and who is a follower/candidate. > +-- > +test_run:cmd('create server master1 with script="replication/gh-6127-master1.lua"') > +test_run:cmd('start server master1 with wait=False') > +test_run:cmd('create server master2 with script="replication/gh-6127-master2.lua"') > +test_run:cmd('start server master2') > + > +test_run:switch('master1') > +box.cfg{election_mode = 'voter'} > +test_run:switch('master2') > +-- Perform manual election because it is faster - the automatic one still tries > +-- to wait for 'death timeout' first which is several seconds. > +box.cfg{ \ > + election_mode = 'manual', \ > + election_timeout = 0.1, \ > +} > +box.ctl.promote() > +box.ctl.wait_rw() > +-- Make sure the other node received the promotion row. Vclocks now should be > +-- equal so the new node would select only using read-only state and min UUID. > +test_run:wait_lsn('master1', 'master2') > + > +-- Min UUID is master1, but it is not writable. Therefore must join from > +-- master2. > +test_run:cmd('create server replica with script="replication/gh-6127-replica.lua"') > +test_run:cmd('start server replica') > +test_run:switch('replica') > +assert(box.info.leader ~= 0) > + > +test_run:switch('default') > +test_run:cmd('stop server replica') > +test_run:cmd('delete server replica') > +test_run:cmd('stop server master2') > +test_run:cmd('delete server master2') > +test_run:cmd('stop server master1') > +test_run:cmd('delete server master1') > diff --git a/test/replication/gh-6127-replica.lua b/test/replication/gh-6127-replica.lua > new file mode 100644 > index 000000000..9f4c35ecd > --- /dev/null > +++ b/test/replication/gh-6127-replica.lua > @@ -0,0 +1,9 @@ > +#!/usr/bin/env tarantool > + > +require('console').listen(os.getenv('ADMIN')) > +box.cfg({ > + replication = { > + 'unix/:./master1.sock', > + 'unix/:./master2.sock' > + }, > +}) -- Serge Petrenko
On Sat, Jun 05, 2021 at 01:37:59AM +0200, Vladislav Shpilevoy wrote: > + int score = 0; > /* > - * Try to find a replica which has already left > - * orphan mode. > + * Prefer instances not configured as read-only via box.cfg, and > + * not being in read-only state due to any other reason. The > + * config is stronger because if it is configured as read-only, > + * it is in read-only state for sure, until the config is > + * changed. > */ > - if (ballot->is_ro && !leader_ballot->is_ro) > + if (!ballot->is_ro_cfg) > + score += 5; > + if (!ballot->is_ro) > + score += 1; > + if (leader_score < score) > + goto elect; > + if (score < leader_score) > continue; Vlad, if only I'm not missing something obvious we can do simplier without branching at all, say score = (ballot->is_booted << 3) | (ballot->is_ro_cfg << 2) | (ballot->is_ro << 1); up to you, just an idea. ... > if (cmp == 0 && tt_uuid_compare(&replica->uuid, > &leader->uuid) > 0) > continue; > + elect: seems extra tab here? > leader = replica; > + leader_score = score; > } > return leader; > }
On Sat, Jun 05, 2021 at 01:37:54AM +0200, Vladislav Shpilevoy wrote:
> See the commit messages for the details.
>
> I did some basic tests for backward compatibility since I added a new field to
> the ballot, and it works fine.
ack
Hi! Thanks for the review! On 10.06.2021 17:02, Cyrill Gorcunov wrote: > On Sat, Jun 05, 2021 at 01:37:59AM +0200, Vladislav Shpilevoy wrote: >> + int score = 0; >> /* >> - * Try to find a replica which has already left >> - * orphan mode. >> + * Prefer instances not configured as read-only via box.cfg, and >> + * not being in read-only state due to any other reason. The >> + * config is stronger because if it is configured as read-only, >> + * it is in read-only state for sure, until the config is >> + * changed. >> */ >> - if (ballot->is_ro && !leader_ballot->is_ro) >> + if (!ballot->is_ro_cfg) >> + score += 5; >> + if (!ballot->is_ro) >> + score += 1; >> + if (leader_score < score) >> + goto elect; >> + if (score < leader_score) >> continue; > > Vlad, if only I'm not missing something obvious we can do simplier > without branching at all, say > > score = (ballot->is_booted << 3) | > (ballot->is_ro_cfg << 2) | > (ballot->is_ro << 1); > > up to you, just an idea. This would work, technically. But TBH I don't see how it is simpler. >> if (cmp == 0 && tt_uuid_compare(&replica->uuid, >> &leader->uuid) > 0) >> continue; >> + elect: > > seems extra tab here? Yes, fixed on the branch.
On Thu, Jun 10, 2021 at 10:09:24PM +0200, Vladislav Shpilevoy wrote:
> Hi! Thanks for the review!
>
> On 10.06.2021 17:02, Cyrill Gorcunov wrote:
> > On Sat, Jun 05, 2021 at 01:37:59AM +0200, Vladislav Shpilevoy wrote:
> >> + int score = 0;
> >> /*
> >> - * Try to find a replica which has already left
> >> - * orphan mode.
> >> + * Prefer instances not configured as read-only via box.cfg, and
> >> + * not being in read-only state due to any other reason. The
> >> + * config is stronger because if it is configured as read-only,
> >> + * it is in read-only state for sure, until the config is
> >> + * changed.
> >> */
> >> - if (ballot->is_ro && !leader_ballot->is_ro)
> >> + if (!ballot->is_ro_cfg)
> >> + score += 5;
> >> + if (!ballot->is_ro)
> >> + score += 1;
> >> + if (leader_score < score)
> >> + goto elect;
> >> + if (score < leader_score)
> >> continue;
> >
> > Vlad, if only I'm not missing something obvious we can do simplier
> > without branching at all, say
> >
> > score = (ballot->is_booted << 3) |
> > (ballot->is_ro_cfg << 2) |
> > (ballot->is_ro << 1);
> >
> > up to you, just an idea.
>
> This would work, technically. But TBH I don't see how it is simpler.
Due to (x)2^N poly properties when each member is used once and
we use additions only.
First, we drop the branching which allows to save BPU cycles but
since this is not a hot path is doesn't matter. What is matter is
the way we choose weigth triplets. Each score addition (in the
form they are chosen now) generates strict partial order, such
as 1 < 5 < 10, thus if someone occasionally change 5 to 9 this
won't longer work properly I think. In case if someone has to
add fourth score he _is_ to modify the numbers carefully so
the order would remain.
In reverse the 2^N poly already has all these properties: each new
power won't be equal to a sum of previous powers, so that if
fourth score is needed then we simply add (x << 4) to the formula
above.
Anyway, I'm fine with current code, just wanted to explain why
I consider using pows here is more simple and flexible. Since we
have only 3 scores by now plain natural triplets should not
be a problem.
Pushed to master, 2.8, 2.7. In the last commit I changed all 'raft' words to 'election', including the test file names. To conform with the existing terminology.