* [Tarantool-patches] [PATCH 0/2] Bootstrap voter @ 2021-07-15 23:49 Vladislav Shpilevoy via Tarantool-patches 2021-07-15 23:49 ` [Tarantool-patches] [PATCH 1/2] replication: introduce ballot.can_be_leader Vladislav Shpilevoy via Tarantool-patches ` (2 more replies) 0 siblings, 3 replies; 23+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2021-07-15 23:49 UTC (permalink / raw) To: tarantool-patches, gorcunov, sergepetrenko In the patch I don't like this start/stop_candidate() calls. They look bulky. I wanted to replace them with a function like raft_try_candidate() which would make the instance a candidate for one current term. Also I don't like box_raft_wait_leader_found() in bootstrap. Even though it should always succeed in a couple of WAL writes. The only problem with having a one-term try_candidate() is the split vote in box_promote(). There between start_candidate() and stop_candidate() several terms can pass if they end without a leader elected. I want to keep this working. But I couldn't find a simple way to make raft_try_candidate() work like that. Nor I could find a nice way to retry the candidate attempts in box_promote(). In the end, if box_raft_wait_leader_found() fails, you need to restore the candidate state. And this leads to having some kind of start/stop_candidate anyway. If you have ideas how to make it look better, I am looking forward to hearing it. Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-6018-boot-voter Issue: https://github.com/tarantool/tarantool/issues/6018 Vladislav Shpilevoy (2): replication: introduce ballot.can_be_leader election: during bootstrap prefer candidates .../unreleased/gh-6018-election-boot-voter.md | 4 + src/box/box.cc | 28 ++++- src/box/iproto_constants.h | 1 + src/box/replication.cc | 11 +- src/box/xrow.c | 14 ++- src/box/xrow.h | 2 + .../gh-6018-election-boot-voter.result | 116 ++++++++++++++++++ .../gh-6018-election-boot-voter.test.lua | 59 +++++++++ test/replication/gh-6018-master.lua | 17 +++ test/replication/gh-6018-replica.lua | 15 +++ test/replication/suite.cfg | 1 + 11 files changed, 263 insertions(+), 5 deletions(-) create mode 100644 changelogs/unreleased/gh-6018-election-boot-voter.md create mode 100644 test/replication/gh-6018-election-boot-voter.result create mode 100644 test/replication/gh-6018-election-boot-voter.test.lua create mode 100644 test/replication/gh-6018-master.lua create mode 100644 test/replication/gh-6018-replica.lua -- 2.24.3 (Apple Git-128) ^ permalink raw reply [flat|nested] 23+ messages in thread
* [Tarantool-patches] [PATCH 1/2] replication: introduce ballot.can_be_leader 2021-07-15 23:49 [Tarantool-patches] [PATCH 0/2] Bootstrap voter Vladislav Shpilevoy via Tarantool-patches @ 2021-07-15 23:49 ` Vladislav Shpilevoy via Tarantool-patches 2021-07-16 10:59 ` Serge Petrenko via Tarantool-patches 2021-07-16 14:29 ` Konstantin Osipov via Tarantool-patches 2021-07-15 23:49 ` [Tarantool-patches] [PATCH 2/2] election: during bootstrap prefer candidates Vladislav Shpilevoy via Tarantool-patches 2021-07-16 14:27 ` [Tarantool-patches] [PATCH 0/2] Bootstrap voter Konstantin Osipov via Tarantool-patches 2 siblings, 2 replies; 23+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2021-07-15 23:49 UTC (permalink / raw) To: tarantool-patches, gorcunov, sergepetrenko The flag tells whether the sender has election mode 'candidate' or 'manual'. The new field during bootstrap will help to avoid selecting a 'voter' as a master. Voters can't write, they are unable to boot themselves nor register others. @TarantoolBot document Title: New field - IPROTO_BALLOT_CAN_BE_LEADER It is sent as a part of `IPROTO_BALLOT (0x29)`. The field is a boolean flag which is true if the sender has `election_mode` in its config `'manual'` or `'candidate'`. During bootstrap the nodes able to become a leader are preferred over the nodes configured as `'voter'`. The field key is `0x07`. --- src/box/box.cc | 3 +++ src/box/iproto_constants.h | 1 + src/box/xrow.c | 14 ++++++++++++-- src/box/xrow.h | 2 ++ 4 files changed, 18 insertions(+), 2 deletions(-) diff --git a/src/box/box.cc b/src/box/box.cc index eeb57b04e..ef3efe3e0 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -2881,6 +2881,9 @@ void box_process_vote(struct ballot *ballot) { ballot->is_ro_cfg = cfg_geti("read_only") != 0; + enum election_mode mode = box_check_election_mode(); + ballot->can_be_leader = mode == ELECTION_MODE_CANDIDATE || + mode == ELECTION_MODE_MANUAL; ballot->is_anon = replication_anon; ballot->is_ro = is_ro_summary; ballot->is_booted = is_box_configured; diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h index 137bee9da..f0f967008 100644 --- a/src/box/iproto_constants.h +++ b/src/box/iproto_constants.h @@ -168,6 +168,7 @@ enum iproto_ballot_key { IPROTO_BALLOT_IS_RO = 0x04, IPROTO_BALLOT_IS_ANON = 0x05, IPROTO_BALLOT_IS_BOOTED = 0x06, + IPROTO_BALLOT_CAN_BE_LEADER = 0x07, }; #define bit(c) (1ULL<<IPROTO_##c) diff --git a/src/box/xrow.c b/src/box/xrow.c index 16cb2484c..a6b19ce89 100644 --- a/src/box/xrow.c +++ b/src/box/xrow.c @@ -459,7 +459,9 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot, mp_sizeof_uint(UINT32_MAX) + mp_sizeof_vclock_ignore0(&ballot->vclock) + mp_sizeof_uint(UINT32_MAX) + - mp_sizeof_vclock_ignore0(&ballot->gc_vclock); + mp_sizeof_vclock_ignore0(&ballot->gc_vclock) + + mp_sizeof_uint(IPROTO_BALLOT_CAN_BE_LEADER) + + mp_sizeof_bool(ballot->can_be_leader); char *buf = obuf_reserve(out, max_size); if (buf == NULL) { @@ -471,7 +473,7 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot, char *data = buf + IPROTO_HEADER_LEN; data = mp_encode_map(data, 1); data = mp_encode_uint(data, IPROTO_BALLOT); - data = mp_encode_map(data, 6); + data = mp_encode_map(data, 7); data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO_CFG); data = mp_encode_bool(data, ballot->is_ro_cfg); data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO); @@ -484,6 +486,8 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot, data = mp_encode_vclock_ignore0(data, &ballot->vclock); data = mp_encode_uint(data, IPROTO_BALLOT_GC_VCLOCK); data = mp_encode_vclock_ignore0(data, &ballot->gc_vclock); + data = mp_encode_uint(data, IPROTO_BALLOT_CAN_BE_LEADER); + data = mp_encode_bool(data, ballot->can_be_leader); size_t size = data - buf; assert(size <= max_size); @@ -1362,6 +1366,7 @@ int xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot) { ballot->is_ro_cfg = false; + ballot->can_be_leader = false; ballot->is_ro = false; ballot->is_anon = false; ballot->is_booted = true; @@ -1434,6 +1439,11 @@ xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot) goto err; ballot->is_booted = mp_decode_bool(&data); break; + case IPROTO_BALLOT_CAN_BE_LEADER: + if (mp_typeof(*data) != MP_BOOL) + goto err; + ballot->can_be_leader = mp_decode_bool(&data); + break; default: mp_next(&data); } diff --git a/src/box/xrow.h b/src/box/xrow.h index 9f61bfd47..f77fa5a9b 100644 --- a/src/box/xrow.h +++ b/src/box/xrow.h @@ -368,6 +368,8 @@ xrow_encode_auth(struct xrow_header *row, const char *salt, size_t salt_len, struct ballot { /** Set if the instance is configured in read-only mode. */ bool is_ro_cfg; + /** Set if the instance is an election candidate. */ + bool can_be_leader; /** * A flag whether the instance is anonymous, not having an * ID, and not going to request it. -- 2.24.3 (Apple Git-128) ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/2] replication: introduce ballot.can_be_leader 2021-07-15 23:49 ` [Tarantool-patches] [PATCH 1/2] replication: introduce ballot.can_be_leader Vladislav Shpilevoy via Tarantool-patches @ 2021-07-16 10:59 ` Serge Petrenko via Tarantool-patches 2021-07-18 17:00 ` Vladislav Shpilevoy via Tarantool-patches 2021-07-16 14:29 ` Konstantin Osipov via Tarantool-patches 1 sibling, 1 reply; 23+ messages in thread From: Serge Petrenko via Tarantool-patches @ 2021-07-16 10:59 UTC (permalink / raw) To: Vladislav Shpilevoy, tarantool-patches, gorcunov 16.07.2021 02:49, Vladislav Shpilevoy пишет: > The flag tells whether the sender has election mode 'candidate' or > 'manual'. > > The new field during bootstrap will help to avoid selecting a > 'voter' as a master. Voters can't write, they are unable to boot > themselves nor register others. > > @TarantoolBot document > Title: New field - IPROTO_BALLOT_CAN_BE_LEADER > It is sent as a part of `IPROTO_BALLOT (0x29)`. The field is a > boolean flag which is true if the sender has `election_mode` in > its config `'manual'` or `'candidate'`. > > During bootstrap the nodes able to become a leader are preferred > over the nodes configured as `'voter'`. > > The field key is `0x07`. > --- > src/box/box.cc | 3 +++ > src/box/iproto_constants.h | 1 + > src/box/xrow.c | 14 ++++++++++++-- > src/box/xrow.h | 2 ++ > 4 files changed, 18 insertions(+), 2 deletions(-) > > diff --git a/src/box/box.cc b/src/box/box.cc > index eeb57b04e..ef3efe3e0 100644 > --- a/src/box/box.cc > +++ b/src/box/box.cc Hi and thanks for the patch! It's all good except the field name. It's rather long IMO, and may be confused with bootsrap leader, which we try to find using ballots. Wouldn't "can be candidate", or "is_election_candidate" be better? I do not insist, though. > @@ -2881,6 +2881,9 @@ void > box_process_vote(struct ballot *ballot) > { > ballot->is_ro_cfg = cfg_geti("read_only") != 0; > + enum election_mode mode = box_check_election_mode(); > + ballot->can_be_leader = mode == ELECTION_MODE_CANDIDATE || > + mode == ELECTION_MODE_MANUAL; > ballot->is_anon = replication_anon; > ballot->is_ro = is_ro_summary; > ballot->is_booted = is_box_configured; > diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h > index 137bee9da..f0f967008 100644 > --- a/src/box/iproto_constants.h > +++ b/src/box/iproto_constants.h > @@ -168,6 +168,7 @@ enum iproto_ballot_key { > IPROTO_BALLOT_IS_RO = 0x04, > IPROTO_BALLOT_IS_ANON = 0x05, > IPROTO_BALLOT_IS_BOOTED = 0x06, > + IPROTO_BALLOT_CAN_BE_LEADER = 0x07, > }; > > #define bit(c) (1ULL<<IPROTO_##c) > diff --git a/src/box/xrow.c b/src/box/xrow.c > index 16cb2484c..a6b19ce89 100644 > --- a/src/box/xrow.c > +++ b/src/box/xrow.c > @@ -459,7 +459,9 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot, > mp_sizeof_uint(UINT32_MAX) + > mp_sizeof_vclock_ignore0(&ballot->vclock) + > mp_sizeof_uint(UINT32_MAX) + > - mp_sizeof_vclock_ignore0(&ballot->gc_vclock); > + mp_sizeof_vclock_ignore0(&ballot->gc_vclock) + > + mp_sizeof_uint(IPROTO_BALLOT_CAN_BE_LEADER) + > + mp_sizeof_bool(ballot->can_be_leader); > > char *buf = obuf_reserve(out, max_size); > if (buf == NULL) { > @@ -471,7 +473,7 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot, > char *data = buf + IPROTO_HEADER_LEN; > data = mp_encode_map(data, 1); > data = mp_encode_uint(data, IPROTO_BALLOT); > - data = mp_encode_map(data, 6); > + data = mp_encode_map(data, 7); > data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO_CFG); > data = mp_encode_bool(data, ballot->is_ro_cfg); > data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO); > @@ -484,6 +486,8 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot, > data = mp_encode_vclock_ignore0(data, &ballot->vclock); > data = mp_encode_uint(data, IPROTO_BALLOT_GC_VCLOCK); > data = mp_encode_vclock_ignore0(data, &ballot->gc_vclock); > + data = mp_encode_uint(data, IPROTO_BALLOT_CAN_BE_LEADER); > + data = mp_encode_bool(data, ballot->can_be_leader); > size_t size = data - buf; > assert(size <= max_size); > > @@ -1362,6 +1366,7 @@ int > xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot) > { > ballot->is_ro_cfg = false; > + ballot->can_be_leader = false; > ballot->is_ro = false; > ballot->is_anon = false; > ballot->is_booted = true; > @@ -1434,6 +1439,11 @@ xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot) > goto err; > ballot->is_booted = mp_decode_bool(&data); > break; > + case IPROTO_BALLOT_CAN_BE_LEADER: > + if (mp_typeof(*data) != MP_BOOL) > + goto err; > + ballot->can_be_leader = mp_decode_bool(&data); > + break; > default: > mp_next(&data); > } > diff --git a/src/box/xrow.h b/src/box/xrow.h > index 9f61bfd47..f77fa5a9b 100644 > --- a/src/box/xrow.h > +++ b/src/box/xrow.h > @@ -368,6 +368,8 @@ xrow_encode_auth(struct xrow_header *row, const char *salt, size_t salt_len, > struct ballot { > /** Set if the instance is configured in read-only mode. */ > bool is_ro_cfg; > + /** Set if the instance is an election candidate. */ > + bool can_be_leader; > /** > * A flag whether the instance is anonymous, not having an > * ID, and not going to request it. -- Serge Petrenko ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/2] replication: introduce ballot.can_be_leader 2021-07-16 10:59 ` Serge Petrenko via Tarantool-patches @ 2021-07-18 17:00 ` Vladislav Shpilevoy via Tarantool-patches 2021-07-19 9:11 ` Sergey Petrenko via Tarantool-patches 0 siblings, 1 reply; 23+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2021-07-18 17:00 UTC (permalink / raw) To: Serge Petrenko, tarantool-patches, gorcunov Hi! Thanks for the review! > Hi and thanks for the patch! > > It's all good except the field name. It's rather long IMO, and may be > confused with bootsrap leader, which we try to find using ballots. > > Wouldn't "can be candidate", or "is_election_candidate" be better? `is_election_candidate` is even longer. I thought about `can_be_candidate` firstly, but then I realized that it is the same as `can_be_leader`, because the purpose of a candidate is to become a leader anyway. And we are looking for a leader, not a candidate. The latter is just an intermediate state. I shortened the name to `can_lead`. Ok? As for inability to boot a cluster we might in the future add a flag `can_boot` if necessary. See new patch in v2. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/2] replication: introduce ballot.can_be_leader 2021-07-18 17:00 ` Vladislav Shpilevoy via Tarantool-patches @ 2021-07-19 9:11 ` Sergey Petrenko via Tarantool-patches 0 siblings, 0 replies; 23+ messages in thread From: Sergey Petrenko via Tarantool-patches @ 2021-07-19 9:11 UTC (permalink / raw) To: Vladislav Shpilevoy, tarantool-patches, gorcunov 18.07.2021 19:00, Vladislav Shpilevoy пишет: > Hi! Thanks for the review! > >> Hi and thanks for the patch! >> >> It's all good except the field name. It's rather long IMO, and may be >> confused with bootsrap leader, which we try to find using ballots. >> >> Wouldn't "can be candidate", or "is_election_candidate" be better? > `is_election_candidate` is even longer. I thought about `can_be_candidate` > firstly, but then I realized that it is the same as `can_be_leader`, because > the purpose of a candidate is to become a leader anyway. And we are looking > for a leader, not a candidate. The latter is just an intermediate state. > > I shortened the name to `can_lead`. Ok? Sounds good, thanks! > > As for inability to boot a cluster we might in the future add a flag `can_boot` > if necessary. See new patch in v2. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/2] replication: introduce ballot.can_be_leader 2021-07-15 23:49 ` [Tarantool-patches] [PATCH 1/2] replication: introduce ballot.can_be_leader Vladislav Shpilevoy via Tarantool-patches 2021-07-16 10:59 ` Serge Petrenko via Tarantool-patches @ 2021-07-16 14:29 ` Konstantin Osipov via Tarantool-patches 2021-07-18 17:00 ` Vladislav Shpilevoy via Tarantool-patches 1 sibling, 1 reply; 23+ messages in thread From: Konstantin Osipov via Tarantool-patches @ 2021-07-16 14:29 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tarantool-patches * Vladislav Shpilevoy via Tarantool-patches <tarantool-patches@dev.tarantool.org> [21/07/16 11:23]: > The flag tells whether the sender has election mode 'candidate' or > 'manual'. > > The new field during bootstrap will help to avoid selecting a > 'voter' as a master. Voters can't write, they are unable to boot > themselves nor register others. > > @TarantoolBot document > Title: New field - IPROTO_BALLOT_CAN_BE_LEADER > It is sent as a part of `IPROTO_BALLOT (0x29)`. The field is a > boolean flag which is true if the sender has `election_mode` in > its config `'manual'` or `'candidate'`. > > During bootstrap the nodes able to become a leader are preferred > over the nodes configured as `'voter'`. Curious why did you add this feature in the first place, I mean "eligibility"? Each voter has to be able to become a leader, otherwise raft liveness guarantees are violated. Raft has learners, but learners neither vote nor can become leaders. -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/2] replication: introduce ballot.can_be_leader 2021-07-16 14:29 ` Konstantin Osipov via Tarantool-patches @ 2021-07-18 17:00 ` Vladislav Shpilevoy via Tarantool-patches 2021-07-19 9:12 ` Konstantin Osipov via Tarantool-patches 0 siblings, 1 reply; 23+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2021-07-18 17:00 UTC (permalink / raw) To: Konstantin Osipov, tarantool-patches, gorcunov, sergepetrenko On 16.07.2021 16:29, Konstantin Osipov via Tarantool-patches wrote: > * Vladislav Shpilevoy via Tarantool-patches <tarantool-patches@dev.tarantool.org> [21/07/16 11:23]: >> The flag tells whether the sender has election mode 'candidate' or >> 'manual'. >> >> The new field during bootstrap will help to avoid selecting a >> 'voter' as a master. Voters can't write, they are unable to boot >> themselves nor register others. >> >> @TarantoolBot document >> Title: New field - IPROTO_BALLOT_CAN_BE_LEADER >> It is sent as a part of `IPROTO_BALLOT (0x29)`. The field is a >> boolean flag which is true if the sender has `election_mode` in >> its config `'manual'` or `'candidate'`. >> >> During bootstrap the nodes able to become a leader are preferred >> over the nodes configured as `'voter'`. > > Curious why did you add this feature in the first place, I mean > "eligibility"? Each voter has to be able to become a leader, > otherwise raft liveness guarantees are violated. Raft has > learners, but learners neither vote nor can become leaders. Voters are nodes which an admin does not want to be a leader. For instance, they are too far away physically. As voters, they might help to elect a leader, for example, if there are just 3 nodes one of which is a voter. Another application is when you specifically start 1 node as a voter and 2 candidates. The voter might skip all the replication data and work on a slow small machine. It can help to form a majority. We are planning to make this feature even easier to use by adding dataless nodes just for voting. As for Raft, it should not bring any problems. In Raft you can say that all nodes are candidates, but some of them are so slow, that they can never vote for themselves in time. Raft still works, and you essentially have 'voters'. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/2] replication: introduce ballot.can_be_leader 2021-07-18 17:00 ` Vladislav Shpilevoy via Tarantool-patches @ 2021-07-19 9:12 ` Konstantin Osipov via Tarantool-patches 2021-07-19 22:06 ` Vladislav Shpilevoy via Tarantool-patches 2021-07-20 21:16 ` Cyrill Gorcunov via Tarantool-patches 0 siblings, 2 replies; 23+ messages in thread From: Konstantin Osipov via Tarantool-patches @ 2021-07-19 9:12 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tarantool-patches * Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [21/07/18 20:03]: > >> The new field during bootstrap will help to avoid selecting a > >> 'voter' as a master. Voters can't write, they are unable to boot > >> themselves nor register others. > >> > >> @TarantoolBot document > >> Title: New field - IPROTO_BALLOT_CAN_BE_LEADER > >> It is sent as a part of `IPROTO_BALLOT (0x29)`. The field is a > >> boolean flag which is true if the sender has `election_mode` in > >> its config `'manual'` or `'candidate'`. > >> > >> During bootstrap the nodes able to become a leader are preferred > >> over the nodes configured as `'voter'`. > > > > Curious why did you add this feature in the first place, I mean > > "eligibility"? Each voter has to be able to become a leader, > > otherwise raft liveness guarantees are violated. Raft has > > learners, but learners neither vote nor can become leaders. > > Voters are nodes which an admin does not want to be a leader. For > instance, they are too far away physically. As voters, they might > help to elect a leader, for example, if there are just 3 nodes one > of which is a voter. > > Another application is when you specifically start 1 node as a > voter and 2 candidates. The voter might skip all the replication > data and work on a slow small machine. > > It can help to form a majority. We are planning to make this > feature even easier to use by adding dataless nodes just for > voting. > > As for Raft, it should not bring any problems. In Raft you can > say that all nodes are candidates, but some of them are so slow, > that they can never vote for themselves in time. Raft still works, > and you essentially have 'voters'. Imagine there are nodes A, B, C, D, E. A is a leader, E is a voter which can not become a leader. Imagine A's log index is 5, B = 4, C = 3, D = 2, E = 5. The majority's log index is 4, so entry 4 is committed. A dies, B is partitioned away. The cluster is stuck, because neither C nor B can get a quorum (3 votes). Worse yet, if E's (voter) commit index is low, not high, it can vote for a node which doesn't have a committed entry. In that case you can lose a committed entry. -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/2] replication: introduce ballot.can_be_leader 2021-07-19 9:12 ` Konstantin Osipov via Tarantool-patches @ 2021-07-19 22:06 ` Vladislav Shpilevoy via Tarantool-patches 2021-07-20 8:49 ` Konstantin Osipov via Tarantool-patches 2021-07-20 21:16 ` Cyrill Gorcunov via Tarantool-patches 1 sibling, 1 reply; 23+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2021-07-19 22:06 UTC (permalink / raw) To: Konstantin Osipov, tarantool-patches, gorcunov, sergepetrenko On 19.07.2021 11:12, Konstantin Osipov wrote: > * Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [21/07/18 20:03]: >>>> The new field during bootstrap will help to avoid selecting a >>>> 'voter' as a master. Voters can't write, they are unable to boot >>>> themselves nor register others. >>>> >>>> @TarantoolBot document >>>> Title: New field - IPROTO_BALLOT_CAN_BE_LEADER >>>> It is sent as a part of `IPROTO_BALLOT (0x29)`. The field is a >>>> boolean flag which is true if the sender has `election_mode` in >>>> its config `'manual'` or `'candidate'`. >>>> >>>> During bootstrap the nodes able to become a leader are preferred >>>> over the nodes configured as `'voter'`. >>> >>> Curious why did you add this feature in the first place, I mean >>> "eligibility"? Each voter has to be able to become a leader, >>> otherwise raft liveness guarantees are violated. Raft has >>> learners, but learners neither vote nor can become leaders. >> >> Voters are nodes which an admin does not want to be a leader. For >> instance, they are too far away physically. As voters, they might >> help to elect a leader, for example, if there are just 3 nodes one >> of which is a voter. >> >> Another application is when you specifically start 1 node as a >> voter and 2 candidates. The voter might skip all the replication >> data and work on a slow small machine. >> >> It can help to form a majority. We are planning to make this >> feature even easier to use by adding dataless nodes just for >> voting. >> >> As for Raft, it should not bring any problems. In Raft you can >> say that all nodes are candidates, but some of them are so slow, >> that they can never vote for themselves in time. Raft still works, >> and you essentially have 'voters'. > > Imagine there are nodes A, B, C, D, E. > A is a leader, E is a voter which can not become a leader. > > Imagine A's log index is 5, B = 4, C = 3, D = 2, E = 5. > > The majority's log index is 4, so entry 4 is committed. A dies, B > is partitioned away. The cluster is stuck, because neither C nor B > can get a quorum (3 votes). But how is it different from the real Raft? In normal Raft I can say E simply is too slow to make any actions. It is just stuck or died. The cluster will be stuck then, yes. Not much you can do here. You can think of a voter as of almost a permanently broken node which sometimes manages to vote but never manages to become a candidate in time. I suppose Raft can withstand that behaviour. > Worse yet, if E's (voter) commit index is low, not high, it can vote for a > node which doesn't have a committed entry. In that case you can > lose a committed entry. Could you provide an example? Because I still do not see how is it different from the classic Raft in which one node either is always too late to become a candidate or is turned off when there are no better candidates. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/2] replication: introduce ballot.can_be_leader 2021-07-19 22:06 ` Vladislav Shpilevoy via Tarantool-patches @ 2021-07-20 8:49 ` Konstantin Osipov via Tarantool-patches 2021-07-20 20:02 ` Vladislav Shpilevoy via Tarantool-patches 0 siblings, 1 reply; 23+ messages in thread From: Konstantin Osipov via Tarantool-patches @ 2021-07-20 8:49 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tarantool-patches * Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [21/07/20 01:09]: > >>> Curious why did you add this feature in the first place, I mean > >>> "eligibility"? Each voter has to be able to become a leader, > >>> otherwise raft liveness guarantees are violated. Raft has > >>> learners, but learners neither vote nor can become leaders. > >> > >> Voters are nodes which an admin does not want to be a leader. For > >> instance, they are too far away physically. As voters, they might > >> help to elect a leader, for example, if there are just 3 nodes one > >> of which is a voter. > >> > >> Another application is when you specifically start 1 node as a > >> voter and 2 candidates. The voter might skip all the replication > >> data and work on a slow small machine. > >> > >> It can help to form a majority. We are planning to make this > >> feature even easier to use by adding dataless nodes just for > >> voting. > >> > >> As for Raft, it should not bring any problems. In Raft you can > >> say that all nodes are candidates, but some of them are so slow, > >> that they can never vote for themselves in time. Raft still works, > >> and you essentially have 'voters'. > > > > Imagine there are nodes A, B, C, D, E. > > A is a leader, E is a voter which can not become a leader. > > > > Imagine A's log index is 5, B = 4, C = 3, D = 2, E = 5. > > > > The majority's log index is 4, so entry 4 is committed. A dies, B > > is partitioned away. The cluster is stuck, because neither C nor B > > can get a quorum (3 votes). > > But how is it different from the real Raft? In normal Raft I can say > E simply is too slow to make any actions. It is just stuck or died. > The cluster will be stuck then, yes. Not much you can do here. In a real raft: - liveness is guaranteed if quorum is present; this guarantee here is not held - you never sacrifice safety for liveness; you never lose committed entries if quorum is present; and you never lose it unnoticed! here you can lose a committed entry and not notice it. > > You can think of a voter as of almost a permanently broken node which > sometimes manages to vote but never manages to become a candidate in > time. I suppose Raft can withstand that behaviour. > > > Worse yet, if E's (voter) commit index is low, not high, it can vote for a > > node which doesn't have a committed entry. In that case you can > > lose a committed entry. > > Could you provide an example? Because I still do not see how is it > different from the classic Raft in which one node either is always too > late to become a candidate or is turned off when there are no better > candidates. -- Konstantin Osipov, Moscow, Russia https://scylladb.com ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/2] replication: introduce ballot.can_be_leader 2021-07-20 8:49 ` Konstantin Osipov via Tarantool-patches @ 2021-07-20 20:02 ` Vladislav Shpilevoy via Tarantool-patches 2021-07-20 20:18 ` Konstantin Osipov via Tarantool-patches 0 siblings, 1 reply; 23+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2021-07-20 20:02 UTC (permalink / raw) To: Konstantin Osipov, tarantool-patches, gorcunov, sergepetrenko On 20.07.2021 10:49, Konstantin Osipov wrote: > * Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [21/07/20 01:09]: >>>>> Curious why did you add this feature in the first place, I mean >>>>> "eligibility"? Each voter has to be able to become a leader, >>>>> otherwise raft liveness guarantees are violated. Raft has >>>>> learners, but learners neither vote nor can become leaders. >>>> >>>> Voters are nodes which an admin does not want to be a leader. For >>>> instance, they are too far away physically. As voters, they might >>>> help to elect a leader, for example, if there are just 3 nodes one >>>> of which is a voter. >>>> >>>> Another application is when you specifically start 1 node as a >>>> voter and 2 candidates. The voter might skip all the replication >>>> data and work on a slow small machine. >>>> >>>> It can help to form a majority. We are planning to make this >>>> feature even easier to use by adding dataless nodes just for >>>> voting. >>>> >>>> As for Raft, it should not bring any problems. In Raft you can >>>> say that all nodes are candidates, but some of them are so slow, >>>> that they can never vote for themselves in time. Raft still works, >>>> and you essentially have 'voters'. >>> >>> Imagine there are nodes A, B, C, D, E. >>> A is a leader, E is a voter which can not become a leader. >>> >>> Imagine A's log index is 5, B = 4, C = 3, D = 2, E = 5. >>> >>> The majority's log index is 4, so entry 4 is committed. A dies, B >>> is partitioned away. The cluster is stuck, because neither C nor B >>> can get a quorum (3 votes). >> >> But how is it different from the real Raft? In normal Raft I can say >> E simply is too slow to make any actions. It is just stuck or died. >> The cluster will be stuck then, yes. Not much you can do here. > > In a real raft: > - liveness is guaranteed if quorum is present; this guarantee here > is not held > - you never sacrifice safety for liveness; you never lose > committed entries if quorum is present; and you never lose it > unnoticed! here you can lose a committed entry and not notice > it. Please, show me an example. The example above only shows that the election might stop if there are issues with the quorum. And this will happen regardless of whether I have voter role or not. In normal Raft you can kill 3/5 nodes and nothing will work too. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/2] replication: introduce ballot.can_be_leader 2021-07-20 20:02 ` Vladislav Shpilevoy via Tarantool-patches @ 2021-07-20 20:18 ` Konstantin Osipov via Tarantool-patches 0 siblings, 0 replies; 23+ messages in thread From: Konstantin Osipov via Tarantool-patches @ 2021-07-20 20:18 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tarantool-patches * Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [21/07/20 23:06]: > >>> > >>> Imagine there are nodes A, B, C, D, E. ^^^ > >>> A is a leader, E is a voter which can not become a leader. > >>> > >>> Imagine A's log index is 5, B = 4, C = 3, D = 2, E = 5. > >>> > >>> The majority's log index is 4, so entry 4 is committed. A dies, B > >>> is partitioned away. The cluster is stuck, because neither C nor B > >>> can get a quorum (3 votes). > >> > >> But how is it different from the real Raft? In normal Raft I can say > >> E simply is too slow to make any actions. It is just stuck or died. > >> The cluster will be stuck then, yes. Not much you can do here. > > > > In a real raft: > > - liveness is guaranteed if quorum is present; this guarantee here > > is not held > > - you never sacrifice safety for liveness; you never lose > > committed entries if quorum is present; and you never lose it > > unnoticed! here you can lose a committed entry and not notice > > it. > > Please, show me an example. The example above only shows that the > election might stop if there are issues with the quorum. And this > will happen regardless of whether I have voter role or not. In normal > Raft you can kill 3/5 nodes and nothing will work too. What is not clear with the example above? In it, I kill 2 out of 5, not 3 out of 5. And no, in Raft "nothing will work" is not correct. The system will not be live, but will be safe. -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/2] replication: introduce ballot.can_be_leader 2021-07-19 9:12 ` Konstantin Osipov via Tarantool-patches 2021-07-19 22:06 ` Vladislav Shpilevoy via Tarantool-patches @ 2021-07-20 21:16 ` Cyrill Gorcunov via Tarantool-patches 2021-07-20 23:20 ` Konstantin Osipov via Tarantool-patches 1 sibling, 1 reply; 23+ messages in thread From: Cyrill Gorcunov via Tarantool-patches @ 2021-07-20 21:16 UTC (permalink / raw) To: Konstantin Osipov; +Cc: tarantool-patches, Vladislav Shpilevoy On Mon, Jul 19, 2021 at 12:12:48PM +0300, Konstantin Osipov wrote: > > Imagine there are nodes A, B, C, D, E. > A is a leader, E is a voter which can not become a leader. > > Imagine A's log index is 5, B = 4, C = 3, D = 2, E = 5. > > The majority's log index is 4, so entry 4 is committed. A dies, B > is partitioned away. The cluster is stuck, because neither C nor B > can get a quorum (3 votes). > > Worse yet, if E's (voter) commit index is low, not high, it can vote for a > node which doesn't have a committed entry. In that case you can > lose a committed entry. Wait, Kostya, here is a set A B C D E {5, 4, 3, 2, 5} * * * L F F F V where L - leader, F - follower, V - voter, LCI is 4 (least common index), Q(uorum) = 3, then A B C D E {-, -, 3, 2, 5} F F V The Voter E won't be able to choose C or D because its log is bigger and the cluster get stuck (this is guaranteed by vclock comparision). Lets assume the E's index is low, say 3 A B C D E {5, 4, 4, 3, 3} * * L F F F V in this config the leader won't commit record 5 until one of {C,D,E} write the new record(s) since otherwise the quorum won't be reached. Assume A and B get out of the set without record 4 written to C A B C D E {-, -, 4, 3, 3} F F V Now the node E can vote for C and D because its index is LE. And since C's index is bigger than others it will be elected next as far as I understand, no? The E won't be leader but will help C to gather the majority. So the cluster should be safe I only I'm not missing something obvious. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/2] replication: introduce ballot.can_be_leader 2021-07-20 21:16 ` Cyrill Gorcunov via Tarantool-patches @ 2021-07-20 23:20 ` Konstantin Osipov via Tarantool-patches 2021-07-21 18:51 ` Cyrill Gorcunov via Tarantool-patches 2021-07-21 21:43 ` Vladislav Shpilevoy via Tarantool-patches 0 siblings, 2 replies; 23+ messages in thread From: Konstantin Osipov via Tarantool-patches @ 2021-07-20 23:20 UTC (permalink / raw) To: Cyrill Gorcunov; +Cc: tarantool-patches, Vladislav Shpilevoy * Cyrill Gorcunov <gorcunov@gmail.com> [21/07/21 00:17]: > > Imagine there are nodes A, B, C, D, E. > > A is a leader, E is a voter which can not become a leader. > > > > Imagine A's log index is 5, B = 4, C = 3, D = 2, E = 5. > > > > The majority's log index is 4, so entry 4 is committed. A dies, B > > is partitioned away. The cluster is stuck, because neither C nor B > > can get a quorum (3 votes). > > > > Worse yet, if E's (voter) commit index is low, not high, it can vote for a > > node which doesn't have a committed entry. In that case you can > > lose a committed entry. > > Wait, Kostya, here is a set > > A B C D E > {5, 4, 3, 2, 5} > * * * > L F F F V > > where L - leader, F - follower, V - voter, LCI is 4 (least common index), > Q(uorum) = 3, then > > A B C D E > {-, -, 3, 2, 5} > F F V > > The Voter E won't be able to choose C or D because its log > is bigger and the cluster get stuck (this is guaranteed by > vclock comparision). Right. That's what I am saying - the cluster is stuck even though the quorum (3 nodes) is present. And this is not something consistent, such clusters will get stuck simply based on the state of the voter - sometimes they will, sometimes they won't. > Lets assume the E's index is low, say 3 > > A B C D E > {5, 4, 4, 3, 3} > * * > L F F F V > > in this config the leader won't commit record 5 until one > of {C,D,E} write the new record(s) since otherwise the quorum > won't be reached. Assume A and B get out of the set without > record 4 written to C > > A B C D E > {-, -, 4, 3, 3} > F F V > > Now the node E can vote for C and D because its index is LE. > And since C's index is bigger than others it will be elected > next as far as I understand, no? You're right, assuming the voter never casts a vote for a candidate with a shorter log the safety is not violated. I wasn't sure it's the case, and presumed that the voter may have no log of its own. But still there are issues with liveness. Raft PHD has learners, so why not use them instead. > The E won't be leader but will > help C to gather the majority. So the cluster should be safe > I only I'm not missing something obvious. -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/2] replication: introduce ballot.can_be_leader 2021-07-20 23:20 ` Konstantin Osipov via Tarantool-patches @ 2021-07-21 18:51 ` Cyrill Gorcunov via Tarantool-patches 2021-07-21 21:43 ` Vladislav Shpilevoy via Tarantool-patches 1 sibling, 0 replies; 23+ messages in thread From: Cyrill Gorcunov via Tarantool-patches @ 2021-07-21 18:51 UTC (permalink / raw) To: Konstantin Osipov; +Cc: tarantool-patches, Vladislav Shpilevoy On Wed, Jul 21, 2021 at 02:20:57AM +0300, Konstantin Osipov wrote: > > Right. That's what I am saying - the cluster is stuck even though > the quorum (3 nodes) is present. And this is not something > consistent, such clusters will get stuck simply based on the state > of the voter - sometimes they will, sometimes they won't. Which is exactly the same as it would be if node E is a follower but at the moment when it gonna be selected as a leader the network environment get broken (or enter a flip/flop cycle). In such case even traditional Raft nodes (ie those which doen't have "voters" at all) will behave the same way -- sometimes quorum will be achieved and sometimes not. Though I see what you mean, thanks! > > Lets assume the E's index is low, say 3 > > > > A B C D E > > {5, 4, 4, 3, 3} > > * * > > L F F F V > > > > in this config the leader won't commit record 5 until one > > of {C,D,E} write the new record(s) since otherwise the quorum > > won't be reached. Assume A and B get out of the set without > > record 4 written to C > > > > A B C D E > > {-, -, 4, 3, 3} > > F F V > > > > Now the node E can vote for C and D because its index is LE. > > And since C's index is bigger than others it will be elected > > next as far as I understand, no? > > You're right, assuming the voter never casts a vote for a > candidate with a shorter log the safety is not violated. I wasn't > sure it's the case, and presumed that the voter may have no log of > its own. But still there are issues with liveness. Raft PHD has > learners, so why not use them instead. Yeah, voters are to carry logs as well and it is their critical feature but idea about Raft learners might be even better, I think Vlad has some thoughts about learners. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/2] replication: introduce ballot.can_be_leader 2021-07-20 23:20 ` Konstantin Osipov via Tarantool-patches 2021-07-21 18:51 ` Cyrill Gorcunov via Tarantool-patches @ 2021-07-21 21:43 ` Vladislav Shpilevoy via Tarantool-patches 1 sibling, 0 replies; 23+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2021-07-21 21:43 UTC (permalink / raw) To: Konstantin Osipov, Cyrill Gorcunov, tarantool-patches, sergepetrenko >> A B C D E >> {-, -, 4, 3, 3} >> F F V >> >> Now the node E can vote for C and D because its index is LE. >> And since C's index is bigger than others it will be elected >> next as far as I understand, no? > > You're right, assuming the voter never casts a vote for a > candidate with a shorter log the safety is not violated. I wasn't > sure it's the case, and presumed that the voter may have no log of > its own. But still there are issues with liveness. Raft PHD has > learners, so why not use them instead. Because I didn't think of that before. It seems we need learners. Otherwise when a cluster is started with a single node, an attempt to add a second node will make the first one not able to commit anything, because it won't be able to gather a quorum for its sync transactions, nor won't even be elected as a leader again until the second node finishes recovery. We probably could make all new nodes learners automatically at least until they finish both join stages. Or make it manual, don't know yet. Anyway, the main problem I see right away is that the learner role means we can't just get `_cluster:count() / 2 + 1` as a quorum value. Because not all registered nodes participate in quorum. Later I will create a ticket for that learner feature. ^ permalink raw reply [flat|nested] 23+ messages in thread
* [Tarantool-patches] [PATCH 2/2] election: during bootstrap prefer candidates 2021-07-15 23:49 [Tarantool-patches] [PATCH 0/2] Bootstrap voter Vladislav Shpilevoy via Tarantool-patches 2021-07-15 23:49 ` [Tarantool-patches] [PATCH 1/2] replication: introduce ballot.can_be_leader Vladislav Shpilevoy via Tarantool-patches @ 2021-07-15 23:49 ` Vladislav Shpilevoy via Tarantool-patches 2021-07-16 11:30 ` Serge Petrenko via Tarantool-patches 2021-07-16 14:27 ` [Tarantool-patches] [PATCH 0/2] Bootstrap voter Konstantin Osipov via Tarantool-patches 2 siblings, 1 reply; 23+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2021-07-15 23:49 UTC (permalink / raw) To: tarantool-patches, gorcunov, sergepetrenko During cluster bootstrap the boot master election algorithm didn't take into account election modes of the instances. It could be that all nodes have box.cfg.read_only = false, none is booted, all are read-only now. Then the node with the smallest UUID was chosen even if it was box.cfg.election_mode='voter' node. It could neither boot nor register other nodes and the cluster couldn't start. The patch makes the boot master election prefer the instances which can become a Raft leader. If all the other parameters didn't help. Closes #6018 --- .../unreleased/gh-6018-election-boot-voter.md | 4 + src/box/box.cc | 25 +++- src/box/replication.cc | 11 +- .../gh-6018-election-boot-voter.result | 116 ++++++++++++++++++ .../gh-6018-election-boot-voter.test.lua | 59 +++++++++ test/replication/gh-6018-master.lua | 17 +++ test/replication/gh-6018-replica.lua | 15 +++ test/replication/suite.cfg | 1 + 8 files changed, 245 insertions(+), 3 deletions(-) create mode 100644 changelogs/unreleased/gh-6018-election-boot-voter.md create mode 100644 test/replication/gh-6018-election-boot-voter.result create mode 100644 test/replication/gh-6018-election-boot-voter.test.lua create mode 100644 test/replication/gh-6018-master.lua create mode 100644 test/replication/gh-6018-replica.lua diff --git a/changelogs/unreleased/gh-6018-election-boot-voter.md b/changelogs/unreleased/gh-6018-election-boot-voter.md new file mode 100644 index 000000000..080484bbe --- /dev/null +++ b/changelogs/unreleased/gh-6018-election-boot-voter.md @@ -0,0 +1,4 @@ +## bugfix/replication + +* Fixed a cluster sometimes being unable to bootstrap if it contains nodes with + `election_mode` `manual` or `voter` (gh-6018). diff --git a/src/box/box.cc b/src/box/box.cc index ef3efe3e0..3105b04b6 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -3519,7 +3519,30 @@ box_cfg_xc(void) * should take the control over the situation and start a new * term immediately. */ - raft_new_term(box_raft()); + struct raft *raft = box_raft(); + if (box_election_mode == ELECTION_MODE_MANUAL) { + raft_start_candidate(raft); + raft_new_term(raft); + int rc = box_raft_wait_leader_found(); + /* + * No need to check if the mode is still manual - it + * couldn't change because box.cfg is protected with a + * fiber lock. + */ + assert(box_election_mode == ELECTION_MODE_MANUAL); + raft_stop_candidate(raft, false); + /* + * It should not fail, because on bootstrap the node is + * a single registered instance. It can't not win the + * elections while being a lone participant. But still + * check the result so as not to a ignore potential + * problems. + */ + if (rc != 0) + diag_raise(); + } else { + raft_new_term(raft); + } } /* box.cfg.read_only is not read yet. */ diff --git a/src/box/replication.cc b/src/box/replication.cc index a0b3e0186..622d12f74 100644 --- a/src/box/replication.cc +++ b/src/box/replication.cc @@ -978,12 +978,19 @@ replicaset_find_join_master(void) * config is stronger because if it is configured as read-only, * it is in read-only state for sure, until the config is * changed. + * + * In a cluster with leader election enabled all instances might + * look equal by the scores above. Then must prefer the ones + * which can be elected as a leader, because only they would be + * able to boot themselves and register the others. */ if (ballot->is_booted) - score += 10; + score += 1000; if (!ballot->is_ro_cfg) - score += 5; + score += 100; if (!ballot->is_ro) + score += 10; + if (ballot->can_be_leader) score += 1; if (leader_score < score) goto elect; diff --git a/test/replication/gh-6018-election-boot-voter.result b/test/replication/gh-6018-election-boot-voter.result new file mode 100644 index 000000000..c960aa4bd --- /dev/null +++ b/test/replication/gh-6018-election-boot-voter.result @@ -0,0 +1,116 @@ +-- test-run result file version 2 +-- +-- gh-6018: in a auto-election cluster nodes with voter state could be selected +-- as bootstrap leaders. They should not, because a voter can't be ever writable +-- and it can neither boot itself nor register other nodes. +-- +-- Similar situation was with the manual election. All instances might have +-- manual election mode. Such a cluster wouldn't be able to boot if their +-- bootstrap master wouldn't become an elected leader automatically at least +-- once. +-- +test_run = require('test_run').new() + | --- + | ... + +function boot_with_master_election_mode(mode) \ + test_run:cmd('create server master with '.. \ + 'script="replication/gh-6018-master.lua"') \ + test_run:cmd('start server master with wait=False, args="'..mode..'"') \ + test_run:cmd('create server replica with '.. \ + 'script="replication/gh-6018-replica.lua"') \ + test_run:cmd('start server replica') \ +end + | --- + | ... + +function stop_cluster() \ + test_run:cmd('stop server replica') \ + test_run:cmd('stop server master') \ + test_run:cmd('delete server replica') \ + test_run:cmd('delete server master') \ +end + | --- + | ... + +-- +-- Candidate leader. +-- +boot_with_master_election_mode('candidate') + | --- + | ... + +test_run:switch('master') + | --- + | - true + | ... +test_run:wait_cond(function() return not box.info.ro end) + | --- + | - true + | ... +assert(box.info.election.state == 'leader') + | --- + | - true + | ... + +test_run:switch('replica') + | --- + | - true + | ... +assert(box.info.ro) + | --- + | - true + | ... +assert(box.info.election.state == 'follower') + | --- + | - true + | ... + +test_run:switch('default') + | --- + | - true + | ... +stop_cluster() + | --- + | ... + +-- +-- Manual leader. +-- +boot_with_master_election_mode('manual') + | --- + | ... + +test_run:switch('master') + | --- + | - true + | ... +test_run:wait_cond(function() return not box.info.ro end) + | --- + | - true + | ... +assert(box.info.election.state == 'leader') + | --- + | - true + | ... + +test_run:switch('replica') + | --- + | - true + | ... +assert(box.info.ro) + | --- + | - true + | ... +assert(box.info.election.state == 'follower') + | --- + | - true + | ... + +test_run:switch('default') + | --- + | - true + | ... +stop_cluster() + | --- + | ... diff --git a/test/replication/gh-6018-election-boot-voter.test.lua b/test/replication/gh-6018-election-boot-voter.test.lua new file mode 100644 index 000000000..800e20c8f --- /dev/null +++ b/test/replication/gh-6018-election-boot-voter.test.lua @@ -0,0 +1,59 @@ +-- +-- gh-6018: in a auto-election cluster nodes with voter state could be selected +-- as bootstrap leaders. They should not, because a voter can't be ever writable +-- and it can neither boot itself nor register other nodes. +-- +-- Similar situation was with the manual election. All instances might have +-- manual election mode. Such a cluster wouldn't be able to boot if their +-- bootstrap master wouldn't become an elected leader automatically at least +-- once. +-- +test_run = require('test_run').new() + +function boot_with_master_election_mode(mode) \ + test_run:cmd('create server master with '.. \ + 'script="replication/gh-6018-master.lua"') \ + test_run:cmd('start server master with wait=False, args="'..mode..'"') \ + test_run:cmd('create server replica with '.. \ + 'script="replication/gh-6018-replica.lua"') \ + test_run:cmd('start server replica') \ +end + +function stop_cluster() \ + test_run:cmd('stop server replica') \ + test_run:cmd('stop server master') \ + test_run:cmd('delete server replica') \ + test_run:cmd('delete server master') \ +end + +-- +-- Candidate leader. +-- +boot_with_master_election_mode('candidate') + +test_run:switch('master') +test_run:wait_cond(function() return not box.info.ro end) +assert(box.info.election.state == 'leader') + +test_run:switch('replica') +assert(box.info.ro) +assert(box.info.election.state == 'follower') + +test_run:switch('default') +stop_cluster() + +-- +-- Manual leader. +-- +boot_with_master_election_mode('manual') + +test_run:switch('master') +test_run:wait_cond(function() return not box.info.ro end) +assert(box.info.election.state == 'leader') + +test_run:switch('replica') +assert(box.info.ro) +assert(box.info.election.state == 'follower') + +test_run:switch('default') +stop_cluster() diff --git a/test/replication/gh-6018-master.lua b/test/replication/gh-6018-master.lua new file mode 100644 index 000000000..1192204ff --- /dev/null +++ b/test/replication/gh-6018-master.lua @@ -0,0 +1,17 @@ +#!/usr/bin/env tarantool + +require('console').listen(os.getenv('ADMIN')) + +box.cfg({ + listen = 'unix/:./gh-6018-master.sock', + replication = { + 'unix/:./gh-6018-master.sock', + 'unix/:./gh-6018-replica.sock', + }, + election_mode = arg[1], + instance_uuid = 'cbf06940-0790-498b-948d-042b62cf3d29', + replication_timeout = 0.1, +}) + +box.ctl.wait_rw() +box.schema.user.grant('guest', 'super') diff --git a/test/replication/gh-6018-replica.lua b/test/replication/gh-6018-replica.lua new file mode 100644 index 000000000..71e669141 --- /dev/null +++ b/test/replication/gh-6018-replica.lua @@ -0,0 +1,15 @@ +#!/usr/bin/env tarantool + +require('console').listen(os.getenv('ADMIN')) + +box.cfg({ + listen = 'unix/:./gh-6018-replica.sock', + replication = { + 'unix/:./gh-6018-master.sock', + 'unix/:./gh-6018-replica.sock', + }, + election_mode = 'voter', + -- Smaller than master UUID. + instance_uuid = 'cbf06940-0790-498b-948d-042b62cf3d28', + replication_timeout = 0.1, +}) diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg index 69f2f3511..2bfc3b845 100644 --- a/test/replication/suite.cfg +++ b/test/replication/suite.cfg @@ -45,6 +45,7 @@ "gh-5536-wal-limit.test.lua": {}, "gh-5566-final-join-synchro.test.lua": {}, "gh-5613-bootstrap-prefer-booted.test.lua": {}, + "gh-6018-election-boot-voter.test.lua": {}, "gh-6027-applier-error-show.test.lua": {}, "gh-6032-promote-wal-write.test.lua": {}, "gh-6057-qsync-confirm-async-no-wal.test.lua": {}, -- 2.24.3 (Apple Git-128) ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tarantool-patches] [PATCH 2/2] election: during bootstrap prefer candidates 2021-07-15 23:49 ` [Tarantool-patches] [PATCH 2/2] election: during bootstrap prefer candidates Vladislav Shpilevoy via Tarantool-patches @ 2021-07-16 11:30 ` Serge Petrenko via Tarantool-patches 2021-07-18 17:00 ` Vladislav Shpilevoy via Tarantool-patches 0 siblings, 1 reply; 23+ messages in thread From: Serge Petrenko via Tarantool-patches @ 2021-07-16 11:30 UTC (permalink / raw) To: Vladislav Shpilevoy, tarantool-patches, gorcunov 16.07.2021 02:49, Vladislav Shpilevoy пишет: > During cluster bootstrap the boot master election algorithm didn't > take into account election modes of the instances. It could be > that all nodes have box.cfg.read_only = false, none is booted, > all are read-only now. Then the node with the smallest UUID was > chosen even if it was box.cfg.election_mode='voter' node. > > It could neither boot nor register other nodes and the cluster > couldn't start. > > The patch makes the boot master election prefer the instances > which can become a Raft leader. If all the other parameters didn't > help. > > Closes #6018 > --- Hi! Thanks for the patch! > .../unreleased/gh-6018-election-boot-voter.md | 4 + > src/box/box.cc | 25 +++- > src/box/replication.cc | 11 +- > .../gh-6018-election-boot-voter.result | 116 ++++++++++++++++++ > .../gh-6018-election-boot-voter.test.lua | 59 +++++++++ > test/replication/gh-6018-master.lua | 17 +++ > test/replication/gh-6018-replica.lua | 15 +++ > test/replication/suite.cfg | 1 + > 8 files changed, 245 insertions(+), 3 deletions(-) > create mode 100644 changelogs/unreleased/gh-6018-election-boot-voter.md > create mode 100644 test/replication/gh-6018-election-boot-voter.result > create mode 100644 test/replication/gh-6018-election-boot-voter.test.lua > create mode 100644 test/replication/gh-6018-master.lua > create mode 100644 test/replication/gh-6018-replica.lua > > diff --git a/changelogs/unreleased/gh-6018-election-boot-voter.md b/changelogs/unreleased/gh-6018-election-boot-voter.md > new file mode 100644 > index 000000000..080484bbe > --- /dev/null > +++ b/changelogs/unreleased/gh-6018-election-boot-voter.md > @@ -0,0 +1,4 @@ > +## bugfix/replication > + > +* Fixed a cluster sometimes being unable to bootstrap if it contains nodes with > + `election_mode` `manual` or `voter` (gh-6018). > diff --git a/src/box/box.cc b/src/box/box.cc > index ef3efe3e0..3105b04b6 100644 > --- a/src/box/box.cc > +++ b/src/box/box.cc > @@ -3519,7 +3519,30 @@ box_cfg_xc(void) > * should take the control over the situation and start a new > * term immediately. > */ > - raft_new_term(box_raft()); > + struct raft *raft = box_raft(); > + if (box_election_mode == ELECTION_MODE_MANUAL) { > + raft_start_candidate(raft); > + raft_new_term(raft); > + int rc = box_raft_wait_leader_found(); > + /* > + * No need to check if the mode is still manual - it > + * couldn't change because box.cfg is protected with a > + * fiber lock. > + */ > + assert(box_election_mode == ELECTION_MODE_MANUAL); > + raft_stop_candidate(raft, false); > + /* > + * It should not fail, because on bootstrap the node is > + * a single registered instance. It can't not win the > + * elections while being a lone participant. But still > + * check the result so as not to a ignore potential > + * problems. > + */ > + if (rc != 0) > + diag_raise(); > + } else { > + raft_new_term(raft); > + } Could you please extract this fix into a separate commit? Speaking of your problems with raft_try_candidate. I also can't think of a good enough alternative. For promote it would be nice to do: do { raft_try_candidate_for_1_term(); } while (leader is not known); and simply raft_try_candidate_for_1_term(); for bootstrap. But raft_try_candidate_for_1_term() looks hard to implement. -- Serge Petrenko ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tarantool-patches] [PATCH 2/2] election: during bootstrap prefer candidates 2021-07-16 11:30 ` Serge Petrenko via Tarantool-patches @ 2021-07-18 17:00 ` Vladislav Shpilevoy via Tarantool-patches 0 siblings, 0 replies; 23+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2021-07-18 17:00 UTC (permalink / raw) To: Serge Petrenko, tarantool-patches, gorcunov Thanks for the review! >> diff --git a/src/box/box.cc b/src/box/box.cc >> index ef3efe3e0..3105b04b6 100644 >> --- a/src/box/box.cc >> +++ b/src/box/box.cc >> @@ -3519,7 +3519,30 @@ box_cfg_xc(void) >> * should take the control over the situation and start a new >> * term immediately. >> */ >> - raft_new_term(box_raft()); >> + struct raft *raft = box_raft(); >> + if (box_election_mode == ELECTION_MODE_MANUAL) { >> + raft_start_candidate(raft); >> + raft_new_term(raft); >> + int rc = box_raft_wait_leader_found(); >> + /* >> + * No need to check if the mode is still manual - it >> + * couldn't change because box.cfg is protected with a >> + * fiber lock. >> + */ >> + assert(box_election_mode == ELECTION_MODE_MANUAL); >> + raft_stop_candidate(raft, false); >> + /* >> + * It should not fail, because on bootstrap the node is >> + * a single registered instance. It can't not win the >> + * elections while being a lone participant. But still >> + * check the result so as not to a ignore potential >> + * problems. >> + */ >> + if (rc != 0) >> + diag_raise(); >> + } else { >> + raft_new_term(raft); >> + } > > Could you please extract this fix into a separate commit? Done. See v2. > Speaking of your problems with raft_try_candidate. I also can't think > of a good enough alternative. > > For promote it would be nice to do: > > do { > raft_try_candidate_for_1_term(); > } while (leader is not known); > > and simply > raft_try_candidate_for_1_term(); > for bootstrap. > > But raft_try_candidate_for_1_term() looks hard to implement. Yes, I had something like that in mind too. I tried to implement it in v2. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tarantool-patches] [PATCH 0/2] Bootstrap voter 2021-07-15 23:49 [Tarantool-patches] [PATCH 0/2] Bootstrap voter Vladislav Shpilevoy via Tarantool-patches 2021-07-15 23:49 ` [Tarantool-patches] [PATCH 1/2] replication: introduce ballot.can_be_leader Vladislav Shpilevoy via Tarantool-patches 2021-07-15 23:49 ` [Tarantool-patches] [PATCH 2/2] election: during bootstrap prefer candidates Vladislav Shpilevoy via Tarantool-patches @ 2021-07-16 14:27 ` Konstantin Osipov via Tarantool-patches 2021-07-18 17:00 ` Vladislav Shpilevoy via Tarantool-patches 2 siblings, 1 reply; 23+ messages in thread From: Konstantin Osipov via Tarantool-patches @ 2021-07-16 14:27 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tarantool-patches * Vladislav Shpilevoy via Tarantool-patches <tarantool-patches@dev.tarantool.org> [21/07/16 11:23]: > In the patch I don't like this start/stop_candidate() calls. They look bulky. > I wanted to replace them with a function like raft_try_candidate() which would > make the instance a candidate for one current term. > > Also I don't like box_raft_wait_leader_found() in bootstrap. Even though it > should always succeed in a couple of WAL writes. > > The only problem with having a one-term try_candidate() is the split vote in > box_promote(). There between start_candidate() and stop_candidate() several > terms can pass if they end without a leader elected. I want to keep this > working. But I couldn't find a simple way to make raft_try_candidate() work like > that. > > Nor I could find a nice way to retry the candidate attempts in box_promote(). > In the end, if box_raft_wait_leader_found() fails, you need to restore the > candidate state. And this leads to having some kind of start/stop_candidate > anyway. > > If you have ideas how to make it look better, I am looking forward to hearing > it. > I'd be happy to help but I don't understand the motivation from this commit comment. > Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-6018-boot-voter > Issue: https://github.com/tarantool/tarantool/issues/6018 > -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tarantool-patches] [PATCH 0/2] Bootstrap voter 2021-07-16 14:27 ` [Tarantool-patches] [PATCH 0/2] Bootstrap voter Konstantin Osipov via Tarantool-patches @ 2021-07-18 17:00 ` Vladislav Shpilevoy via Tarantool-patches 2021-07-19 9:13 ` Konstantin Osipov via Tarantool-patches 0 siblings, 1 reply; 23+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2021-07-18 17:00 UTC (permalink / raw) To: Konstantin Osipov, tarantool-patches, gorcunov, sergepetrenko Thanks for the help proposal! On 16.07.2021 16:27, Konstantin Osipov wrote: > * Vladislav Shpilevoy via Tarantool-patches <tarantool-patches@dev.tarantool.org> [21/07/16 11:23]: >> In the patch I don't like this start/stop_candidate() calls. They look bulky. >> I wanted to replace them with a function like raft_try_candidate() which would >> make the instance a candidate for one current term. >> >> Also I don't like box_raft_wait_leader_found() in bootstrap. Even though it >> should always succeed in a couple of WAL writes. >> >> The only problem with having a one-term try_candidate() is the split vote in >> box_promote(). There between start_candidate() and stop_candidate() several >> terms can pass if they end without a leader elected. I want to keep this >> working. But I couldn't find a simple way to make raft_try_candidate() work like >> that. >> >> Nor I could find a nice way to retry the candidate attempts in box_promote(). >> In the end, if box_raft_wait_leader_found() fails, you need to restore the >> candidate state. And this leads to having some kind of start/stop_candidate >> anyway. >> >> If you have ideas how to make it look better, I am looking forward to hearing >> it. >> > > I'd be happy to help but I don't understand the motivation from > this commit comment. I asked for ideas how to get rid of raft_start/stop_candidate(). The motivation is to keep the code clearer. In v2 I think I managed to achieve it. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tarantool-patches] [PATCH 0/2] Bootstrap voter 2021-07-18 17:00 ` Vladislav Shpilevoy via Tarantool-patches @ 2021-07-19 9:13 ` Konstantin Osipov via Tarantool-patches 2021-07-19 22:04 ` Vladislav Shpilevoy via Tarantool-patches 0 siblings, 1 reply; 23+ messages in thread From: Konstantin Osipov via Tarantool-patches @ 2021-07-19 9:13 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tarantool-patches * Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [21/07/18 20:03]: > > I'd be happy to help but I don't understand the motivation from > > this commit comment. > > I asked for ideas how to get rid of raft_start/stop_candidate(). The > motivation is to keep the code clearer. In v2 I think I managed to > achieve it. I am trying to say that from the commit comment it's a purely internal change so motivation for it (why to get rid of it in the first place) is not clear. -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tarantool-patches] [PATCH 0/2] Bootstrap voter 2021-07-19 9:13 ` Konstantin Osipov via Tarantool-patches @ 2021-07-19 22:04 ` Vladislav Shpilevoy via Tarantool-patches 0 siblings, 0 replies; 23+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2021-07-19 22:04 UTC (permalink / raw) To: Konstantin Osipov, tarantool-patches, gorcunov, sergepetrenko On 19.07.2021 11:13, Konstantin Osipov wrote: > * Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [21/07/18 20:03]: >>> I'd be happy to help but I don't understand the motivation from >>> this commit comment. >> >> I asked for ideas how to get rid of raft_start/stop_candidate(). The >> motivation is to keep the code clearer. In v2 I think I managed to >> achieve it. > > I am trying to say that from the commit comment it's a purely > internal change so motivation for it (why to get rid of it in the > first place) is not clear. Yeah, sorry, the request for ideas assumed that the reader is familiar with the code. Basically, it was just for Sergey P :D ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2021-07-21 21:43 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-07-15 23:49 [Tarantool-patches] [PATCH 0/2] Bootstrap voter Vladislav Shpilevoy via Tarantool-patches 2021-07-15 23:49 ` [Tarantool-patches] [PATCH 1/2] replication: introduce ballot.can_be_leader Vladislav Shpilevoy via Tarantool-patches 2021-07-16 10:59 ` Serge Petrenko via Tarantool-patches 2021-07-18 17:00 ` Vladislav Shpilevoy via Tarantool-patches 2021-07-19 9:11 ` Sergey Petrenko via Tarantool-patches 2021-07-16 14:29 ` Konstantin Osipov via Tarantool-patches 2021-07-18 17:00 ` Vladislav Shpilevoy via Tarantool-patches 2021-07-19 9:12 ` Konstantin Osipov via Tarantool-patches 2021-07-19 22:06 ` Vladislav Shpilevoy via Tarantool-patches 2021-07-20 8:49 ` Konstantin Osipov via Tarantool-patches 2021-07-20 20:02 ` Vladislav Shpilevoy via Tarantool-patches 2021-07-20 20:18 ` Konstantin Osipov via Tarantool-patches 2021-07-20 21:16 ` Cyrill Gorcunov via Tarantool-patches 2021-07-20 23:20 ` Konstantin Osipov via Tarantool-patches 2021-07-21 18:51 ` Cyrill Gorcunov via Tarantool-patches 2021-07-21 21:43 ` Vladislav Shpilevoy via Tarantool-patches 2021-07-15 23:49 ` [Tarantool-patches] [PATCH 2/2] election: during bootstrap prefer candidates Vladislav Shpilevoy via Tarantool-patches 2021-07-16 11:30 ` Serge Petrenko via Tarantool-patches 2021-07-18 17:00 ` Vladislav Shpilevoy via Tarantool-patches 2021-07-16 14:27 ` [Tarantool-patches] [PATCH 0/2] Bootstrap voter Konstantin Osipov via Tarantool-patches 2021-07-18 17:00 ` Vladislav Shpilevoy via Tarantool-patches 2021-07-19 9:13 ` Konstantin Osipov via Tarantool-patches 2021-07-19 22:04 ` Vladislav Shpilevoy via Tarantool-patches
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox