* [Tarantool-patches] [PATCH 0/6] Instance join should prefer booted instances
@ 2021-06-04 23:37 Vladislav Shpilevoy via Tarantool-patches
2021-06-04 23:37 ` [Tarantool-patches] [PATCH 1/6] replication: refactor replicaset_leader() Vladislav Shpilevoy via Tarantool-patches
` (8 more replies)
0 siblings, 9 replies; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-04 23:37 UTC (permalink / raw)
To: tarantool-patches, gorcunov, sergepetrenko
See the commit messages for the details.
I did some basic tests for backward compatibility since I added a new field to
the ballot, and it works fine.
Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5613-boot-scores
Issue: https://github.com/tarantool/tarantool/issues/5613
Vladislav Shpilevoy (6):
replication: refactor replicaset_leader()
replication: ballot.is_ro -> is_ro_cfg
replication: ballot.is_loading -> is_ro
replication: introduce ballot.is_booted
replication: use 'score' to find a join-master
replication: prefer to join from booted replicas
.../gh-5613-bootstrap-prefer-booted.md | 6 ++
src/box/box.cc | 17 ++--
src/box/iproto_constants.h | 5 +-
src/box/replication.cc | 82 ++++++++-----------
src/box/replication.h | 5 +-
src/box/xrow.c | 30 ++++---
src/box/xrow.h | 13 +--
.../gh-5613-bootstrap-prefer-booted.result | 70 ++++++++++++++++
.../gh-5613-bootstrap-prefer-booted.test.lua | 21 +++++
test/replication/gh-5613-master.lua | 11 +++
test/replication/gh-5613-replica1.lua | 13 +++
test/replication/gh-5613-replica2.lua | 11 +++
test/replication/suite.cfg | 1 +
13 files changed, 207 insertions(+), 78 deletions(-)
create mode 100644 changelogs/unreleased/gh-5613-bootstrap-prefer-booted.md
create mode 100644 test/replication/gh-5613-bootstrap-prefer-booted.result
create mode 100644 test/replication/gh-5613-bootstrap-prefer-booted.test.lua
create mode 100644 test/replication/gh-5613-master.lua
create mode 100644 test/replication/gh-5613-replica1.lua
create mode 100644 test/replication/gh-5613-replica2.lua
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Tarantool-patches] [PATCH 1/6] replication: refactor replicaset_leader()
2021-06-04 23:37 [Tarantool-patches] [PATCH 0/6] Instance join should prefer booted instances Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-04 23:37 ` Vladislav Shpilevoy via Tarantool-patches
2021-06-10 13:54 ` Serge Petrenko via Tarantool-patches
2021-06-04 23:37 ` [Tarantool-patches] [PATCH 2/6] replication: ballot.is_ro -> is_ro_cfg Vladislav Shpilevoy via Tarantool-patches
` (7 subsequent siblings)
8 siblings, 1 reply; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-04 23:37 UTC (permalink / raw)
To: tarantool-patches, gorcunov, sergepetrenko
Firstly, rename it to replicaset_find_join_master(). Now, when
there is Raft with a concept of an actual leader, the function
name becomes confusing.
Secondly, do not access ballot member in struct applier in such a
long way - save the ballot pointer on the stack. This is going to
become useful when in one of the next patches the ballot will be
used more.
Part of #5613
---
src/box/box.cc | 5 ++---
src/box/replication.cc | 12 +++++++-----
src/box/replication.h | 5 +++--
3 files changed, 12 insertions(+), 10 deletions(-)
diff --git a/src/box/box.cc b/src/box/box.cc
index 6dc991dc8..0e615e944 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1369,7 +1369,7 @@ box_set_replication_anon(void)
* Wait until the master has registered this
* instance.
*/
- struct replica *master = replicaset_leader();
+ struct replica *master = replicaset_find_join_master();
if (master == NULL || master->applier == NULL ||
master->applier->state != APPLIER_CONNECTED) {
tnt_raise(ClientError, ER_CANNOT_REGISTER);
@@ -3114,8 +3114,7 @@ bootstrap(const struct tt_uuid *instance_uuid,
*/
box_sync_replication(true);
- /* Use the first replica by URI as a bootstrap leader */
- struct replica *master = replicaset_leader();
+ struct replica *master = replicaset_find_join_master();
assert(master == NULL || master->applier != NULL);
if (master != NULL && !tt_uuid_is_equal(&master->uuid, &INSTANCE_UUID)) {
diff --git a/src/box/replication.cc b/src/box/replication.cc
index aefb812b3..a1c6e3c7c 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -968,6 +968,7 @@ replicaset_round(bool skip_ro)
struct applier *applier = replica->applier;
if (applier == NULL)
continue;
+ const struct ballot *ballot = &applier->ballot;
/**
* While bootstrapping a new cluster, read-only
* replicas shouldn't be considered as a leader.
@@ -975,17 +976,18 @@ replicaset_round(bool skip_ro)
* replicas since there is still a possibility
* that all replicas exist in cluster table.
*/
- if (skip_ro && applier->ballot.is_ro)
+ if (skip_ro && ballot->is_ro)
continue;
if (leader == NULL) {
leader = replica;
continue;
}
+ const struct ballot *leader_ballot = &leader->applier->ballot;
/*
* Try to find a replica which has already left
* orphan mode.
*/
- if (applier->ballot.is_loading && ! leader->applier->ballot.is_loading)
+ if (ballot->is_loading && !leader_ballot->is_loading)
continue;
/*
* Choose the replica with the most advanced
@@ -993,8 +995,8 @@ replicaset_round(bool skip_ro)
* with the same vclock, prefer the one with
* the lowest uuid.
*/
- int cmp = vclock_compare_ignore0(&applier->ballot.vclock,
- &leader->applier->ballot.vclock);
+ int cmp = vclock_compare_ignore0(&ballot->vclock,
+ &leader_ballot->vclock);
if (cmp < 0)
continue;
if (cmp == 0 && tt_uuid_compare(&replica->uuid,
@@ -1006,7 +1008,7 @@ replicaset_round(bool skip_ro)
}
struct replica *
-replicaset_leader(void)
+replicaset_find_join_master(void)
{
bool skip_ro = true;
/**
diff --git a/src/box/replication.h b/src/box/replication.h
index 2ad1cbf66..5cc380373 100644
--- a/src/box/replication.h
+++ b/src/box/replication.h
@@ -356,10 +356,11 @@ struct replica *
replica_by_id(uint32_t replica_id);
/**
- * Return the replica set leader.
+ * Find a node in the replicaset on which the instance can try to register to
+ * join the replicaset.
*/
struct replica *
-replicaset_leader(void);
+replicaset_find_join_master(void);
struct replica *
replicaset_first(void);
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Tarantool-patches] [PATCH 2/6] replication: ballot.is_ro -> is_ro_cfg
2021-06-04 23:37 [Tarantool-patches] [PATCH 0/6] Instance join should prefer booted instances Vladislav Shpilevoy via Tarantool-patches
2021-06-04 23:37 ` [Tarantool-patches] [PATCH 1/6] replication: refactor replicaset_leader() Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-04 23:37 ` Vladislav Shpilevoy via Tarantool-patches
2021-06-10 13:56 ` Serge Petrenko via Tarantool-patches
2021-06-04 23:37 ` [Tarantool-patches] [PATCH 3/6] replication: ballot.is_loading -> is_ro Vladislav Shpilevoy via Tarantool-patches
` (6 subsequent siblings)
8 siblings, 1 reply; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-04 23:37 UTC (permalink / raw)
To: tarantool-patches, gorcunov, sergepetrenko
Rename the member to show its actual meaning. It is not the
real RO state of the instance. Only how it is configured.
It can happen that the instance is read_only = false, but still is
in RO state due to other reasons.
The patch is done in scope of #5613 where the ballot is going to
be extended and used a bit differently in the join-master search
algorithm.
Part of #5613
---
src/box/box.cc | 2 +-
src/box/iproto_constants.h | 2 +-
src/box/replication.cc | 2 +-
src/box/xrow.c | 12 ++++++------
src/box/xrow.h | 4 ++--
5 files changed, 11 insertions(+), 11 deletions(-)
diff --git a/src/box/box.cc b/src/box/box.cc
index 0e615e944..d35a339ad 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -2860,7 +2860,7 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
void
box_process_vote(struct ballot *ballot)
{
- ballot->is_ro = cfg_geti("read_only") != 0;
+ ballot->is_ro_cfg = cfg_geti("read_only") != 0;
ballot->is_anon = replication_anon;
/*
* is_ro is true on initial load and is set to box.cfg.read_only
diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
index 7362ddaf1..d4ee9e090 100644
--- a/src/box/iproto_constants.h
+++ b/src/box/iproto_constants.h
@@ -162,7 +162,7 @@ enum iproto_metadata_key {
};
enum iproto_ballot_key {
- IPROTO_BALLOT_IS_RO = 0x01,
+ IPROTO_BALLOT_IS_RO_CFG = 0x01,
IPROTO_BALLOT_VCLOCK = 0x02,
IPROTO_BALLOT_GC_VCLOCK = 0x03,
IPROTO_BALLOT_IS_LOADING = 0x04,
diff --git a/src/box/replication.cc b/src/box/replication.cc
index a1c6e3c7c..ce2b74065 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -976,7 +976,7 @@ replicaset_round(bool skip_ro)
* replicas since there is still a possibility
* that all replicas exist in cluster table.
*/
- if (skip_ro && ballot->is_ro)
+ if (skip_ro && ballot->is_ro_cfg)
continue;
if (leader == NULL) {
leader = replica;
diff --git a/src/box/xrow.c b/src/box/xrow.c
index 2e364cea5..6e2a87f8a 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -450,7 +450,7 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot,
{
size_t max_size = IPROTO_HEADER_LEN + mp_sizeof_map(1) +
mp_sizeof_uint(UINT32_MAX) + mp_sizeof_map(5) +
- mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro) +
+ mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro_cfg) +
mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_loading) +
mp_sizeof_uint(IPROTO_BALLOT_IS_ANON) +
mp_sizeof_bool(ballot->is_anon) +
@@ -470,8 +470,8 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot,
data = mp_encode_map(data, 1);
data = mp_encode_uint(data, IPROTO_BALLOT);
data = mp_encode_map(data, 5);
- data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO);
- data = mp_encode_bool(data, ballot->is_ro);
+ data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO_CFG);
+ data = mp_encode_bool(data, ballot->is_ro_cfg);
data = mp_encode_uint(data, IPROTO_BALLOT_IS_LOADING);
data = mp_encode_bool(data, ballot->is_loading);
data = mp_encode_uint(data, IPROTO_BALLOT_IS_ANON);
@@ -1357,7 +1357,7 @@ xrow_encode_vote(struct xrow_header *row)
int
xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot)
{
- ballot->is_ro = false;
+ ballot->is_ro_cfg = false;
ballot->is_loading = false;
ballot->is_anon = false;
vclock_create(&ballot->vclock);
@@ -1399,10 +1399,10 @@ xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot)
}
uint32_t key = mp_decode_uint(&data);
switch (key) {
- case IPROTO_BALLOT_IS_RO:
+ case IPROTO_BALLOT_IS_RO_CFG:
if (mp_typeof(*data) != MP_BOOL)
goto err;
- ballot->is_ro = mp_decode_bool(&data);
+ ballot->is_ro_cfg = mp_decode_bool(&data);
break;
case IPROTO_BALLOT_IS_LOADING:
if (mp_typeof(*data) != MP_BOOL)
diff --git a/src/box/xrow.h b/src/box/xrow.h
index b3c664be2..241a7af8e 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -366,8 +366,8 @@ xrow_encode_auth(struct xrow_header *row, const char *salt, size_t salt_len,
/** Reply to IPROTO_VOTE request. */
struct ballot {
- /** Set if the instance is running in read-only mode. */
- bool is_ro;
+ /** Set if the instance is configured in read-only mode. */
+ bool is_ro_cfg;
/**
* A flag whether the instance is anonymous, not having an
* ID, and not going to request it.
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Tarantool-patches] [PATCH 3/6] replication: ballot.is_loading -> is_ro
2021-06-04 23:37 [Tarantool-patches] [PATCH 0/6] Instance join should prefer booted instances Vladislav Shpilevoy via Tarantool-patches
2021-06-04 23:37 ` [Tarantool-patches] [PATCH 1/6] replication: refactor replicaset_leader() Vladislav Shpilevoy via Tarantool-patches
2021-06-04 23:37 ` [Tarantool-patches] [PATCH 2/6] replication: ballot.is_ro -> is_ro_cfg Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-04 23:37 ` Vladislav Shpilevoy via Tarantool-patches
2021-06-10 13:58 ` Serge Petrenko via Tarantool-patches
2021-06-04 23:37 ` [Tarantool-patches] [PATCH 4/6] replication: introduce ballot.is_booted Vladislav Shpilevoy via Tarantool-patches
` (5 subsequent siblings)
8 siblings, 1 reply; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-04 23:37 UTC (permalink / raw)
To: tarantool-patches, gorcunov, sergepetrenko
Is_loading in the ballot used to mean the following: "the instance
did not finish its box.cfg() or has read_only = true". Which is
quite a strange property.
For instance, it was 'true' even if the instance is not really
loading anymore but has read_only = true.
The patch renames it to 'is_ro' (which existed here before, but
also with a wrong meaning).
Its behaviour is slightly changed to report the RO state of the
instance. Not its read_only. This way it incorporates all the
possible RO conditions. Such as not finished bootstrap, having
read_only = true, being a Raft follower, and so on.
The patch is done in scope of #5613 where the ballot is going to
be extended and used a bit differently in the join-master search
algorithm.
Part of #5613
---
src/box/box.cc | 9 +--------
src/box/iproto_constants.h | 2 +-
src/box/replication.cc | 2 +-
src/box/xrow.c | 12 ++++++------
src/box/xrow.h | 7 ++++---
5 files changed, 13 insertions(+), 19 deletions(-)
diff --git a/src/box/box.cc b/src/box/box.cc
index d35a339ad..d56b44d33 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -2862,14 +2862,7 @@ box_process_vote(struct ballot *ballot)
{
ballot->is_ro_cfg = cfg_geti("read_only") != 0;
ballot->is_anon = replication_anon;
- /*
- * is_ro is true on initial load and is set to box.cfg.read_only
- * after box_cfg() returns, during dynamic box.cfg parameters setting.
- * We would like to prefer already bootstrapped instances to the ones
- * still bootstrapping and the ones still bootstrapping, but writeable
- * to the ones that have box.cfg.read_only = true.
- */
- ballot->is_loading = is_ro;
+ ballot->is_ro = is_ro_summary;
vclock_copy(&ballot->vclock, &replicaset.vclock);
vclock_copy(&ballot->gc_vclock, &gc.vclock);
}
diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
index d4ee9e090..0f84843d0 100644
--- a/src/box/iproto_constants.h
+++ b/src/box/iproto_constants.h
@@ -165,7 +165,7 @@ enum iproto_ballot_key {
IPROTO_BALLOT_IS_RO_CFG = 0x01,
IPROTO_BALLOT_VCLOCK = 0x02,
IPROTO_BALLOT_GC_VCLOCK = 0x03,
- IPROTO_BALLOT_IS_LOADING = 0x04,
+ IPROTO_BALLOT_IS_RO = 0x04,
IPROTO_BALLOT_IS_ANON = 0x05,
};
diff --git a/src/box/replication.cc b/src/box/replication.cc
index ce2b74065..990f6239c 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -987,7 +987,7 @@ replicaset_round(bool skip_ro)
* Try to find a replica which has already left
* orphan mode.
*/
- if (ballot->is_loading && !leader_ballot->is_loading)
+ if (ballot->is_ro && !leader_ballot->is_ro)
continue;
/*
* Choose the replica with the most advanced
diff --git a/src/box/xrow.c b/src/box/xrow.c
index 6e2a87f8a..115a25473 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -451,7 +451,7 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot,
size_t max_size = IPROTO_HEADER_LEN + mp_sizeof_map(1) +
mp_sizeof_uint(UINT32_MAX) + mp_sizeof_map(5) +
mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro_cfg) +
- mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_loading) +
+ mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro) +
mp_sizeof_uint(IPROTO_BALLOT_IS_ANON) +
mp_sizeof_bool(ballot->is_anon) +
mp_sizeof_uint(UINT32_MAX) +
@@ -472,8 +472,8 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot,
data = mp_encode_map(data, 5);
data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO_CFG);
data = mp_encode_bool(data, ballot->is_ro_cfg);
- data = mp_encode_uint(data, IPROTO_BALLOT_IS_LOADING);
- data = mp_encode_bool(data, ballot->is_loading);
+ data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO);
+ data = mp_encode_bool(data, ballot->is_ro);
data = mp_encode_uint(data, IPROTO_BALLOT_IS_ANON);
data = mp_encode_bool(data, ballot->is_anon);
data = mp_encode_uint(data, IPROTO_BALLOT_VCLOCK);
@@ -1358,7 +1358,7 @@ int
xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot)
{
ballot->is_ro_cfg = false;
- ballot->is_loading = false;
+ ballot->is_ro = false;
ballot->is_anon = false;
vclock_create(&ballot->vclock);
@@ -1404,10 +1404,10 @@ xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot)
goto err;
ballot->is_ro_cfg = mp_decode_bool(&data);
break;
- case IPROTO_BALLOT_IS_LOADING:
+ case IPROTO_BALLOT_IS_RO:
if (mp_typeof(*data) != MP_BOOL)
goto err;
- ballot->is_loading = mp_decode_bool(&data);
+ ballot->is_ro = mp_decode_bool(&data);
break;
case IPROTO_BALLOT_IS_ANON:
if (mp_typeof(*data) != MP_BOOL)
diff --git a/src/box/xrow.h b/src/box/xrow.h
index 241a7af8e..1d00b2e43 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -374,10 +374,11 @@ struct ballot {
*/
bool is_anon;
/**
- * Set if the instance hasn't finished bootstrap or recovery, or
- * is syncing with other replicas in the replicaset.
+ * Set if the instance is not writable due to any reason. Could be
+ * config read_only=true; being orphan; being a Raft follower; not
+ * finished recovery/bootstrap; or anything else.
*/
- bool is_loading;
+ bool is_ro;
/** Current instance vclock. */
struct vclock vclock;
/** Oldest vclock available on the instance. */
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Tarantool-patches] [PATCH 4/6] replication: introduce ballot.is_booted
2021-06-04 23:37 [Tarantool-patches] [PATCH 0/6] Instance join should prefer booted instances Vladislav Shpilevoy via Tarantool-patches
` (2 preceding siblings ...)
2021-06-04 23:37 ` [Tarantool-patches] [PATCH 3/6] replication: ballot.is_loading -> is_ro Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-04 23:37 ` Vladislav Shpilevoy via Tarantool-patches
2021-06-10 14:02 ` Serge Petrenko via Tarantool-patches
2021-06-04 23:37 ` [Tarantool-patches] [PATCH 5/6] replication: use 'score' to find a join-master Vladislav Shpilevoy via Tarantool-patches
` (4 subsequent siblings)
8 siblings, 1 reply; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-04 23:37 UTC (permalink / raw)
To: tarantool-patches, gorcunov, sergepetrenko
The new field reports whether the instance has finished its
bootstrap/recovery, or IOW has finished box.cfg().
The new field will help in fixing #5613 so as not to try to join
to a replicaset via non-bootstrapped instances if there are
others.
The problem is that otherwise, if all nodes are booted but
are read-only, new instances bootstrap their own independent
replicaset. It would be better to just fail and terminate the
process than do such a bizarre action.
Part of #5613
---
src/box/box.cc | 1 +
src/box/iproto_constants.h | 1 +
src/box/xrow.c | 14 ++++++++++++--
src/box/xrow.h | 2 ++
4 files changed, 16 insertions(+), 2 deletions(-)
diff --git a/src/box/box.cc b/src/box/box.cc
index d56b44d33..6fca337bc 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -2863,6 +2863,7 @@ box_process_vote(struct ballot *ballot)
ballot->is_ro_cfg = cfg_geti("read_only") != 0;
ballot->is_anon = replication_anon;
ballot->is_ro = is_ro_summary;
+ ballot->is_booted = is_box_configured;
vclock_copy(&ballot->vclock, &replicaset.vclock);
vclock_copy(&ballot->gc_vclock, &gc.vclock);
}
diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
index 0f84843d0..137bee9da 100644
--- a/src/box/iproto_constants.h
+++ b/src/box/iproto_constants.h
@@ -167,6 +167,7 @@ enum iproto_ballot_key {
IPROTO_BALLOT_GC_VCLOCK = 0x03,
IPROTO_BALLOT_IS_RO = 0x04,
IPROTO_BALLOT_IS_ANON = 0x05,
+ IPROTO_BALLOT_IS_BOOTED = 0x06,
};
#define bit(c) (1ULL<<IPROTO_##c)
diff --git a/src/box/xrow.c b/src/box/xrow.c
index 115a25473..0d0548fef 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -449,11 +449,13 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot,
uint64_t sync, uint32_t schema_version)
{
size_t max_size = IPROTO_HEADER_LEN + mp_sizeof_map(1) +
- mp_sizeof_uint(UINT32_MAX) + mp_sizeof_map(5) +
+ mp_sizeof_uint(UINT32_MAX) + mp_sizeof_map(6) +
mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro_cfg) +
mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro) +
mp_sizeof_uint(IPROTO_BALLOT_IS_ANON) +
mp_sizeof_bool(ballot->is_anon) +
+ mp_sizeof_uint(IPROTO_BALLOT_IS_BOOTED) +
+ mp_sizeof_bool(ballot->is_booted) +
mp_sizeof_uint(UINT32_MAX) +
mp_sizeof_vclock_ignore0(&ballot->vclock) +
mp_sizeof_uint(UINT32_MAX) +
@@ -469,13 +471,15 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot,
char *data = buf + IPROTO_HEADER_LEN;
data = mp_encode_map(data, 1);
data = mp_encode_uint(data, IPROTO_BALLOT);
- data = mp_encode_map(data, 5);
+ data = mp_encode_map(data, 6);
data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO_CFG);
data = mp_encode_bool(data, ballot->is_ro_cfg);
data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO);
data = mp_encode_bool(data, ballot->is_ro);
data = mp_encode_uint(data, IPROTO_BALLOT_IS_ANON);
data = mp_encode_bool(data, ballot->is_anon);
+ data = mp_encode_uint(data, IPROTO_BALLOT_IS_BOOTED);
+ data = mp_encode_bool(data, ballot->is_booted);
data = mp_encode_uint(data, IPROTO_BALLOT_VCLOCK);
data = mp_encode_vclock_ignore0(data, &ballot->vclock);
data = mp_encode_uint(data, IPROTO_BALLOT_GC_VCLOCK);
@@ -1360,6 +1364,7 @@ xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot)
ballot->is_ro_cfg = false;
ballot->is_ro = false;
ballot->is_anon = false;
+ ballot->is_booted = true;
vclock_create(&ballot->vclock);
const char *start = NULL;
@@ -1424,6 +1429,11 @@ xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot)
&ballot->gc_vclock) != 0)
goto err;
break;
+ case IPROTO_BALLOT_IS_BOOTED:
+ if (mp_typeof(*data) != MP_BOOL)
+ goto err;
+ ballot->is_booted = mp_decode_bool(&data);
+ break;
default:
mp_next(&data);
}
diff --git a/src/box/xrow.h b/src/box/xrow.h
index 1d00b2e43..055fd31e1 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -379,6 +379,8 @@ struct ballot {
* finished recovery/bootstrap; or anything else.
*/
bool is_ro;
+ /** Set if the instance has finished its bootstrap/recovery. */
+ bool is_booted;
/** Current instance vclock. */
struct vclock vclock;
/** Oldest vclock available on the instance. */
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Tarantool-patches] [PATCH 5/6] replication: use 'score' to find a join-master
2021-06-04 23:37 [Tarantool-patches] [PATCH 0/6] Instance join should prefer booted instances Vladislav Shpilevoy via Tarantool-patches
` (3 preceding siblings ...)
2021-06-04 23:37 ` [Tarantool-patches] [PATCH 4/6] replication: introduce ballot.is_booted Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-04 23:37 ` Vladislav Shpilevoy via Tarantool-patches
2021-06-10 14:06 ` Serge Petrenko via Tarantool-patches
2021-06-10 15:02 ` Cyrill Gorcunov via Tarantool-patches
2021-06-04 23:38 ` [Tarantool-patches] [PATCH 6/6] replication: prefer to join from booted replicas Vladislav Shpilevoy via Tarantool-patches
` (3 subsequent siblings)
8 siblings, 2 replies; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-04 23:37 UTC (permalink / raw)
To: tarantool-patches, gorcunov, sergepetrenko
The patch refactors the algorithm of finding a join-master (in
replicaset_find_join_master()) to use scores instead of multiple
iterations with different criteria.
The original code was relatively fine as long as it had only
one parameter to change - whether should it skip
`box.cfg{read_only = true}` nodes.
Although it was clear that it was "on the edge" of acceptable
complexity due to a second non-configurable parameter whether a
replica is in read-only state regardless of its config.
It is going to get more complicated when the algorithm will take
into account the third parameter whether an instance is
bootstrapped.
Then it should make decisions like "among bootstrapped nodes try
to prefer instances not having read_only=true, and not being in
read-only state". The easiest way to do so is to use
scores/weights incremented according to the instance's parameters
matching certain "good points".
Part of #5613
---
src/box/replication.cc | 62 ++++++++++++++++--------------------------
1 file changed, 23 insertions(+), 39 deletions(-)
diff --git a/src/box/replication.cc b/src/box/replication.cc
index 990f6239c..d33e70f28 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -960,71 +960,55 @@ replicaset_next(struct replica *replica)
* replicas, choose a read-only replica with biggest vclock
* as a leader, in hope it will become read-write soon.
*/
-static struct replica *
-replicaset_round(bool skip_ro)
+struct replica *
+replicaset_find_join_master(void)
{
struct replica *leader = NULL;
+ int leader_score = -1;
replicaset_foreach(replica) {
struct applier *applier = replica->applier;
if (applier == NULL)
continue;
const struct ballot *ballot = &applier->ballot;
- /**
- * While bootstrapping a new cluster, read-only
- * replicas shouldn't be considered as a leader.
- * The only exception if there is no read-write
- * replicas since there is still a possibility
- * that all replicas exist in cluster table.
- */
- if (skip_ro && ballot->is_ro_cfg)
- continue;
- if (leader == NULL) {
- leader = replica;
- continue;
- }
- const struct ballot *leader_ballot = &leader->applier->ballot;
+ int score = 0;
/*
- * Try to find a replica which has already left
- * orphan mode.
+ * Prefer instances not configured as read-only via box.cfg, and
+ * not being in read-only state due to any other reason. The
+ * config is stronger because if it is configured as read-only,
+ * it is in read-only state for sure, until the config is
+ * changed.
*/
- if (ballot->is_ro && !leader_ballot->is_ro)
+ if (!ballot->is_ro_cfg)
+ score += 5;
+ if (!ballot->is_ro)
+ score += 1;
+ if (leader_score < score)
+ goto elect;
+ if (score < leader_score)
continue;
+ const struct ballot *leader_ballot;
+ leader_ballot = &leader->applier->ballot;
/*
* Choose the replica with the most advanced
* vclock. If there are two or more replicas
* with the same vclock, prefer the one with
* the lowest uuid.
*/
- int cmp = vclock_compare_ignore0(&ballot->vclock,
- &leader_ballot->vclock);
+ int cmp;
+ cmp = vclock_compare_ignore0(&ballot->vclock,
+ &leader_ballot->vclock);
if (cmp < 0)
continue;
if (cmp == 0 && tt_uuid_compare(&replica->uuid,
&leader->uuid) > 0)
continue;
+ elect:
leader = replica;
+ leader_score = score;
}
return leader;
}
-struct replica *
-replicaset_find_join_master(void)
-{
- bool skip_ro = true;
- /**
- * Two loops, first prefers read-write replicas among others.
- * Second for backward compatibility, if there is no such
- * replicas at all.
- */
- struct replica *leader = replicaset_round(skip_ro);
- if (leader == NULL) {
- skip_ro = false;
- leader = replicaset_round(skip_ro);
- }
-
- return leader;
-}
-
struct replica *
replica_by_uuid(const struct tt_uuid *uuid)
{
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Tarantool-patches] [PATCH 6/6] replication: prefer to join from booted replicas
2021-06-04 23:37 [Tarantool-patches] [PATCH 0/6] Instance join should prefer booted instances Vladislav Shpilevoy via Tarantool-patches
` (4 preceding siblings ...)
2021-06-04 23:37 ` [Tarantool-patches] [PATCH 5/6] replication: use 'score' to find a join-master Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-04 23:38 ` Vladislav Shpilevoy via Tarantool-patches
2021-06-06 17:06 ` Vladislav Shpilevoy via Tarantool-patches
2021-06-10 14:14 ` Serge Petrenko via Tarantool-patches
2021-06-06 17:03 ` [Tarantool-patches] [PATCH 7/6] raft: test join to a raft cluster Vladislav Shpilevoy via Tarantool-patches
` (2 subsequent siblings)
8 siblings, 2 replies; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-04 23:38 UTC (permalink / raw)
To: tarantool-patches, gorcunov, sergepetrenko
The algorithm of looking for an instance to join the replicaset
from didn't take into account that some of the instances might be
not bootstrapped but still perfectly available.
As a result, a ridiculous situation could happen - an instance
could connect to a cluster with just read-only instances, but it
could have itself with box.cfg{read_only = false}. Then instead of
failing or waiting it just booted a brand new cluster. And after
that the node just started complaining about the others having a
different replicaset UUID.
The patch makes so a new instance always prefers a bootstrapped
join-source to a non-boostrapped one, including self. In the
situation above the new instance now terminates with an error.
In future hopefully it should start a retry-loop instead.
Closes #5613
@TarantoolBot document
Title: IPROTO_BALLOT rework and a new field
A couple of fields in `IPROTO_BALLOT 0x29` used to have values not
matching with their names. They are changed.
* `IPROTO_BALLOT_IS_RO 0x01` used to mean "the instance has
`box.cfg{read_only = true}`". It was renamed in the source code
to `IPROTO_BALLOT_IS_RO_CFG`. It has the same code `0x01`, and
the value is the same. Only the name has changed, and in the doc
should be too.
* `IPROTO_BALLOT_IS_LOADING 0x04` used to mean "the instance has
finished `box.cfg()` and it has `read_only = true`". The name
was wrong therefore, because even if the instance finished
loading, the flag still was false for `read_only = true` nodes.
Also such a value is not very suitable for any sane usage.
The name was changed to `IPROTO_BALLOT_IS_RO`, the code stayed
the same, and the value now is "the instance is not writable".
The reason for being not writable can be any: the node is an
orphan; or it has `read_only = true`; or it is a Raft follower;
or anything else.
And there is a new field.
`IPROTO_BALLOT_IS_BOOTED 0x06` means the instance has finished its
bootstrap or recovery.
---
.../gh-5613-bootstrap-prefer-booted.md | 6 ++
src/box/replication.cc | 20 +++---
.../gh-5613-bootstrap-prefer-booted.result | 70 +++++++++++++++++++
.../gh-5613-bootstrap-prefer-booted.test.lua | 21 ++++++
test/replication/gh-5613-master.lua | 11 +++
test/replication/gh-5613-replica1.lua | 13 ++++
test/replication/gh-5613-replica2.lua | 11 +++
test/replication/suite.cfg | 1 +
8 files changed, 144 insertions(+), 9 deletions(-)
create mode 100644 changelogs/unreleased/gh-5613-bootstrap-prefer-booted.md
create mode 100644 test/replication/gh-5613-bootstrap-prefer-booted.result
create mode 100644 test/replication/gh-5613-bootstrap-prefer-booted.test.lua
create mode 100644 test/replication/gh-5613-master.lua
create mode 100644 test/replication/gh-5613-replica1.lua
create mode 100644 test/replication/gh-5613-replica2.lua
diff --git a/changelogs/unreleased/gh-5613-bootstrap-prefer-booted.md b/changelogs/unreleased/gh-5613-bootstrap-prefer-booted.md
new file mode 100644
index 000000000..c022ee012
--- /dev/null
+++ b/changelogs/unreleased/gh-5613-bootstrap-prefer-booted.md
@@ -0,0 +1,6 @@
+## bugfix/replication
+
+* Fixed an error when a replica, at attempt to join a cluster with exclusively
+ read-only replicas available, instead of failing or retrying just decided to
+ boot its own replicaset. Now it fails with an error about the other nodes
+ being read-only so they can't register it (gh-5613).
diff --git a/src/box/replication.cc b/src/box/replication.cc
index d33e70f28..52086c65e 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -951,15 +951,6 @@ replicaset_next(struct replica *replica)
return replica_hash_next(&replicaset.hash, replica);
}
-/**
- * Compare vclock, read only mode and orphan status
- * of all connected replicas and elect a leader.
- * Initiallly, skip read-only replicas, since they
- * can not properly act as bootstrap masters (register
- * new nodes in _cluster table). If there are no read-write
- * replicas, choose a read-only replica with biggest vclock
- * as a leader, in hope it will become read-write soon.
- */
struct replica *
replicaset_find_join_master(void)
{
@@ -972,12 +963,23 @@ replicaset_find_join_master(void)
const struct ballot *ballot = &applier->ballot;
int score = 0;
/*
+ * First of all try to ignore non-booted instances. Including
+ * self if not booted yet. For self it is even dangerous as the
+ * instance might decide to boot its own cluster if, for
+ * example, the other nodes are available, but read-only. It
+ * would be a mistake.
+ *
+ * For a new cluster it is ok to use a non-booted instance as it
+ * means the algorithm tries to find an initial "boot-master".
+ *
* Prefer instances not configured as read-only via box.cfg, and
* not being in read-only state due to any other reason. The
* config is stronger because if it is configured as read-only,
* it is in read-only state for sure, until the config is
* changed.
*/
+ if (ballot->is_booted)
+ score += 10;
if (!ballot->is_ro_cfg)
score += 5;
if (!ballot->is_ro)
diff --git a/test/replication/gh-5613-bootstrap-prefer-booted.result b/test/replication/gh-5613-bootstrap-prefer-booted.result
new file mode 100644
index 000000000..e8e7fb792
--- /dev/null
+++ b/test/replication/gh-5613-bootstrap-prefer-booted.result
@@ -0,0 +1,70 @@
+-- test-run result file version 2
+test_run = require('test_run').new()
+ | ---
+ | ...
+
+test_run:cmd('create server master with script="replication/gh-5613-master.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server master with wait=False')
+ | ---
+ | - true
+ | ...
+test_run:cmd('create server replica1 with script="replication/gh-5613-replica1.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica1')
+ | ---
+ | - true
+ | ...
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+box.cfg{read_only = true}
+ | ---
+ | ...
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+
+test_run:cmd('create server replica2 with script="replication/gh-5613-replica2.lua"')
+ | ---
+ | - true
+ | ...
+-- It returns false, but it is expected.
+test_run:cmd('start server replica2 with crash_expected=True')
+ | ---
+ | - false
+ | ...
+opts = {filename = 'gh-5613-replica2.log'}
+ | ---
+ | ...
+assert(test_run:grep_log(nil, 'ER_READONLY', nil, opts) ~= nil)
+ | ---
+ | - true
+ | ...
+
+test_run:cmd('delete server replica2')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica1')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica1')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server master')
+ | ---
+ | - true
+ | ...
diff --git a/test/replication/gh-5613-bootstrap-prefer-booted.test.lua b/test/replication/gh-5613-bootstrap-prefer-booted.test.lua
new file mode 100644
index 000000000..d3c1c1189
--- /dev/null
+++ b/test/replication/gh-5613-bootstrap-prefer-booted.test.lua
@@ -0,0 +1,21 @@
+test_run = require('test_run').new()
+
+test_run:cmd('create server master with script="replication/gh-5613-master.lua"')
+test_run:cmd('start server master with wait=False')
+test_run:cmd('create server replica1 with script="replication/gh-5613-replica1.lua"')
+test_run:cmd('start server replica1')
+test_run:switch('master')
+box.cfg{read_only = true}
+test_run:switch('default')
+
+test_run:cmd('create server replica2 with script="replication/gh-5613-replica2.lua"')
+-- It returns false, but it is expected.
+test_run:cmd('start server replica2 with crash_expected=True')
+opts = {filename = 'gh-5613-replica2.log'}
+assert(test_run:grep_log(nil, 'ER_READONLY', nil, opts) ~= nil)
+
+test_run:cmd('delete server replica2')
+test_run:cmd('stop server replica1')
+test_run:cmd('delete server replica1')
+test_run:cmd('stop server master')
+test_run:cmd('delete server master')
diff --git a/test/replication/gh-5613-master.lua b/test/replication/gh-5613-master.lua
new file mode 100644
index 000000000..408427315
--- /dev/null
+++ b/test/replication/gh-5613-master.lua
@@ -0,0 +1,11 @@
+#!/usr/bin/env tarantool
+
+require('console').listen(os.getenv('ADMIN'))
+box.cfg({
+ listen = 'unix/:./gh-5613-master.sock',
+ replication = {
+ 'unix/:./gh-5613-master.sock',
+ 'unix/:./gh-5613-replica1.sock',
+ },
+})
+box.schema.user.grant('guest', 'super')
diff --git a/test/replication/gh-5613-replica1.lua b/test/replication/gh-5613-replica1.lua
new file mode 100644
index 000000000..d0d6e3372
--- /dev/null
+++ b/test/replication/gh-5613-replica1.lua
@@ -0,0 +1,13 @@
+#!/usr/bin/env tarantool
+
+require('console').listen(os.getenv('ADMIN'))
+box.cfg({
+ listen = 'unix/:./gh-5613-replica1.sock',
+ replication = {
+ 'unix/:./gh-5613-master.sock',
+ 'unix/:./gh-5613-replica1.sock',
+ },
+ -- Set to read_only initially so as the bootstrap-master would be
+ -- known in advance.
+ read_only = true,
+})
diff --git a/test/replication/gh-5613-replica2.lua b/test/replication/gh-5613-replica2.lua
new file mode 100644
index 000000000..8cbd45b61
--- /dev/null
+++ b/test/replication/gh-5613-replica2.lua
@@ -0,0 +1,11 @@
+#!/usr/bin/env tarantool
+
+require('console').listen(os.getenv('ADMIN'))
+box.cfg({
+ listen = 'unix/:./gh-5613-replica2.sock',
+ replication = {
+ 'unix/:./gh-5613-master.sock',
+ 'unix/:./gh-5613-replica1.sock',
+ 'unix/:./gh-5613-replica2.sock',
+ },
+})
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 27eab20c2..f9d5ce1cc 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -44,6 +44,7 @@
"gh-5435-qsync-clear-synchro-queue-commit-all.test.lua": {},
"gh-5536-wal-limit.test.lua": {},
"gh-5566-final-join-synchro.test.lua": {},
+ "gh-5613-bootstrap-prefer-booted.test.lua": {},
"gh-6032-promote-wal-write.test.lua": {},
"gh-6057-qsync-confirm-async-no-wal.test.lua": {},
"gh-6094-rs-uuid-mismatch.test.lua": {},
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Tarantool-patches] [PATCH 7/6] raft: test join to a raft cluster
2021-06-04 23:37 [Tarantool-patches] [PATCH 0/6] Instance join should prefer booted instances Vladislav Shpilevoy via Tarantool-patches
` (5 preceding siblings ...)
2021-06-04 23:38 ` [Tarantool-patches] [PATCH 6/6] replication: prefer to join from booted replicas Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-06 17:03 ` Vladislav Shpilevoy via Tarantool-patches
2021-06-06 23:01 ` Vladislav Shpilevoy via Tarantool-patches
2021-06-10 14:17 ` Serge Petrenko via Tarantool-patches
2021-06-10 15:03 ` [Tarantool-patches] [PATCH 0/6] Instance join should prefer booted instances Cyrill Gorcunov via Tarantool-patches
2021-06-11 20:56 ` Vladislav Shpilevoy via Tarantool-patches
8 siblings, 2 replies; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-06 17:03 UTC (permalink / raw)
To: tarantool-patches, gorcunov, sergepetrenko
There was a bug that a new replica at join to a Raft cluster
sometimes tried to register on a non-leader node which couldn't
write to _cluster, so the join failed with ER_READONLY error.
Now in scope of #5613 the algorithm of join-master selection is
changed. A new node looks for writable members of the cluster to
use a join-master. It will not choose a follower if there is a
leader.
Closes #6127
---
.../unreleased/gh-6127-raft-join-new.md | 4 +
test/replication/gh-6127-master1.lua | 15 +++
test/replication/gh-6127-master2.lua | 13 +++
test/replication/gh-6127-raft-join-new.result | 105 ++++++++++++++++++
.../gh-6127-raft-join-new.test.lua | 41 +++++++
test/replication/gh-6127-replica.lua | 9 ++
6 files changed, 187 insertions(+)
create mode 100644 changelogs/unreleased/gh-6127-raft-join-new.md
create mode 100644 test/replication/gh-6127-master1.lua
create mode 100644 test/replication/gh-6127-master2.lua
create mode 100644 test/replication/gh-6127-raft-join-new.result
create mode 100644 test/replication/gh-6127-raft-join-new.test.lua
create mode 100644 test/replication/gh-6127-replica.lua
diff --git a/changelogs/unreleased/gh-6127-raft-join-new.md b/changelogs/unreleased/gh-6127-raft-join-new.md
new file mode 100644
index 000000000..a2d898df0
--- /dev/null
+++ b/changelogs/unreleased/gh-6127-raft-join-new.md
@@ -0,0 +1,4 @@
+## bugfix/raft
+
+* Fixed an error when a new replica in a Raft cluster could try to join from a
+ follower instead of a leader and failed with an error `ER_READONLY` (gh-6127).
diff --git a/test/replication/gh-6127-master1.lua b/test/replication/gh-6127-master1.lua
new file mode 100644
index 000000000..708574322
--- /dev/null
+++ b/test/replication/gh-6127-master1.lua
@@ -0,0 +1,15 @@
+#!/usr/bin/env tarantool
+
+require('console').listen(os.getenv('ADMIN'))
+box.cfg({
+ listen = 'unix/:./master1.sock',
+ replication = {
+ 'unix/:./master1.sock',
+ 'unix/:./master2.sock'
+ },
+ election_mode = 'candidate',
+ election_timeout = 0.1,
+ instance_uuid = '10f9828d-b5d5-46a9-b698-ddac7cce5e27',
+})
+box.ctl.wait_rw()
+box.schema.user.grant('guest', 'super')
diff --git a/test/replication/gh-6127-master2.lua b/test/replication/gh-6127-master2.lua
new file mode 100644
index 000000000..1851070c7
--- /dev/null
+++ b/test/replication/gh-6127-master2.lua
@@ -0,0 +1,13 @@
+#!/usr/bin/env tarantool
+
+require('console').listen(os.getenv('ADMIN'))
+box.cfg({
+ listen = 'unix/:./master2.sock',
+ replication = {
+ 'unix/:./master1.sock',
+ 'unix/:./master2.sock'
+ },
+ election_mode = 'voter',
+ election_timeout = 0.1,
+ instance_uuid = '20f9828d-b5d5-46a9-b698-ddac7cce5e27',
+})
diff --git a/test/replication/gh-6127-raft-join-new.result b/test/replication/gh-6127-raft-join-new.result
new file mode 100644
index 000000000..be6f8489b
--- /dev/null
+++ b/test/replication/gh-6127-raft-join-new.result
@@ -0,0 +1,105 @@
+-- test-run result file version 2
+test_run = require('test_run').new()
+ | ---
+ | ...
+
+--
+-- gh-6127: the algorithm selecting a node from which to join to a replicaset
+-- should take into account who is the leader (is writable and can write to
+-- _cluster) and who is a follower/candidate.
+--
+test_run:cmd('create server master1 with script="replication/gh-6127-master1.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server master1 with wait=False')
+ | ---
+ | - true
+ | ...
+test_run:cmd('create server master2 with script="replication/gh-6127-master2.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server master2')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('master1')
+ | ---
+ | - true
+ | ...
+box.cfg{election_mode = 'voter'}
+ | ---
+ | ...
+test_run:switch('master2')
+ | ---
+ | - true
+ | ...
+-- Perform manual election because it is faster - the automatic one still tries
+-- to wait for 'death timeout' first which is several seconds.
+box.cfg{ \
+ election_mode = 'manual', \
+ election_timeout = 0.1, \
+}
+ | ---
+ | ...
+box.ctl.promote()
+ | ---
+ | ...
+box.ctl.wait_rw()
+ | ---
+ | ...
+-- Make sure the other node received the promotion row. Vclocks now should be
+-- equal so the new node would select only using read-only state and min UUID.
+test_run:wait_lsn('master1', 'master2')
+ | ---
+ | ...
+
+-- Min UUID is master1, but it is not writable. Therefore must join from
+-- master2.
+test_run:cmd('create server replica with script="replication/gh-6127-replica.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica')
+ | ---
+ | - true
+ | ...
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+assert(box.info.leader ~= 0)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server master2')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server master2')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server master1')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server master1')
+ | ---
+ | - true
+ | ...
diff --git a/test/replication/gh-6127-raft-join-new.test.lua b/test/replication/gh-6127-raft-join-new.test.lua
new file mode 100644
index 000000000..3e0e9f226
--- /dev/null
+++ b/test/replication/gh-6127-raft-join-new.test.lua
@@ -0,0 +1,41 @@
+test_run = require('test_run').new()
+
+--
+-- gh-6127: the algorithm selecting a node from which to join to a replicaset
+-- should take into account who is the leader (is writable and can write to
+-- _cluster) and who is a follower/candidate.
+--
+test_run:cmd('create server master1 with script="replication/gh-6127-master1.lua"')
+test_run:cmd('start server master1 with wait=False')
+test_run:cmd('create server master2 with script="replication/gh-6127-master2.lua"')
+test_run:cmd('start server master2')
+
+test_run:switch('master1')
+box.cfg{election_mode = 'voter'}
+test_run:switch('master2')
+-- Perform manual election because it is faster - the automatic one still tries
+-- to wait for 'death timeout' first which is several seconds.
+box.cfg{ \
+ election_mode = 'manual', \
+ election_timeout = 0.1, \
+}
+box.ctl.promote()
+box.ctl.wait_rw()
+-- Make sure the other node received the promotion row. Vclocks now should be
+-- equal so the new node would select only using read-only state and min UUID.
+test_run:wait_lsn('master1', 'master2')
+
+-- Min UUID is master1, but it is not writable. Therefore must join from
+-- master2.
+test_run:cmd('create server replica with script="replication/gh-6127-replica.lua"')
+test_run:cmd('start server replica')
+test_run:switch('replica')
+assert(box.info.leader ~= 0)
+
+test_run:switch('default')
+test_run:cmd('stop server replica')
+test_run:cmd('delete server replica')
+test_run:cmd('stop server master2')
+test_run:cmd('delete server master2')
+test_run:cmd('stop server master1')
+test_run:cmd('delete server master1')
diff --git a/test/replication/gh-6127-replica.lua b/test/replication/gh-6127-replica.lua
new file mode 100644
index 000000000..9f4c35ecd
--- /dev/null
+++ b/test/replication/gh-6127-replica.lua
@@ -0,0 +1,9 @@
+#!/usr/bin/env tarantool
+
+require('console').listen(os.getenv('ADMIN'))
+box.cfg({
+ replication = {
+ 'unix/:./master1.sock',
+ 'unix/:./master2.sock'
+ },
+})
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH 6/6] replication: prefer to join from booted replicas
2021-06-04 23:38 ` [Tarantool-patches] [PATCH 6/6] replication: prefer to join from booted replicas Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-06 17:06 ` Vladislav Shpilevoy via Tarantool-patches
2021-06-10 14:14 ` Serge Petrenko via Tarantool-patches
1 sibling, 0 replies; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-06 17:06 UTC (permalink / raw)
To: tarantool-patches, gorcunov, sergepetrenko
I added the issue description to the test:
diff --git a/test/replication/gh-5613-bootstrap-prefer-booted.test.lua b/test/replication/gh-5613-bootstrap-prefer-booted.test.lua
index d3c1c1189..4dfc39175 100644
--- a/test/replication/gh-5613-bootstrap-prefer-booted.test.lua
+++ b/test/replication/gh-5613-bootstrap-prefer-booted.test.lua
@@ -1,5 +1,13 @@
test_run = require('test_run').new()
+--
+-- gh-5613: when a new replica is joined to a cluster, it must prefer
+-- bootstrapped join sources over non-bootstrapped ones, including self. Even
+-- if all the bootstrapped ones are read-only. Otherwise the node might have
+-- decided to bootstrap a new cluster on its own and won't be able to join the
+-- existing one forever.
+--
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH 7/6] raft: test join to a raft cluster
2021-06-06 17:03 ` [Tarantool-patches] [PATCH 7/6] raft: test join to a raft cluster Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-06 23:01 ` Vladislav Shpilevoy via Tarantool-patches
2021-06-10 14:17 ` Serge Petrenko via Tarantool-patches
1 sibling, 0 replies; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-06 23:01 UTC (permalink / raw)
To: tarantool-patches, gorcunov, sergepetrenko
I made the test not run with 2 engines:
====================
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index f9d5ce1cc..3be64e133 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -48,6 +48,7 @@
"gh-6032-promote-wal-write.test.lua": {},
"gh-6057-qsync-confirm-async-no-wal.test.lua": {},
"gh-6094-rs-uuid-mismatch.test.lua": {},
+ "gh-6127-raft-join-new.test.lua": {},
"*": {
"memtx": {"engine": "memtx"},
"vinyl": {"engine": "vinyl"}
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/6] replication: refactor replicaset_leader()
2021-06-04 23:37 ` [Tarantool-patches] [PATCH 1/6] replication: refactor replicaset_leader() Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-10 13:54 ` Serge Petrenko via Tarantool-patches
0 siblings, 0 replies; 22+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-06-10 13:54 UTC (permalink / raw)
To: Vladislav Shpilevoy, tarantool-patches, gorcunov
05.06.2021 02:37, Vladislav Shpilevoy пишет:
> Firstly, rename it to replicaset_find_join_master(). Now, when
> there is Raft with a concept of an actual leader, the function
> name becomes confusing.
>
> Secondly, do not access ballot member in struct applier in such a
> long way - save the ballot pointer on the stack. This is going to
> become useful when in one of the next patches the ballot will be
> used more.
>
> Part of #5613
Hi! Thanks for the patch!
LGTM.
> ---
> src/box/box.cc | 5 ++---
> src/box/replication.cc | 12 +++++++-----
> src/box/replication.h | 5 +++--
> 3 files changed, 12 insertions(+), 10 deletions(-)
>
> diff --git a/src/box/box.cc b/src/box/box.cc
> index 6dc991dc8..0e615e944 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -1369,7 +1369,7 @@ box_set_replication_anon(void)
> * Wait until the master has registered this
> * instance.
> */
> - struct replica *master = replicaset_leader();
> + struct replica *master = replicaset_find_join_master();
> if (master == NULL || master->applier == NULL ||
> master->applier->state != APPLIER_CONNECTED) {
> tnt_raise(ClientError, ER_CANNOT_REGISTER);
> @@ -3114,8 +3114,7 @@ bootstrap(const struct tt_uuid *instance_uuid,
> */
> box_sync_replication(true);
>
> - /* Use the first replica by URI as a bootstrap leader */
> - struct replica *master = replicaset_leader();
> + struct replica *master = replicaset_find_join_master();
> assert(master == NULL || master->applier != NULL);
>
> if (master != NULL && !tt_uuid_is_equal(&master->uuid, &INSTANCE_UUID)) {
> diff --git a/src/box/replication.cc b/src/box/replication.cc
> index aefb812b3..a1c6e3c7c 100644
> --- a/src/box/replication.cc
> +++ b/src/box/replication.cc
> @@ -968,6 +968,7 @@ replicaset_round(bool skip_ro)
> struct applier *applier = replica->applier;
> if (applier == NULL)
> continue;
> + const struct ballot *ballot = &applier->ballot;
> /**
> * While bootstrapping a new cluster, read-only
> * replicas shouldn't be considered as a leader.
> @@ -975,17 +976,18 @@ replicaset_round(bool skip_ro)
> * replicas since there is still a possibility
> * that all replicas exist in cluster table.
> */
> - if (skip_ro && applier->ballot.is_ro)
> + if (skip_ro && ballot->is_ro)
> continue;
> if (leader == NULL) {
> leader = replica;
> continue;
> }
> + const struct ballot *leader_ballot = &leader->applier->ballot;
> /*
> * Try to find a replica which has already left
> * orphan mode.
> */
> - if (applier->ballot.is_loading && ! leader->applier->ballot.is_loading)
> + if (ballot->is_loading && !leader_ballot->is_loading)
> continue;
> /*
> * Choose the replica with the most advanced
> @@ -993,8 +995,8 @@ replicaset_round(bool skip_ro)
> * with the same vclock, prefer the one with
> * the lowest uuid.
> */
> - int cmp = vclock_compare_ignore0(&applier->ballot.vclock,
> - &leader->applier->ballot.vclock);
> + int cmp = vclock_compare_ignore0(&ballot->vclock,
> + &leader_ballot->vclock);
> if (cmp < 0)
> continue;
> if (cmp == 0 && tt_uuid_compare(&replica->uuid,
> @@ -1006,7 +1008,7 @@ replicaset_round(bool skip_ro)
> }
>
> struct replica *
> -replicaset_leader(void)
> +replicaset_find_join_master(void)
> {
> bool skip_ro = true;
> /**
> diff --git a/src/box/replication.h b/src/box/replication.h
> index 2ad1cbf66..5cc380373 100644
> --- a/src/box/replication.h
> +++ b/src/box/replication.h
> @@ -356,10 +356,11 @@ struct replica *
> replica_by_id(uint32_t replica_id);
>
> /**
> - * Return the replica set leader.
> + * Find a node in the replicaset on which the instance can try to register to
> + * join the replicaset.
> */
> struct replica *
> -replicaset_leader(void);
> +replicaset_find_join_master(void);
>
> struct replica *
> replicaset_first(void);
--
Serge Petrenko
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH 2/6] replication: ballot.is_ro -> is_ro_cfg
2021-06-04 23:37 ` [Tarantool-patches] [PATCH 2/6] replication: ballot.is_ro -> is_ro_cfg Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-10 13:56 ` Serge Petrenko via Tarantool-patches
0 siblings, 0 replies; 22+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-06-10 13:56 UTC (permalink / raw)
To: Vladislav Shpilevoy, tarantool-patches, gorcunov
05.06.2021 02:37, Vladislav Shpilevoy пишет:
> Rename the member to show its actual meaning. It is not the
> real RO state of the instance. Only how it is configured.
>
> It can happen that the instance is read_only = false, but still is
> in RO state due to other reasons.
>
> The patch is done in scope of #5613 where the ballot is going to
> be extended and used a bit differently in the join-master search
> algorithm.
>
> Part of #5613
LGTM.
> ---
> src/box/box.cc | 2 +-
> src/box/iproto_constants.h | 2 +-
> src/box/replication.cc | 2 +-
> src/box/xrow.c | 12 ++++++------
> src/box/xrow.h | 4 ++--
> 5 files changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/src/box/box.cc b/src/box/box.cc
> index 0e615e944..d35a339ad 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -2860,7 +2860,7 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
> void
> box_process_vote(struct ballot *ballot)
> {
> - ballot->is_ro = cfg_geti("read_only") != 0;
> + ballot->is_ro_cfg = cfg_geti("read_only") != 0;
> ballot->is_anon = replication_anon;
> /*
> * is_ro is true on initial load and is set to box.cfg.read_only
> diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
> index 7362ddaf1..d4ee9e090 100644
> --- a/src/box/iproto_constants.h
> +++ b/src/box/iproto_constants.h
> @@ -162,7 +162,7 @@ enum iproto_metadata_key {
> };
>
> enum iproto_ballot_key {
> - IPROTO_BALLOT_IS_RO = 0x01,
> + IPROTO_BALLOT_IS_RO_CFG = 0x01,
> IPROTO_BALLOT_VCLOCK = 0x02,
> IPROTO_BALLOT_GC_VCLOCK = 0x03,
> IPROTO_BALLOT_IS_LOADING = 0x04,
> diff --git a/src/box/replication.cc b/src/box/replication.cc
> index a1c6e3c7c..ce2b74065 100644
> --- a/src/box/replication.cc
> +++ b/src/box/replication.cc
> @@ -976,7 +976,7 @@ replicaset_round(bool skip_ro)
> * replicas since there is still a possibility
> * that all replicas exist in cluster table.
> */
> - if (skip_ro && ballot->is_ro)
> + if (skip_ro && ballot->is_ro_cfg)
> continue;
> if (leader == NULL) {
> leader = replica;
> diff --git a/src/box/xrow.c b/src/box/xrow.c
> index 2e364cea5..6e2a87f8a 100644
> --- a/src/box/xrow.c
> +++ b/src/box/xrow.c
> @@ -450,7 +450,7 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot,
> {
> size_t max_size = IPROTO_HEADER_LEN + mp_sizeof_map(1) +
> mp_sizeof_uint(UINT32_MAX) + mp_sizeof_map(5) +
> - mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro) +
> + mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro_cfg) +
> mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_loading) +
> mp_sizeof_uint(IPROTO_BALLOT_IS_ANON) +
> mp_sizeof_bool(ballot->is_anon) +
> @@ -470,8 +470,8 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot,
> data = mp_encode_map(data, 1);
> data = mp_encode_uint(data, IPROTO_BALLOT);
> data = mp_encode_map(data, 5);
> - data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO);
> - data = mp_encode_bool(data, ballot->is_ro);
> + data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO_CFG);
> + data = mp_encode_bool(data, ballot->is_ro_cfg);
> data = mp_encode_uint(data, IPROTO_BALLOT_IS_LOADING);
> data = mp_encode_bool(data, ballot->is_loading);
> data = mp_encode_uint(data, IPROTO_BALLOT_IS_ANON);
> @@ -1357,7 +1357,7 @@ xrow_encode_vote(struct xrow_header *row)
> int
> xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot)
> {
> - ballot->is_ro = false;
> + ballot->is_ro_cfg = false;
> ballot->is_loading = false;
> ballot->is_anon = false;
> vclock_create(&ballot->vclock);
> @@ -1399,10 +1399,10 @@ xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot)
> }
> uint32_t key = mp_decode_uint(&data);
> switch (key) {
> - case IPROTO_BALLOT_IS_RO:
> + case IPROTO_BALLOT_IS_RO_CFG:
> if (mp_typeof(*data) != MP_BOOL)
> goto err;
> - ballot->is_ro = mp_decode_bool(&data);
> + ballot->is_ro_cfg = mp_decode_bool(&data);
> break;
> case IPROTO_BALLOT_IS_LOADING:
> if (mp_typeof(*data) != MP_BOOL)
> diff --git a/src/box/xrow.h b/src/box/xrow.h
> index b3c664be2..241a7af8e 100644
> --- a/src/box/xrow.h
> +++ b/src/box/xrow.h
> @@ -366,8 +366,8 @@ xrow_encode_auth(struct xrow_header *row, const char *salt, size_t salt_len,
>
> /** Reply to IPROTO_VOTE request. */
> struct ballot {
> - /** Set if the instance is running in read-only mode. */
> - bool is_ro;
> + /** Set if the instance is configured in read-only mode. */
> + bool is_ro_cfg;
> /**
> * A flag whether the instance is anonymous, not having an
> * ID, and not going to request it.
--
Serge Petrenko
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH 3/6] replication: ballot.is_loading -> is_ro
2021-06-04 23:37 ` [Tarantool-patches] [PATCH 3/6] replication: ballot.is_loading -> is_ro Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-10 13:58 ` Serge Petrenko via Tarantool-patches
0 siblings, 0 replies; 22+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-06-10 13:58 UTC (permalink / raw)
To: Vladislav Shpilevoy, tarantool-patches, gorcunov
05.06.2021 02:37, Vladislav Shpilevoy пишет:
> Is_loading in the ballot used to mean the following: "the instance
> did not finish its box.cfg() or has read_only = true". Which is
> quite a strange property.
>
> For instance, it was 'true' even if the instance is not really
> loading anymore but has read_only = true.
>
> The patch renames it to 'is_ro' (which existed here before, but
> also with a wrong meaning).
>
> Its behaviour is slightly changed to report the RO state of the
> instance. Not its read_only. This way it incorporates all the
> possible RO conditions. Such as not finished bootstrap, having
> read_only = true, being a Raft follower, and so on.
>
> The patch is done in scope of #5613 where the ballot is going to
> be extended and used a bit differently in the join-master search
> algorithm.
>
> Part of #5613
LGTM.
> ---
> src/box/box.cc | 9 +--------
> src/box/iproto_constants.h | 2 +-
> src/box/replication.cc | 2 +-
> src/box/xrow.c | 12 ++++++------
> src/box/xrow.h | 7 ++++---
> 5 files changed, 13 insertions(+), 19 deletions(-)
>
> diff --git a/src/box/box.cc b/src/box/box.cc
> index d35a339ad..d56b44d33 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -2862,14 +2862,7 @@ box_process_vote(struct ballot *ballot)
> {
> ballot->is_ro_cfg = cfg_geti("read_only") != 0;
> ballot->is_anon = replication_anon;
> - /*
> - * is_ro is true on initial load and is set to box.cfg.read_only
> - * after box_cfg() returns, during dynamic box.cfg parameters setting.
> - * We would like to prefer already bootstrapped instances to the ones
> - * still bootstrapping and the ones still bootstrapping, but writeable
> - * to the ones that have box.cfg.read_only = true.
> - */
> - ballot->is_loading = is_ro;
> + ballot->is_ro = is_ro_summary;
> vclock_copy(&ballot->vclock, &replicaset.vclock);
> vclock_copy(&ballot->gc_vclock, &gc.vclock);
> }
> diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
> index d4ee9e090..0f84843d0 100644
> --- a/src/box/iproto_constants.h
> +++ b/src/box/iproto_constants.h
> @@ -165,7 +165,7 @@ enum iproto_ballot_key {
> IPROTO_BALLOT_IS_RO_CFG = 0x01,
> IPROTO_BALLOT_VCLOCK = 0x02,
> IPROTO_BALLOT_GC_VCLOCK = 0x03,
> - IPROTO_BALLOT_IS_LOADING = 0x04,
> + IPROTO_BALLOT_IS_RO = 0x04,
> IPROTO_BALLOT_IS_ANON = 0x05,
> };
>
> diff --git a/src/box/replication.cc b/src/box/replication.cc
> index ce2b74065..990f6239c 100644
> --- a/src/box/replication.cc
> +++ b/src/box/replication.cc
> @@ -987,7 +987,7 @@ replicaset_round(bool skip_ro)
> * Try to find a replica which has already left
> * orphan mode.
> */
> - if (ballot->is_loading && !leader_ballot->is_loading)
> + if (ballot->is_ro && !leader_ballot->is_ro)
> continue;
> /*
> * Choose the replica with the most advanced
> diff --git a/src/box/xrow.c b/src/box/xrow.c
> index 6e2a87f8a..115a25473 100644
> --- a/src/box/xrow.c
> +++ b/src/box/xrow.c
> @@ -451,7 +451,7 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot,
> size_t max_size = IPROTO_HEADER_LEN + mp_sizeof_map(1) +
> mp_sizeof_uint(UINT32_MAX) + mp_sizeof_map(5) +
> mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro_cfg) +
> - mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_loading) +
> + mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro) +
> mp_sizeof_uint(IPROTO_BALLOT_IS_ANON) +
> mp_sizeof_bool(ballot->is_anon) +
> mp_sizeof_uint(UINT32_MAX) +
> @@ -472,8 +472,8 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot,
> data = mp_encode_map(data, 5);
> data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO_CFG);
> data = mp_encode_bool(data, ballot->is_ro_cfg);
> - data = mp_encode_uint(data, IPROTO_BALLOT_IS_LOADING);
> - data = mp_encode_bool(data, ballot->is_loading);
> + data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO);
> + data = mp_encode_bool(data, ballot->is_ro);
> data = mp_encode_uint(data, IPROTO_BALLOT_IS_ANON);
> data = mp_encode_bool(data, ballot->is_anon);
> data = mp_encode_uint(data, IPROTO_BALLOT_VCLOCK);
> @@ -1358,7 +1358,7 @@ int
> xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot)
> {
> ballot->is_ro_cfg = false;
> - ballot->is_loading = false;
> + ballot->is_ro = false;
> ballot->is_anon = false;
> vclock_create(&ballot->vclock);
>
> @@ -1404,10 +1404,10 @@ xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot)
> goto err;
> ballot->is_ro_cfg = mp_decode_bool(&data);
> break;
> - case IPROTO_BALLOT_IS_LOADING:
> + case IPROTO_BALLOT_IS_RO:
> if (mp_typeof(*data) != MP_BOOL)
> goto err;
> - ballot->is_loading = mp_decode_bool(&data);
> + ballot->is_ro = mp_decode_bool(&data);
> break;
> case IPROTO_BALLOT_IS_ANON:
> if (mp_typeof(*data) != MP_BOOL)
> diff --git a/src/box/xrow.h b/src/box/xrow.h
> index 241a7af8e..1d00b2e43 100644
> --- a/src/box/xrow.h
> +++ b/src/box/xrow.h
> @@ -374,10 +374,11 @@ struct ballot {
> */
> bool is_anon;
> /**
> - * Set if the instance hasn't finished bootstrap or recovery, or
> - * is syncing with other replicas in the replicaset.
> + * Set if the instance is not writable due to any reason. Could be
> + * config read_only=true; being orphan; being a Raft follower; not
> + * finished recovery/bootstrap; or anything else.
> */
> - bool is_loading;
> + bool is_ro;
> /** Current instance vclock. */
> struct vclock vclock;
> /** Oldest vclock available on the instance. */
--
Serge Petrenko
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH 4/6] replication: introduce ballot.is_booted
2021-06-04 23:37 ` [Tarantool-patches] [PATCH 4/6] replication: introduce ballot.is_booted Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-10 14:02 ` Serge Petrenko via Tarantool-patches
0 siblings, 0 replies; 22+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-06-10 14:02 UTC (permalink / raw)
To: Vladislav Shpilevoy, tarantool-patches, gorcunov
05.06.2021 02:37, Vladislav Shpilevoy пишет:
> The new field reports whether the instance has finished its
> bootstrap/recovery, or IOW has finished box.cfg().
>
> The new field will help in fixing #5613 so as not to try to join
> to a replicaset via non-bootstrapped instances if there are
> others.
>
> The problem is that otherwise, if all nodes are booted but
> are read-only, new instances bootstrap their own independent
> replicaset. It would be better to just fail and terminate the
> process than do such a bizarre action.
>
> Part of #5613
Thanks!
LGTM.
> ---
> src/box/box.cc | 1 +
> src/box/iproto_constants.h | 1 +
> src/box/xrow.c | 14 ++++++++++++--
> src/box/xrow.h | 2 ++
> 4 files changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/src/box/box.cc b/src/box/box.cc
> index d56b44d33..6fca337bc 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -2863,6 +2863,7 @@ box_process_vote(struct ballot *ballot)
> ballot->is_ro_cfg = cfg_geti("read_only") != 0;
> ballot->is_anon = replication_anon;
> ballot->is_ro = is_ro_summary;
> + ballot->is_booted = is_box_configured;
> vclock_copy(&ballot->vclock, &replicaset.vclock);
> vclock_copy(&ballot->gc_vclock, &gc.vclock);
> }
> diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
> index 0f84843d0..137bee9da 100644
> --- a/src/box/iproto_constants.h
> +++ b/src/box/iproto_constants.h
> @@ -167,6 +167,7 @@ enum iproto_ballot_key {
> IPROTO_BALLOT_GC_VCLOCK = 0x03,
> IPROTO_BALLOT_IS_RO = 0x04,
> IPROTO_BALLOT_IS_ANON = 0x05,
> + IPROTO_BALLOT_IS_BOOTED = 0x06,
> };
>
> #define bit(c) (1ULL<<IPROTO_##c)
> diff --git a/src/box/xrow.c b/src/box/xrow.c
> index 115a25473..0d0548fef 100644
> --- a/src/box/xrow.c
> +++ b/src/box/xrow.c
> @@ -449,11 +449,13 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot,
> uint64_t sync, uint32_t schema_version)
> {
> size_t max_size = IPROTO_HEADER_LEN + mp_sizeof_map(1) +
> - mp_sizeof_uint(UINT32_MAX) + mp_sizeof_map(5) +
> + mp_sizeof_uint(UINT32_MAX) + mp_sizeof_map(6) +
> mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro_cfg) +
> mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro) +
> mp_sizeof_uint(IPROTO_BALLOT_IS_ANON) +
> mp_sizeof_bool(ballot->is_anon) +
> + mp_sizeof_uint(IPROTO_BALLOT_IS_BOOTED) +
> + mp_sizeof_bool(ballot->is_booted) +
> mp_sizeof_uint(UINT32_MAX) +
> mp_sizeof_vclock_ignore0(&ballot->vclock) +
> mp_sizeof_uint(UINT32_MAX) +
> @@ -469,13 +471,15 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot,
> char *data = buf + IPROTO_HEADER_LEN;
> data = mp_encode_map(data, 1);
> data = mp_encode_uint(data, IPROTO_BALLOT);
> - data = mp_encode_map(data, 5);
> + data = mp_encode_map(data, 6);
> data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO_CFG);
> data = mp_encode_bool(data, ballot->is_ro_cfg);
> data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO);
> data = mp_encode_bool(data, ballot->is_ro);
> data = mp_encode_uint(data, IPROTO_BALLOT_IS_ANON);
> data = mp_encode_bool(data, ballot->is_anon);
> + data = mp_encode_uint(data, IPROTO_BALLOT_IS_BOOTED);
> + data = mp_encode_bool(data, ballot->is_booted);
> data = mp_encode_uint(data, IPROTO_BALLOT_VCLOCK);
> data = mp_encode_vclock_ignore0(data, &ballot->vclock);
> data = mp_encode_uint(data, IPROTO_BALLOT_GC_VCLOCK);
> @@ -1360,6 +1364,7 @@ xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot)
> ballot->is_ro_cfg = false;
> ballot->is_ro = false;
> ballot->is_anon = false;
> + ballot->is_booted = true;
> vclock_create(&ballot->vclock);
>
> const char *start = NULL;
> @@ -1424,6 +1429,11 @@ xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot)
> &ballot->gc_vclock) != 0)
> goto err;
> break;
> + case IPROTO_BALLOT_IS_BOOTED:
> + if (mp_typeof(*data) != MP_BOOL)
> + goto err;
> + ballot->is_booted = mp_decode_bool(&data);
> + break;
> default:
> mp_next(&data);
> }
> diff --git a/src/box/xrow.h b/src/box/xrow.h
> index 1d00b2e43..055fd31e1 100644
> --- a/src/box/xrow.h
> +++ b/src/box/xrow.h
> @@ -379,6 +379,8 @@ struct ballot {
> * finished recovery/bootstrap; or anything else.
> */
> bool is_ro;
> + /** Set if the instance has finished its bootstrap/recovery. */
> + bool is_booted;
> /** Current instance vclock. */
> struct vclock vclock;
> /** Oldest vclock available on the instance. */
--
Serge Petrenko
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH 5/6] replication: use 'score' to find a join-master
2021-06-04 23:37 ` [Tarantool-patches] [PATCH 5/6] replication: use 'score' to find a join-master Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-10 14:06 ` Serge Petrenko via Tarantool-patches
2021-06-10 15:02 ` Cyrill Gorcunov via Tarantool-patches
1 sibling, 0 replies; 22+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-06-10 14:06 UTC (permalink / raw)
To: Vladislav Shpilevoy, tarantool-patches, gorcunov
05.06.2021 02:37, Vladislav Shpilevoy пишет:
> The patch refactors the algorithm of finding a join-master (in
> replicaset_find_join_master()) to use scores instead of multiple
> iterations with different criteria.
>
> The original code was relatively fine as long as it had only
> one parameter to change - whether should it skip
> `box.cfg{read_only = true}` nodes.
>
> Although it was clear that it was "on the edge" of acceptable
> complexity due to a second non-configurable parameter whether a
> replica is in read-only state regardless of its config.
>
> It is going to get more complicated when the algorithm will take
> into account the third parameter whether an instance is
> bootstrapped.
>
> Then it should make decisions like "among bootstrapped nodes try
> to prefer instances not having read_only=true, and not being in
> read-only state". The easiest way to do so is to use
> scores/weights incremented according to the instance's parameters
> matching certain "good points".
>
> Part of #5613
LGTM.
> ---
> src/box/replication.cc | 62 ++++++++++++++++--------------------------
> 1 file changed, 23 insertions(+), 39 deletions(-)
>
> diff --git a/src/box/replication.cc b/src/box/replication.cc
> index 990f6239c..d33e70f28 100644
> --- a/src/box/replication.cc
> +++ b/src/box/replication.cc
> @@ -960,71 +960,55 @@ replicaset_next(struct replica *replica)
> * replicas, choose a read-only replica with biggest vclock
> * as a leader, in hope it will become read-write soon.
> */
> -static struct replica *
> -replicaset_round(bool skip_ro)
> +struct replica *
> +replicaset_find_join_master(void)
> {
> struct replica *leader = NULL;
> + int leader_score = -1;
> replicaset_foreach(replica) {
> struct applier *applier = replica->applier;
> if (applier == NULL)
> continue;
> const struct ballot *ballot = &applier->ballot;
> - /**
> - * While bootstrapping a new cluster, read-only
> - * replicas shouldn't be considered as a leader.
> - * The only exception if there is no read-write
> - * replicas since there is still a possibility
> - * that all replicas exist in cluster table.
> - */
> - if (skip_ro && ballot->is_ro_cfg)
> - continue;
> - if (leader == NULL) {
> - leader = replica;
> - continue;
> - }
> - const struct ballot *leader_ballot = &leader->applier->ballot;
> + int score = 0;
> /*
> - * Try to find a replica which has already left
> - * orphan mode.
> + * Prefer instances not configured as read-only via box.cfg, and
> + * not being in read-only state due to any other reason. The
> + * config is stronger because if it is configured as read-only,
> + * it is in read-only state for sure, until the config is
> + * changed.
> */
> - if (ballot->is_ro && !leader_ballot->is_ro)
> + if (!ballot->is_ro_cfg)
> + score += 5;
> + if (!ballot->is_ro)
> + score += 1;
> + if (leader_score < score)
> + goto elect;
> + if (score < leader_score)
> continue;
> + const struct ballot *leader_ballot;
> + leader_ballot = &leader->applier->ballot;
> /*
> * Choose the replica with the most advanced
> * vclock. If there are two or more replicas
> * with the same vclock, prefer the one with
> * the lowest uuid.
> */
> - int cmp = vclock_compare_ignore0(&ballot->vclock,
> - &leader_ballot->vclock);
> + int cmp;
> + cmp = vclock_compare_ignore0(&ballot->vclock,
> + &leader_ballot->vclock);
> if (cmp < 0)
> continue;
> if (cmp == 0 && tt_uuid_compare(&replica->uuid,
> &leader->uuid) > 0)
> continue;
> + elect:
> leader = replica;
> + leader_score = score;
> }
> return leader;
> }
>
> -struct replica *
> -replicaset_find_join_master(void)
> -{
> - bool skip_ro = true;
> - /**
> - * Two loops, first prefers read-write replicas among others.
> - * Second for backward compatibility, if there is no such
> - * replicas at all.
> - */
> - struct replica *leader = replicaset_round(skip_ro);
> - if (leader == NULL) {
> - skip_ro = false;
> - leader = replicaset_round(skip_ro);
> - }
> -
> - return leader;
> -}
> -
> struct replica *
> replica_by_uuid(const struct tt_uuid *uuid)
> {
--
Serge Petrenko
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH 6/6] replication: prefer to join from booted replicas
2021-06-04 23:38 ` [Tarantool-patches] [PATCH 6/6] replication: prefer to join from booted replicas Vladislav Shpilevoy via Tarantool-patches
2021-06-06 17:06 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-10 14:14 ` Serge Petrenko via Tarantool-patches
1 sibling, 0 replies; 22+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-06-10 14:14 UTC (permalink / raw)
To: Vladislav Shpilevoy, tarantool-patches, gorcunov
05.06.2021 02:38, Vladislav Shpilevoy пишет:
> The algorithm of looking for an instance to join the replicaset
> from didn't take into account that some of the instances might be
> not bootstrapped but still perfectly available.
>
> As a result, a ridiculous situation could happen - an instance
> could connect to a cluster with just read-only instances, but it
> could have itself with box.cfg{read_only = false}. Then instead of
> failing or waiting it just booted a brand new cluster. And after
> that the node just started complaining about the others having a
> different replicaset UUID.
>
> The patch makes so a new instance always prefers a bootstrapped
> join-source to a non-boostrapped one, including self. In the
> situation above the new instance now terminates with an error.
>
> In future hopefully it should start a retry-loop instead.
>
> Closes #5613
Thanks! LGTM.
>
> @TarantoolBot document
> Title: IPROTO_BALLOT rework and a new field
>
> A couple of fields in `IPROTO_BALLOT 0x29` used to have values not
> matching with their names. They are changed.
>
> * `IPROTO_BALLOT_IS_RO 0x01` used to mean "the instance has
> `box.cfg{read_only = true}`". It was renamed in the source code
> to `IPROTO_BALLOT_IS_RO_CFG`. It has the same code `0x01`, and
> the value is the same. Only the name has changed, and in the doc
> should be too.
>
> * `IPROTO_BALLOT_IS_LOADING 0x04` used to mean "the instance has
> finished `box.cfg()` and it has `read_only = true`". The name
> was wrong therefore, because even if the instance finished
> loading, the flag still was false for `read_only = true` nodes.
> Also such a value is not very suitable for any sane usage.
> The name was changed to `IPROTO_BALLOT_IS_RO`, the code stayed
> the same, and the value now is "the instance is not writable".
> The reason for being not writable can be any: the node is an
> orphan; or it has `read_only = true`; or it is a Raft follower;
> or anything else.
>
> And there is a new field.
>
> `IPROTO_BALLOT_IS_BOOTED 0x06` means the instance has finished its
> bootstrap or recovery.
> ---
> .../gh-5613-bootstrap-prefer-booted.md | 6 ++
> src/box/replication.cc | 20 +++---
> .../gh-5613-bootstrap-prefer-booted.result | 70 +++++++++++++++++++
> .../gh-5613-bootstrap-prefer-booted.test.lua | 21 ++++++
> test/replication/gh-5613-master.lua | 11 +++
> test/replication/gh-5613-replica1.lua | 13 ++++
> test/replication/gh-5613-replica2.lua | 11 +++
> test/replication/suite.cfg | 1 +
> 8 files changed, 144 insertions(+), 9 deletions(-)
> create mode 100644 changelogs/unreleased/gh-5613-bootstrap-prefer-booted.md
> create mode 100644 test/replication/gh-5613-bootstrap-prefer-booted.result
> create mode 100644 test/replication/gh-5613-bootstrap-prefer-booted.test.lua
> create mode 100644 test/replication/gh-5613-master.lua
> create mode 100644 test/replication/gh-5613-replica1.lua
> create mode 100644 test/replication/gh-5613-replica2.lua
>
> diff --git a/changelogs/unreleased/gh-5613-bootstrap-prefer-booted.md b/changelogs/unreleased/gh-5613-bootstrap-prefer-booted.md
> new file mode 100644
> index 000000000..c022ee012
> --- /dev/null
> +++ b/changelogs/unreleased/gh-5613-bootstrap-prefer-booted.md
> @@ -0,0 +1,6 @@
> +## bugfix/replication
> +
> +* Fixed an error when a replica, at attempt to join a cluster with exclusively
> + read-only replicas available, instead of failing or retrying just decided to
> + boot its own replicaset. Now it fails with an error about the other nodes
> + being read-only so they can't register it (gh-5613).
> diff --git a/src/box/replication.cc b/src/box/replication.cc
> index d33e70f28..52086c65e 100644
> --- a/src/box/replication.cc
> +++ b/src/box/replication.cc
> @@ -951,15 +951,6 @@ replicaset_next(struct replica *replica)
> return replica_hash_next(&replicaset.hash, replica);
> }
>
> -/**
> - * Compare vclock, read only mode and orphan status
> - * of all connected replicas and elect a leader.
> - * Initiallly, skip read-only replicas, since they
> - * can not properly act as bootstrap masters (register
> - * new nodes in _cluster table). If there are no read-write
> - * replicas, choose a read-only replica with biggest vclock
> - * as a leader, in hope it will become read-write soon.
> - */
> struct replica *
> replicaset_find_join_master(void)
> {
> @@ -972,12 +963,23 @@ replicaset_find_join_master(void)
> const struct ballot *ballot = &applier->ballot;
> int score = 0;
> /*
> + * First of all try to ignore non-booted instances. Including
> + * self if not booted yet. For self it is even dangerous as the
> + * instance might decide to boot its own cluster if, for
> + * example, the other nodes are available, but read-only. It
> + * would be a mistake.
> + *
> + * For a new cluster it is ok to use a non-booted instance as it
> + * means the algorithm tries to find an initial "boot-master".
> + *
> * Prefer instances not configured as read-only via box.cfg, and
> * not being in read-only state due to any other reason. The
> * config is stronger because if it is configured as read-only,
> * it is in read-only state for sure, until the config is
> * changed.
> */
> + if (ballot->is_booted)
> + score += 10;
> if (!ballot->is_ro_cfg)
> score += 5;
> if (!ballot->is_ro)
> diff --git a/test/replication/gh-5613-bootstrap-prefer-booted.result b/test/replication/gh-5613-bootstrap-prefer-booted.result
> new file mode 100644
> index 000000000..e8e7fb792
> --- /dev/null
> +++ b/test/replication/gh-5613-bootstrap-prefer-booted.result
> @@ -0,0 +1,70 @@
> +-- test-run result file version 2
> +test_run = require('test_run').new()
> + | ---
> + | ...
> +
> +test_run:cmd('create server master with script="replication/gh-5613-master.lua"')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server master with wait=False')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('create server replica1 with script="replication/gh-5613-replica1.lua"')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server replica1')
> + | ---
> + | - true
> + | ...
> +test_run:switch('master')
> + | ---
> + | - true
> + | ...
> +box.cfg{read_only = true}
> + | ---
> + | ...
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +
> +test_run:cmd('create server replica2 with script="replication/gh-5613-replica2.lua"')
> + | ---
> + | - true
> + | ...
> +-- It returns false, but it is expected.
> +test_run:cmd('start server replica2 with crash_expected=True')
> + | ---
> + | - false
> + | ...
> +opts = {filename = 'gh-5613-replica2.log'}
> + | ---
> + | ...
> +assert(test_run:grep_log(nil, 'ER_READONLY', nil, opts) ~= nil)
> + | ---
> + | - true
> + | ...
> +
> +test_run:cmd('delete server replica2')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server replica1')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('delete server replica1')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server master')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('delete server master')
> + | ---
> + | - true
> + | ...
> diff --git a/test/replication/gh-5613-bootstrap-prefer-booted.test.lua b/test/replication/gh-5613-bootstrap-prefer-booted.test.lua
> new file mode 100644
> index 000000000..d3c1c1189
> --- /dev/null
> +++ b/test/replication/gh-5613-bootstrap-prefer-booted.test.lua
> @@ -0,0 +1,21 @@
> +test_run = require('test_run').new()
> +
> +test_run:cmd('create server master with script="replication/gh-5613-master.lua"')
> +test_run:cmd('start server master with wait=False')
> +test_run:cmd('create server replica1 with script="replication/gh-5613-replica1.lua"')
> +test_run:cmd('start server replica1')
> +test_run:switch('master')
> +box.cfg{read_only = true}
> +test_run:switch('default')
> +
> +test_run:cmd('create server replica2 with script="replication/gh-5613-replica2.lua"')
> +-- It returns false, but it is expected.
> +test_run:cmd('start server replica2 with crash_expected=True')
> +opts = {filename = 'gh-5613-replica2.log'}
> +assert(test_run:grep_log(nil, 'ER_READONLY', nil, opts) ~= nil)
> +
> +test_run:cmd('delete server replica2')
> +test_run:cmd('stop server replica1')
> +test_run:cmd('delete server replica1')
> +test_run:cmd('stop server master')
> +test_run:cmd('delete server master')
> diff --git a/test/replication/gh-5613-master.lua b/test/replication/gh-5613-master.lua
> new file mode 100644
> index 000000000..408427315
> --- /dev/null
> +++ b/test/replication/gh-5613-master.lua
> @@ -0,0 +1,11 @@
> +#!/usr/bin/env tarantool
> +
> +require('console').listen(os.getenv('ADMIN'))
> +box.cfg({
> + listen = 'unix/:./gh-5613-master.sock',
> + replication = {
> + 'unix/:./gh-5613-master.sock',
> + 'unix/:./gh-5613-replica1.sock',
> + },
> +})
> +box.schema.user.grant('guest', 'super')
> diff --git a/test/replication/gh-5613-replica1.lua b/test/replication/gh-5613-replica1.lua
> new file mode 100644
> index 000000000..d0d6e3372
> --- /dev/null
> +++ b/test/replication/gh-5613-replica1.lua
> @@ -0,0 +1,13 @@
> +#!/usr/bin/env tarantool
> +
> +require('console').listen(os.getenv('ADMIN'))
> +box.cfg({
> + listen = 'unix/:./gh-5613-replica1.sock',
> + replication = {
> + 'unix/:./gh-5613-master.sock',
> + 'unix/:./gh-5613-replica1.sock',
> + },
> + -- Set to read_only initially so as the bootstrap-master would be
> + -- known in advance.
> + read_only = true,
> +})
> diff --git a/test/replication/gh-5613-replica2.lua b/test/replication/gh-5613-replica2.lua
> new file mode 100644
> index 000000000..8cbd45b61
> --- /dev/null
> +++ b/test/replication/gh-5613-replica2.lua
> @@ -0,0 +1,11 @@
> +#!/usr/bin/env tarantool
> +
> +require('console').listen(os.getenv('ADMIN'))
> +box.cfg({
> + listen = 'unix/:./gh-5613-replica2.sock',
> + replication = {
> + 'unix/:./gh-5613-master.sock',
> + 'unix/:./gh-5613-replica1.sock',
> + 'unix/:./gh-5613-replica2.sock',
> + },
> +})
> diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
> index 27eab20c2..f9d5ce1cc 100644
> --- a/test/replication/suite.cfg
> +++ b/test/replication/suite.cfg
> @@ -44,6 +44,7 @@
> "gh-5435-qsync-clear-synchro-queue-commit-all.test.lua": {},
> "gh-5536-wal-limit.test.lua": {},
> "gh-5566-final-join-synchro.test.lua": {},
> + "gh-5613-bootstrap-prefer-booted.test.lua": {},
> "gh-6032-promote-wal-write.test.lua": {},
> "gh-6057-qsync-confirm-async-no-wal.test.lua": {},
> "gh-6094-rs-uuid-mismatch.test.lua": {},
--
Serge Petrenko
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH 7/6] raft: test join to a raft cluster
2021-06-06 17:03 ` [Tarantool-patches] [PATCH 7/6] raft: test join to a raft cluster Vladislav Shpilevoy via Tarantool-patches
2021-06-06 23:01 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-10 14:17 ` Serge Petrenko via Tarantool-patches
1 sibling, 0 replies; 22+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-06-10 14:17 UTC (permalink / raw)
To: Vladislav Shpilevoy, tarantool-patches, gorcunov
06.06.2021 20:03, Vladislav Shpilevoy пишет:
> There was a bug that a new replica at join to a Raft cluster
> sometimes tried to register on a non-leader node which couldn't
> write to _cluster, so the join failed with ER_READONLY error.
>
> Now in scope of #5613 the algorithm of join-master selection is
> changed. A new node looks for writable members of the cluster to
> use a join-master. It will not choose a follower if there is a
> leader.
>
> Closes #6127
Thanks for working on this! LGTM.
> ---
> .../unreleased/gh-6127-raft-join-new.md | 4 +
> test/replication/gh-6127-master1.lua | 15 +++
> test/replication/gh-6127-master2.lua | 13 +++
> test/replication/gh-6127-raft-join-new.result | 105 ++++++++++++++++++
> .../gh-6127-raft-join-new.test.lua | 41 +++++++
> test/replication/gh-6127-replica.lua | 9 ++
> 6 files changed, 187 insertions(+)
> create mode 100644 changelogs/unreleased/gh-6127-raft-join-new.md
> create mode 100644 test/replication/gh-6127-master1.lua
> create mode 100644 test/replication/gh-6127-master2.lua
> create mode 100644 test/replication/gh-6127-raft-join-new.result
> create mode 100644 test/replication/gh-6127-raft-join-new.test.lua
> create mode 100644 test/replication/gh-6127-replica.lua
>
> diff --git a/changelogs/unreleased/gh-6127-raft-join-new.md b/changelogs/unreleased/gh-6127-raft-join-new.md
> new file mode 100644
> index 000000000..a2d898df0
> --- /dev/null
> +++ b/changelogs/unreleased/gh-6127-raft-join-new.md
> @@ -0,0 +1,4 @@
> +## bugfix/raft
> +
> +* Fixed an error when a new replica in a Raft cluster could try to join from a
> + follower instead of a leader and failed with an error `ER_READONLY` (gh-6127).
> diff --git a/test/replication/gh-6127-master1.lua b/test/replication/gh-6127-master1.lua
> new file mode 100644
> index 000000000..708574322
> --- /dev/null
> +++ b/test/replication/gh-6127-master1.lua
> @@ -0,0 +1,15 @@
> +#!/usr/bin/env tarantool
> +
> +require('console').listen(os.getenv('ADMIN'))
> +box.cfg({
> + listen = 'unix/:./master1.sock',
> + replication = {
> + 'unix/:./master1.sock',
> + 'unix/:./master2.sock'
> + },
> + election_mode = 'candidate',
> + election_timeout = 0.1,
> + instance_uuid = '10f9828d-b5d5-46a9-b698-ddac7cce5e27',
> +})
> +box.ctl.wait_rw()
> +box.schema.user.grant('guest', 'super')
> diff --git a/test/replication/gh-6127-master2.lua b/test/replication/gh-6127-master2.lua
> new file mode 100644
> index 000000000..1851070c7
> --- /dev/null
> +++ b/test/replication/gh-6127-master2.lua
> @@ -0,0 +1,13 @@
> +#!/usr/bin/env tarantool
> +
> +require('console').listen(os.getenv('ADMIN'))
> +box.cfg({
> + listen = 'unix/:./master2.sock',
> + replication = {
> + 'unix/:./master1.sock',
> + 'unix/:./master2.sock'
> + },
> + election_mode = 'voter',
> + election_timeout = 0.1,
> + instance_uuid = '20f9828d-b5d5-46a9-b698-ddac7cce5e27',
> +})
> diff --git a/test/replication/gh-6127-raft-join-new.result b/test/replication/gh-6127-raft-join-new.result
> new file mode 100644
> index 000000000..be6f8489b
> --- /dev/null
> +++ b/test/replication/gh-6127-raft-join-new.result
> @@ -0,0 +1,105 @@
> +-- test-run result file version 2
> +test_run = require('test_run').new()
> + | ---
> + | ...
> +
> +--
> +-- gh-6127: the algorithm selecting a node from which to join to a replicaset
> +-- should take into account who is the leader (is writable and can write to
> +-- _cluster) and who is a follower/candidate.
> +--
> +test_run:cmd('create server master1 with script="replication/gh-6127-master1.lua"')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server master1 with wait=False')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('create server master2 with script="replication/gh-6127-master2.lua"')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server master2')
> + | ---
> + | - true
> + | ...
> +
> +test_run:switch('master1')
> + | ---
> + | - true
> + | ...
> +box.cfg{election_mode = 'voter'}
> + | ---
> + | ...
> +test_run:switch('master2')
> + | ---
> + | - true
> + | ...
> +-- Perform manual election because it is faster - the automatic one still tries
> +-- to wait for 'death timeout' first which is several seconds.
> +box.cfg{ \
> + election_mode = 'manual', \
> + election_timeout = 0.1, \
> +}
> + | ---
> + | ...
> +box.ctl.promote()
> + | ---
> + | ...
> +box.ctl.wait_rw()
> + | ---
> + | ...
> +-- Make sure the other node received the promotion row. Vclocks now should be
> +-- equal so the new node would select only using read-only state and min UUID.
> +test_run:wait_lsn('master1', 'master2')
> + | ---
> + | ...
> +
> +-- Min UUID is master1, but it is not writable. Therefore must join from
> +-- master2.
> +test_run:cmd('create server replica with script="replication/gh-6127-replica.lua"')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server replica')
> + | ---
> + | - true
> + | ...
> +test_run:switch('replica')
> + | ---
> + | - true
> + | ...
> +assert(box.info.leader ~= 0)
> + | ---
> + | - true
> + | ...
> +
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server replica')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('delete server replica')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server master2')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('delete server master2')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server master1')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('delete server master1')
> + | ---
> + | - true
> + | ...
> diff --git a/test/replication/gh-6127-raft-join-new.test.lua b/test/replication/gh-6127-raft-join-new.test.lua
> new file mode 100644
> index 000000000..3e0e9f226
> --- /dev/null
> +++ b/test/replication/gh-6127-raft-join-new.test.lua
> @@ -0,0 +1,41 @@
> +test_run = require('test_run').new()
> +
> +--
> +-- gh-6127: the algorithm selecting a node from which to join to a replicaset
> +-- should take into account who is the leader (is writable and can write to
> +-- _cluster) and who is a follower/candidate.
> +--
> +test_run:cmd('create server master1 with script="replication/gh-6127-master1.lua"')
> +test_run:cmd('start server master1 with wait=False')
> +test_run:cmd('create server master2 with script="replication/gh-6127-master2.lua"')
> +test_run:cmd('start server master2')
> +
> +test_run:switch('master1')
> +box.cfg{election_mode = 'voter'}
> +test_run:switch('master2')
> +-- Perform manual election because it is faster - the automatic one still tries
> +-- to wait for 'death timeout' first which is several seconds.
> +box.cfg{ \
> + election_mode = 'manual', \
> + election_timeout = 0.1, \
> +}
> +box.ctl.promote()
> +box.ctl.wait_rw()
> +-- Make sure the other node received the promotion row. Vclocks now should be
> +-- equal so the new node would select only using read-only state and min UUID.
> +test_run:wait_lsn('master1', 'master2')
> +
> +-- Min UUID is master1, but it is not writable. Therefore must join from
> +-- master2.
> +test_run:cmd('create server replica with script="replication/gh-6127-replica.lua"')
> +test_run:cmd('start server replica')
> +test_run:switch('replica')
> +assert(box.info.leader ~= 0)
> +
> +test_run:switch('default')
> +test_run:cmd('stop server replica')
> +test_run:cmd('delete server replica')
> +test_run:cmd('stop server master2')
> +test_run:cmd('delete server master2')
> +test_run:cmd('stop server master1')
> +test_run:cmd('delete server master1')
> diff --git a/test/replication/gh-6127-replica.lua b/test/replication/gh-6127-replica.lua
> new file mode 100644
> index 000000000..9f4c35ecd
> --- /dev/null
> +++ b/test/replication/gh-6127-replica.lua
> @@ -0,0 +1,9 @@
> +#!/usr/bin/env tarantool
> +
> +require('console').listen(os.getenv('ADMIN'))
> +box.cfg({
> + replication = {
> + 'unix/:./master1.sock',
> + 'unix/:./master2.sock'
> + },
> +})
--
Serge Petrenko
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH 5/6] replication: use 'score' to find a join-master
2021-06-04 23:37 ` [Tarantool-patches] [PATCH 5/6] replication: use 'score' to find a join-master Vladislav Shpilevoy via Tarantool-patches
2021-06-10 14:06 ` Serge Petrenko via Tarantool-patches
@ 2021-06-10 15:02 ` Cyrill Gorcunov via Tarantool-patches
2021-06-10 20:09 ` Vladislav Shpilevoy via Tarantool-patches
1 sibling, 1 reply; 22+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-06-10 15:02 UTC (permalink / raw)
To: Vladislav Shpilevoy; +Cc: tarantool-patches
On Sat, Jun 05, 2021 at 01:37:59AM +0200, Vladislav Shpilevoy wrote:
> + int score = 0;
> /*
> - * Try to find a replica which has already left
> - * orphan mode.
> + * Prefer instances not configured as read-only via box.cfg, and
> + * not being in read-only state due to any other reason. The
> + * config is stronger because if it is configured as read-only,
> + * it is in read-only state for sure, until the config is
> + * changed.
> */
> - if (ballot->is_ro && !leader_ballot->is_ro)
> + if (!ballot->is_ro_cfg)
> + score += 5;
> + if (!ballot->is_ro)
> + score += 1;
> + if (leader_score < score)
> + goto elect;
> + if (score < leader_score)
> continue;
Vlad, if only I'm not missing something obvious we can do simplier
without branching at all, say
score = (ballot->is_booted << 3) |
(ballot->is_ro_cfg << 2) |
(ballot->is_ro << 1);
up to you, just an idea.
...
> if (cmp == 0 && tt_uuid_compare(&replica->uuid,
> &leader->uuid) > 0)
> continue;
> + elect:
seems extra tab here?
> leader = replica;
> + leader_score = score;
> }
> return leader;
> }
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH 0/6] Instance join should prefer booted instances
2021-06-04 23:37 [Tarantool-patches] [PATCH 0/6] Instance join should prefer booted instances Vladislav Shpilevoy via Tarantool-patches
` (6 preceding siblings ...)
2021-06-06 17:03 ` [Tarantool-patches] [PATCH 7/6] raft: test join to a raft cluster Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-10 15:03 ` Cyrill Gorcunov via Tarantool-patches
2021-06-11 20:56 ` Vladislav Shpilevoy via Tarantool-patches
8 siblings, 0 replies; 22+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-06-10 15:03 UTC (permalink / raw)
To: Vladislav Shpilevoy; +Cc: tarantool-patches
On Sat, Jun 05, 2021 at 01:37:54AM +0200, Vladislav Shpilevoy wrote:
> See the commit messages for the details.
>
> I did some basic tests for backward compatibility since I added a new field to
> the ballot, and it works fine.
ack
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH 5/6] replication: use 'score' to find a join-master
2021-06-10 15:02 ` Cyrill Gorcunov via Tarantool-patches
@ 2021-06-10 20:09 ` Vladislav Shpilevoy via Tarantool-patches
2021-06-10 21:28 ` Cyrill Gorcunov via Tarantool-patches
0 siblings, 1 reply; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-10 20:09 UTC (permalink / raw)
To: Cyrill Gorcunov; +Cc: tarantool-patches
Hi! Thanks for the review!
On 10.06.2021 17:02, Cyrill Gorcunov wrote:
> On Sat, Jun 05, 2021 at 01:37:59AM +0200, Vladislav Shpilevoy wrote:
>> + int score = 0;
>> /*
>> - * Try to find a replica which has already left
>> - * orphan mode.
>> + * Prefer instances not configured as read-only via box.cfg, and
>> + * not being in read-only state due to any other reason. The
>> + * config is stronger because if it is configured as read-only,
>> + * it is in read-only state for sure, until the config is
>> + * changed.
>> */
>> - if (ballot->is_ro && !leader_ballot->is_ro)
>> + if (!ballot->is_ro_cfg)
>> + score += 5;
>> + if (!ballot->is_ro)
>> + score += 1;
>> + if (leader_score < score)
>> + goto elect;
>> + if (score < leader_score)
>> continue;
>
> Vlad, if only I'm not missing something obvious we can do simplier
> without branching at all, say
>
> score = (ballot->is_booted << 3) |
> (ballot->is_ro_cfg << 2) |
> (ballot->is_ro << 1);
>
> up to you, just an idea.
This would work, technically. But TBH I don't see how it is simpler.
>> if (cmp == 0 && tt_uuid_compare(&replica->uuid,
>> &leader->uuid) > 0)
>> continue;
>> + elect:
>
> seems extra tab here?
Yes, fixed on the branch.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH 5/6] replication: use 'score' to find a join-master
2021-06-10 20:09 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-10 21:28 ` Cyrill Gorcunov via Tarantool-patches
0 siblings, 0 replies; 22+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-06-10 21:28 UTC (permalink / raw)
To: Vladislav Shpilevoy; +Cc: tarantool-patches
On Thu, Jun 10, 2021 at 10:09:24PM +0200, Vladislav Shpilevoy wrote:
> Hi! Thanks for the review!
>
> On 10.06.2021 17:02, Cyrill Gorcunov wrote:
> > On Sat, Jun 05, 2021 at 01:37:59AM +0200, Vladislav Shpilevoy wrote:
> >> + int score = 0;
> >> /*
> >> - * Try to find a replica which has already left
> >> - * orphan mode.
> >> + * Prefer instances not configured as read-only via box.cfg, and
> >> + * not being in read-only state due to any other reason. The
> >> + * config is stronger because if it is configured as read-only,
> >> + * it is in read-only state for sure, until the config is
> >> + * changed.
> >> */
> >> - if (ballot->is_ro && !leader_ballot->is_ro)
> >> + if (!ballot->is_ro_cfg)
> >> + score += 5;
> >> + if (!ballot->is_ro)
> >> + score += 1;
> >> + if (leader_score < score)
> >> + goto elect;
> >> + if (score < leader_score)
> >> continue;
> >
> > Vlad, if only I'm not missing something obvious we can do simplier
> > without branching at all, say
> >
> > score = (ballot->is_booted << 3) |
> > (ballot->is_ro_cfg << 2) |
> > (ballot->is_ro << 1);
> >
> > up to you, just an idea.
>
> This would work, technically. But TBH I don't see how it is simpler.
Due to (x)2^N poly properties when each member is used once and
we use additions only.
First, we drop the branching which allows to save BPU cycles but
since this is not a hot path is doesn't matter. What is matter is
the way we choose weigth triplets. Each score addition (in the
form they are chosen now) generates strict partial order, such
as 1 < 5 < 10, thus if someone occasionally change 5 to 9 this
won't longer work properly I think. In case if someone has to
add fourth score he _is_ to modify the numbers carefully so
the order would remain.
In reverse the 2^N poly already has all these properties: each new
power won't be equal to a sum of previous powers, so that if
fourth score is needed then we simply add (x << 4) to the formula
above.
Anyway, I'm fine with current code, just wanted to explain why
I consider using pows here is more simple and flexible. Since we
have only 3 scores by now plain natural triplets should not
be a problem.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH 0/6] Instance join should prefer booted instances
2021-06-04 23:37 [Tarantool-patches] [PATCH 0/6] Instance join should prefer booted instances Vladislav Shpilevoy via Tarantool-patches
` (7 preceding siblings ...)
2021-06-10 15:03 ` [Tarantool-patches] [PATCH 0/6] Instance join should prefer booted instances Cyrill Gorcunov via Tarantool-patches
@ 2021-06-11 20:56 ` Vladislav Shpilevoy via Tarantool-patches
8 siblings, 0 replies; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-11 20:56 UTC (permalink / raw)
To: tarantool-patches, gorcunov, sergepetrenko
Pushed to master, 2.8, 2.7.
In the last commit I changed all 'raft' words to
'election', including the test file names. To
conform with the existing terminology.
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2021-06-11 20:56 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-04 23:37 [Tarantool-patches] [PATCH 0/6] Instance join should prefer booted instances Vladislav Shpilevoy via Tarantool-patches
2021-06-04 23:37 ` [Tarantool-patches] [PATCH 1/6] replication: refactor replicaset_leader() Vladislav Shpilevoy via Tarantool-patches
2021-06-10 13:54 ` Serge Petrenko via Tarantool-patches
2021-06-04 23:37 ` [Tarantool-patches] [PATCH 2/6] replication: ballot.is_ro -> is_ro_cfg Vladislav Shpilevoy via Tarantool-patches
2021-06-10 13:56 ` Serge Petrenko via Tarantool-patches
2021-06-04 23:37 ` [Tarantool-patches] [PATCH 3/6] replication: ballot.is_loading -> is_ro Vladislav Shpilevoy via Tarantool-patches
2021-06-10 13:58 ` Serge Petrenko via Tarantool-patches
2021-06-04 23:37 ` [Tarantool-patches] [PATCH 4/6] replication: introduce ballot.is_booted Vladislav Shpilevoy via Tarantool-patches
2021-06-10 14:02 ` Serge Petrenko via Tarantool-patches
2021-06-04 23:37 ` [Tarantool-patches] [PATCH 5/6] replication: use 'score' to find a join-master Vladislav Shpilevoy via Tarantool-patches
2021-06-10 14:06 ` Serge Petrenko via Tarantool-patches
2021-06-10 15:02 ` Cyrill Gorcunov via Tarantool-patches
2021-06-10 20:09 ` Vladislav Shpilevoy via Tarantool-patches
2021-06-10 21:28 ` Cyrill Gorcunov via Tarantool-patches
2021-06-04 23:38 ` [Tarantool-patches] [PATCH 6/6] replication: prefer to join from booted replicas Vladislav Shpilevoy via Tarantool-patches
2021-06-06 17:06 ` Vladislav Shpilevoy via Tarantool-patches
2021-06-10 14:14 ` Serge Petrenko via Tarantool-patches
2021-06-06 17:03 ` [Tarantool-patches] [PATCH 7/6] raft: test join to a raft cluster Vladislav Shpilevoy via Tarantool-patches
2021-06-06 23:01 ` Vladislav Shpilevoy via Tarantool-patches
2021-06-10 14:17 ` Serge Petrenko via Tarantool-patches
2021-06-10 15:03 ` [Tarantool-patches] [PATCH 0/6] Instance join should prefer booted instances Cyrill Gorcunov via Tarantool-patches
2021-06-11 20:56 ` Vladislav Shpilevoy via Tarantool-patches
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox