[tarantool-patches] Re: [tarantool-patches] [PATCH v2] replication: add is_orphan field to ballot

Sergey Petrenko sergepetrenko at tarantool.org
Tue Oct 8 12:53:39 MSK 2019




>Понедельник,  7 октября 2019, 18:50 +03:00 от Serge Petrenko <sergepetrenko at tarantool.org>:
>
>A successfully fetched remote instance ballot isn't updated during
>bootstrap procedure. This leads to a case when different instances
>choose different masters as their bootstrap leaders.
>
>Imagine such a situation.
>You start instance A without replication set up. Instance A successfully
>bootstraps.
>You also have instances B and C both with replication set up to {A, B,
>C} and replication_connect_quorum set to 3
>You first start instance B. It doesn't proceed to choosing a leader
>until one of the events happens: either the replication_connect_timeout
>runs out, or instance C is up and starts listening on its port.
>B has established connection to A and fetched its ballot, with some
>vclock, say, {1: 1}.
>B retries connection to C every replication_timeout seconds.
>Then you start instance C. Instance C succeeds in connecting to A and B
>right away and bootstraps from instance A. Instance A registers C in its
>_cluster table. This registration is replicated to instance C.
>Meanwhile, instance C is trying to sync with quorum instances (which is
>3), and stays in orphan mode.
>Now replication_timeout on instance B finally runs out. It retries a
>previously unsuccessful connection to C and succeeds. C sends its ballot
>to B with vclock = {1: 2, 2:0} (in our example), since it has already
>incremented it after _cluster registration.
>B sees that C has a greater vclock than A, and chooses to bootstrap from
>C instead of A. C is orphan and rejects B's attempt to join. B dies.
>
>To fix such ungentlemanlike behaviour of C, we should at least include
>loading status in ballot and prefer fully bootstrapped instances to the
>ones still syncing with other replicas.
>We also need to use a separate flag instead of ballot's already existent
>is_ro, since we still want to prefer loading instances over the ones
>explicitly configured to be read-only.
>
>Closes #4527
>---
>https://github.com/tarantool/tarantool/issues/4527
>https://github.com/tarantool/tarantool/tree/sp/gh-4527-ballot-vclock
>
>Changes in v2:
>  - Change is_orphan flag to is_loading
>    Only pass is_ro flag to is_loading,
>    since it is true at all times when
>    is_orphan is true (during bootstrap).
>
> src/box/box.cc                             |  8 +++
> src/box/iproto_constants.h                 |  1 +
> src/box/replication.cc                     | 10 +++-
> src/box/xrow.c                             | 13 ++++-
> src/box/xrow.h                             |  5 ++
> test/replication/bootstrap_leader.result   | 61 ++++++++++++++++++++++
> test/replication/bootstrap_leader.test.lua | 27 ++++++++++
> test/replication/replica_uuid_rw.lua       | 26 +++++++++
> test/replication/replica_uuid_rw1.lua      |  1 +
> test/replication/replica_uuid_rw2.lua      |  1 +
> test/replication/replica_uuid_rw3.lua      |  1 +
> 11 files changed, 150 insertions(+), 4 deletions(-)
> create mode 100644 test/replication/bootstrap_leader.result
> create mode 100644 test/replication/bootstrap_leader.test.lua
> create mode 100644 test/replication/replica_uuid_rw.lua
> create mode 120000 test/replication/replica_uuid_rw1.lua
> create mode 120000 test/replication/replica_uuid_rw2.lua
> create mode 120000 test/replication/replica_uuid_rw3.lua
>
>diff --git a/src/box/box.cc b/src/box/box.cc
>index 40920e649..baf029a09 100644
>--- a/src/box/box.cc
>+++ b/src/box/box.cc
>@@ -1630,6 +1630,14 @@ void
> box_process_vote(struct ballot *ballot)
> {
> ballot->is_ro = cfg_geti("read_only") != 0;
>+/*
>+ * is_ro is true on initial load and is set to box.cfg.read_only
>+ * after box_cfg() returns, during dynamic box.cfg parameters setting.
>+ * We would like to prefer already bootstrapped instances to the ones
>+ * still bootstrapping and the ones still bootstrapping, but writeable
>+ * to the ones that have box.cfg.read_only = true.
>+ */
>+ballot->is_loading = is_ro;
> vclock_copy(&ballot->vclock, &replicaset.vclock);
> vclock_copy(&ballot->gc_vclock, &gc.vclock);
> }
>diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
>index 724cce535..5e8a7d483 100644
>--- a/src/box/iproto_constants.h
>+++ b/src/box/iproto_constants.h
>@@ -137,6 +137,7 @@ enum iproto_ballot_key {
> IPROTO_BALLOT_IS_RO = 0x01,
> IPROTO_BALLOT_VCLOCK = 0x02,
> IPROTO_BALLOT_GC_VCLOCK = 0x03,
>+IPROTO_BALLOT_IS_LOADING = 0x04,
> };
> 
> #define bit(c) (1ULL<<IPROTO_##c)
>diff --git a/src/box/replication.cc b/src/box/replication.cc
>index d691ce487..6fcc56fe3 100644
>--- a/src/box/replication.cc
>+++ b/src/box/replication.cc
>@@ -863,8 +863,8 @@ replicaset_next(struct replica *replica)
> }
> 
> /**
>- * Compare vclock and read only mode of all connected
>- * replicas and elect a leader.
>+ * Compare vclock, read only mode and orphan status
>+ * of all connected replicas and elect a leader.
>  * Initiallly, skip read-only replicas, since they
>  * can not properly act as bootstrap masters (register
>  * new nodes in _cluster table). If there are no read-write
>@@ -892,6 +892,12 @@ replicaset_round(bool skip_ro)
> leader = replica;
> continue;
> }
>+/*
>+ * Try to find a replica which has already left
>+ * orphan mode.
>+ */
>+if (applier->ballot.is_loading && ! leader->applier->ballot.is_loading)
>+continue;
> /*
>  * Choose the replica with the most advanced
>  * vclock. If there are two or more replicas
>diff --git a/src/box/xrow.c b/src/box/xrow.c
>index 0ae5271c1..18bf08971 100644
>--- a/src/box/xrow.c
>+++ b/src/box/xrow.c
>@@ -441,8 +441,9 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot,
>   uint64_t sync, uint32_t schema_version)
> {
> size_t max_size = IPROTO_HEADER_LEN + mp_sizeof_map(1) +
>-mp_sizeof_uint(UINT32_MAX) + mp_sizeof_map(3) +
>+mp_sizeof_uint(UINT32_MAX) + mp_sizeof_map(4) +
> mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro) +
>+mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_loading) +
> mp_sizeof_uint(UINT32_MAX) + mp_sizeof_vclock(&ballot->vclock) +
> mp_sizeof_uint(UINT32_MAX) + mp_sizeof_vclock(&ballot->gc_vclock);
> 
>@@ -456,9 +457,11 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot,
> char *data = buf + IPROTO_HEADER_LEN;
> data = mp_encode_map(data, 1);
> data = mp_encode_uint(data, IPROTO_BALLOT);
>-data = mp_encode_map(data, 3);
>+data = mp_encode_map(data, 4);
> data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO);
> data = mp_encode_bool(data, ballot->is_ro);
>+data = mp_encode_uint(data, IPROTO_BALLOT_IS_LOADING);
>+data = mp_encode_bool(data, ballot->is_loading);
> data = mp_encode_uint(data, IPROTO_BALLOT_VCLOCK);
> data = mp_encode_vclock(data, &ballot->vclock);
> data = mp_encode_uint(data, IPROTO_BALLOT_GC_VCLOCK);
>@@ -1077,6 +1080,7 @@ int
> xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot)
> {
> ballot->is_ro = false;
>+ballot->is_loading = false;
> vclock_create(&ballot->vclock);
> 
> const char *start = NULL;
>@@ -1121,6 +1125,11 @@ xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot)
> goto err;
> ballot->is_ro = mp_decode_bool(&data);
> break;
>+case IPROTO_BALLOT_IS_LOADING:
>+if (mp_typeof(*data) != MP_BOOL)
>+goto err;
>+ballot->is_loading = mp_decode_bool(&data);
>+break;
> case IPROTO_BALLOT_VCLOCK:
> if (mp_decode_vclock(&data, &ballot->vclock) != 0)
> goto err;
>diff --git a/src/box/xrow.h b/src/box/xrow.h
>index 35ec06dc0..60def2d3c 100644
>--- a/src/box/xrow.h
>+++ b/src/box/xrow.h
>@@ -275,6 +275,11 @@ xrow_encode_auth(struct xrow_header *row, const char *salt, size_t salt_len,
> struct ballot {
> /** Set if the instance is running in read-only mode. */
> bool is_ro;
>+/**
>+ * Set if the instance hasn't finished bootstrap or recovery, or
>+ * is syncing with other replicas in the replicaset.
>+ */
>+bool is_loading;
> /** Current instance vclock. */
> struct vclock vclock;
> /** Oldest vclock available on the instance. */
>diff --git a/test/replication/bootstrap_leader.result b/test/replication/bootstrap_leader.result
>new file mode 100644
>index 000000000..ef4265cd3
>--- /dev/null
>+++ b/test/replication/bootstrap_leader.result
>@@ -0,0 +1,61 @@
>+-- test-run result file version 2
>+test_run = require('test_run').new()
>+ | ---
>+ | ...
>+
>+create_server_cmd = "create server replica%d with script='replication/replica_uuid_rw%d.lua'"
>+ | ---
>+ | ...
>+
>+test_run:cmd("setopt delimiter ';'")
>+ | ---
>+ | - true
>+ | ...
>+
>+for i = 1,3 do
>+    test_run:cmd(string.format(create_server_cmd, i, i))
>+end;
>+ | ---
>+ | ...
>+
>+test_run:cmd("start server replica1 with wait_load=False, wait=False");
>+ | ---
>+ | - true
>+ | ...
>+test_run:cmd("start server replica2 with args='1,2,3 1.0 100500 0.1', wait_load=False, wait=False");
>+ | ---
>+ | - true
>+ | ...
>+test_run:cmd("start server replica3 with args='1,2,3 0.1 0.5 100500', wait_load=False, wait=False");
>+ | ---
>+ | - true
>+ | ...
>+
>+test_run:cmd("switch replica3");
>+ | ---
>+ | - true
>+ | ...
>+fiber = require('fiber');
>+ | ---
>+ | ...
>+while (#box.info.replication < 3) do
>+    fiber.sleep(0.05)
>+end;
>+ | ---
>+ | ...
>+test_run:cmd("switch default");
>+ | ---
>+ | - true
>+ | ...
>+
>+for i = 1,3 do
>+    test_run:cmd("stop server replica"..i.." with cleanup=1")
>+    test_run:cmd("delete server replica"..i)
>+end;
>+ | ---
>+ | ...
>+
>+test_run:cmd("setopt delimiter ''");
>+ | ---
>+ | - true
>+ | ...
>diff --git a/test/replication/bootstrap_leader.test.lua b/test/replication/bootstrap_leader.test.lua
>new file mode 100644
>index 000000000..eb9aa293f
>--- /dev/null
>+++ b/test/replication/bootstrap_leader.test.lua
>@@ -0,0 +1,27 @@
>+test_run = require('test_run').new()
>+
>+create_server_cmd = "create server replica%d with script='replication/replica_uuid_rw%d.lua'"
>+
>+test_run:cmd("setopt delimiter ';'")
>+
>+for i = 1,3 do
>+    test_run:cmd(string.format(create_server_cmd, i, i))
>+end;
>+
>+test_run:cmd("start server replica1 with wait_load=False, wait=False");
>+test_run:cmd("start server replica2 with args='1,2,3 1.0 100500 0.1', wait_load=False, wait=False");
>+test_run:cmd("start server replica3 with args='1,2,3 0.1 0.5 100500', wait_load=False, wait=False");
>+
>+test_run:cmd("switch replica3");
>+fiber = require('fiber');
>+while (#box.info.replication < 3) do
>+    fiber.sleep(0.05)
>+end;
>+test_run:cmd("switch default");
>+
>+for i = 1,3 do
>+    test_run:cmd("stop server replica"..i.." with cleanup=1")
>+    test_run:cmd("delete server replica"..i)
>+end;
>+
>+test_run:cmd("setopt delimiter ''");
>diff --git a/test/replication/replica_uuid_rw.lua b/test/replication/replica_uuid_rw.lua
>new file mode 100644
>index 000000000..60f2d6362
>--- /dev/null
>+++ b/test/replication/replica_uuid_rw.lua
>@@ -0,0 +1,26 @@
>+#!/usr/bin/env tarantool
>+
>+local INSTANCE_ID = string.match(arg[0], "%d")
>+local SOCKET_DIR = require('fio').cwd()
>+
>+local function instance_uri(instance_id)
>+    return SOCKET_DIR..'/replica_uuid_rw'..instance_id..'.sock'
>+end
>+
>+local repl_tbl = {}
>+for num in string.gmatch(arg[1] or "", "%d") do
>+    table.insert(repl_tbl, instance_uri(num))
>+end
>+
>+require('console').listen(os.getenv('ADMIN'))
>+
>+box.cfg({
>+    instance_uuid       = "aaaaaaaa-aaaa-0000-0000-00000000000"..INSTANCE_ID,
>+    listen              = instance_uri(INSTANCE_ID),
>+    replication         = repl_tbl,
>+    replication_timeout = arg[2] and tonumber(arg[2]) or 0.1,
>+    replication_connect_timeout = arg[3] and tonumber(arg[3]) or 0.5,
>+    replication_sync_timeout = arg[4] and tonumber(arg[4]) or 1.0,
>+})
>+
>+box.once("bootstrap", function() box.schema.user.grant('guest', 'replication') end)
>diff --git a/test/replication/replica_uuid_rw1.lua b/test/replication/replica_uuid_rw1.lua
>new file mode 120000
>index 000000000..328386f50
>--- /dev/null
>+++ b/test/replication/replica_uuid_rw1.lua
>@@ -0,0 +1 @@
>+replica_uuid_rw.lua
>\ No newline at end of file
>diff --git a/test/replication/replica_uuid_rw2.lua b/test/replication/replica_uuid_rw2.lua
>new file mode 120000
>index 000000000..328386f50
>--- /dev/null
>+++ b/test/replication/replica_uuid_rw2.lua
>@@ -0,0 +1 @@
>+replica_uuid_rw.lua
>\ No newline at end of file
>diff --git a/test/replication/replica_uuid_rw3.lua b/test/replication/replica_uuid_rw3.lua
>new file mode 120000
>index 000000000..328386f50
>--- /dev/null
>+++ b/test/replication/replica_uuid_rw3.lua
>@@ -0,0 +1 @@
>+replica_uuid_rw.lua
>\ No newline at end of file
>-- 
>2.21.0 (Apple Git-122)
>
>

Hi! A minor fix for the test.
The diff is below
The new branch is sp/gh-4527-ballot-vclock-full-ci

diff --git a/test/replication/bootstrap_leader.result b/test/replication/bootstrap_leader.result
index ef4265cd3..7d1a33d8e 100644
--- a/test/replication/bootstrap_leader.result
+++ b/test/replication/bootstrap_leader.result
@@ -18,7 +18,7 @@ end;
  | ---
  | ...
 
-test_run:cmd("start server replica1 with wait_load=False, wait=False");
+test_run:cmd("start server replica1 with wait_load=True, wait=True");
  | ---
  | - true
  | ...
@@ -26,7 +26,7 @@ test_run:cmd("start server replica2 with args='1,2,3 1.0 100500 0.1', wait_load=
  | ---
  | - true
  | ...
-test_run:cmd("start server replica3 with args='1,2,3 0.1 0.5 100500', wait_load=False, wait=False");
+test_run:cmd("start server replica3 with args='1,2,3 0.1 0.5 100500', wait_load=True, wait=True");
  | ---
  | - true
  | ...
diff --git a/test/replication/bootstrap_leader.test.lua b/test/replication/bootstrap_leader.test.lua
index eb9aa293f..984a82b8e 100644
--- a/test/replication/bootstrap_leader.test.lua
+++ b/test/replication/bootstrap_leader.test.lua
@@ -8,9 +8,9 @@ for i = 1,3 do
     test_run:cmd(string.format(create_server_cmd, i, i))
 end;
 
-test_run:cmd("start server replica1 with wait_load=False, wait=False");
+test_run:cmd("start server replica1 with wait_load=True, wait=True");
 test_run:cmd("start server replica2 with args='1,2,3 1.0 100500 0.1', wait_load=False, wait=False");
-test_run:cmd("start server replica3 with args='1,2,3 0.1 0.5 100500', wait_load=False, wait=False");
+test_run:cmd("start server replica3 with args='1,2,3 0.1 0.5 100500', wait_load=True, wait=True");
 
 test_run:cmd("switch replica3");
 fiber = require('fiber');
-- 
Sergey Petrenko


More information about the Tarantool-patches mailing list