[tarantool-patches] Re: [PATCH v2] replication: add is_orphan field to ballot

Georgy Kirichenko georgy at tarantool.org
Tue Oct 8 15:05:12 MSK 2019


Thank you for the patch.
LGTM

On Tuesday, October 8, 2019 12:53:39 PM MSK Sergey Petrenko wrote:
> >Понедельник,  7 октября 2019, 18:50 +03:00 от Serge Petrenko
> ><sergepetrenko at tarantool.org>:
> >
> >A successfully fetched remote instance ballot isn't updated during
> >bootstrap procedure. This leads to a case when different instances
> >choose different masters as their bootstrap leaders.
> >
> >Imagine such a situation.
> >You start instance A without replication set up. Instance A successfully
> >bootstraps.
> >You also have instances B and C both with replication set up to {A, B,
> >C} and replication_connect_quorum set to 3
> >You first start instance B. It doesn't proceed to choosing a leader
> >until one of the events happens: either the replication_connect_timeout
> >runs out, or instance C is up and starts listening on its port.
> >B has established connection to A and fetched its ballot, with some
> >vclock, say, {1: 1}.
> >B retries connection to C every replication_timeout seconds.
> >Then you start instance C. Instance C succeeds in connecting to A and B
> >right away and bootstraps from instance A. Instance A registers C in its
> >_cluster table. This registration is replicated to instance C.
> >Meanwhile, instance C is trying to sync with quorum instances (which is
> >3), and stays in orphan mode.
> >Now replication_timeout on instance B finally runs out. It retries a
> >previously unsuccessful connection to C and succeeds. C sends its ballot
> >to B with vclock = {1: 2, 2:0} (in our example), since it has already
> >incremented it after _cluster registration.
> >B sees that C has a greater vclock than A, and chooses to bootstrap from
> >C instead of A. C is orphan and rejects B's attempt to join. B dies.
> >
> >To fix such ungentlemanlike behaviour of C, we should at least include
> >loading status in ballot and prefer fully bootstrapped instances to the
> >ones still syncing with other replicas.
> >We also need to use a separate flag instead of ballot's already existent
> >is_ro, since we still want to prefer loading instances over the ones
> >explicitly configured to be read-only.
> >
> >Closes #4527
> >---
> >https://github.com/tarantool/tarantool/issues/4527
> >https://github.com/tarantool/tarantool/tree/sp/gh-4527-ballot-vclock
> >
> >Changes in v2:
> >  - Change is_orphan flag to is_loading
> >    Only pass is_ro flag to is_loading,
> >    since it is true at all times when
> >    is_orphan is true (during bootstrap).
> >
> > src/box/box.cc                             |  8 +++
> > src/box/iproto_constants.h                 |  1 +
> > src/box/replication.cc                     | 10 +++-
> > src/box/xrow.c                             | 13 ++++-
> > src/box/xrow.h                             |  5 ++
> > test/replication/bootstrap_leader.result   | 61 ++++++++++++++++++++++
> > test/replication/bootstrap_leader.test.lua | 27 ++++++++++
> > test/replication/replica_uuid_rw.lua       | 26 +++++++++
> > test/replication/replica_uuid_rw1.lua      |  1 +
> > test/replication/replica_uuid_rw2.lua      |  1 +
> > test/replication/replica_uuid_rw3.lua      |  1 +
> > 11 files changed, 150 insertions(+), 4 deletions(-)
> > create mode 100644 test/replication/bootstrap_leader.result
> > create mode 100644 test/replication/bootstrap_leader.test.lua
> > create mode 100644 test/replication/replica_uuid_rw.lua
> > create mode 120000 test/replication/replica_uuid_rw1.lua
> > create mode 120000 test/replication/replica_uuid_rw2.lua
> > create mode 120000 test/replication/replica_uuid_rw3.lua
> >
> >diff --git a/src/box/box.cc b/src/box/box.cc
> >index 40920e649..baf029a09 100644
> >--- a/src/box/box.cc
> >+++ b/src/box/box.cc
> >@@ -1630,6 +1630,14 @@ void
> > box_process_vote(struct ballot *ballot)
> > {
> > ballot->is_ro = cfg_geti("read_only") != 0;
> >+/*
> >+ * is_ro is true on initial load and is set to box.cfg.read_only
> >+ * after box_cfg() returns, during dynamic box.cfg parameters setting.
> >+ * We would like to prefer already bootstrapped instances to the ones
> >+ * still bootstrapping and the ones still bootstrapping, but writeable
> >+ * to the ones that have box.cfg.read_only = true.
> >+ */
> >+ballot->is_loading = is_ro;
> > vclock_copy(&ballot->vclock, &replicaset.vclock);
> > vclock_copy(&ballot->gc_vclock, &gc.vclock);
> > }
> >diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
> >index 724cce535..5e8a7d483 100644
> >--- a/src/box/iproto_constants.h
> >+++ b/src/box/iproto_constants.h
> >@@ -137,6 +137,7 @@ enum iproto_ballot_key {
> > IPROTO_BALLOT_IS_RO = 0x01,
> > IPROTO_BALLOT_VCLOCK = 0x02,
> > IPROTO_BALLOT_GC_VCLOCK = 0x03,
> >+IPROTO_BALLOT_IS_LOADING = 0x04,
> > };
> > 
> > #define bit(c) (1ULL<<IPROTO_##c)
> >diff --git a/src/box/replication.cc b/src/box/replication.cc
> >index d691ce487..6fcc56fe3 100644
> >--- a/src/box/replication.cc
> >+++ b/src/box/replication.cc
> >@@ -863,8 +863,8 @@ replicaset_next(struct replica *replica)
> > }
> > 
> > /**
> >- * Compare vclock and read only mode of all connected
> >- * replicas and elect a leader.
> >+ * Compare vclock, read only mode and orphan status
> >+ * of all connected replicas and elect a leader.
> >  * Initiallly, skip read-only replicas, since they
> >  * can not properly act as bootstrap masters (register
> >  * new nodes in _cluster table). If there are no read-write
> >@@ -892,6 +892,12 @@ replicaset_round(bool skip_ro)
> > leader = replica;
> > continue;
> > }
> >+/*
> >+ * Try to find a replica which has already left
> >+ * orphan mode.
> >+ */
> >+if (applier->ballot.is_loading && ! leader->applier->ballot.is_loading)
> >+continue;
> > /*
> >  * Choose the replica with the most advanced
> >  * vclock. If there are two or more replicas
> >diff --git a/src/box/xrow.c b/src/box/xrow.c
> >index 0ae5271c1..18bf08971 100644
> >--- a/src/box/xrow.c
> >+++ b/src/box/xrow.c
> >@@ -441,8 +441,9 @@ iproto_reply_vote(struct obuf *out, const struct ballot
> >*ballot, uint64_t sync, uint32_t schema_version)
> > {
> > size_t max_size = IPROTO_HEADER_LEN + mp_sizeof_map(1) +
> >-mp_sizeof_uint(UINT32_MAX) + mp_sizeof_map(3) +
> >+mp_sizeof_uint(UINT32_MAX) + mp_sizeof_map(4) +
> > mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro) +
> >+mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_loading) +
> > mp_sizeof_uint(UINT32_MAX) + mp_sizeof_vclock(&ballot->vclock) +
> > mp_sizeof_uint(UINT32_MAX) + mp_sizeof_vclock(&ballot->gc_vclock);
> > 
> >@@ -456,9 +457,11 @@ iproto_reply_vote(struct obuf *out, const struct
> >ballot *ballot, char *data = buf + IPROTO_HEADER_LEN;
> > data = mp_encode_map(data, 1);
> > data = mp_encode_uint(data, IPROTO_BALLOT);
> >-data = mp_encode_map(data, 3);
> >+data = mp_encode_map(data, 4);
> > data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO);
> > data = mp_encode_bool(data, ballot->is_ro);
> >+data = mp_encode_uint(data, IPROTO_BALLOT_IS_LOADING);
> >+data = mp_encode_bool(data, ballot->is_loading);
> > data = mp_encode_uint(data, IPROTO_BALLOT_VCLOCK);
> > data = mp_encode_vclock(data, &ballot->vclock);
> > data = mp_encode_uint(data, IPROTO_BALLOT_GC_VCLOCK);
> >@@ -1077,6 +1080,7 @@ int
> > xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot)
> > {
> > ballot->is_ro = false;
> >+ballot->is_loading = false;
> > vclock_create(&ballot->vclock);
> > 
> > const char *start = NULL;
> >@@ -1121,6 +1125,11 @@ xrow_decode_ballot(struct xrow_header *row, struct
> >ballot *ballot) goto err;
> > ballot->is_ro = mp_decode_bool(&data);
> > break;
> >+case IPROTO_BALLOT_IS_LOADING:
> >+if (mp_typeof(*data) != MP_BOOL)
> >+goto err;
> >+ballot->is_loading = mp_decode_bool(&data);
> >+break;
> > case IPROTO_BALLOT_VCLOCK:
> > if (mp_decode_vclock(&data, &ballot->vclock) != 0)
> > goto err;
> >diff --git a/src/box/xrow.h b/src/box/xrow.h
> >index 35ec06dc0..60def2d3c 100644
> >--- a/src/box/xrow.h
> >+++ b/src/box/xrow.h
> >@@ -275,6 +275,11 @@ xrow_encode_auth(struct xrow_header *row, const char
> >*salt, size_t salt_len, struct ballot {
> > /** Set if the instance is running in read-only mode. */
> > bool is_ro;
> >+/**
> >+ * Set if the instance hasn't finished bootstrap or recovery, or
> >+ * is syncing with other replicas in the replicaset.
> >+ */
> >+bool is_loading;
> > /** Current instance vclock. */
> > struct vclock vclock;
> > /** Oldest vclock available on the instance. */
> >diff --git a/test/replication/bootstrap_leader.result
> >b/test/replication/bootstrap_leader.result new file mode 100644
> >index 000000000..ef4265cd3
> >--- /dev/null
> >+++ b/test/replication/bootstrap_leader.result
> >@@ -0,0 +1,61 @@
> >+-- test-run result file version 2
> >+test_run = require('test_run').new()
> >+ | ---
> >+ | ...
> >+
> >+create_server_cmd = "create server replica%d with
> >script='replication/replica_uuid_rw%d.lua'" + | ---
> >+ | ...
> >+
> >+test_run:cmd("setopt delimiter ';'")
> >+ | ---
> >+ | - true
> >+ | ...
> >+
> >+for i = 1,3 do
> >+    test_run:cmd(string.format(create_server_cmd, i, i))
> >+end;
> >+ | ---
> >+ | ...
> >+
> >+test_run:cmd("start server replica1 with wait_load=False, wait=False");
> >+ | ---
> >+ | - true
> >+ | ...
> >+test_run:cmd("start server replica2 with args='1,2,3 1.0 100500 0.1',
> >wait_load=False, wait=False"); + | ---
> >+ | - true
> >+ | ...
> >+test_run:cmd("start server replica3 with args='1,2,3 0.1 0.5 100500',
> >wait_load=False, wait=False"); + | ---
> >+ | - true
> >+ | ...
> >+
> >+test_run:cmd("switch replica3");
> >+ | ---
> >+ | - true
> >+ | ...
> >+fiber = require('fiber');
> >+ | ---
> >+ | ...
> >+while (#box.info.replication < 3) do
> >+    fiber.sleep(0.05)
> >+end;
> >+ | ---
> >+ | ...
> >+test_run:cmd("switch default");
> >+ | ---
> >+ | - true
> >+ | ...
> >+
> >+for i = 1,3 do
> >+    test_run:cmd("stop server replica"..i.." with cleanup=1")
> >+    test_run:cmd("delete server replica"..i)
> >+end;
> >+ | ---
> >+ | ...
> >+
> >+test_run:cmd("setopt delimiter ''");
> >+ | ---
> >+ | - true
> >+ | ...
> >diff --git a/test/replication/bootstrap_leader.test.lua
> >b/test/replication/bootstrap_leader.test.lua new file mode 100644
> >index 000000000..eb9aa293f
> >--- /dev/null
> >+++ b/test/replication/bootstrap_leader.test.lua
> >@@ -0,0 +1,27 @@
> >+test_run = require('test_run').new()
> >+
> >+create_server_cmd = "create server replica%d with
> >script='replication/replica_uuid_rw%d.lua'" +
> >+test_run:cmd("setopt delimiter ';'")
> >+
> >+for i = 1,3 do
> >+    test_run:cmd(string.format(create_server_cmd, i, i))
> >+end;
> >+
> >+test_run:cmd("start server replica1 with wait_load=False, wait=False");
> >+test_run:cmd("start server replica2 with args='1,2,3 1.0 100500 0.1',
> >wait_load=False, wait=False"); +test_run:cmd("start server replica3 with
> >args='1,2,3 0.1 0.5 100500', wait_load=False, wait=False"); +
> >+test_run:cmd("switch replica3");
> >+fiber = require('fiber');
> >+while (#box.info.replication < 3) do
> >+    fiber.sleep(0.05)
> >+end;
> >+test_run:cmd("switch default");
> >+
> >+for i = 1,3 do
> >+    test_run:cmd("stop server replica"..i.." with cleanup=1")
> >+    test_run:cmd("delete server replica"..i)
> >+end;
> >+
> >+test_run:cmd("setopt delimiter ''");
> >diff --git a/test/replication/replica_uuid_rw.lua
> >b/test/replication/replica_uuid_rw.lua new file mode 100644
> >index 000000000..60f2d6362
> >--- /dev/null
> >+++ b/test/replication/replica_uuid_rw.lua
> >@@ -0,0 +1,26 @@
> >+#!/usr/bin/env tarantool
> >+
> >+local INSTANCE_ID = string.match(arg[0], "%d")
> >+local SOCKET_DIR = require('fio').cwd()
> >+
> >+local function instance_uri(instance_id)
> >+    return SOCKET_DIR..'/replica_uuid_rw'..instance_id..'.sock'
> >+end
> >+
> >+local repl_tbl = {}
> >+for num in string.gmatch(arg[1] or "", "%d") do
> >+    table.insert(repl_tbl, instance_uri(num))
> >+end
> >+
> >+require('console').listen(os.getenv('ADMIN'))
> >+
> >+box.cfg({
> >+    instance_uuid       =
> >"aaaaaaaa-aaaa-0000-0000-00000000000"..INSTANCE_ID, +    listen           
> >  = instance_uri(INSTANCE_ID),
> >+    replication         = repl_tbl,
> >+    replication_timeout = arg[2] and tonumber(arg[2]) or 0.1,
> >+    replication_connect_timeout = arg[3] and tonumber(arg[3]) or 0.5,
> >+    replication_sync_timeout = arg[4] and tonumber(arg[4]) or 1.0,
> >+})
> >+
> >+box.once("bootstrap", function() box.schema.user.grant('guest',
> >'replication') end) diff --git a/test/replication/replica_uuid_rw1.lua
> >b/test/replication/replica_uuid_rw1.lua new file mode 120000
> >index 000000000..328386f50
> >--- /dev/null
> >+++ b/test/replication/replica_uuid_rw1.lua
> >@@ -0,0 +1 @@
> >+replica_uuid_rw.lua
> >\ No newline at end of file
> >diff --git a/test/replication/replica_uuid_rw2.lua
> >b/test/replication/replica_uuid_rw2.lua new file mode 120000
> >index 000000000..328386f50
> >--- /dev/null
> >+++ b/test/replication/replica_uuid_rw2.lua
> >@@ -0,0 +1 @@
> >+replica_uuid_rw.lua
> >\ No newline at end of file
> >diff --git a/test/replication/replica_uuid_rw3.lua
> >b/test/replication/replica_uuid_rw3.lua new file mode 120000
> >index 000000000..328386f50
> >--- /dev/null
> >+++ b/test/replication/replica_uuid_rw3.lua
> >@@ -0,0 +1 @@
> >+replica_uuid_rw.lua
> >\ No newline at end of file
> 
> Hi! A minor fix for the test.
> The diff is below
> The new branch is sp/gh-4527-ballot-vclock-full-ci
> 
> diff --git a/test/replication/bootstrap_leader.result
> b/test/replication/bootstrap_leader.result index ef4265cd3..7d1a33d8e
> 100644
> --- a/test/replication/bootstrap_leader.result
> +++ b/test/replication/bootstrap_leader.result
> @@ -18,7 +18,7 @@ end;
>   | ---
>   | ...
>  
> -test_run:cmd("start server replica1 with wait_load=False, wait=False");
> +test_run:cmd("start server replica1 with wait_load=True, wait=True");
>   | ---
>   | - true
>   | ...
> @@ -26,7 +26,7 @@ test_run:cmd("start server replica2 with args='1,2,3 1.0
> 100500 0.1', wait_load= | ---
>   | - true
>   | ...
> -test_run:cmd("start server replica3 with args='1,2,3 0.1 0.5 100500',
> wait_load=False, wait=False"); +test_run:cmd("start server replica3 with
> args='1,2,3 0.1 0.5 100500', wait_load=True, wait=True"); | ---
>   | - true
>   | ...
> diff --git a/test/replication/bootstrap_leader.test.lua
> b/test/replication/bootstrap_leader.test.lua index eb9aa293f..984a82b8e
> 100644
> --- a/test/replication/bootstrap_leader.test.lua
> +++ b/test/replication/bootstrap_leader.test.lua
> @@ -8,9 +8,9 @@ for i = 1,3 do
>      test_run:cmd(string.format(create_server_cmd, i, i))
>  end;
>  
> -test_run:cmd("start server replica1 with wait_load=False, wait=False");
> +test_run:cmd("start server replica1 with wait_load=True, wait=True");
>  test_run:cmd("start server replica2 with args='1,2,3 1.0 100500 0.1',
> wait_load=False, wait=False"); -test_run:cmd("start server replica3 with
> args='1,2,3 0.1 0.5 100500', wait_load=False, wait=False");
> +test_run:cmd("start server replica3 with args='1,2,3 0.1 0.5 100500',
> wait_load=True, wait=True"); 
>  test_run:cmd("switch replica3");
>  fiber = require('fiber');

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: This is a digitally signed message part.
URL: <https://lists.tarantool.org/pipermail/tarantool-patches/attachments/20191008/9e98d53e/attachment.sig>


More information about the Tarantool-patches mailing list