[Tarantool-patches] [PATCH 6/6] replication: prefer to join from booted replicas

Vladislav Shpilevoy v.shpilevoy at tarantool.org
Sat Jun 5 02:38:00 MSK 2021


The algorithm of looking for an instance to join the replicaset
from didn't take into account that some of the instances might be
not bootstrapped but still perfectly available.

As a result, a ridiculous situation could happen - an instance
could connect to a cluster with just read-only instances, but it
could have itself with box.cfg{read_only = false}. Then instead of
failing or waiting it just booted a brand new cluster. And after
that the node just started complaining about the others having a
different replicaset UUID.

The patch makes so a new instance always prefers a bootstrapped
join-source to a non-boostrapped one, including self. In the
situation above the new instance now terminates with an error.

In future hopefully it should start a retry-loop instead.

Closes #5613

@TarantoolBot document
Title: IPROTO_BALLOT rework and a new field

A couple of fields in `IPROTO_BALLOT 0x29` used to have values not
matching with their names. They are changed.

* `IPROTO_BALLOT_IS_RO 0x01` used to mean "the instance has
  `box.cfg{read_only = true}`". It was renamed in the source code
  to `IPROTO_BALLOT_IS_RO_CFG`. It has the same code `0x01`, and
  the value is the same. Only the name has changed, and in the doc
  should be too.

* `IPROTO_BALLOT_IS_LOADING 0x04` used to mean "the instance has
  finished `box.cfg()` and it has `read_only = true`". The name
  was wrong therefore, because even if the instance finished
  loading, the flag still was false for `read_only = true` nodes.
  Also such a value is not very suitable for any sane usage.
  The name was changed to `IPROTO_BALLOT_IS_RO`, the code stayed
  the same, and the value now is "the instance is not writable".
  The reason for being not writable can be any: the node is an
  orphan; or it has `read_only = true`; or it is a Raft follower;
  or anything else.

And there is a new field.

`IPROTO_BALLOT_IS_BOOTED 0x06` means the instance has finished its
bootstrap or recovery.
---
 .../gh-5613-bootstrap-prefer-booted.md        |  6 ++
 src/box/replication.cc                        | 20 +++---
 .../gh-5613-bootstrap-prefer-booted.result    | 70 +++++++++++++++++++
 .../gh-5613-bootstrap-prefer-booted.test.lua  | 21 ++++++
 test/replication/gh-5613-master.lua           | 11 +++
 test/replication/gh-5613-replica1.lua         | 13 ++++
 test/replication/gh-5613-replica2.lua         | 11 +++
 test/replication/suite.cfg                    |  1 +
 8 files changed, 144 insertions(+), 9 deletions(-)
 create mode 100644 changelogs/unreleased/gh-5613-bootstrap-prefer-booted.md
 create mode 100644 test/replication/gh-5613-bootstrap-prefer-booted.result
 create mode 100644 test/replication/gh-5613-bootstrap-prefer-booted.test.lua
 create mode 100644 test/replication/gh-5613-master.lua
 create mode 100644 test/replication/gh-5613-replica1.lua
 create mode 100644 test/replication/gh-5613-replica2.lua

diff --git a/changelogs/unreleased/gh-5613-bootstrap-prefer-booted.md b/changelogs/unreleased/gh-5613-bootstrap-prefer-booted.md
new file mode 100644
index 000000000..c022ee012
--- /dev/null
+++ b/changelogs/unreleased/gh-5613-bootstrap-prefer-booted.md
@@ -0,0 +1,6 @@
+## bugfix/replication
+
+* Fixed an error when a replica, at attempt to join a cluster with exclusively
+  read-only replicas available, instead of failing or retrying just decided to
+  boot its own replicaset. Now it fails with an error about the other nodes
+  being read-only so they can't register it (gh-5613).
diff --git a/src/box/replication.cc b/src/box/replication.cc
index d33e70f28..52086c65e 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -951,15 +951,6 @@ replicaset_next(struct replica *replica)
 	return replica_hash_next(&replicaset.hash, replica);
 }
 
-/**
- * Compare vclock, read only mode and orphan status
- * of all connected replicas and elect a leader.
- * Initiallly, skip read-only replicas, since they
- * can not properly act as bootstrap masters (register
- * new nodes in _cluster table). If there are no read-write
- * replicas, choose a read-only replica with biggest vclock
- * as a leader, in hope it will become read-write soon.
- */
 struct replica *
 replicaset_find_join_master(void)
 {
@@ -972,12 +963,23 @@ replicaset_find_join_master(void)
 		const struct ballot *ballot = &applier->ballot;
 		int score = 0;
 		/*
+		 * First of all try to ignore non-booted instances. Including
+		 * self if not booted yet. For self it is even dangerous as the
+		 * instance might decide to boot its own cluster if, for
+		 * example, the other nodes are available, but read-only. It
+		 * would be a mistake.
+		 *
+		 * For a new cluster it is ok to use a non-booted instance as it
+		 * means the algorithm tries to find an initial "boot-master".
+		 *
 		 * Prefer instances not configured as read-only via box.cfg, and
 		 * not being in read-only state due to any other reason. The
 		 * config is stronger because if it is configured as read-only,
 		 * it is in read-only state for sure, until the config is
 		 * changed.
 		 */
+		if (ballot->is_booted)
+			score += 10;
 		if (!ballot->is_ro_cfg)
 			score += 5;
 		if (!ballot->is_ro)
diff --git a/test/replication/gh-5613-bootstrap-prefer-booted.result b/test/replication/gh-5613-bootstrap-prefer-booted.result
new file mode 100644
index 000000000..e8e7fb792
--- /dev/null
+++ b/test/replication/gh-5613-bootstrap-prefer-booted.result
@@ -0,0 +1,70 @@
+-- test-run result file version 2
+test_run = require('test_run').new()
+ | ---
+ | ...
+
+test_run:cmd('create server master with script="replication/gh-5613-master.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server master with wait=False')
+ | ---
+ | - true
+ | ...
+test_run:cmd('create server replica1 with script="replication/gh-5613-replica1.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica1')
+ | ---
+ | - true
+ | ...
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+box.cfg{read_only = true}
+ | ---
+ | ...
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+
+test_run:cmd('create server replica2 with script="replication/gh-5613-replica2.lua"')
+ | ---
+ | - true
+ | ...
+-- It returns false, but it is expected.
+test_run:cmd('start server replica2 with crash_expected=True')
+ | ---
+ | - false
+ | ...
+opts = {filename = 'gh-5613-replica2.log'}
+ | ---
+ | ...
+assert(test_run:grep_log(nil, 'ER_READONLY', nil, opts) ~= nil)
+ | ---
+ | - true
+ | ...
+
+test_run:cmd('delete server replica2')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica1')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica1')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server master')
+ | ---
+ | - true
+ | ...
diff --git a/test/replication/gh-5613-bootstrap-prefer-booted.test.lua b/test/replication/gh-5613-bootstrap-prefer-booted.test.lua
new file mode 100644
index 000000000..d3c1c1189
--- /dev/null
+++ b/test/replication/gh-5613-bootstrap-prefer-booted.test.lua
@@ -0,0 +1,21 @@
+test_run = require('test_run').new()
+
+test_run:cmd('create server master with script="replication/gh-5613-master.lua"')
+test_run:cmd('start server master with wait=False')
+test_run:cmd('create server replica1 with script="replication/gh-5613-replica1.lua"')
+test_run:cmd('start server replica1')
+test_run:switch('master')
+box.cfg{read_only = true}
+test_run:switch('default')
+
+test_run:cmd('create server replica2 with script="replication/gh-5613-replica2.lua"')
+-- It returns false, but it is expected.
+test_run:cmd('start server replica2 with crash_expected=True')
+opts = {filename = 'gh-5613-replica2.log'}
+assert(test_run:grep_log(nil, 'ER_READONLY', nil, opts) ~= nil)
+
+test_run:cmd('delete server replica2')
+test_run:cmd('stop server replica1')
+test_run:cmd('delete server replica1')
+test_run:cmd('stop server master')
+test_run:cmd('delete server master')
diff --git a/test/replication/gh-5613-master.lua b/test/replication/gh-5613-master.lua
new file mode 100644
index 000000000..408427315
--- /dev/null
+++ b/test/replication/gh-5613-master.lua
@@ -0,0 +1,11 @@
+#!/usr/bin/env tarantool
+
+require('console').listen(os.getenv('ADMIN'))
+box.cfg({
+	listen = 'unix/:./gh-5613-master.sock',
+	replication = {
+		'unix/:./gh-5613-master.sock',
+		'unix/:./gh-5613-replica1.sock',
+	},
+})
+box.schema.user.grant('guest', 'super')
diff --git a/test/replication/gh-5613-replica1.lua b/test/replication/gh-5613-replica1.lua
new file mode 100644
index 000000000..d0d6e3372
--- /dev/null
+++ b/test/replication/gh-5613-replica1.lua
@@ -0,0 +1,13 @@
+#!/usr/bin/env tarantool
+
+require('console').listen(os.getenv('ADMIN'))
+box.cfg({
+	listen = 'unix/:./gh-5613-replica1.sock',
+	replication = {
+		'unix/:./gh-5613-master.sock',
+		'unix/:./gh-5613-replica1.sock',
+	},
+	-- Set to read_only initially so as the bootstrap-master would be
+	-- known in advance.
+	read_only = true,
+})
diff --git a/test/replication/gh-5613-replica2.lua b/test/replication/gh-5613-replica2.lua
new file mode 100644
index 000000000..8cbd45b61
--- /dev/null
+++ b/test/replication/gh-5613-replica2.lua
@@ -0,0 +1,11 @@
+#!/usr/bin/env tarantool
+
+require('console').listen(os.getenv('ADMIN'))
+box.cfg({
+	listen = 'unix/:./gh-5613-replica2.sock',
+	replication = {
+		'unix/:./gh-5613-master.sock',
+		'unix/:./gh-5613-replica1.sock',
+		'unix/:./gh-5613-replica2.sock',
+	},
+})
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 27eab20c2..f9d5ce1cc 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -44,6 +44,7 @@
     "gh-5435-qsync-clear-synchro-queue-commit-all.test.lua": {},
     "gh-5536-wal-limit.test.lua": {},
     "gh-5566-final-join-synchro.test.lua": {},
+    "gh-5613-bootstrap-prefer-booted.test.lua": {},
     "gh-6032-promote-wal-write.test.lua": {},
     "gh-6057-qsync-confirm-async-no-wal.test.lua": {},
     "gh-6094-rs-uuid-mismatch.test.lua": {},
-- 
2.24.3 (Apple Git-128)



More information about the Tarantool-patches mailing list