Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH v6 0/5] qsync: evaluate replication_synchro_quorum dynamically
@ 2020-12-22 11:14 Cyrill Gorcunov
  2020-12-22 11:14 ` [Tarantool-patches] [PATCH v6 1/5] cfg: add cfg_isnumber helper Cyrill Gorcunov
                   ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: Cyrill Gorcunov @ 2020-12-22 11:14 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy

Take a look please once time permit.

v2 (by Serge):
 - keep replication_synchro_quorum been skipped at bootstrap in load_cfg.lua
 - eliminate redundant say_info calls
 - call quorum update routine from replica_set_id/replica_clear_id
 - use replicaset.registered_count directly when evaluating the formula
 - make quorum evaluation procedure always return value in allowed range,
   the only error which may happen here is some syntax error or Lua evaluation
   errors
 - a test has been added

v3 (by Serge, Mons, Vlad):
 - use replica.lua in tests
 - use N symbol in formula
 - use lua_pcall when evaluating a formula
 - make formula more safe itself, provide various math helpers
 - use box_update_replication_synchro_quorum name as a general
   updater from replication code
 - do not forget to update raft election quorum inside
   box_update_replication_synchro_quorum
 - print warns inside functions evaluator if value get out of bounds

v4 (by Vlad):
 - when testing a formula we walk over all amount of replicas,
   thus we are sure that later when real evaluation takes place
   we won't get quorum out of bounds
 - improve test to make sure that when no quorum aquired the
   transaction doesn't pass

v5 (by Vlad):
  - make box_check_replication_synchro_quorum to return 0|1
  - in test use `s` and a shorthand for the space

v6 (by Vlad):
  - use wait_log in test 'cause log flushing doesn't go immediately
  - fix potential numbers overflow when setting up quorum

Vlad, I resend the whole series for one more general review.
Take a look once time permit. The interdiff for v5 is below.

issue https://github.com/tarantool/tarantool/issues/5446
branch gorcunov/gh-5446-eval-quorum-6

Cyrill Gorcunov (5):
  cfg: add cfg_isnumber helper
  cfg: rework box_check_replication_synchro_quorum
  cfg: support symbolic evaluation of replication_synchro_quorum
  cfg: more precise check for replication_synchro_quorum value
  test: add replication/gh-5446-qsync-eval-quorum.test.lua

 src/box/box.cc                                | 148 +++++++++-
 src/box/box.h                                 |   1 +
 src/box/lua/load_cfg.lua                      |   2 +-
 src/box/replication.cc                        |   4 +-
 src/cfg.c                                     |   9 +
 src/cfg.h                                     |   6 +
 .../gh-5446-qsync-eval-quorum.result          | 277 ++++++++++++++++++
 .../gh-5446-qsync-eval-quorum.test.lua        | 110 +++++++
 test/replication/replica-quorum-1.lua         |   1 +
 test/replication/replica-quorum-2.lua         |   1 +
 test/replication/replica-quorum-3.lua         |   1 +
 test/replication/replica-quorum-4.lua         |   1 +
 test/replication/replica-quorum-5.lua         |   1 +
 test/replication/replica-quorum-6.lua         |   1 +
 14 files changed, 552 insertions(+), 11 deletions(-)
 create mode 100644 test/replication/gh-5446-qsync-eval-quorum.result
 create mode 100644 test/replication/gh-5446-qsync-eval-quorum.test.lua
 create mode 120000 test/replication/replica-quorum-1.lua
 create mode 120000 test/replication/replica-quorum-2.lua
 create mode 120000 test/replication/replica-quorum-3.lua
 create mode 120000 test/replication/replica-quorum-4.lua
 create mode 120000 test/replication/replica-quorum-5.lua
 create mode 120000 test/replication/replica-quorum-6.lua


base-commit: 28f3b2f1e845aff49048d92f9062a4dfa365bf57
-- 
diff --git a/src/box/box.cc b/src/box/box.cc
index d3ec1faf3..b3cc45358 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -609,9 +609,9 @@ box_eval_replication_synchro_quorum(int nr_replicas)
 		return -1;
 	}
 
-	int quorum = -1;
+	int64_t quorum = -1;
 	if (lua_isnumber(tarantool_L, -1))
-		quorum = (int)lua_tonumber(tarantool_L, -1);
+		quorum = luaL_toint64(tarantool_L, -1);
 	lua_pop(tarantool_L, 1);
 
 	/*
@@ -657,7 +657,7 @@ box_check_replication_synchro_quorum(void)
 		return 0;
 	}
 
-	int quorum = cfg_geti("replication_synchro_quorum");
+	int64_t quorum = cfg_geti64("replication_synchro_quorum");
 	if (quorum <= 0 || quorum >= VCLOCK_MAX) {
 		diag_set(ClientError, ER_CFG, "replication_synchro_quorum",
 			 "the value must be greater than zero and less than "
@@ -1026,13 +1026,20 @@ box_update_replication_synchro_quorum(void)
 		 */
 		int value = MAX(1, replicaset.registered_count);
 		quorum = box_eval_replication_synchro_quorum(value);
-		if (quorum <= 0 || quorum >= VCLOCK_MAX)
-			panic("failed to eval replication_synchro_quorum");
 		say_info("update replication_synchro_quorum = %d", quorum);
 	} else {
 		quorum = cfg_geti("replication_synchro_quorum");
 	}
 
+	/*
+	 * This should never happen because the values were
+	 * validated already but just to prevent from
+	 * unexpected changes and because the value is too
+	 * important for qsync, lets re-check (this is cheap).
+	 */
+	if (quorum <= 0 || quorum >= VCLOCK_MAX)
+		panic("failed to eval/fetch replication_synchro_quorum");
+
 	replication_synchro_quorum = quorum;
 	txn_limbo_on_parameters_change(&txn_limbo);
 	box_raft_update_election_quorum();
diff --git a/test/replication/gh-5446-qsync-eval-quorum.result b/test/replication/gh-5446-qsync-eval-quorum.result
index 6fd31d1bc..6c3c63577 100644
--- a/test/replication/gh-5446-qsync-eval-quorum.result
+++ b/test/replication/gh-5446-qsync-eval-quorum.result
@@ -29,6 +29,18 @@ box.cfg{replication_synchro_quorum = "N-1"}
  |     evaluated to the quorum 0 for replica number 1, which is out of range [1;31]'
  | ...
 
+-- Test big number value
+box.cfg{replication_synchro_quorum = '4294967297'}
+ | ---
+ | - error: 'Incorrect value for option ''replication_synchro_quorum'': the value must
+ |     be greater than zero and less than maximal number of replicas'
+ | ...
+box.cfg{replication_synchro_quorum = 4294967297}
+ | ---
+ | - error: 'Incorrect value for option ''replication_synchro_quorum'': the value must
+ |     be greater than zero and less than maximal number of replicas'
+ | ...
+
 -- Use canonical majority formula
 box.cfg { replication_synchro_quorum = "N/2+1", replication_synchro_timeout = 1000 }
  | ---
@@ -36,9 +48,9 @@ box.cfg { replication_synchro_quorum = "N/2+1", replication_synchro_timeout = 10
 match = 'set \'replication_synchro_quorum\' configuration option to \"N\\/2%+1'
  | ---
  | ...
-test_run:grep_log("default", match) ~= nil
+test_run:wait_log('default', match, nil, 5)
  | ---
- | - true
+ | - set 'replication_synchro_quorum' configuration option to "N\/2+1
  | ...
 
 -- Create a sync space we will operate on
@@ -70,9 +82,9 @@ test_run:cmd('start server replica1 with wait=True, wait_load=True')
 match = 'update replication_synchro_quorum = 2'
  | ---
  | ...
-test_run:grep_log("default", match) ~= nil
+test_run:wait_log('default', match, nil, 5)
  | ---
- | - true
+ | - update replication_synchro_quorum = 2
  | ...
 
 test_run:cmd('create server replica2 with rpl_master=default,\
@@ -89,9 +101,9 @@ test_run:cmd('start server replica2 with wait=True, wait_load=True')
 match = 'update replication_synchro_quorum = 2'
  | ---
  | ...
-test_run:grep_log("default", match) ~= nil
+test_run:wait_log('default', match, nil, 5)
  | ---
- | - true
+ | - update replication_synchro_quorum = 2
  | ...
 
 test_run:cmd('create server replica3 with rpl_master=default,\
@@ -108,9 +120,9 @@ test_run:cmd('start server replica3 with wait=True, wait_load=True')
 match = 'update replication_synchro_quorum = 3'
  | ---
  | ...
-test_run:grep_log("default", match) ~= nil
+test_run:wait_log('default', match, nil, 5)
  | ---
- | - true
+ | - update replication_synchro_quorum = 3
  | ...
 
 test_run:cmd('create server replica4 with rpl_master=default,\
@@ -127,9 +139,9 @@ test_run:cmd('start server replica4 with wait=True, wait_load=True')
 match = 'update replication_synchro_quorum = 3'
  | ---
  | ...
-test_run:grep_log("default", match) ~= nil
+test_run:wait_log('default', match, nil, 5)
  | ---
- | - true
+ | - update replication_synchro_quorum = 3
  | ...
 
 test_run:cmd('create server replica5 with rpl_master=default,\
@@ -156,9 +168,9 @@ test_run:cmd('start server replica6 with wait=True, wait_load=True')
 match = 'update replication_synchro_quorum = 4'
  | ---
  | ...
-test_run:grep_log("default", match) ~= nil
+test_run:wait_log('default', match, nil, 5)
  | ---
- | - true
+ | - update replication_synchro_quorum = 4
  | ...
 
 -- 5 replicas left, the commit should pass
diff --git a/test/replication/gh-5446-qsync-eval-quorum.test.lua b/test/replication/gh-5446-qsync-eval-quorum.test.lua
index 56269fc43..b7ce9a9ca 100644
--- a/test/replication/gh-5446-qsync-eval-quorum.test.lua
+++ b/test/replication/gh-5446-qsync-eval-quorum.test.lua
@@ -10,10 +10,14 @@ box.cfg{replication_synchro_quorum = "aaa"}
 box.cfg{replication_synchro_quorum = "N+1"}
 box.cfg{replication_synchro_quorum = "N-1"}
 
+-- Test big number value
+box.cfg{replication_synchro_quorum = '4294967297'}
+box.cfg{replication_synchro_quorum = 4294967297}
+
 -- Use canonical majority formula
 box.cfg { replication_synchro_quorum = "N/2+1", replication_synchro_timeout = 1000 }
 match = 'set \'replication_synchro_quorum\' configuration option to \"N\\/2%+1'
-test_run:grep_log("default", match) ~= nil
+test_run:wait_log('default', match, nil, 5)
 
 -- Create a sync space we will operate on
 _ = box.schema.space.create('sync', {is_sync = true, engine = engine})
@@ -27,7 +31,7 @@ test_run:cmd('start server replica1 with wait=True, wait_load=True')
 
 -- 1 replica -> replication_synchro_quorum = 2/2 + 1 = 2
 match = 'update replication_synchro_quorum = 2'
-test_run:grep_log("default", match) ~= nil
+test_run:wait_log('default', match, nil, 5)
 
 test_run:cmd('create server replica2 with rpl_master=default,\
               script="replication/replica-quorum-2.lua"')
@@ -35,7 +39,7 @@ test_run:cmd('start server replica2 with wait=True, wait_load=True')
 
 -- 2 replicas -> replication_synchro_quorum = 3/2 + 1 = 2
 match = 'update replication_synchro_quorum = 2'
-test_run:grep_log("default", match) ~= nil
+test_run:wait_log('default', match, nil, 5)
 
 test_run:cmd('create server replica3 with rpl_master=default,\
               script="replication/replica-quorum-3.lua"')
@@ -43,7 +47,7 @@ test_run:cmd('start server replica3 with wait=True, wait_load=True')
 
 -- 3 replicas -> replication_synchro_quorum = 4/2 + 1 = 3
 match = 'update replication_synchro_quorum = 3'
-test_run:grep_log("default", match) ~= nil
+test_run:wait_log('default', match, nil, 5)
 
 test_run:cmd('create server replica4 with rpl_master=default,\
               script="replication/replica-quorum-4.lua"')
@@ -51,7 +55,7 @@ test_run:cmd('start server replica4 with wait=True, wait_load=True')
 
 -- 4 replicas -> replication_synchro_quorum = 5/2 + 1 = 3
 match = 'update replication_synchro_quorum = 3'
-test_run:grep_log("default", match) ~= nil
+test_run:wait_log('default', match, nil, 5)
 
 test_run:cmd('create server replica5 with rpl_master=default,\
               script="replication/replica-quorum-5.lua"')
@@ -63,7 +67,7 @@ test_run:cmd('start server replica6 with wait=True, wait_load=True')
 
 -- 6 replicas -> replication_synchro_quorum = 7/2 + 1 = 4
 match = 'update replication_synchro_quorum = 4'
-test_run:grep_log("default", match) ~= nil
+test_run:wait_log('default', match, nil, 5)
 
 -- 5 replicas left, the commit should pass
 test_run:cmd('stop server replica1')

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Tarantool-patches] [PATCH v6 1/5] cfg: add cfg_isnumber helper
  2020-12-22 11:14 [Tarantool-patches] [PATCH v6 0/5] qsync: evaluate replication_synchro_quorum dynamically Cyrill Gorcunov
@ 2020-12-22 11:14 ` Cyrill Gorcunov
  2020-12-22 11:14 ` [Tarantool-patches] [PATCH v6 2/5] cfg: rework box_check_replication_synchro_quorum Cyrill Gorcunov
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Cyrill Gorcunov @ 2020-12-22 11:14 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy

We will need it to figure out if parameter
is a numeric value when doing configuration
check.

Part-of #5446

Acked-by: Serge Petrenko <sergepetrenko@tarantool.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/cfg.c | 9 +++++++++
 src/cfg.h | 6 ++++++
 2 files changed, 15 insertions(+)

diff --git a/src/cfg.c b/src/cfg.c
index 46cff1999..cf07d5817 100644
--- a/src/cfg.c
+++ b/src/cfg.c
@@ -57,6 +57,15 @@ cfg_geti(const char *param)
 	return val;
 }
 
+bool
+cfg_isnumber(const char *param)
+{
+	cfg_get(param);
+	bool ret = !!lua_isnumber(tarantool_L, -1);
+	lua_pop(tarantool_L, 1);
+	return ret;
+}
+
 int
 cfg_getb(const char *param)
 {
diff --git a/src/cfg.h b/src/cfg.h
index 140bdddb8..e2955e6b2 100644
--- a/src/cfg.h
+++ b/src/cfg.h
@@ -40,6 +40,12 @@ extern "C" {
 int
 cfg_geti(const char *param);
 
+/**
+ * Test if cfg parameter is a number.
+ */
+bool
+cfg_isnumber(const char *param);
+
 /**
  * Gets boolean parameter of cfg.
  * Returns -1 in case of nil
-- 
2.26.2

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Tarantool-patches] [PATCH v6 2/5] cfg: rework box_check_replication_synchro_quorum
  2020-12-22 11:14 [Tarantool-patches] [PATCH v6 0/5] qsync: evaluate replication_synchro_quorum dynamically Cyrill Gorcunov
  2020-12-22 11:14 ` [Tarantool-patches] [PATCH v6 1/5] cfg: add cfg_isnumber helper Cyrill Gorcunov
@ 2020-12-22 11:14 ` Cyrill Gorcunov
  2020-12-22 11:14 ` [Tarantool-patches] [PATCH v6 3/5] cfg: support symbolic evaluation of replication_synchro_quorum Cyrill Gorcunov
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Cyrill Gorcunov @ 2020-12-22 11:14 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy

Currently the box_check_replication_synchro_quorum helper
test for "replication_synchro_quorum" value being valid
and returns the value itself to use later in code.

This is fine for regular numbers but since we're gonna
support formula evaluation the real value to use will
be dynamic and returning a number "to use" won't be
convenient.

Thus lets change the context: make
box_check_replication_synchro_quorum() to return 0|-1
for success|failure and when the real value is needed
we will fetch it explicitly via cfg_geti call.

To make this more explicit the real update of the
appropriate variable is done via
box_update_replication_synchro_quorum() helper.

Part-of #5446

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/box.cc | 21 ++++++++++++++-------
 src/box/box.h  |  1 +
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index a8bc3471d..630f579df 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -564,7 +564,7 @@ box_check_replication_synchro_quorum(void)
 			 "maximal number of replicas");
 		return -1;
 	}
-	return quorum;
+	return 0;
 }
 
 static double
@@ -765,7 +765,7 @@ box_check_config(void)
 	box_check_replication_connect_timeout();
 	box_check_replication_connect_quorum();
 	box_check_replication_sync_lag();
-	if (box_check_replication_synchro_quorum() < 0)
+	if (box_check_replication_synchro_quorum() != 0)
 		diag_raise();
 	if (box_check_replication_synchro_timeout() < 0)
 		diag_raise();
@@ -910,15 +910,22 @@ box_set_replication_sync_lag(void)
 	replication_sync_lag = box_check_replication_sync_lag();
 }
 
+void
+box_update_replication_synchro_quorum(void)
+{
+	int quorum = cfg_geti("replication_synchro_quorum");
+
+	replication_synchro_quorum = quorum;
+	txn_limbo_on_parameters_change(&txn_limbo);
+	box_raft_update_election_quorum();
+}
+
 int
 box_set_replication_synchro_quorum(void)
 {
-	int value = box_check_replication_synchro_quorum();
-	if (value < 0)
+	if (box_check_replication_synchro_quorum() != 0)
 		return -1;
-	replication_synchro_quorum = value;
-	txn_limbo_on_parameters_change(&txn_limbo);
-	box_raft_update_election_quorum();
+	box_update_replication_synchro_quorum();
 	return 0;
 }
 
diff --git a/src/box/box.h b/src/box/box.h
index b47a220b7..8a7cda194 100644
--- a/src/box/box.h
+++ b/src/box/box.h
@@ -251,6 +251,7 @@ void box_set_replication_timeout(void);
 void box_set_replication_connect_timeout(void);
 void box_set_replication_connect_quorum(void);
 void box_set_replication_sync_lag(void);
+void box_update_replication_synchro_quorum(void);
 int box_set_replication_synchro_quorum(void);
 int box_set_replication_synchro_timeout(void);
 void box_set_replication_sync_timeout(void);
-- 
2.26.2

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Tarantool-patches] [PATCH v6 3/5] cfg: support symbolic evaluation of replication_synchro_quorum
  2020-12-22 11:14 [Tarantool-patches] [PATCH v6 0/5] qsync: evaluate replication_synchro_quorum dynamically Cyrill Gorcunov
  2020-12-22 11:14 ` [Tarantool-patches] [PATCH v6 1/5] cfg: add cfg_isnumber helper Cyrill Gorcunov
  2020-12-22 11:14 ` [Tarantool-patches] [PATCH v6 2/5] cfg: rework box_check_replication_synchro_quorum Cyrill Gorcunov
@ 2020-12-22 11:14 ` Cyrill Gorcunov
  2020-12-22 11:14 ` [Tarantool-patches] [PATCH v6 4/5] cfg: more precise check for replication_synchro_quorum value Cyrill Gorcunov
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Cyrill Gorcunov @ 2020-12-22 11:14 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy

When synchronous replication is used we prefer a user to specify
a quorum number, ie the number of replicas where data must be
replicated before the master node continue accepting new
transactions.

This is not very convenient since a user may not know initially
how many replicas will be used. Moreover the number of replicas
may vary dynamically. For this sake we allow to specify the
number of quorum in a symbolic way.

For example

box.cfg {
	replication_synchro_quorum = "N/2+1",
}

where `N` is a number of registered replicas in a cluster.
Once new replica attached or old one detached the number
is renewed and propagated.

Internally on each replica_set_id() and replica_clear_id(),
ie at moment when replica get registered or unregistered,
we call box_update_replication_synchro_quorum() helper which
finds out if evaluation of replication_synchro_quorum is
needed and if so we calculate new replication_synchro_quorum
value based on number of currently registered replicas. Then
we notify dependent systems such as qsync and raft to update
their guts.

Note: we do *not* change the default settings for this option,
it remains 1 by default for now. Change the default option should
be done as a separate commit once we make sure that everything is
fine.

Closes #5446

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

@TarantoolBot document
Title: Support dynamic evaluation of synchronous replication quorum

Setting `replication_synchro_quorum` option to an explicit integer
value was introduced rather for simplicity sake mostly. For example
if the cluster's size is not a constant value and new replicas are
connected in dynamically then an administrator might need to increase
the option by hands or by some other external tool.

Instead one can use a dynamic evaluation of a quorum value via formal
representation using symbol `N` as a current number of registered replicas
in a cluster.

For example the canonical definition for a quorum (ie majority
of members in a set) of `N` replicas is `N/2+1`. For such configuration
define

```
box.cfg {replication_synchro_quorum = "N/2+1"}
```

The formal statement allows to provide a flexible configuration but keep
in mind that only canonical quorum (and bigger values, say `N` for all
replicas) guarantees data reliability and various weird forms such as
`N/3+1` while allowed may lead to unexpected results.
---
 src/box/box.cc           | 127 ++++++++++++++++++++++++++++++++++++++-
 src/box/lua/load_cfg.lua |   2 +-
 src/box/replication.cc   |   4 +-
 3 files changed, 129 insertions(+), 4 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 630f579df..68579c254 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -554,9 +554,109 @@ box_check_replication_sync_lag(void)
 	return lag;
 }
 
+/**
+ * Evaluate replication syncro quorum number from a formula.
+ */
+static int
+box_eval_replication_synchro_quorum(int nr_replicas)
+{
+	assert(nr_replicas > 0 && nr_replicas < VCLOCK_MAX);
+
+	const char fmt[] =
+		"local expr = [[%s]]\n"
+		"local f, err = loadstring('return ('..expr..')')\n"
+		"if not f then "
+			"error(string.format('Failed to load \%\%s:"
+			"\%\%s', expr, err)) "
+		"end\n"
+		"setfenv(f, {N = %d, math = {"
+			"ceil = math.ceil,"
+			"floor = math.floor,"
+			"abs = math.abs,"
+			"random = math.random,"
+			"min = math.min,"
+			"max = math.max,"
+			"sqrt = math.sqrt,"
+			"fmod = math.fmod,"
+		"}})\n"
+		"local res = f()\n"
+		"if type(res) ~= 'number' then\n"
+			"error('Expression should return a number')\n"
+		"end\n"
+		"return math.floor(res)\n";
+	const char *expr = cfg_gets("replication_synchro_quorum");
+
+	/*
+	 * cfg_gets uses static buffer as well so we need a local
+	 * one, 1K should be enough to carry arbitrary but sane
+	 * formula.
+	 */
+	char buf[1024];
+	int len = snprintf(buf, sizeof(buf), fmt, expr,
+			   nr_replicas);
+	if (len >= (int)sizeof(buf)) {
+		diag_set(ClientError, ER_CFG,
+			 "replication_synchro_quorum",
+			 "the formula is too big");
+		return -1;
+	}
+
+	luaL_loadstring(tarantool_L, buf);
+	if (lua_pcall(tarantool_L, 0, 1, 0) != 0) {
+		diag_set(ClientError, ER_CFG,
+			 "replication_synchro_quorum",
+			 lua_tostring(tarantool_L, -1));
+		return -1;
+	}
+
+	int quorum = -1;
+	if (lua_isnumber(tarantool_L, -1))
+		quorum = (int)lua_tonumber(tarantool_L, -1);
+	lua_pop(tarantool_L, 1);
+
+	/*
+	 * At least we should have 1 node to sync, the weird
+	 * formulas such as N-2 do not guarantee quorums thus
+	 * return an error.
+	 */
+	if (quorum <= 0 || quorum >= VCLOCK_MAX) {
+		const char *msg =
+			tt_sprintf("the formula is evaluated "
+				   "to the quorum %d for replica "
+				   "number %d, which is out of range "
+				   "[%d;%d]",
+				   quorum, nr_replicas, 1, VCLOCK_MAX - 1);
+		diag_set(ClientError, ER_CFG,
+			 "replication_synchro_quorum", msg);
+		return -1;
+	}
+
+	return quorum;
+}
+
 static int
 box_check_replication_synchro_quorum(void)
 {
+	if (!cfg_isnumber("replication_synchro_quorum")) {
+		/*
+		 * The formula uses symbolic name 'N' as
+		 * a number of currently registered replicas.
+		 *
+		 * When we're in "checking" mode we should walk
+		 * over all possible number of replicas to make
+		 * sure the formula is correct.
+		 *
+		 * Note that currently VCLOCK_MAX is pretty small
+		 * value but if we gonna increase this limit make
+		 * sure that the cycle won't take too much time.
+		 */
+		for (int i = 1; i < VCLOCK_MAX; i++) {
+			if (box_eval_replication_synchro_quorum(i) < 0)
+				return -1;
+		}
+		return 0;
+	}
+
 	int quorum = cfg_geti("replication_synchro_quorum");
 	if (quorum <= 0 || quorum >= VCLOCK_MAX) {
 		diag_set(ClientError, ER_CFG, "replication_synchro_quorum",
@@ -913,7 +1013,32 @@ box_set_replication_sync_lag(void)
 void
 box_update_replication_synchro_quorum(void)
 {
-	int quorum = cfg_geti("replication_synchro_quorum");
+	int quorum = -1;
+
+	if (!cfg_isnumber("replication_synchro_quorum")) {
+		/*
+		 * The formula has been verified already. For bootstrap
+		 * stage pass 1 as a number of replicas to sync because
+		 * we're at early stage and registering a new replica.
+		 *
+		 * This should cover the valid case where formula is plain
+		 * "N", ie all replicas are to be synchro mode.
+		 */
+		int value = MAX(1, replicaset.registered_count);
+		quorum = box_eval_replication_synchro_quorum(value);
+		say_info("update replication_synchro_quorum = %d", quorum);
+	} else {
+		quorum = cfg_geti("replication_synchro_quorum");
+	}
+
+	/*
+	 * This should never happen because the values were
+	 * validated already but just to prevent from
+	 * unexpected changes and because the value is too
+	 * important for qsync, lets re-check (this is cheap).
+	 */
+	if (quorum <= 0 || quorum >= VCLOCK_MAX)
+		panic("failed to eval/fetch replication_synchro_quorum");
 
 	replication_synchro_quorum = quorum;
 	txn_limbo_on_parameters_change(&txn_limbo);
diff --git a/src/box/lua/load_cfg.lua b/src/box/lua/load_cfg.lua
index 770442052..2355dbcd2 100644
--- a/src/box/lua/load_cfg.lua
+++ b/src/box/lua/load_cfg.lua
@@ -172,7 +172,7 @@ local template_cfg = {
     replication_timeout = 'number',
     replication_sync_lag = 'number',
     replication_sync_timeout = 'number',
-    replication_synchro_quorum = 'number',
+    replication_synchro_quorum = 'string, number',
     replication_synchro_timeout = 'number',
     replication_connect_timeout = 'number',
     replication_connect_quorum = 'number',
diff --git a/src/box/replication.cc b/src/box/replication.cc
index 931c73a37..3126d86ac 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -251,7 +251,7 @@ replica_set_id(struct replica *replica, uint32_t replica_id)
 	say_info("assigned id %d to replica %s",
 		 replica->id, tt_uuid_str(&replica->uuid));
 	replica->anon = false;
-	box_raft_update_election_quorum();
+	box_update_replication_synchro_quorum();
 }
 
 void
@@ -300,7 +300,7 @@ replica_clear_id(struct replica *replica)
 		assert(!replica->anon);
 		replica_delete(replica);
 	}
-	box_raft_update_election_quorum();
+	box_update_replication_synchro_quorum();
 }
 
 void
-- 
2.26.2

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Tarantool-patches] [PATCH v6 4/5] cfg: more precise check for replication_synchro_quorum value
  2020-12-22 11:14 [Tarantool-patches] [PATCH v6 0/5] qsync: evaluate replication_synchro_quorum dynamically Cyrill Gorcunov
                   ` (2 preceding siblings ...)
  2020-12-22 11:14 ` [Tarantool-patches] [PATCH v6 3/5] cfg: support symbolic evaluation of replication_synchro_quorum Cyrill Gorcunov
@ 2020-12-22 11:14 ` Cyrill Gorcunov
  2020-12-22 11:14 ` [Tarantool-patches] [PATCH v6 5/5] test: add replication/gh-5446-qsync-eval-quorum.test.lua Cyrill Gorcunov
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Cyrill Gorcunov @ 2020-12-22 11:14 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy

When we fetch replication_synchro_quorum value (either as
a plain integer or via formula evaluation) we trim the
number down to integer, which silently hides potential
overflow errors.

For example

 | box.cfg{replication_synchro_quorum='4294967297'}

which is 1 in terms of machine words. Lets use 8 bytes
values and trigger an error instead.

Part-of #5446

Reported-by: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/box.cc | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 68579c254..b3cc45358 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -609,9 +609,9 @@ box_eval_replication_synchro_quorum(int nr_replicas)
 		return -1;
 	}
 
-	int quorum = -1;
+	int64_t quorum = -1;
 	if (lua_isnumber(tarantool_L, -1))
-		quorum = (int)lua_tonumber(tarantool_L, -1);
+		quorum = luaL_toint64(tarantool_L, -1);
 	lua_pop(tarantool_L, 1);
 
 	/*
@@ -657,7 +657,7 @@ box_check_replication_synchro_quorum(void)
 		return 0;
 	}
 
-	int quorum = cfg_geti("replication_synchro_quorum");
+	int64_t quorum = cfg_geti64("replication_synchro_quorum");
 	if (quorum <= 0 || quorum >= VCLOCK_MAX) {
 		diag_set(ClientError, ER_CFG, "replication_synchro_quorum",
 			 "the value must be greater than zero and less than "
-- 
2.26.2

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Tarantool-patches] [PATCH v6 5/5] test: add replication/gh-5446-qsync-eval-quorum.test.lua
  2020-12-22 11:14 [Tarantool-patches] [PATCH v6 0/5] qsync: evaluate replication_synchro_quorum dynamically Cyrill Gorcunov
                   ` (3 preceding siblings ...)
  2020-12-22 11:14 ` [Tarantool-patches] [PATCH v6 4/5] cfg: more precise check for replication_synchro_quorum value Cyrill Gorcunov
@ 2020-12-22 11:14 ` Cyrill Gorcunov
  2020-12-22 13:49 ` [Tarantool-patches] [PATCH v6 0/5] qsync: evaluate replication_synchro_quorum dynamically Vladislav Shpilevoy
  2020-12-23 17:52 ` Vladislav Shpilevoy
  6 siblings, 0 replies; 14+ messages in thread
From: Cyrill Gorcunov @ 2020-12-22 11:14 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy

Part-of #5446

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 .../gh-5446-qsync-eval-quorum.result          | 277 ++++++++++++++++++
 .../gh-5446-qsync-eval-quorum.test.lua        | 110 +++++++
 test/replication/replica-quorum-1.lua         |   1 +
 test/replication/replica-quorum-2.lua         |   1 +
 test/replication/replica-quorum-3.lua         |   1 +
 test/replication/replica-quorum-4.lua         |   1 +
 test/replication/replica-quorum-5.lua         |   1 +
 test/replication/replica-quorum-6.lua         |   1 +
 8 files changed, 393 insertions(+)
 create mode 100644 test/replication/gh-5446-qsync-eval-quorum.result
 create mode 100644 test/replication/gh-5446-qsync-eval-quorum.test.lua
 create mode 120000 test/replication/replica-quorum-1.lua
 create mode 120000 test/replication/replica-quorum-2.lua
 create mode 120000 test/replication/replica-quorum-3.lua
 create mode 120000 test/replication/replica-quorum-4.lua
 create mode 120000 test/replication/replica-quorum-5.lua
 create mode 120000 test/replication/replica-quorum-6.lua

diff --git a/test/replication/gh-5446-qsync-eval-quorum.result b/test/replication/gh-5446-qsync-eval-quorum.result
new file mode 100644
index 000000000..6c3c63577
--- /dev/null
+++ b/test/replication/gh-5446-qsync-eval-quorum.result
@@ -0,0 +1,277 @@
+-- test-run result file version 2
+test_run = require('test_run').new()
+ | ---
+ | ...
+engine = test_run:get_cfg('engine')
+ | ---
+ | ...
+
+box.schema.user.grant('guest', 'replication')
+ | ---
+ | ...
+
+-- Test syntax error
+box.cfg{replication_synchro_quorum = "aaa"}
+ | ---
+ | - error: 'Incorrect value for option ''replication_synchro_quorum'': [string "local
+ |     expr = [[aaa]]..."]:7: Expression should return a number'
+ | ...
+
+-- Test out of bounds values
+box.cfg{replication_synchro_quorum = "N+1"}
+ | ---
+ | - error: 'Incorrect value for option ''replication_synchro_quorum'': the formula is
+ |     evaluated to the quorum 32 for replica number 31, which is out of range [1;31]'
+ | ...
+box.cfg{replication_synchro_quorum = "N-1"}
+ | ---
+ | - error: 'Incorrect value for option ''replication_synchro_quorum'': the formula is
+ |     evaluated to the quorum 0 for replica number 1, which is out of range [1;31]'
+ | ...
+
+-- Test big number value
+box.cfg{replication_synchro_quorum = '4294967297'}
+ | ---
+ | - error: 'Incorrect value for option ''replication_synchro_quorum'': the value must
+ |     be greater than zero and less than maximal number of replicas'
+ | ...
+box.cfg{replication_synchro_quorum = 4294967297}
+ | ---
+ | - error: 'Incorrect value for option ''replication_synchro_quorum'': the value must
+ |     be greater than zero and less than maximal number of replicas'
+ | ...
+
+-- Use canonical majority formula
+box.cfg { replication_synchro_quorum = "N/2+1", replication_synchro_timeout = 1000 }
+ | ---
+ | ...
+match = 'set \'replication_synchro_quorum\' configuration option to \"N\\/2%+1'
+ | ---
+ | ...
+test_run:wait_log('default', match, nil, 5)
+ | ---
+ | - set 'replication_synchro_quorum' configuration option to "N\/2+1
+ | ...
+
+-- Create a sync space we will operate on
+_ = box.schema.space.create('sync', {is_sync = true, engine = engine})
+ | ---
+ | ...
+_ = box.space.sync:create_index('pk')
+ | ---
+ | ...
+s = box.space.sync
+ | ---
+ | ...
+s:insert{1}
+ | ---
+ | - [1]
+ | ...
+
+test_run:cmd('create server replica1 with rpl_master=default,\
+              script="replication/replica-quorum-1.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica1 with wait=True, wait_load=True')
+ | ---
+ | - true
+ | ...
+
+-- 1 replica -> replication_synchro_quorum = 2/2 + 1 = 2
+match = 'update replication_synchro_quorum = 2'
+ | ---
+ | ...
+test_run:wait_log('default', match, nil, 5)
+ | ---
+ | - update replication_synchro_quorum = 2
+ | ...
+
+test_run:cmd('create server replica2 with rpl_master=default,\
+              script="replication/replica-quorum-2.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica2 with wait=True, wait_load=True')
+ | ---
+ | - true
+ | ...
+
+-- 2 replicas -> replication_synchro_quorum = 3/2 + 1 = 2
+match = 'update replication_synchro_quorum = 2'
+ | ---
+ | ...
+test_run:wait_log('default', match, nil, 5)
+ | ---
+ | - update replication_synchro_quorum = 2
+ | ...
+
+test_run:cmd('create server replica3 with rpl_master=default,\
+              script="replication/replica-quorum-3.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica3 with wait=True, wait_load=True')
+ | ---
+ | - true
+ | ...
+
+-- 3 replicas -> replication_synchro_quorum = 4/2 + 1 = 3
+match = 'update replication_synchro_quorum = 3'
+ | ---
+ | ...
+test_run:wait_log('default', match, nil, 5)
+ | ---
+ | - update replication_synchro_quorum = 3
+ | ...
+
+test_run:cmd('create server replica4 with rpl_master=default,\
+              script="replication/replica-quorum-4.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica4 with wait=True, wait_load=True')
+ | ---
+ | - true
+ | ...
+
+-- 4 replicas -> replication_synchro_quorum = 5/2 + 1 = 3
+match = 'update replication_synchro_quorum = 3'
+ | ---
+ | ...
+test_run:wait_log('default', match, nil, 5)
+ | ---
+ | - update replication_synchro_quorum = 3
+ | ...
+
+test_run:cmd('create server replica5 with rpl_master=default,\
+              script="replication/replica-quorum-5.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica5 with wait=True, wait_load=True')
+ | ---
+ | - true
+ | ...
+
+test_run:cmd('create server replica6 with rpl_master=default,\
+              script="replication/replica-quorum-6.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica6 with wait=True, wait_load=True')
+ | ---
+ | - true
+ | ...
+
+-- 6 replicas -> replication_synchro_quorum = 7/2 + 1 = 4
+match = 'update replication_synchro_quorum = 4'
+ | ---
+ | ...
+test_run:wait_log('default', match, nil, 5)
+ | ---
+ | - update replication_synchro_quorum = 4
+ | ...
+
+-- 5 replicas left, the commit should pass
+test_run:cmd('stop server replica1')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica1')
+ | ---
+ | - true
+ | ...
+s:insert{2}
+ | ---
+ | - [2]
+ | ...
+
+-- 4 replicas left,the commit should pass
+test_run:cmd('stop server replica2')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica2')
+ | ---
+ | - true
+ | ...
+s:insert{3}
+ | ---
+ | - [3]
+ | ...
+
+-- 3 replicas left, the commit should pass
+test_run:cmd('stop server replica3')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica3')
+ | ---
+ | - true
+ | ...
+s:insert{4}
+ | ---
+ | - [4]
+ | ...
+
+-- 2 replicas left, the commit should NOT pass
+--
+-- The replication_synchro_timeout set to a small value to not wait
+-- for very long for the case where we know the commit should
+-- not pass since replicas are stopped.
+box.cfg { replication_synchro_timeout = 0.5 }
+ | ---
+ | ...
+test_run:cmd('stop server replica4')
+ | ---
+ | - true
+ | ...
+s:insert{5}
+ | ---
+ | - error: Quorum collection for a synchronous transaction is timed out
+ | ...
+-- restore it back and retry
+test_run:cmd('start server replica4 with wait=True, wait_load=True')
+ | ---
+ | - true
+ | ...
+box.cfg { replication_synchro_timeout = 1000 }
+ | ---
+ | ...
+s:insert{5}
+ | ---
+ | - [5]
+ | ...
+test_run:cmd('stop server replica4')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica4')
+ | ---
+ | - true
+ | ...
+
+-- cleanup leftovers
+
+test_run:cmd('stop server replica5')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica5')
+ | ---
+ | - true
+ | ...
+
+test_run:cmd('stop server replica6')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica6')
+ | ---
+ | - true
+ | ...
+
+box.schema.user.revoke('guest', 'replication')
+ | ---
+ | ...
diff --git a/test/replication/gh-5446-qsync-eval-quorum.test.lua b/test/replication/gh-5446-qsync-eval-quorum.test.lua
new file mode 100644
index 000000000..b7ce9a9ca
--- /dev/null
+++ b/test/replication/gh-5446-qsync-eval-quorum.test.lua
@@ -0,0 +1,110 @@
+test_run = require('test_run').new()
+engine = test_run:get_cfg('engine')
+
+box.schema.user.grant('guest', 'replication')
+
+-- Test syntax error
+box.cfg{replication_synchro_quorum = "aaa"}
+
+-- Test out of bounds values
+box.cfg{replication_synchro_quorum = "N+1"}
+box.cfg{replication_synchro_quorum = "N-1"}
+
+-- Test big number value
+box.cfg{replication_synchro_quorum = '4294967297'}
+box.cfg{replication_synchro_quorum = 4294967297}
+
+-- Use canonical majority formula
+box.cfg { replication_synchro_quorum = "N/2+1", replication_synchro_timeout = 1000 }
+match = 'set \'replication_synchro_quorum\' configuration option to \"N\\/2%+1'
+test_run:wait_log('default', match, nil, 5)
+
+-- Create a sync space we will operate on
+_ = box.schema.space.create('sync', {is_sync = true, engine = engine})
+_ = box.space.sync:create_index('pk')
+s = box.space.sync
+s:insert{1}
+
+test_run:cmd('create server replica1 with rpl_master=default,\
+              script="replication/replica-quorum-1.lua"')
+test_run:cmd('start server replica1 with wait=True, wait_load=True')
+
+-- 1 replica -> replication_synchro_quorum = 2/2 + 1 = 2
+match = 'update replication_synchro_quorum = 2'
+test_run:wait_log('default', match, nil, 5)
+
+test_run:cmd('create server replica2 with rpl_master=default,\
+              script="replication/replica-quorum-2.lua"')
+test_run:cmd('start server replica2 with wait=True, wait_load=True')
+
+-- 2 replicas -> replication_synchro_quorum = 3/2 + 1 = 2
+match = 'update replication_synchro_quorum = 2'
+test_run:wait_log('default', match, nil, 5)
+
+test_run:cmd('create server replica3 with rpl_master=default,\
+              script="replication/replica-quorum-3.lua"')
+test_run:cmd('start server replica3 with wait=True, wait_load=True')
+
+-- 3 replicas -> replication_synchro_quorum = 4/2 + 1 = 3
+match = 'update replication_synchro_quorum = 3'
+test_run:wait_log('default', match, nil, 5)
+
+test_run:cmd('create server replica4 with rpl_master=default,\
+              script="replication/replica-quorum-4.lua"')
+test_run:cmd('start server replica4 with wait=True, wait_load=True')
+
+-- 4 replicas -> replication_synchro_quorum = 5/2 + 1 = 3
+match = 'update replication_synchro_quorum = 3'
+test_run:wait_log('default', match, nil, 5)
+
+test_run:cmd('create server replica5 with rpl_master=default,\
+              script="replication/replica-quorum-5.lua"')
+test_run:cmd('start server replica5 with wait=True, wait_load=True')
+
+test_run:cmd('create server replica6 with rpl_master=default,\
+              script="replication/replica-quorum-6.lua"')
+test_run:cmd('start server replica6 with wait=True, wait_load=True')
+
+-- 6 replicas -> replication_synchro_quorum = 7/2 + 1 = 4
+match = 'update replication_synchro_quorum = 4'
+test_run:wait_log('default', match, nil, 5)
+
+-- 5 replicas left, the commit should pass
+test_run:cmd('stop server replica1')
+test_run:cmd('delete server replica1')
+s:insert{2}
+
+-- 4 replicas left,the commit should pass
+test_run:cmd('stop server replica2')
+test_run:cmd('delete server replica2')
+s:insert{3}
+
+-- 3 replicas left, the commit should pass
+test_run:cmd('stop server replica3')
+test_run:cmd('delete server replica3')
+s:insert{4}
+
+-- 2 replicas left, the commit should NOT pass
+--
+-- The replication_synchro_timeout set to a small value to not wait
+-- for very long for the case where we know the commit should
+-- not pass since replicas are stopped.
+box.cfg { replication_synchro_timeout = 0.5 }
+test_run:cmd('stop server replica4')
+s:insert{5}
+-- restore it back and retry
+test_run:cmd('start server replica4 with wait=True, wait_load=True')
+box.cfg { replication_synchro_timeout = 1000 }
+s:insert{5}
+test_run:cmd('stop server replica4')
+test_run:cmd('delete server replica4')
+
+-- cleanup leftovers
+
+test_run:cmd('stop server replica5')
+test_run:cmd('delete server replica5')
+
+test_run:cmd('stop server replica6')
+test_run:cmd('delete server replica6')
+
+box.schema.user.revoke('guest', 'replication')
diff --git a/test/replication/replica-quorum-1.lua b/test/replication/replica-quorum-1.lua
new file mode 120000
index 000000000..da69ac81c
--- /dev/null
+++ b/test/replication/replica-quorum-1.lua
@@ -0,0 +1 @@
+replica.lua
\ No newline at end of file
diff --git a/test/replication/replica-quorum-2.lua b/test/replication/replica-quorum-2.lua
new file mode 120000
index 000000000..da69ac81c
--- /dev/null
+++ b/test/replication/replica-quorum-2.lua
@@ -0,0 +1 @@
+replica.lua
\ No newline at end of file
diff --git a/test/replication/replica-quorum-3.lua b/test/replication/replica-quorum-3.lua
new file mode 120000
index 000000000..da69ac81c
--- /dev/null
+++ b/test/replication/replica-quorum-3.lua
@@ -0,0 +1 @@
+replica.lua
\ No newline at end of file
diff --git a/test/replication/replica-quorum-4.lua b/test/replication/replica-quorum-4.lua
new file mode 120000
index 000000000..da69ac81c
--- /dev/null
+++ b/test/replication/replica-quorum-4.lua
@@ -0,0 +1 @@
+replica.lua
\ No newline at end of file
diff --git a/test/replication/replica-quorum-5.lua b/test/replication/replica-quorum-5.lua
new file mode 120000
index 000000000..da69ac81c
--- /dev/null
+++ b/test/replication/replica-quorum-5.lua
@@ -0,0 +1 @@
+replica.lua
\ No newline at end of file
diff --git a/test/replication/replica-quorum-6.lua b/test/replication/replica-quorum-6.lua
new file mode 120000
index 000000000..da69ac81c
--- /dev/null
+++ b/test/replication/replica-quorum-6.lua
@@ -0,0 +1 @@
+replica.lua
\ No newline at end of file
-- 
2.26.2

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 0/5] qsync: evaluate replication_synchro_quorum dynamically
  2020-12-22 11:14 [Tarantool-patches] [PATCH v6 0/5] qsync: evaluate replication_synchro_quorum dynamically Cyrill Gorcunov
                   ` (4 preceding siblings ...)
  2020-12-22 11:14 ` [Tarantool-patches] [PATCH v6 5/5] test: add replication/gh-5446-qsync-eval-quorum.test.lua Cyrill Gorcunov
@ 2020-12-22 13:49 ` Vladislav Shpilevoy
  2020-12-22 13:53   ` Cyrill Gorcunov
                     ` (2 more replies)
  2020-12-23 17:52 ` Vladislav Shpilevoy
  6 siblings, 3 replies; 14+ messages in thread
From: Vladislav Shpilevoy @ 2020-12-22 13:49 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml

Hi! Thanks for the patchset!

Looks good, except that the test is still flaky. I run it
with the same command as earlier, and got this:

[015] Test failed! Result content mismatch:
[015] --- replication/gh-5446-qsync-eval-quorum.result	Tue Dec 22 14:45:06 2020
[015] +++ var/rejects/replication/gh-5446-qsync-eval-quorum.reject	Tue Dec 22 14:47:31 2020
[015] @@ -50,7 +50,6 @@
[015]   | ...
[015]  test_run:wait_log('default', match, nil, 5)
[015]   | ---
[015] - | - set 'replication_synchro_quorum' configuration option to "N\/2+1
[015]   | ...
[015]  
[015]  -- Create a sync space we will operate on
[015] 

The command looks like this:

	python test-run.py gh-5446- gh-5446- gh-5446- gh-5446- gh-5446- gh-5446 ... <many many more>

Test-run says I run this test 258 times in the command above.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 0/5] qsync: evaluate replication_synchro_quorum dynamically
  2020-12-22 13:49 ` [Tarantool-patches] [PATCH v6 0/5] qsync: evaluate replication_synchro_quorum dynamically Vladislav Shpilevoy
@ 2020-12-22 13:53   ` Cyrill Gorcunov
  2020-12-22 14:32   ` Cyrill Gorcunov
  2020-12-22 17:13   ` Cyrill Gorcunov
  2 siblings, 0 replies; 14+ messages in thread
From: Cyrill Gorcunov @ 2020-12-22 13:53 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml

On Tue, Dec 22, 2020 at 02:49:06PM +0100, Vladislav Shpilevoy wrote:
> Hi! Thanks for the patchset!
> 
> Looks good, except that the test is still flaky. I run it
> with the same command as earlier, and got this:

Thanks a huge, Vlad! I'll try to repeat the problem locally.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 0/5] qsync: evaluate replication_synchro_quorum dynamically
  2020-12-22 13:49 ` [Tarantool-patches] [PATCH v6 0/5] qsync: evaluate replication_synchro_quorum dynamically Vladislav Shpilevoy
  2020-12-22 13:53   ` Cyrill Gorcunov
@ 2020-12-22 14:32   ` Cyrill Gorcunov
  2020-12-22 17:13   ` Cyrill Gorcunov
  2 siblings, 0 replies; 14+ messages in thread
From: Cyrill Gorcunov @ 2020-12-22 14:32 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml

On Tue, Dec 22, 2020 at 02:49:06PM +0100, Vladislav Shpilevoy wrote:
> Hi! Thanks for the patchset!
> 
> Looks good, except that the test is still flaky. I run it
> with the same command as earlier, and got this:

Vlad, I've increased timeout for console buffer flushing.
Could you please re-fetch the branch (I've force pushed it)
and test out?

The timeout trick doesn't look 100% reliable and I think
I will need to enhance the test to use new qsync statistics
(which is not yet done). Thus I'll make a linked issue for
this and once qsync stat is out we will fix the test for
sure.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 0/5] qsync: evaluate replication_synchro_quorum dynamically
  2020-12-22 13:49 ` [Tarantool-patches] [PATCH v6 0/5] qsync: evaluate replication_synchro_quorum dynamically Vladislav Shpilevoy
  2020-12-22 13:53   ` Cyrill Gorcunov
  2020-12-22 14:32   ` Cyrill Gorcunov
@ 2020-12-22 17:13   ` Cyrill Gorcunov
  2020-12-23 14:14     ` Vladislav Shpilevoy
  2 siblings, 1 reply; 14+ messages in thread
From: Cyrill Gorcunov @ 2020-12-22 17:13 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml

On Tue, Dec 22, 2020 at 02:49:06PM +0100, Vladislav Shpilevoy wrote:
> Hi! Thanks for the patchset!
> 
> Looks good, except that the test is still flaky. I run it
> with the same command as earlier, and got this:

Vlad, I've reworked the test. Could you please take a look?
Pushed it into separate branch gorcunov/gh-5446-eval-quorum-7

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 0/5] qsync: evaluate replication_synchro_quorum dynamically
  2020-12-22 17:13   ` Cyrill Gorcunov
@ 2020-12-23 14:14     ` Vladislav Shpilevoy
  2020-12-23 15:54       ` Cyrill Gorcunov
  0 siblings, 1 reply; 14+ messages in thread
From: Vladislav Shpilevoy @ 2020-12-23 14:14 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

Hi! Thanks for the fixes!

> diff --git a/test/replication/gh-5446-qsync-eval-quorum.result b/test/replication/gh-5446-qsync-eval-quorum.result
> new file mode 100644
> index 000000000..832c3f6e5
> --- /dev/null
> +++ b/test/replication/gh-5446-qsync-eval-quorum.result
> @@ -0,0 +1,298 @@
> +-- test-run result file version 2
> +test_run = require('test_run').new()
> + | ---
> + | ...
> +engine = test_run:get_cfg('engine')
> + | ---
> + | ...
> +
> +box.schema.user.grant('guest', 'replication')
> + | ---
> + | ...
> +
> +-- Test syntax error
> +box.cfg{replication_synchro_quorum = "aaa"}
> + | ---
> + | - error: 'Incorrect value for option ''replication_synchro_quorum'': [string "local
> + |     expr = [[aaa]]..."]:7: Expression should return a number'
> + | ...
> +
> +-- Test out of bounds values
> +box.cfg{replication_synchro_quorum = "N+1"}
> + | ---
> + | - error: 'Incorrect value for option ''replication_synchro_quorum'': the formula is
> + |     evaluated to the quorum 32 for replica number 31, which is out of range [1;31]'
> + | ...
> +box.cfg{replication_synchro_quorum = "N-1"}
> + | ---
> + | - error: 'Incorrect value for option ''replication_synchro_quorum'': the formula is
> + |     evaluated to the quorum 0 for replica number 1, which is out of range [1;31]'
> + | ...
> +
> +-- Test big number value
> +box.cfg{replication_synchro_quorum = '4294967297'}

1. This test is exactly the same as the next line, because
both return true from cfg_isnumber(). So you didn't test overflow
in the formula result here.

> + | ---
> + | - error: 'Incorrect value for option ''replication_synchro_quorum'': the value must
> + |     be greater than zero and less than maximal number of replicas'
> + | ...
> +box.cfg{replication_synchro_quorum = 4294967297}
> + | ---
> + | - error: 'Incorrect value for option ''replication_synchro_quorum'': the value must
> + |     be greater than zero and less than maximal number of replicas'
> + | ...
> +
> +-- Timeouts for replication
> +function cfg_set_pass_tmo() box.cfg{replication_synchro_timeout = 1000} end
> + | ---
> + | ...
> +function cfg_set_fail_tmo() box.cfg{replication_synchro_timeout = 0.5} end
> + | ---
> + | ...
> +
> +-- Use canonical majority formula
> +box.cfg{replication_synchro_quorum = "N/2+1"}

2. I asked it in the previous email and in the chat, but
you didn't pay attention, did you? The option is never restored
to the initial value in the end of the test. Timeout too. This
could break the next tests running in the same test-run worker.

Also you didn't drop the space in the end.

Since we don't have much time, I fixed these issues, and pushed
on top of the branch in a separate commit. Please, review and
squash if you agree.

Also the diff is below:

====================
diff --git a/test/replication/gh-5446-qsync-eval-quorum.result b/test/replication/gh-5446-qsync-eval-quorum.result
index 832c3f6e5..2acd633b2 100644
--- a/test/replication/gh-5446-qsync-eval-quorum.result
+++ b/test/replication/gh-5446-qsync-eval-quorum.result
@@ -10,6 +10,13 @@ box.schema.user.grant('guest', 'replication')
  | ---
  | ...
 
+old_synchro_quorum = box.cfg.replication_synchro_quorum
+ | ---
+ | ...
+old_synchro_timeout = box.cfg.replication_synchro_timeout
+ | ---
+ | ...
+
 -- Test syntax error
 box.cfg{replication_synchro_quorum = "aaa"}
  | ---
@@ -30,10 +37,10 @@ box.cfg{replication_synchro_quorum = "N-1"}
  | ...
 
 -- Test big number value
-box.cfg{replication_synchro_quorum = '4294967297'}
+box.cfg{replication_synchro_quorum = '4294967296 + 1'}
  | ---
- | - error: 'Incorrect value for option ''replication_synchro_quorum'': the value must
- |     be greater than zero and less than maximal number of replicas'
+ | - error: 'Incorrect value for option ''replication_synchro_quorum'': the formula is
+ |     evaluated to the quorum 1 for replica number 1, which is out of range [1;31]'
  | ...
 box.cfg{replication_synchro_quorum = 4294967297}
  | ---
@@ -58,13 +65,10 @@ cfg_set_pass_tmo()
  | ...
 
 -- Create a sync space we will operate on
-_ = box.schema.space.create('sync', {is_sync = true, engine = engine})
+s = box.schema.space.create('sync', {is_sync = true, engine = engine})
  | ---
  | ...
-_ = box.space.sync:create_index('pk')
- | ---
- | ...
-s = box.space.sync
+_ = s:create_index('pk')
  | ---
  | ...
 
@@ -293,6 +297,17 @@ test_run:cmd('delete server replica6')
  | - true
  | ...
 
+s:drop()
+ | ---
+ | ...
+
 box.schema.user.revoke('guest', 'replication')
  | ---
  | ...
+
+box.cfg{                                                                        \
+    replication_synchro_quorum = old_synchro_quorum,                            \
+    replication_synchro_timeout = old_synchro_timeout,                          \
+}
+ | ---
+ | ...
diff --git a/test/replication/gh-5446-qsync-eval-quorum.test.lua b/test/replication/gh-5446-qsync-eval-quorum.test.lua
index b905af3d9..e8c067246 100644
--- a/test/replication/gh-5446-qsync-eval-quorum.test.lua
+++ b/test/replication/gh-5446-qsync-eval-quorum.test.lua
@@ -3,6 +3,9 @@ engine = test_run:get_cfg('engine')
 
 box.schema.user.grant('guest', 'replication')
 
+old_synchro_quorum = box.cfg.replication_synchro_quorum
+old_synchro_timeout = box.cfg.replication_synchro_timeout
+
 -- Test syntax error
 box.cfg{replication_synchro_quorum = "aaa"}
 
@@ -11,7 +14,7 @@ box.cfg{replication_synchro_quorum = "N+1"}
 box.cfg{replication_synchro_quorum = "N-1"}
 
 -- Test big number value
-box.cfg{replication_synchro_quorum = '4294967297'}
+box.cfg{replication_synchro_quorum = '4294967296 + 1'}
 box.cfg{replication_synchro_quorum = 4294967297}
 
 -- Timeouts for replication
@@ -23,9 +26,8 @@ box.cfg{replication_synchro_quorum = "N/2+1"}
 cfg_set_pass_tmo()
 
 -- Create a sync space we will operate on
-_ = box.schema.space.create('sync', {is_sync = true, engine = engine})
-_ = box.space.sync:create_index('pk')
-s = box.space.sync
+s = box.schema.space.create('sync', {is_sync = true, engine = engine})
+_ = s:create_index('pk')
 
 -- Only one master node -> 1/2 + 1 = 1
 s:insert{1} -- should pass
@@ -103,4 +105,11 @@ test_run:cmd('delete server replica5')
 test_run:cmd('stop server replica6')
 test_run:cmd('delete server replica6')
 
+s:drop()
+
 box.schema.user.revoke('guest', 'replication')
+
+box.cfg{                                                                        \
+    replication_synchro_quorum = old_synchro_quorum,                            \
+    replication_synchro_timeout = old_synchro_timeout,                          \
+}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 0/5] qsync: evaluate replication_synchro_quorum dynamically
  2020-12-23 14:14     ` Vladislav Shpilevoy
@ 2020-12-23 15:54       ` Cyrill Gorcunov
  0 siblings, 0 replies; 14+ messages in thread
From: Cyrill Gorcunov @ 2020-12-23 15:54 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml

On Wed, Dec 23, 2020 at 03:14:15PM +0100, Vladislav Shpilevoy wrote:
> > +-- Test big number value
> > +box.cfg{replication_synchro_quorum = '4294967297'}
> 
> 1. This test is exactly the same as the next line, because
> both return true from cfg_isnumber(). So you didn't test overflow
> in the formula result here.

True :( You've fixed it.

> > +-- Use canonical majority formula
> > +box.cfg{replication_synchro_quorum = "N/2+1"}
> 
> 2. I asked it in the previous email and in the chat, but
> you didn't pay attention, did you? The option is never restored
> to the initial value in the end of the test. Timeout too. This
> could break the next tests running in the same test-run worker.
> 
> Also you didn't drop the space in the end.
> 
> Since we don't have much time, I fixed these issues, and pushed
> on top of the branch in a separate commit. Please, review and
> squash if you agree.

Thanks a huge, Vlad! I happen to miss that I need to do it.
Here is a diff over previous patches (test file only
to not spam alot). I pushed everything as

	gorcunov/gh-5446-eval-quorum-8
--
diff --git a/test/replication/gh-5446-qsync-eval-quorum.test.lua b/test/replication/gh-5446-qsync-eval-quorum.test.lua
index b905af3d9..62d87ddcb 100644
--- a/test/replication/gh-5446-qsync-eval-quorum.test.lua
+++ b/test/replication/gh-5446-qsync-eval-quorum.test.lua
@@ -3,6 +3,10 @@ engine = test_run:get_cfg('engine')
 
 box.schema.user.grant('guest', 'replication')
 
+-- Save old settings for cleanup sake
+old_synchro_quorum = box.cfg.replication_synchro_quorum
+old_synchro_timeout = box.cfg.replication_synchro_timeout
+
 -- Test syntax error
 box.cfg{replication_synchro_quorum = "aaa"}
 
@@ -11,7 +15,8 @@ box.cfg{replication_synchro_quorum = "N+1"}
 box.cfg{replication_synchro_quorum = "N-1"}
 
 -- Test big number value
-box.cfg{replication_synchro_quorum = '4294967297'}
+box.cfg{replication_synchro_quorum = '4294967296 + 1'}
+box.cfg{replication_synchro_quorum = 'N + 4294967296'}
 box.cfg{replication_synchro_quorum = 4294967297}
 
 -- Timeouts for replication
@@ -23,9 +28,8 @@ box.cfg{replication_synchro_quorum = "N/2+1"}
 cfg_set_pass_tmo()
 
 -- Create a sync space we will operate on
-_ = box.schema.space.create('sync', {is_sync = true, engine = engine})
-_ = box.space.sync:create_index('pk')
-s = box.space.sync
+s = box.schema.space.create('sync', {is_sync = true, engine = engine})
+_ = s:create_index('pk')
 
 -- Only one master node -> 1/2 + 1 = 1
 s:insert{1} -- should pass
@@ -103,4 +107,11 @@ test_run:cmd('delete server replica5')
 test_run:cmd('stop server replica6')
 test_run:cmd('delete server replica6')
 
+s:drop()
+
 box.schema.user.revoke('guest', 'replication')
+
+box.cfg{                                                                        \
+    replication_synchro_quorum = old_synchro_quorum,                            \
+    replication_synchro_timeout = old_synchro_timeout,                          \
+}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 0/5] qsync: evaluate replication_synchro_quorum dynamically
  2020-12-22 11:14 [Tarantool-patches] [PATCH v6 0/5] qsync: evaluate replication_synchro_quorum dynamically Cyrill Gorcunov
                   ` (5 preceding siblings ...)
  2020-12-22 13:49 ` [Tarantool-patches] [PATCH v6 0/5] qsync: evaluate replication_synchro_quorum dynamically Vladislav Shpilevoy
@ 2020-12-23 17:52 ` Vladislav Shpilevoy
  2020-12-24 13:55   ` Vladislav Shpilevoy
  6 siblings, 1 reply; 14+ messages in thread
From: Vladislav Shpilevoy @ 2020-12-23 17:52 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml, Alexander V. Tikhonov

The latest branch is: gorcunov/gh-5446-eval-quorum-8.

The patchset LGTM.

But I can't see CI - gitlab jobs are not running, and Travis
is simply useless because needs thousand years even to start.

Alexander Ti., please, tell if I can push this.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 0/5] qsync: evaluate replication_synchro_quorum dynamically
  2020-12-23 17:52 ` Vladislav Shpilevoy
@ 2020-12-24 13:55   ` Vladislav Shpilevoy
  0 siblings, 0 replies; 14+ messages in thread
From: Vladislav Shpilevoy @ 2020-12-24 13:55 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml, Alexander V. Tikhonov

Pushed to master, 2.6, and 2.5.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-12-24 13:55 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-22 11:14 [Tarantool-patches] [PATCH v6 0/5] qsync: evaluate replication_synchro_quorum dynamically Cyrill Gorcunov
2020-12-22 11:14 ` [Tarantool-patches] [PATCH v6 1/5] cfg: add cfg_isnumber helper Cyrill Gorcunov
2020-12-22 11:14 ` [Tarantool-patches] [PATCH v6 2/5] cfg: rework box_check_replication_synchro_quorum Cyrill Gorcunov
2020-12-22 11:14 ` [Tarantool-patches] [PATCH v6 3/5] cfg: support symbolic evaluation of replication_synchro_quorum Cyrill Gorcunov
2020-12-22 11:14 ` [Tarantool-patches] [PATCH v6 4/5] cfg: more precise check for replication_synchro_quorum value Cyrill Gorcunov
2020-12-22 11:14 ` [Tarantool-patches] [PATCH v6 5/5] test: add replication/gh-5446-qsync-eval-quorum.test.lua Cyrill Gorcunov
2020-12-22 13:49 ` [Tarantool-patches] [PATCH v6 0/5] qsync: evaluate replication_synchro_quorum dynamically Vladislav Shpilevoy
2020-12-22 13:53   ` Cyrill Gorcunov
2020-12-22 14:32   ` Cyrill Gorcunov
2020-12-22 17:13   ` Cyrill Gorcunov
2020-12-23 14:14     ` Vladislav Shpilevoy
2020-12-23 15:54       ` Cyrill Gorcunov
2020-12-23 17:52 ` Vladislav Shpilevoy
2020-12-24 13:55   ` Vladislav Shpilevoy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox