[Tarantool-patches] [PATCH v3 2/3] cfg: support symbolic evaluation of replication_synchro_quorum

Vladislav Shpilevoy v.shpilevoy at tarantool.org
Thu Dec 10 02:22:28 MSK 2020


Hi! Thanks for the patch!

Your branch fails all the qsync tests in CI:
https://gitlab.com/tarantool/tarantool/-/pipelines/226651134

See 6 comments below.

> Subject: [PATCH 2/3] cfg: support symbolic evaluation of replication_synchro_quorum
> 
> When synchronous replication is used we prefer a user to specify
> a quorum number, ie the number of replicas where data must be
> replicated before the master node continue accepting new
> transactions.
> 
> This is not very convenient since a user may not know initially
> how many replicas will be used. Moreover the number of replicas
> may vary dynamically. For this sake we allow to specify the
> number of quorum in a symbolic way.
> 
> For example
> 
> box.cfg {
> 	replication_synchro_quorum = "N/2+1",
> }
> 
> where `N` is a number of registered replicas in a cluster.
> Once new replica attached or old one detached the number
> is renewed and propagated.
> 
> Internally on each replica_set_id() and replica_clear_id(),
> ie at moment when replica get registered or unregistered,
> we call box_update_replication_synchro_quorum() helper which
> finds out if evaluation of replication_synchro_quorum is
> needed and if so we calculate new replication_synchro_quorum
> value based on number of currently registered replicas. Then
> we notify dependent systems such as qsync and raft to update
> their guts.
> 
> Note: we do *not* change the default settings for this option,
> it remains 1 by default for now. Change the default option should
> be done as a separate commit once we make sure that everything is
> fine.
> 
> Closes #5446
> 
> Signed-off-by: Cyrill Gorcunov <gorcunov at gmail.com>
> 
> @TarantoolBot document
> Title: Synchronous replication

1. Please, be more specific. Imagine if all github tickets about
qsync would have the same title 'Synchronous replication'.

> The plain integer value might be convenient for simple scenarios.

The plain integer for what? Please, keep in mind that the docteam
will see everything below "@TarantoolBot document" mark. It means,
at this sentence the reader is already lost, because no context at
all. This text will be read not by developers, and out of the
commit message context.

Only by replication_synchro_quorum below the reader may assume,
that as a 'plain integer' you mean the old way of specifying
replication_synchro_quorum. Which we know, but the docteam does
not remember, that replication_synchro_quorum was an integer before,
and now can be a string.

State explicitly what is the request is about, how it worked before,
what changed now, and why. 'Why' part is good - that it handles the
'dynamic' part for you.

> But in case if number of replicas is dynamic a better way should
> be to describe a quorum via formula. The formula should use symbol
> `N` to represent amount of registered replicas.
> 
> For example the canonical definition for a quorum (ie majority
> of members in a set) of `N` replicas is `N/2+1`. For such
> configuration one can define
> 
> ```
> box.cfg {replication_synchro_quorum = "N/2+1"}
> ```
> 
> Note that for sake of simplicity quorum evaluation never returns
> negative values thus for the case of formula say `N-2` the result
> will be 1 until number of replicas become 4 and more.
> ---
>  src/box/box.cc           | 121 +++++++++++++++++++++++++++++++++++++--
>  src/box/box.h            |   1 +
>  src/box/lua/load_cfg.lua |   2 +-
>  src/box/replication.cc   |   4 +-
>  4 files changed, 121 insertions(+), 7 deletions(-)
> 
> diff --git a/src/box/box.cc b/src/box/box.cc
> index a8bc3471d..2076a8dc3 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -554,10 +554,91 @@ box_check_replication_sync_lag(void)
>  	return lag;
>  }
>  
> +/**
> + * Evaluate replication syncro quorum number from a formula.
> + */
> +static int
> +eval_replication_synchro_quorum(int nr_replicas)

2. Number of replicas is always passed as replicaset.registered_count.
I suggest you to make this function take no args, and read
replicaset.registered_count internally.

Also would be good to rename it to box_eval_... . Because you
touch box things here. Such as cfg_gets("replication_synchro_quorum"),
for example. Which reads box.cfg.

> +{
> +	const char fmt[] =
> +		"local expr = [[%s]]\n"
> +		"local f, err = loadstring('return ('..expr..')')\n"
> +		"if not f then "
> +			"error(string.format('Failed to load \%\%s:"
> +			"\%\%s', expr, err)) "
> +		"end\n"
> +		"setfenv(f, {N = %d, math = {"
> +			"ceil = math.ceil,"
> +			"floor = math.floor,"
> +			"abs = math.abs,"
> +			"random = math.random,"
> +			"min = math.min,"
> +			"max = math.abs,"
> +			"sqrt = math.sqrt,"
> +			"fmod = math.fmod,"
> +		"}})\n"
> +		"local res = f()\n"
> +		"if type(res) ~= 'number' then\n"
> +			"error('Expression should return a number')\n"
> +		"end\n"
> +		"return math.floor(res)\n";
> +	int value = -1;
> +	const char *expr = cfg_gets("replication_synchro_quorum");
> +	const char *buf = tt_sprintf(fmt, expr, nr_replicas);

3. What is the result is >= TT_STATIC_BUF_LEN? I suspect a user will
get a surprising error message, or will even get no error, but the
expression will be just truncated. Does not look good.

Oh, shit. I just found that cfg_gets() also uses the static buffer.
Besides, it just truncates the string value to 256 chars. So whatever
you specify as replication_synchro_quorum, if it is longer than 256,
it is silently truncated. Also does not look good. But don't know
how to fix it, and if we want to fix it now.

> +
> +	luaL_loadstring(tarantool_L, buf);
> +	if (lua_pcall(tarantool_L, 0, 1, 0) != 0) {
> +		diag_set(ClientError, ER_CFG,
> +			 "replication_synchro_quorum",
> +			 lua_tostring(tarantool_L, -1));
> +		return -1;
> +	}
> +
> +	if (lua_isnumber(tarantool_L, -1))
> +		value = (int)lua_tonumber(tarantool_L, -1);
> +	lua_pop(tarantool_L, 1);
> +
> +	/*
> +	 * At least we should have 1 node to sync, thus
> +	 * if the formula has evaluated to some negative
> +	 * value (say it was n-2) do not treat it as an
> +	 * error but just yield a minimum valid magnitude.
> +	 */
> +	if (value >= VCLOCK_MAX) {
> +		const int value_max = VCLOCK_MAX - 1;
> +		say_warn("replication_synchro_quorum evaluated "
> +			 "to value %d, set to %d",
> +			 value, value_max);
> +		value = value_max;

4. When I said I want to see a warning when the quorum is
evaluated to 0, while number of replicas is > 0, I didn't mean to
delete the validation at all.

Your example about 'n-2' is a proof that a negative value means
an issue in user's code. Because if node count is 3, the quorum
will be 1, and synchro guarantees simply don't work.

But since we are here, there are two options:

- Delete the upper bound validation as well. Because it makes no
sense to check it if we allow to overflow the other bound. This
warning does not warn about all invalid values anyway. Moreover,
a "<= 0" quorum is more dangerous than a too big quorum IMO, as the
user will commit data but with weaker guarantees.

- Return the lower bound check and properly catch the case when the
quorum is 0 illegally. It is easy. If the formula returned a negative
value, it is always a warning. If the formula returned 0, but the
number of replicas is > 0, then this is a warning. Everything else is
correct (if no upper overflow). Including the case when the formula
returned 0, and the number of replicas is 0 (happens at bootstrap).
Also we could warn even in the latter case (0 quorum, 0 replicas).
Because it signals that the user does not use N/2+1 formula.

If user will want to do strange things like N*3/4 or N-2, then he
will see warnings, will think more, and will add 'if's into his
code or min/max calls, or will fix an issue in his code.


Another third special option is kinda stupid, but reliable as fuck.

When formula is changed, you can try it with all replica counts
from 0 to 31. And if any returns an out of range value, we return
an error saying on which size the bad value was returned. Then
during cluster size changes we will never get a bad value. And the
user won't need to read the logs to see the errors. Personally, I
would just do it from the beginning. Box.cfg is rare, and all 32
values will be checked in the order of ones of microseconds I think.

> +	}
> +
> +	/*
> +	 * We never return 0, even if we're in bootstrap
> +	 * stage were number of replicas equals zero we
> +	 * should consider the node itself as a minimum
> +	 * quorum number.
> +	 */
> +	return MAX(1, value);
> +}
> +
>  static int
>  box_check_replication_synchro_quorum(void)
>  {
> -	int quorum = cfg_geti("replication_synchro_quorum");
> +	int quorum = 0;
> +
> +	if (!cfg_isnumber("replication_synchro_quorum")) {
> +		/*
> +		 * The formula uses symbolic name 'N' as
> +		 * a number of currently registered replicas.
> +		 */
> +		int value = replicaset.registered_count;
> +		quorum = eval_replication_synchro_quorum(value);
> +		if (quorum < 0)
> +			return -1;> +	} else {
> +		quorum = cfg_geti("replication_synchro_quorum");
> +	}
> +
>  	if (quorum <= 0 || quorum >= VCLOCK_MAX) {
>  		diag_set(ClientError, ER_CFG, "replication_synchro_quorum",
>  			 "the value must be greater than zero and less than "
> @@ -910,15 +991,47 @@ box_set_replication_sync_lag(void)
>  	replication_sync_lag = box_check_replication_sync_lag();
>  }
>  
> +/**
> + * Renew replication_synchro_quorum value if defined
> + * as a formula and we need to recalculate it.
> + */
> +void
> +box_update_replication_synchro_quorum(void)
> +{
> +	if (cfg_isnumber("replication_synchro_quorum")) {
> +		/*
> +		 * Even if replication_synchro_quorum is a constant
> +		 * number the RAFT engine should be notified on
> +		 * change of replicas amount.
> +		 */
> +		box_raft_update_election_quorum();

5. Why don't you update the limbo? And why don't you change
replication_synchro_quorum global variable? It is not changed
anywhere now.

> +		return;
> +	}
> +
> +	/*
> +	 * The formula has been verified already on the bootstrap
> +	 * stage (and on dynamic reconfig as well), still there
> +	 * is a Lua call inside, heck knowns what could go wrong

6. knowns -> knows.

> +	 * there thus panic if we're screwed.
> +	 */
> +	int value = replicaset.registered_count;
> +	int quorum = eval_replication_synchro_quorum(value);
> +	if (quorum < 0 || quorum >= VCLOCK_MAX)
> +		panic("failed to eval replication_synchro_quorum");
> +	say_info("update replication_synchro_quorum = %d", quorum);
> +
> +	replication_synchro_quorum = quorum;
> +	txn_limbo_on_parameters_change(&txn_limbo);
> +	box_raft_update_election_quorum();
> +}
> +
>  int
>  box_set_replication_synchro_quorum(void)
>  {
>  	int value = box_check_replication_synchro_quorum();
>  	if (value < 0)
>  		return -1;
> -	replication_synchro_quorum = value;
> -	txn_limbo_on_parameters_change(&txn_limbo);
> -	box_raft_update_election_quorum();
> +	box_update_replication_synchro_quorum();
>  	return 0;
>  }


More information about the Tarantool-patches mailing list