[Tarantool-patches] [PATCH 2/2] replication: fix replica disconnect upon reconfiguration
Vladislav Shpilevoy
v.shpilevoy at tarantool.org
Tue Oct 5 00:04:10 MSK 2021
Hi! Thanks for working on this!
>>> diff --git a/src/box/box.cc b/src/box/box.cc
>>> index 219ffa38d..89cda5599 100644
>>> --- a/src/box/box.cc
>>> +++ b/src/box/box.cc> @@ -1261,7 +1261,9 @@ box_sync_replication(bool connect_quorum)
>>> applier_delete(appliers[i]); /* doesn't affect diag */
>>> });
>>> - replicaset_connect(appliers, count, connect_quorum);
>>> + bool connect_quorum = strict;
>>> + bool keep_connect = !strict;
>>> + replicaset_connect(appliers, count, connect_quorum, keep_connect);
>> 1. How about passing both these parameters explicitly to box_sync_replication?
>> I don't understand the link between them so that they could be one.
>>
>> It seems the only case when you need to drop the old connections is when
>> you turn anon to normal. Why should they be fully reset otherwise?
>
> Yes, it's true. anon to normal is the only place where existing
> connections should be reset.
>
> For both bootstrap and local recovery (first ever box.cfg) keep_connect
> doesn't make sense at all, because there are no previous connections to
> keep.
>
> So the only two (out of 5) box_sync_replication calls, that need
> keep_connect are replication reconfiguration (keep_connect = true) and
> anon replica reconfiguration (keep_connect = false).
>
> Speaking of the relation between keep_connect and connect_quorum:
> We don't care about keep_connect in 3 calls (bootstrap and recovery),
> and when keep_connect is important, it's equal to !connect_quorum.
> I thought it might be nice to replace them with a single parameter.
>
> I tried to pass both parameters to box_sync_repication() at first.
> This looked rather ugly IMO:
> box_sync_replication(true, false), box_sync_replication(false, true);
> Two boolean parameters which are responsible for God knows what are
> worse than one parameter.
>
> I'm not 100% happy with my solution, but it at least hides the second
> parameter. And IMO box_sync_replication(strict) is rather easy to
> understand: when strict = true, you want to connect to quorum, and
> you want to reset the connections. And vice versa when strict = false.
This can be resolved with a couple of wrappers, like in this diff:
====================
diff --git a/src/box/box.cc b/src/box/box.cc
index 89cda5599..c1216172d 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1249,7 +1249,7 @@ cfg_get_replication(int *p_count)
* don't start appliers.
*/
static void
-box_sync_replication(bool strict)
+box_sync_replication(bool do_quorum, bool do_reuse)
{
int count = 0;
struct applier **appliers = cfg_get_replication(&count);
@@ -1260,14 +1260,27 @@ box_sync_replication(bool strict)
for (int i = 0; i < count; i++)
applier_delete(appliers[i]); /* doesn't affect diag */
});
-
- bool connect_quorum = strict;
- bool keep_connect = !strict;
- replicaset_connect(appliers, count, connect_quorum, keep_connect);
+ replicaset_connect(appliers, count, do_quorum, do_reuse);
guard.is_active = false;
}
+static inline void
+box_reset_replication(void)
+{
+ const bool do_quorum = true;
+ const bool do_reuse = false;
+ box_sync_replication(do_quorum, do_reuse);
+}
+
+static inline void
+box_update_replication(void)
+{
+ const bool do_quorum = false;
+ const bool do_reuse = true;
+ box_sync_replication(do_quorum, do_reuse);
+}
+
void
box_set_replication(void)
{
@@ -1286,7 +1299,7 @@ box_set_replication(void)
* Stay in orphan mode in case we fail to connect to at least
* 'replication_connect_quorum' remote instances.
*/
- box_sync_replication(false);
+ box_update_replication();
/* Follow replica */
replicaset_follow();
/* Wait until appliers are in sync */
@@ -1406,7 +1419,7 @@ box_set_replication_anon(void)
* them can register and others resend a
* non-anonymous subscribe.
*/
- box_sync_replication(true);
+ box_reset_replication();
/*
* Wait until the master has registered this
* instance.
@@ -3260,7 +3273,7 @@ bootstrap(const struct tt_uuid *instance_uuid,
* with connecting to 'replication_connect_quorum' masters.
* If this also fails, throw an error.
*/
- box_sync_replication(true);
+ box_update_replication();
struct replica *master = replicaset_find_join_master();
assert(master == NULL || master->applier != NULL);
@@ -3337,7 +3350,7 @@ local_recovery(const struct tt_uuid *instance_uuid,
if (wal_dir_lock >= 0) {
if (box_listen() != 0)
diag_raise();
- box_sync_replication(false);
+ box_update_replication();
struct replica *master;
if (replicaset_needs_rejoin(&master)) {
@@ -3416,7 +3429,7 @@ local_recovery(const struct tt_uuid *instance_uuid,
vclock_copy(&replicaset.vclock, &recovery->vclock);
if (box_listen() != 0)
diag_raise();
- box_sync_replication(false);
+ box_update_replication();
}
stream_guard.is_active = false;
recovery_finalize(recovery);
====================
Feel free to discard it if don't like. I am fine with the current
solution too.
Now when I sent this diff, I realized box_restart_replication()
would be a better name than reset. Up to you as well.
> diff --git a/test/instance_files/base_instance.lua b/test/instance_files/base_instance.lua
> index 45bdbc7e8..e579c3843 100755
> --- a/test/instance_files/base_instance.lua
> +++ b/test/instance_files/base_instance.lua
> @@ -5,7 +5,8 @@ local listen = os.getenv('TARANTOOL_LISTEN')
> box.cfg({
> work_dir = workdir,
> -- listen = 'localhost:3310'
> - listen = listen
> + listen = listen,
> + log = workdir..'/tarantool.log',
Do you really need it in this patch?
Other than that LGTM. You can send the next version to a next
reviewer. I suppose it can be Yan now.
More information about the Tarantool-patches
mailing list