[Tarantool-patches] [PATCH v2 4/6] box: rework clear_synchro_queue to commit everything

Serge Petrenko sergepetrenko at tarantool.org
Fri Dec 25 00:02:24 MSK 2020



24.12.2020 20:35, Vladislav Shpilevoy пишет:
> I've force pushed this diff:
>
> ====================
> @@ -98,7 +98,7 @@ box_raft_update_synchro_queue(struct raft *raft)
>   		uint32_t errcode = 0;
>   		do {
>   			rc = box_clear_synchro_queue(false);
> -			if (rc) {
> +			if (rc != 0) {
>   				struct error *err = diag_last_error(diag_get());
>   				errcode = box_error_code(err);
>   				diag_log();
> ====================
>
> The patchset looks good, but the test hangs if I run it a lot of times
> in parallel. I tried 132 times and after some number of runs all the
> workers hung.
>
> Couldn't find the reason right away.

I could reproduce the issue like this:
`./test-run.py $(yes 
replication/gh-5435-qsync-clear-synchro-queue-commit-all.test.lua | head 
-n 512) -j 32`

Looks like I've found the issue. Sometimes the replica doesn't
receive is_leader notification from the new leader. So it ignores
everything that the leader sends it and never sends out acks.
I suppose it happens when the instance is just started and subscribes
to the candidate. You see, box_process_subscribe sends out
raft state unconditionally, and in our case it sends out 2's
vote request to 3. 3 responds immediately, and 2 becomes
leader, but relay isn't started yet, or is started but didn't have
time to set is_raft_enabled to true, so 3 never gets 2's is_leader
notification.

In other words it's a race between 2 becoming a leader and trying
to broadcast it's new state and 2's relay becoming ready to handle
raft broadcasts.

I couldn't find a way to ameliorate this in the test.
Can we push it as is then?

Two tiny fixes force pushed (not related to the test hang):
===================================================
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 8fe3930db..c80430afc 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -36,7 +36,7 @@
      "gh-4730-applier-rollback.test.lua": {},
      "gh-4928-tx-boundaries.test.lua": {},
      "gh-5440-qsync-ro.test.lua": {},
-    "gh-5435-clear-synchro-queue-commit-all.test.lua": {},
+    "gh-5435-qsync-clear-synchro-queue-commit-all.test.lua": {},
      "*": {
          "memtx": {"engine": "memtx"},
          "vinyl": {"engine": "vinyl"}

diff --git a/src/box/box.cc b/src/box/box.cc
index 28146a747..e1d8305c8 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1228,8 +1228,8 @@ box_wait_quorum(uint32_t lead_id, int64_t 
target_lsn, int quorum,
         }
         if (ack_count < quorum) {
                 diag_set(ClientError, ER_QUORUM_WAIT, quorum, tt_sprintf(
-                        "timeout after %.2lf seconds, collected %d acks 
with "
-                        "%d quorum", timeout, ack_count, quorum));
+                        "timeout after %.2lf seconds, collected %d acks",
+                        timeout, ack_count, quorum));
                 return -1;
         }
         return 0;

-- 
Serge Petrenko



More information about the Tarantool-patches mailing list