[Tarantool-patches] [PATCH v2 4/6] box: rework clear_synchro_queue to commit everything
Serge Petrenko
sergepetrenko at tarantool.org
Fri Dec 25 00:02:24 MSK 2020
24.12.2020 20:35, Vladislav Shpilevoy пишет:
> I've force pushed this diff:
>
> ====================
> @@ -98,7 +98,7 @@ box_raft_update_synchro_queue(struct raft *raft)
> uint32_t errcode = 0;
> do {
> rc = box_clear_synchro_queue(false);
> - if (rc) {
> + if (rc != 0) {
> struct error *err = diag_last_error(diag_get());
> errcode = box_error_code(err);
> diag_log();
> ====================
>
> The patchset looks good, but the test hangs if I run it a lot of times
> in parallel. I tried 132 times and after some number of runs all the
> workers hung.
>
> Couldn't find the reason right away.
I could reproduce the issue like this:
`./test-run.py $(yes
replication/gh-5435-qsync-clear-synchro-queue-commit-all.test.lua | head
-n 512) -j 32`
Looks like I've found the issue. Sometimes the replica doesn't
receive is_leader notification from the new leader. So it ignores
everything that the leader sends it and never sends out acks.
I suppose it happens when the instance is just started and subscribes
to the candidate. You see, box_process_subscribe sends out
raft state unconditionally, and in our case it sends out 2's
vote request to 3. 3 responds immediately, and 2 becomes
leader, but relay isn't started yet, or is started but didn't have
time to set is_raft_enabled to true, so 3 never gets 2's is_leader
notification.
In other words it's a race between 2 becoming a leader and trying
to broadcast it's new state and 2's relay becoming ready to handle
raft broadcasts.
I couldn't find a way to ameliorate this in the test.
Can we push it as is then?
Two tiny fixes force pushed (not related to the test hang):
===================================================
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 8fe3930db..c80430afc 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -36,7 +36,7 @@
"gh-4730-applier-rollback.test.lua": {},
"gh-4928-tx-boundaries.test.lua": {},
"gh-5440-qsync-ro.test.lua": {},
- "gh-5435-clear-synchro-queue-commit-all.test.lua": {},
+ "gh-5435-qsync-clear-synchro-queue-commit-all.test.lua": {},
"*": {
"memtx": {"engine": "memtx"},
"vinyl": {"engine": "vinyl"}
diff --git a/src/box/box.cc b/src/box/box.cc
index 28146a747..e1d8305c8 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1228,8 +1228,8 @@ box_wait_quorum(uint32_t lead_id, int64_t
target_lsn, int quorum,
}
if (ack_count < quorum) {
diag_set(ClientError, ER_QUORUM_WAIT, quorum, tt_sprintf(
- "timeout after %.2lf seconds, collected %d acks
with "
- "%d quorum", timeout, ack_count, quorum));
+ "timeout after %.2lf seconds, collected %d acks",
+ timeout, ack_count, quorum));
return -1;
}
return 0;
--
Serge Petrenko
More information about the Tarantool-patches
mailing list