Tarantool development patches archive
 help / color / mirror / Atom feed
From: Serge Petrenko <sergepetrenko@tarantool.org>
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>, gorcunov@gmail.com
Cc: tarantool-patches@dev.tarantool.org
Subject: Re: [Tarantool-patches] [PATCH v2 4/6] box: rework clear_synchro_queue to commit everything
Date: Fri, 25 Dec 2020 00:02:24 +0300	[thread overview]
Message-ID: <fc9f382e-fe35-b1b1-6c99-0ec8e96fd030@tarantool.org> (raw)
In-Reply-To: <62b8a48a-f627-1155-246f-1493ea4e9459@tarantool.org>



24.12.2020 20:35, Vladislav Shpilevoy пишет:
> I've force pushed this diff:
>
> ====================
> @@ -98,7 +98,7 @@ box_raft_update_synchro_queue(struct raft *raft)
>   		uint32_t errcode = 0;
>   		do {
>   			rc = box_clear_synchro_queue(false);
> -			if (rc) {
> +			if (rc != 0) {
>   				struct error *err = diag_last_error(diag_get());
>   				errcode = box_error_code(err);
>   				diag_log();
> ====================
>
> The patchset looks good, but the test hangs if I run it a lot of times
> in parallel. I tried 132 times and after some number of runs all the
> workers hung.
>
> Couldn't find the reason right away.

I could reproduce the issue like this:
`./test-run.py $(yes 
replication/gh-5435-qsync-clear-synchro-queue-commit-all.test.lua | head 
-n 512) -j 32`

Looks like I've found the issue. Sometimes the replica doesn't
receive is_leader notification from the new leader. So it ignores
everything that the leader sends it and never sends out acks.
I suppose it happens when the instance is just started and subscribes
to the candidate. You see, box_process_subscribe sends out
raft state unconditionally, and in our case it sends out 2's
vote request to 3. 3 responds immediately, and 2 becomes
leader, but relay isn't started yet, or is started but didn't have
time to set is_raft_enabled to true, so 3 never gets 2's is_leader
notification.

In other words it's a race between 2 becoming a leader and trying
to broadcast it's new state and 2's relay becoming ready to handle
raft broadcasts.

I couldn't find a way to ameliorate this in the test.
Can we push it as is then?

Two tiny fixes force pushed (not related to the test hang):
===================================================
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 8fe3930db..c80430afc 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -36,7 +36,7 @@
      "gh-4730-applier-rollback.test.lua": {},
      "gh-4928-tx-boundaries.test.lua": {},
      "gh-5440-qsync-ro.test.lua": {},
-    "gh-5435-clear-synchro-queue-commit-all.test.lua": {},
+    "gh-5435-qsync-clear-synchro-queue-commit-all.test.lua": {},
      "*": {
          "memtx": {"engine": "memtx"},
          "vinyl": {"engine": "vinyl"}

diff --git a/src/box/box.cc b/src/box/box.cc
index 28146a747..e1d8305c8 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1228,8 +1228,8 @@ box_wait_quorum(uint32_t lead_id, int64_t 
target_lsn, int quorum,
         }
         if (ack_count < quorum) {
                 diag_set(ClientError, ER_QUORUM_WAIT, quorum, tt_sprintf(
-                        "timeout after %.2lf seconds, collected %d acks 
with "
-                        "%d quorum", timeout, ack_count, quorum));
+                        "timeout after %.2lf seconds, collected %d acks",
+                        timeout, ack_count, quorum));
                 return -1;
         }
         return 0;

-- 
Serge Petrenko

  reply	other threads:[~2020-12-24 21:02 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-23 11:59 [Tarantool-patches] [PATCH v2 0/6] make clear_synchro_queue " Serge Petrenko
2020-12-23 11:59 ` [Tarantool-patches] [PATCH v2 1/6] box: add a single execution guard to clear_synchro_queue Serge Petrenko
2020-12-23 11:59 ` [Tarantool-patches] [PATCH v2 2/6] relay: introduce on_status_update trigger Serge Petrenko
2020-12-23 17:25   ` Vladislav Shpilevoy
2020-12-24 16:11     ` Serge Petrenko
2020-12-23 11:59 ` [Tarantool-patches] [PATCH v2 3/6] txn_limbo: introduce txn_limbo_last_synchro_entry method Serge Petrenko
2020-12-23 17:25   ` Vladislav Shpilevoy
2020-12-24 16:13     ` Serge Petrenko
2020-12-23 11:59 ` [Tarantool-patches] [PATCH v2 4/6] box: rework clear_synchro_queue to commit everything Serge Petrenko
2020-12-23 17:28   ` Vladislav Shpilevoy
2020-12-24 16:12     ` Serge Petrenko
2020-12-24 17:35       ` Vladislav Shpilevoy
2020-12-24 21:02         ` Serge Petrenko [this message]
2020-12-23 11:59 ` [Tarantool-patches] [PATCH v2 5/6] test: fix replication/election_qsync_stress test Serge Petrenko
2020-12-23 11:59 ` [Tarantool-patches] [PATCH v2 6/6] txn_limbo: ignore CONFIRM/ROLLBACK for a foreign master Serge Petrenko
2020-12-23 17:28   ` Vladislav Shpilevoy
2020-12-24 16:13     ` Serge Petrenko
2020-12-25 10:04 ` [Tarantool-patches] [PATCH v2 0/6] make clear_synchro_queue commit everything Kirill Yukhin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fc9f382e-fe35-b1b1-6c99-0ec8e96fd030@tarantool.org \
    --to=sergepetrenko@tarantool.org \
    --cc=gorcunov@gmail.com \
    --cc=tarantool-patches@dev.tarantool.org \
    --cc=v.shpilevoy@tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH v2 4/6] box: rework clear_synchro_queue to commit everything' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox