From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <sergepetrenko@tarantool.org>
Received: from smtp17.mail.ru (smtp17.mail.ru [94.100.176.154])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id D012C4765E0
 for <tarantool-patches@dev.tarantool.org>;
 Fri, 25 Dec 2020 00:02:25 +0300 (MSK)
References: <cover.1608724238.git.sergepetrenko@tarantool.org>
 <3136023eb90fd3c6a10cb288466a9f3c8f9d2c01.1608724239.git.sergepetrenko@tarantool.org>
 <37c539ca-cdce-a4e9-5af9-3814b2c00131@tarantool.org>
 <a6cb0d4f-4152-480c-8e5d-3de01be4c157@tarantool.org>
 <62b8a48a-f627-1155-246f-1493ea4e9459@tarantool.org>
From: Serge Petrenko <sergepetrenko@tarantool.org>
Message-ID: <fc9f382e-fe35-b1b1-6c99-0ec8e96fd030@tarantool.org>
Date: Fri, 25 Dec 2020 00:02:24 +0300
MIME-Version: 1.0
In-Reply-To: <62b8a48a-f627-1155-246f-1493ea4e9459@tarantool.org>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Content-Language: en-GB
Subject: Re: [Tarantool-patches] [PATCH v2 4/6] box: rework
 clear_synchro_queue to commit everything
List-Id: Tarantool development patches <tarantool-patches.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-patches/>
List-Post: <mailto:tarantool-patches@dev.tarantool.org>
List-Help: <mailto:tarantool-patches-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=subscribe>
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>, gorcunov@gmail.com
Cc: tarantool-patches@dev.tarantool.org


24.12.2020 20:35, Vladislav Shpilevoy пишет:
> I've force pushed this diff:
>
> ====================
> @@ -98,7 +98,7 @@ box_raft_update_synchro_queue(struct raft *raft)
>   		uint32_t errcode = 0;
>   		do {
>   			rc = box_clear_synchro_queue(false);
> -			if (rc) {
> +			if (rc != 0) {
>   				struct error *err = diag_last_error(diag_get());
>   				errcode = box_error_code(err);
>   				diag_log();
> ====================
>
> The patchset looks good, but the test hangs if I run it a lot of times
> in parallel. I tried 132 times and after some number of runs all the
> workers hung.
>
> Couldn't find the reason right away.

I could reproduce the issue like this:
`./test-run.py $(yes 
replication/gh-5435-qsync-clear-synchro-queue-commit-all.test.lua | head 
-n 512) -j 32`

Looks like I've found the issue. Sometimes the replica doesn't
receive is_leader notification from the new leader. So it ignores
everything that the leader sends it and never sends out acks.
I suppose it happens when the instance is just started and subscribes
to the candidate. You see, box_process_subscribe sends out
raft state unconditionally, and in our case it sends out 2's
vote request to 3. 3 responds immediately, and 2 becomes
leader, but relay isn't started yet, or is started but didn't have
time to set is_raft_enabled to true, so 3 never gets 2's is_leader
notification.

In other words it's a race between 2 becoming a leader and trying
to broadcast it's new state and 2's relay becoming ready to handle
raft broadcasts.

I couldn't find a way to ameliorate this in the test.
Can we push it as is then?

Two tiny fixes force pushed (not related to the test hang):
===================================================
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 8fe3930db..c80430afc 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -36,7 +36,7 @@
      "gh-4730-applier-rollback.test.lua": {},
      "gh-4928-tx-boundaries.test.lua": {},
      "gh-5440-qsync-ro.test.lua": {},
-    "gh-5435-clear-synchro-queue-commit-all.test.lua": {},
+    "gh-5435-qsync-clear-synchro-queue-commit-all.test.lua": {},
      "*": {
          "memtx": {"engine": "memtx"},
          "vinyl": {"engine": "vinyl"}

diff --git a/src/box/box.cc b/src/box/box.cc
index 28146a747..e1d8305c8 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1228,8 +1228,8 @@ box_wait_quorum(uint32_t lead_id, int64_t 
target_lsn, int quorum,
         }
         if (ack_count < quorum) {
                 diag_set(ClientError, ER_QUORUM_WAIT, quorum, tt_sprintf(
-                        "timeout after %.2lf seconds, collected %d acks 
with "
-                        "%d quorum", timeout, ack_count, quorum));
+                        "timeout after %.2lf seconds, collected %d acks",
+                        timeout, ack_count, quorum));
                 return -1;
         }
         return 0;

-- 
Serge Petrenko