From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp17.mail.ru (smtp17.mail.ru [94.100.176.154]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id D012C4765E0 for ; Fri, 25 Dec 2020 00:02:25 +0300 (MSK) References: <3136023eb90fd3c6a10cb288466a9f3c8f9d2c01.1608724239.git.sergepetrenko@tarantool.org> <37c539ca-cdce-a4e9-5af9-3814b2c00131@tarantool.org> <62b8a48a-f627-1155-246f-1493ea4e9459@tarantool.org> From: Serge Petrenko Message-ID: Date: Fri, 25 Dec 2020 00:02:24 +0300 MIME-Version: 1.0 In-Reply-To: <62b8a48a-f627-1155-246f-1493ea4e9459@tarantool.org> Content-Type: text/plain; charset="utf-8"; format="flowed" Content-Transfer-Encoding: 8bit Content-Language: en-GB Subject: Re: [Tarantool-patches] [PATCH v2 4/6] box: rework clear_synchro_queue to commit everything List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Vladislav Shpilevoy , gorcunov@gmail.com Cc: tarantool-patches@dev.tarantool.org 24.12.2020 20:35, Vladislav Shpilevoy пишет: > I've force pushed this diff: > > ==================== > @@ -98,7 +98,7 @@ box_raft_update_synchro_queue(struct raft *raft) > uint32_t errcode = 0; > do { > rc = box_clear_synchro_queue(false); > - if (rc) { > + if (rc != 0) { > struct error *err = diag_last_error(diag_get()); > errcode = box_error_code(err); > diag_log(); > ==================== > > The patchset looks good, but the test hangs if I run it a lot of times > in parallel. I tried 132 times and after some number of runs all the > workers hung. > > Couldn't find the reason right away. I could reproduce the issue like this: `./test-run.py $(yes replication/gh-5435-qsync-clear-synchro-queue-commit-all.test.lua | head -n 512) -j 32` Looks like I've found the issue. Sometimes the replica doesn't receive is_leader notification from the new leader. So it ignores everything that the leader sends it and never sends out acks. I suppose it happens when the instance is just started and subscribes to the candidate. You see, box_process_subscribe sends out raft state unconditionally, and in our case it sends out 2's vote request to 3. 3 responds immediately, and 2 becomes leader, but relay isn't started yet, or is started but didn't have time to set is_raft_enabled to true, so 3 never gets 2's is_leader notification. In other words it's a race between 2 becoming a leader and trying to broadcast it's new state and 2's relay becoming ready to handle raft broadcasts. I couldn't find a way to ameliorate this in the test. Can we push it as is then? Two tiny fixes force pushed (not related to the test hang): =================================================== diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg index 8fe3930db..c80430afc 100644 --- a/test/replication/suite.cfg +++ b/test/replication/suite.cfg @@ -36,7 +36,7 @@      "gh-4730-applier-rollback.test.lua": {},      "gh-4928-tx-boundaries.test.lua": {},      "gh-5440-qsync-ro.test.lua": {}, -    "gh-5435-clear-synchro-queue-commit-all.test.lua": {}, +    "gh-5435-qsync-clear-synchro-queue-commit-all.test.lua": {},      "*": {          "memtx": {"engine": "memtx"},          "vinyl": {"engine": "vinyl"} diff --git a/src/box/box.cc b/src/box/box.cc index 28146a747..e1d8305c8 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -1228,8 +1228,8 @@ box_wait_quorum(uint32_t lead_id, int64_t target_lsn, int quorum,         }         if (ack_count < quorum) {                 diag_set(ClientError, ER_QUORUM_WAIT, quorum, tt_sprintf( -                        "timeout after %.2lf seconds, collected %d acks with " -                        "%d quorum", timeout, ack_count, quorum)); +                        "timeout after %.2lf seconds, collected %d acks", +                        timeout, ack_count, quorum));                 return -1;         }         return 0; -- Serge Petrenko