From: Serge Petrenko <sergepetrenko@tarantool.org> To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>, gorcunov@gmail.com Cc: tarantool-patches@dev.tarantool.org Subject: Re: [Tarantool-patches] [PATCH v2 4/6] box: rework clear_synchro_queue to commit everything Date: Fri, 25 Dec 2020 00:02:24 +0300 [thread overview] Message-ID: <fc9f382e-fe35-b1b1-6c99-0ec8e96fd030@tarantool.org> (raw) In-Reply-To: <62b8a48a-f627-1155-246f-1493ea4e9459@tarantool.org> 24.12.2020 20:35, Vladislav Shpilevoy пишет: > I've force pushed this diff: > > ==================== > @@ -98,7 +98,7 @@ box_raft_update_synchro_queue(struct raft *raft) > uint32_t errcode = 0; > do { > rc = box_clear_synchro_queue(false); > - if (rc) { > + if (rc != 0) { > struct error *err = diag_last_error(diag_get()); > errcode = box_error_code(err); > diag_log(); > ==================== > > The patchset looks good, but the test hangs if I run it a lot of times > in parallel. I tried 132 times and after some number of runs all the > workers hung. > > Couldn't find the reason right away. I could reproduce the issue like this: `./test-run.py $(yes replication/gh-5435-qsync-clear-synchro-queue-commit-all.test.lua | head -n 512) -j 32` Looks like I've found the issue. Sometimes the replica doesn't receive is_leader notification from the new leader. So it ignores everything that the leader sends it and never sends out acks. I suppose it happens when the instance is just started and subscribes to the candidate. You see, box_process_subscribe sends out raft state unconditionally, and in our case it sends out 2's vote request to 3. 3 responds immediately, and 2 becomes leader, but relay isn't started yet, or is started but didn't have time to set is_raft_enabled to true, so 3 never gets 2's is_leader notification. In other words it's a race between 2 becoming a leader and trying to broadcast it's new state and 2's relay becoming ready to handle raft broadcasts. I couldn't find a way to ameliorate this in the test. Can we push it as is then? Two tiny fixes force pushed (not related to the test hang): =================================================== diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg index 8fe3930db..c80430afc 100644 --- a/test/replication/suite.cfg +++ b/test/replication/suite.cfg @@ -36,7 +36,7 @@ "gh-4730-applier-rollback.test.lua": {}, "gh-4928-tx-boundaries.test.lua": {}, "gh-5440-qsync-ro.test.lua": {}, - "gh-5435-clear-synchro-queue-commit-all.test.lua": {}, + "gh-5435-qsync-clear-synchro-queue-commit-all.test.lua": {}, "*": { "memtx": {"engine": "memtx"}, "vinyl": {"engine": "vinyl"} diff --git a/src/box/box.cc b/src/box/box.cc index 28146a747..e1d8305c8 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -1228,8 +1228,8 @@ box_wait_quorum(uint32_t lead_id, int64_t target_lsn, int quorum, } if (ack_count < quorum) { diag_set(ClientError, ER_QUORUM_WAIT, quorum, tt_sprintf( - "timeout after %.2lf seconds, collected %d acks with " - "%d quorum", timeout, ack_count, quorum)); + "timeout after %.2lf seconds, collected %d acks", + timeout, ack_count, quorum)); return -1; } return 0; -- Serge Petrenko
next prev parent reply other threads:[~2020-12-24 21:02 UTC|newest] Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-12-23 11:59 [Tarantool-patches] [PATCH v2 0/6] make clear_synchro_queue " Serge Petrenko 2020-12-23 11:59 ` [Tarantool-patches] [PATCH v2 1/6] box: add a single execution guard to clear_synchro_queue Serge Petrenko 2020-12-23 11:59 ` [Tarantool-patches] [PATCH v2 2/6] relay: introduce on_status_update trigger Serge Petrenko 2020-12-23 17:25 ` Vladislav Shpilevoy 2020-12-24 16:11 ` Serge Petrenko 2020-12-23 11:59 ` [Tarantool-patches] [PATCH v2 3/6] txn_limbo: introduce txn_limbo_last_synchro_entry method Serge Petrenko 2020-12-23 17:25 ` Vladislav Shpilevoy 2020-12-24 16:13 ` Serge Petrenko 2020-12-23 11:59 ` [Tarantool-patches] [PATCH v2 4/6] box: rework clear_synchro_queue to commit everything Serge Petrenko 2020-12-23 17:28 ` Vladislav Shpilevoy 2020-12-24 16:12 ` Serge Petrenko 2020-12-24 17:35 ` Vladislav Shpilevoy 2020-12-24 21:02 ` Serge Petrenko [this message] 2020-12-23 11:59 ` [Tarantool-patches] [PATCH v2 5/6] test: fix replication/election_qsync_stress test Serge Petrenko 2020-12-23 11:59 ` [Tarantool-patches] [PATCH v2 6/6] txn_limbo: ignore CONFIRM/ROLLBACK for a foreign master Serge Petrenko 2020-12-23 17:28 ` Vladislav Shpilevoy 2020-12-24 16:13 ` Serge Petrenko 2020-12-25 10:04 ` [Tarantool-patches] [PATCH v2 0/6] make clear_synchro_queue commit everything Kirill Yukhin
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=fc9f382e-fe35-b1b1-6c99-0ec8e96fd030@tarantool.org \ --to=sergepetrenko@tarantool.org \ --cc=gorcunov@gmail.com \ --cc=tarantool-patches@dev.tarantool.org \ --cc=v.shpilevoy@tarantool.org \ --subject='Re: [Tarantool-patches] [PATCH v2 4/6] box: rework clear_synchro_queue to commit everything' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox