From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id F3D936EC5D; Fri, 9 Apr 2021 11:25:22 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org F3D936EC5D DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1617956723; bh=YxAu4LZw9gZHz39Akwru2UJvUT7tGC0Qq/0lrEq5d/Q=; h=To:Cc:References:Date:In-Reply-To:Subject:List-Id: List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe: From:Reply-To:From; b=ca4/3dNhfY/Qqi5Xr39UWBeDJXmO5ppr9vIApNHwTzOwdshheWGpUJRay40btCkOq i330QreWxWJJCLQkqVO2P6NeZMxHIO+XO/NdjtYRq73bP7jzj5ayy1UBwCEKWv8cNk +SNTiaS6eDCdAPChsX3oeuoZWcB8yhr/Nmw8nbJY= Received: from smtp46.i.mail.ru (smtp46.i.mail.ru [94.100.177.106]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id C90BF6EC5D for ; Fri, 9 Apr 2021 11:25:20 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org C90BF6EC5D Received: by smtp46.i.mail.ru with esmtpa (envelope-from ) id 1lUmS7-0000P6-HV; Fri, 09 Apr 2021 11:25:20 +0300 To: Vladislav Shpilevoy , Cyrill Gorcunov Cc: tarantool-patches@dev.tarantool.org References: <00c39bc5e3090383a5f0e732b5a1f32130191cf1.1617835503.git.v.shpilevoy@tarantool.org> <71ee3907-a8fd-d916-fc6a-3205e66f2d29@tarantool.org> <72609da5-e79f-dd29-c69e-77090ea06df2@tarantool.org> Message-ID: Date: Fri, 9 Apr 2021 11:25:18 +0300 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.9.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-GB X-7564579A: 78E4E2B564C1792B X-77F55803: 4F1203BC0FB41BD92FFCB8E6708E748094FADAEB10E66ADA4C48BE3C291E66DA182A05F53808504024C7765AAA581D952186990370E4BFF4201BE3A6B1FFE05234C2AB110D9143F0 X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE7BB17EE3498E810FEEA1F7E6F0F101C67BD4B6F7A4D31EC0BCC500DACC3FED6E28638F802B75D45FF8AA50765F790063729381D024F10DD158638F802B75D45FF914D58D5BE9E6BC1A93B80C6DEB9DEE97C6FB206A91F05B2A01F93309F31CAFE9BEBA5BA82B99B9516A3AA18C56759C2D2E47CDBA5A96583C09775C1D3CA48CFCA5A41EBD8A3A0199FA2833FD35BB23D2EF20D2F80756B5F868A13BD56FB6657A471835C12D1D977725E5C173C3A84C34964A708C60C975A117882F4460429728AD0CFFFB425014E868A13BD56FB6657E2021AF6380DFAD1A18204E546F3947CB11811A4A51E3B096D1867E19FE1407959CC434672EE6371089D37D7C0E48F6C8AA50765F79006371693D85F8934E509EFF80C71ABB335746BA297DBC24807EABDAD6C7F3747799A X-C1DE0DAB: 0D63561A33F958A5CC3483C9F962B7A9AC1B1821A2208FBD93D315419F4F5627D59269BC5F550898D99A6476B3ADF6B47008B74DF8BB9EF7333BD3B22AA88B938A852937E12ACA7502E6951B79FF9A3F410CA545F18667F91A7EA1CDA0B5A7A0 X-C8649E89: 4E36BF7865823D7055A7F0CF078B5EC49A30900B95165D344F2126FF337FA651D085B4D6832E3E56FD2EA8B9A47D621F923EA8A0C1DD46A066DDC5DBEE5E54B11D7E09C32AA3244CA7EFB32FD2F497139ABE1E59431E0DD4F2F5F14F68F1805BFACE5A9C96DEB163 X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2biojyO2lHpuZu4RD2mghcE920g== X-Mailru-Sender: 3B9A0136629DC9125D61937A2360A446AFD2F2A8002A64666782406F33F2C13AF971DEFEF195D0D2424AE0EB1F3D1D21E2978F233C3FAE6EE63DB1732555E4A8EE80603BA4A5B0BC112434F685709FCF0DA7A0AF5A3A8387 X-Mras: Ok Subject: Re: [Tarantool-patches] [PATCH 1/1] applier: process synchro rows after WAL write X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Serge Petrenko via Tarantool-patches Reply-To: Serge Petrenko Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" 09.04.2021 01:57, Vladislav Shpilevoy пишет: > On 08.04.2021 12:32, Serge Petrenko via Tarantool-patches wrote> 08.04.2021 13:19, Cyrill Gorcunov пишет: >>> On Thu, Apr 08, 2021 at 11:39:03AM +0300, Serge Petrenko wrote: >>>> Thanks for the patch! >>>> >>>> I'm a bit worried about two different synchro rows coming from two >>>> appliers. Is everything ok in this case? >>> Serge, you mean the scenario when some instances in replicaset >>> have the patch applied and some are not? >> No. Let's suppose everyone has this patch applied. >> Now look at one particular instance. It may happen that while one of >> its appliers is writing this synchro row (either CONFIRM or ROLLBACK, >> doesn't matter), some other applier may still apply requests coming >> from other replicaset members. >> >> I was wondering what would happen if someone else sent this instance >> another synchro row. Looks like nothing bad but I just wanted to >> double-check. >> >> And looks like there's a bug, which I'm speaking of below. It's about >> someone sending us normal rows (either synchronous transactions or >> asynchronous, but not CONFIRM/ROLLBACK entries) while we're waiting for >> syncro row's write to end. >> >> Say, limbo was owned by instance 1, and instance 2 has written CONFIRM >> for everything there was. While we wait for 2's CONFIRM to be written to >> WAL, we may receive some rows from instance 3, who has already applied 2's >> CONFIRM. Since we haven't written the CONFIRM yet, we haven't applied it, >> and the limbo on our instance still isn't empty. All the rows coming from >> 3 will get rejected and replication between 3 and us will be broken. > ## Regarding the described issue > > This is a super good question, I am glad you noticed. I firstly thought > it is a bug, but then I realized it is not possible. > > If the instance 3 already wrote that CONFIRM from instance_id 1, it would > send it us too before its own rows, and it would block on the latch of the > struct replica with instance_id 1 until the CONFIRM is finally written. Then > the second attempt to write it would be skipped and the rows of instance_id 3 > would be applied fine. > > I made a test which replayed the described situation and all worked fine. I > added it to the patch, and also I provide it below for easy reproduction in > 3 consoles. Yes, indeed. Thanks for the explanation! > > ## A problem not related to the patch > > However there might happen a similar issue, not related to the fixed bug. > Assume the limbo belongs to instance 1. Then instance 1 is gone, and > instance 2 tries to write CONFIRM on behalf of instance 1 using > box.ctl.clear_synchro_queue(). > > While the CONFIRM is in progress, it goes to WAL, from WAL it goes to > replication, from there it goes to instance 3. This one applies it, the limbo > becomes empty on the instance 3. It writes a new row, which is written to > its WAL, and is sent back to instance 2. The instance 2 still didn't finish > the original CONFIRM WAL write, TX thread didn't see it yet. So its limbo still > has rows from instance 1, and the latch is not locked. Now the replication > between instances 2 and 3 is broken. Yes, that's possible. I've also thought of a lock on CONFIRM, etc. > > I didn't try to reproduce this craziness though. However I have an idea how > we could fix it "for good". Introduce a fiber read-write lock. All the > transactions take read-lock on the limbo. All CONFIRM/ROLLBACK/PROMOTE take > a write-lock on the limbo. Then all the limbo owner transitions would be > fixed. Sounds good. Isn't it enough to only have a write lock though? > > I also thought more about how to evolve my idea about waiting for WAL write > only if we get a conflicting transaction. We need to know transaction's > origin ID before calling txn_commit/txn_commit_try_async. To make it fast we > need to calculate it while create txn_stmt objects so it is already known > right before commit. And here it is tricky, because locally-created > transactions might rollback individual statements using savepoints. That > makes the on-fly calculation of the origin_id not feasible. So far I couldn't > find what to do with that. > > ## The console test > > It reproduces your scenario and all works fine, because the latch saves > our asses. You need 3 instances in different consoles and need to do the > steps in the order as they are marked in the comments. The test's fine. Thanks for checking this! > > -- > -- Instance 1 > -- > -- Step 1 > box.cfg{ > listen = 3313, > replication_synchro_quorum = 4, > replication_synchro_timeout = 1000, > } > box.schema.user.grant('guest', 'super') > s = box.schema.create_space('test', {is_sync = true}) > _ = s:create_index('pk') > > -- Step 4 > fiber = require('fiber') > f = fiber.new(function() s:replace{1} end) > > -- Step 8 > box.cfg{replication_synchro_quorum = 3} > while f:status() ~= 'dead' do fiber.sleep(0.1) end > > > > -- > -- Instance 2 > -- > -- Step 2 > box.cfg{ > listen = 3314, > replication = {'localhost:3313', 'localhost:3314', 'localhost:3315'}, > read_only = true, > } > > -- Step 5 > -- Wait vclock. > > -- Step 7 > box.error.injection.set("ERRINJ_WAL_DELAY", true) > > -- Step 10 > box.error.injection.set("ERRINJ_WAL_DELAY", false) > > > > -- > -- Instance 3 > -- > -- Step 3 > box.cfg{ > listen = 3315, > replication = {'localhost:3313', 'localhost:3314', 'localhost:3315'}, > read_only = true, > } > > -- Step 6 > -- Wait vclock. > > -- Step 9 > box.cfg{read_only = false} > s = box.space.test > box.cfg{ > replication_synchro_quorum = 2, > replication_synchro_timeout = 1000, > } > s:replace{2} > > ## The patch > > Here is a full diff of the patch. It includes the test above as a > new .test.lua file, and all review fixes. > > ==================== > > applier: process synchro rows after WAL write > > Applier used to process synchronous rows CONFIRM and ROLLBACK > right after receipt before they are written to WAL. > > That led to a bug that the confirmed data became visible, might be > accessed by user requests, then the node restarted before CONFIRM > finished its WAL write, and the data was not visible again. That > is just like if it would be rolled back, which is not acceptable. > > Another case - CONFIRM WAL write could simply fail due to any > reason (no disk space, OOM), but the transactions would remain > confirmed anyway. > > Also that produced some hacks in the limbo's code to support the > confirmation and rollback of transactions not yet written to WAL. > > The patch makes the synchro rows processed only after they are > written to WAL. Although the 'rollback' case above might still > happen if the xlogs were in the kernel caches, and the machine was > powered off before they were flushed to disk. But that is not > related to qsync specifically. > > To handle the synchro rows after WAL write the patch makes them go > to WAL in a blocking way (journal_write() instead of > journal_write_try_async()). Otherwise it could happen that a > CONFIRM/ROLLBACK is being written to WAL and would clear the limbo > afterwards, but a new transaction arrives with a different owner, > and it conflicts with the current limbo owner. > > Closes #5213 > > diff --git a/changelogs/unreleased/applier-qsync-write-order.md b/changelogs/unreleased/applier-qsync-write-order.md > new file mode 100644 > index 000000000..10b9ade80 > --- /dev/null > +++ b/changelogs/unreleased/applier-qsync-write-order.md > @@ -0,0 +1,6 @@ > +## bugfix/replication > + > +* Fix the bug when a synchronous transaction could be confirmed and visible on > + a replica, but then not confirmed / invisible again after restart. Could > + happen more likely on memtx spaces with `memtx_use_mvcc_engine` enabled > + (gh-5213). > diff --git a/src/box/applier.cc b/src/box/applier.cc > index 971b2e64c..4898f9f7b 100644 > --- a/src/box/applier.cc > +++ b/src/box/applier.cc > @@ -762,26 +762,21 @@ applier_txn_wal_write_cb(struct trigger *trigger, void *event) > } > > struct synchro_entry { > - /** Encoded form of a synchro record. */ > - struct synchro_body_bin body_bin; > - > - /** xrow to write, used by the journal engine. */ > - struct xrow_header row; > - > + /** Request to process when WAL write is done. */ > + struct synchro_request *req; > + /** Fiber created the entry. To wakeup when WAL write is done. */ > + struct fiber *owner; > /** > - * The journal entry itself. Note since > - * it has unsized array it must be the > - * last entry in the structure. > + * The base journal entry. It has unsized array and then must be the > + * last entry in the structure. But can workaround it via a union > + * adding the needed tail as char[]. > */ > - struct journal_entry journal_entry; > + union { > + struct journal_entry base; > + char base_buf[sizeof(base) + sizeof(base.rows[0])]; > + }; > }; > > -static void > -synchro_entry_delete(struct synchro_entry *entry) > -{ > - free(entry); > -} > - > /** > * Async write journal completion. > */ > @@ -791,50 +786,15 @@ apply_synchro_row_cb(struct journal_entry *entry) > assert(entry->complete_data != NULL); > struct synchro_entry *synchro_entry = > (struct synchro_entry *)entry->complete_data; > - if (entry->res < 0) > + if (entry->res < 0) { > applier_rollback_by_wal_io(); > - else > + } else { > + txn_limbo_process(&txn_limbo, synchro_entry->req); > trigger_run(&replicaset.applier.on_wal_write, NULL); > - > - synchro_entry_delete(synchro_entry); > -} > - > -/** > - * Allocate a new synchro_entry to be passed to > - * the journal engine in async write way. > - */ > -static struct synchro_entry * > -synchro_entry_new(struct xrow_header *applier_row, > - struct synchro_request *req) > -{ > - struct synchro_entry *entry; > - size_t size = sizeof(*entry) + sizeof(struct xrow_header *); > - > - /* > - * For simplicity we use malloc here but > - * probably should provide some cache similar > - * to txn cache. > - */ > - entry = (struct synchro_entry *)malloc(size); > - if (entry == NULL) { > - diag_set(OutOfMemory, size, "malloc", "synchro_entry"); > - return NULL; > } > - > - struct journal_entry *journal_entry = &entry->journal_entry; > - struct synchro_body_bin *body_bin = &entry->body_bin; > - struct xrow_header *row = &entry->row; > - > - journal_entry->rows[0] = row; > - > - xrow_encode_synchro(row, body_bin, req); > - > - row->lsn = applier_row->lsn; > - row->replica_id = applier_row->replica_id; > - > - journal_entry_create(journal_entry, 1, xrow_approx_len(row), > - apply_synchro_row_cb, entry); > - return entry; > + /* The fiber is the same on final join. */ > + if (synchro_entry->owner != fiber()) > + fiber_wakeup(synchro_entry->owner); > } > > /** Process a synchro request. */ > @@ -847,14 +807,37 @@ apply_synchro_row(struct xrow_header *row) > if (xrow_decode_synchro(row, &req) != 0) > goto err; > > - txn_limbo_process(&txn_limbo, &req); > - > - struct synchro_entry *entry; > - entry = synchro_entry_new(row, &req); > - if (entry == NULL) > - goto err; > - > - if (journal_write_try_async(&entry->journal_entry) != 0) { > + struct synchro_entry entry; > + /* > + * Rows array is cast from *[] to **, because otherwise g++ complains > + * about out of array bounds access. > + */ > + struct xrow_header **rows; > + rows = entry.base.rows; > + rows[0] = row; > ==================== > This is how I fixed the compilation issue. I couldn't reproduce it > locally in my GCC installed via brew, but it was failing in CI in > one job. A simple cast didn't help. Only creation of a new variable > did help. Also it is not assigned right away - the declaration is > separate. Otherwise 'goto's before that line do not work. > ==================== > + journal_entry_create(&entry.base, 1, xrow_approx_len(row), > + apply_synchro_row_cb, &entry); > + entry.req = &req; > + entry.owner = fiber(); > + /* > + * The WAL write is blocking. Otherwise it might happen that a CONFIRM > + * or ROLLBACK is sent to WAL, and it would empty the limbo, but before > + * it is written, more transactions arrive with a different owner. They > + * won't be able to enter the limbo due to owner ID mismatch. Hence the > + * synchro rows must block receipt of new transactions. > + * > + * Don't forget to return -1 both if the journal write failed right > + * away, and if it failed inside of WAL thread (res < 0). Otherwise the > + * caller would propagate committed vclock to this row thinking it was > + * a success. > + * > + * XXX: in theory it could be done vice-versa. The write could be made > + * non-blocking, and instead the potentially conflicting transactions > + * could try to wait for all the current synchro WAL writes to end > + * before trying to commit. But that requires extra steps from the > + * transactions side, including the async ones. > + */ > + if (journal_write(&entry.base) != 0 || entry.base.res < 0) { > diag_set(ClientError, ER_WAL_IO); > goto err; > } > diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c > index cf0ad9350..c63211191 100644 > --- a/src/box/txn_limbo.c > +++ b/src/box/txn_limbo.c > @@ -400,20 +400,11 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn) > txn_limbo_remove(limbo, e); > txn_clear_flags(e->txn, TXN_WAIT_SYNC | TXN_WAIT_ACK); > /* > - * If already written to WAL by now, finish tx processing. > - * Otherwise just clear the sync flags. Tx procesing will finish > - * automatically once the tx is written to WAL. > - * > - * XXX: Normally at this point all transactions covered by this > - * CONFIRM should be in WAL already, but there is a bug, that > - * replica always processes received synchro requests *before* > - * writing them to WAL. So it can happen, that a CONFIRM is > - * 'read', but the transaction is not written yet. Should be > - * fixed when the replica will behave properly, and then this > - * branch won't exist. > + * Should be written to WAL by now. Confirm is always written > + * after the affected transactions. > */ > - if (e->txn->signature >= 0) > - txn_complete_success(e->txn); > + assert(e->txn->signature >= 0); > + txn_complete_success(e->txn); > } > /* Update is_ro once the limbo is clear. */ > if (txn_limbo_is_empty(limbo)) > @@ -455,30 +446,13 @@ txn_limbo_read_rollback(struct txn_limbo *limbo, int64_t lsn) > rlist_foreach_entry_safe_reverse(e, &limbo->queue, in_queue, tmp) { > txn_limbo_abort(limbo, e); > txn_clear_flags(e->txn, TXN_WAIT_SYNC | TXN_WAIT_ACK); > - if (e->txn->signature >= 0) { > - /* Rollback the transaction. */ > - e->txn->signature = TXN_SIGNATURE_SYNC_ROLLBACK; > - txn_complete_fail(e->txn); > - } else { > - /* > - * Rollback the transaction, but don't free it yet. It > - * will be freed after its WAL write is completed. > - * > - * XXX: Normally at this point all transactions covered > - * by this ROLLBACK should be in WAL already, but there > - * is a bug, that replica always processes received > - * synchro requests *before* writing them to WAL. So it > - * can happen, that a ROLLBACK is 'read', but the > - * transaction is not written yet. Should be fixed when > - * the replica will behave properly, and then this > - * branch won't exist. > - */ > - e->txn->signature = TXN_SIGNATURE_SYNC_ROLLBACK; > - struct fiber *fiber = e->txn->fiber; > - e->txn->fiber = fiber(); > - txn_complete_fail(e->txn); > - e->txn->fiber = fiber; > - } > + /* > + * Should be written to WAL by now. Rollback is always written > + * after the affected transactions. > + */ > + assert(e->txn->signature >= 0); > + e->txn->signature = TXN_SIGNATURE_SYNC_ROLLBACK; > + txn_complete_fail(e->txn); > if (e == last_rollback) > break; > } > diff --git a/test/replication/gh-5213-qsync-applier-order-3.result b/test/replication/gh-5213-qsync-applier-order-3.result > new file mode 100644 > index 000000000..bcb18b5c0 > --- /dev/null > +++ b/test/replication/gh-5213-qsync-applier-order-3.result > @@ -0,0 +1,263 @@ > +-- test-run result file version 2 > +-- > +-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL. > +-- As a result it could happen that the transactions became visible on CONFIRM, > +-- then somehow weren't written to WAL, and on restart the data might not be > +-- visible again. Which means rollback of confirmed data and is not acceptable > +-- (on the contrary with commit after rollback). > +-- > +-- To fix that there was a patch making synchro rows processing after WAL write. > +-- As a result, the following situation could happen. Instance 1 owns the limbo, > +-- instances 2 and 3 know about that. The limbo is not empty. Now instance 1 > +-- writes CONFIRM and sends it to the instances 2 and 3. Both start a WAL write > +-- for the CONFIRM. Now instance 3 finishes it, creates a new synchro > +-- transaction, owns the local limbo, and sends the transaction to the instance > +-- 2. Here the CONFIRM WAL write is not done yet. It should not happen, that > +-- the instance 3's new transaction is rejected. Because firstly instance 3 will > +-- send the same instance 1's CONFIRM to the instance 2 due to it being earlier > +-- in WAL of instance 3. Then on instance 2 it will block on a latch with > +-- replica_id 1 until the original CONFIRM received from the instance 1 is > +-- finished. Afterwards the instance 3's transaction is applied just fine - the > +-- limbo on instance 2 is empty now. > +-- > +-- It is not related to read-views, but could break the replication. > +-- > +test_run = require('test_run').new() > + | --- > + | ... > +fiber = require('fiber') > + | --- > + | ... > +old_synchro_quorum = box.cfg.replication_synchro_quorum > + | --- > + | ... > +old_synchro_timeout = box.cfg.replication_synchro_timeout > + | --- > + | ... > + > +box.schema.user.grant('guest', 'super') > + | --- > + | ... > + > +s = box.schema.space.create('test', {is_sync = true}) > + | --- > + | ... > +_ = s:create_index('pk') > + | --- > + | ... > + > +test_run:cmd('create server replica1 with rpl_master=default,\ > + script="replication/replica1.lua"') > + | --- > + | - true > + | ... > +test_run:cmd('start server replica1') > + | --- > + | - true > + | ... > + > +test_run:cmd('create server replica2 with rpl_master=default,\ > + script="replication/replica2.lua"') > + | --- > + | - true > + | ... > +test_run:cmd('start server replica2') > + | --- > + | - true > + | ... > + > +-- Build mutual replication between replica1 and replica2. > +test_run:switch('replica1') > + | --- > + | - true > + | ... > +replication = box.cfg.replication > + | --- > + | ... > +table.insert(replication, test_run:eval('replica2', 'return box.cfg.listen')[1]) > + | --- > + | ... > +box.cfg{replication = {}} > + | --- > + | ... > +box.cfg{replication = replication} > + | --- > + | ... > + > +test_run:switch('replica2') > + | --- > + | - true > + | ... > +replication = box.cfg.replication > + | --- > + | ... > +table.insert(replication, test_run:eval('replica1', 'return box.cfg.listen')[1]) > + | --- > + | ... > +box.cfg{replication = {}} > + | --- > + | ... > +box.cfg{replication = replication} > + | --- > + | ... > + > +test_run:switch('default') > + | --- > + | - true > + | ... > +fiber = require('fiber') > + | --- > + | ... > +box.cfg{ \ > + replication_synchro_quorum = 4, \ > + replication_synchro_timeout = 1000, \ > +} > + | --- > + | ... > +-- Send a transaction to all 3 nodes. The limbo is owned by the default node > +-- everywhere. > +f = fiber.new(function() s:replace{1} end) > + | --- > + | ... > +test_run:wait_lsn('replica1', 'default') > + | --- > + | ... > +test_run:wait_lsn('replica2', 'default') > + | --- > + | ... > + > +-- Make so the replica1 will apply CONFIRM from the default instance for a long > +-- time. > +test_run:switch('replica1') > + | --- > + | - true > + | ... > +box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0) > + | --- > + | - ok > + | ... > + > +-- Emit the CONFIRM. > +test_run:switch('default') > + | --- > + | - true > + | ... > +box.cfg{replication_synchro_quorum = 3} > + | --- > + | ... > +test_run:wait_cond(function() return f:status() == 'dead' end) > + | --- > + | - true > + | ... > + > +-- It hangs on the replica1. > +test_run:switch('replica1') > + | --- > + | - true > + | ... > +test_run:wait_cond(function() \ > + return box.error.injection.get('ERRINJ_WAL_DELAY') \ > +end) > + | --- > + | - true > + | ... > + > +-- But is applied on the replica2. The limbo is empty here now. > +test_run:switch('replica2') > + | --- > + | - true > + | ... > +test_run:wait_lsn('replica2', 'default') > + | --- > + | ... > +box.cfg{ \ > + replication_synchro_quorum = 2, \ > + replication_synchro_timeout = 1000, \ > +} > + | --- > + | ... > +-- Replica2 takes the limbo ownership and sends the transaction to the replica1. > +-- Along with the CONFIRM from the default node, which is still not applied > +-- on the replica1. > +fiber = require('fiber') > + | --- > + | ... > +f = fiber.new(function() box.space.test:replace{2} end) > + | --- > + | ... > + > +test_run:switch('replica1') > + | --- > + | - true > + | ... > +fiber = require('fiber') > + | --- > + | ... > +-- WAL write of the CONFIRM from the default node still is not done. Give it > +-- some time to get the new rows from the replica2 and block on the latch. > +-- Can't catch it anyhow via conds, so the only way is to sleep a bit. > +fiber.sleep(0.1) > + | --- > + | ... > +-- Let the WAL writes finish. Firstly CONFIRM is finished, the limbo is emptied > +-- and the replica_id 1 latch is unlocked. Now the replica2 transaction is > +-- applied and persisted. > +box.error.injection.set('ERRINJ_WAL_DELAY', false) > + | --- > + | - ok > + | ... > +test_run:wait_lsn('replica1', 'replica2') > + | --- > + | ... > +box.space.test:get({2}) > + | --- > + | - [2] > + | ... > + > +-- Ensure the replication works fine, nothing is broken. > +test_run:wait_upstream(test_run:get_server_id('replica2'), {status = 'follow'}) > + | --- > + | - true > + | ... > + > +test_run:switch('replica2') > + | --- > + | - true > + | ... > +test_run:wait_upstream(test_run:get_server_id('replica1'), {status = 'follow'}) > + | --- > + | - true > + | ... > + > +test_run:switch('default') > + | --- > + | - true > + | ... > +test_run:cmd('stop server replica1') > + | --- > + | - true > + | ... > +test_run:cmd('delete server replica1') > + | --- > + | - true > + | ... > +test_run:cmd('stop server replica2') > + | --- > + | - true > + | ... > +test_run:cmd('delete server replica2') > + | --- > + | - true > + | ... > +s:drop() > + | --- > + | ... > +box.schema.user.revoke('guest', 'super') > + | --- > + | ... > +box.cfg{ \ > + replication_synchro_quorum = old_synchro_quorum, \ > + replication_synchro_timeout = old_synchro_timeout, \ > +} > + | --- > + | ... > diff --git a/test/replication/gh-5213-qsync-applier-order-3.test.lua b/test/replication/gh-5213-qsync-applier-order-3.test.lua > new file mode 100644 > index 000000000..37b569da7 > --- /dev/null > +++ b/test/replication/gh-5213-qsync-applier-order-3.test.lua > @@ -0,0 +1,125 @@ > +-- > +-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL. > +-- As a result it could happen that the transactions became visible on CONFIRM, > +-- then somehow weren't written to WAL, and on restart the data might not be > +-- visible again. Which means rollback of confirmed data and is not acceptable > +-- (on the contrary with commit after rollback). > +-- > +-- To fix that there was a patch making synchro rows processing after WAL write. > +-- As a result, the following situation could happen. Instance 1 owns the limbo, > +-- instances 2 and 3 know about that. The limbo is not empty. Now instance 1 > +-- writes CONFIRM and sends it to the instances 2 and 3. Both start a WAL write > +-- for the CONFIRM. Now instance 3 finishes it, creates a new synchro > +-- transaction, owns the local limbo, and sends the transaction to the instance > +-- 2. Here the CONFIRM WAL write is not done yet. It should not happen, that > +-- the instance 3's new transaction is rejected. Because firstly instance 3 will > +-- send the same instance 1's CONFIRM to the instance 2 due to it being earlier > +-- in WAL of instance 3. Then on instance 2 it will block on a latch with > +-- replica_id 1 until the original CONFIRM received from the instance 1 is > +-- finished. Afterwards the instance 3's transaction is applied just fine - the > +-- limbo on instance 2 is empty now. > +-- > +-- It is not related to read-views, but could break the replication. > +-- > +test_run = require('test_run').new() > +fiber = require('fiber') > +old_synchro_quorum = box.cfg.replication_synchro_quorum > +old_synchro_timeout = box.cfg.replication_synchro_timeout > + > +box.schema.user.grant('guest', 'super') > + > +s = box.schema.space.create('test', {is_sync = true}) > +_ = s:create_index('pk') > + > +test_run:cmd('create server replica1 with rpl_master=default,\ > + script="replication/replica1.lua"') > +test_run:cmd('start server replica1') > + > +test_run:cmd('create server replica2 with rpl_master=default,\ > + script="replication/replica2.lua"') > +test_run:cmd('start server replica2') > + > +-- Build mutual replication between replica1 and replica2. > +test_run:switch('replica1') > +replication = box.cfg.replication > +table.insert(replication, test_run:eval('replica2', 'return box.cfg.listen')[1]) > +box.cfg{replication = {}} > +box.cfg{replication = replication} > + > +test_run:switch('replica2') > +replication = box.cfg.replication > +table.insert(replication, test_run:eval('replica1', 'return box.cfg.listen')[1]) > +box.cfg{replication = {}} > +box.cfg{replication = replication} > + > +test_run:switch('default') > +fiber = require('fiber') > +box.cfg{ \ > + replication_synchro_quorum = 4, \ > + replication_synchro_timeout = 1000, \ > +} > +-- Send a transaction to all 3 nodes. The limbo is owned by the default node > +-- everywhere. > +f = fiber.new(function() s:replace{1} end) > +test_run:wait_lsn('replica1', 'default') > +test_run:wait_lsn('replica2', 'default') > + > +-- Make so the replica1 will apply CONFIRM from the default instance for a long > +-- time. > +test_run:switch('replica1') > +box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0) > + > +-- Emit the CONFIRM. > +test_run:switch('default') > +box.cfg{replication_synchro_quorum = 3} > +test_run:wait_cond(function() return f:status() == 'dead' end) > + > +-- It hangs on the replica1. > +test_run:switch('replica1') > +test_run:wait_cond(function() \ > + return box.error.injection.get('ERRINJ_WAL_DELAY') \ > +end) > + > +-- But is applied on the replica2. The limbo is empty here now. > +test_run:switch('replica2') > +test_run:wait_lsn('replica2', 'default') > +box.cfg{ \ > + replication_synchro_quorum = 2, \ > + replication_synchro_timeout = 1000, \ > +} > +-- Replica2 takes the limbo ownership and sends the transaction to the replica1. > +-- Along with the CONFIRM from the default node, which is still not applied > +-- on the replica1. > +fiber = require('fiber') > +f = fiber.new(function() box.space.test:replace{2} end) > + > +test_run:switch('replica1') > +fiber = require('fiber') > +-- WAL write of the CONFIRM from the default node still is not done. Give it > +-- some time to get the new rows from the replica2 and block on the latch. > +-- Can't catch it anyhow via conds, so the only way is to sleep a bit. > +fiber.sleep(0.1) > +-- Let the WAL writes finish. Firstly CONFIRM is finished, the limbo is emptied > +-- and the replica_id 1 latch is unlocked. Now the replica2 transaction is > +-- applied and persisted. > +box.error.injection.set('ERRINJ_WAL_DELAY', false) > +test_run:wait_lsn('replica1', 'replica2') > +box.space.test:get({2}) > + > +-- Ensure the replication works fine, nothing is broken. > +test_run:wait_upstream(test_run:get_server_id('replica2'), {status = 'follow'}) > + > +test_run:switch('replica2') > +test_run:wait_upstream(test_run:get_server_id('replica1'), {status = 'follow'}) > + > +test_run:switch('default') > +test_run:cmd('stop server replica1') > +test_run:cmd('delete server replica1') > +test_run:cmd('stop server replica2') > +test_run:cmd('delete server replica2') > +s:drop() > +box.schema.user.revoke('guest', 'super') > +box.cfg{ \ > + replication_synchro_quorum = old_synchro_quorum, \ > + replication_synchro_timeout = old_synchro_timeout, \ > +} > diff --git a/test/replication/gh-5213-qsync-applier-order.result b/test/replication/gh-5213-qsync-applier-order.result > new file mode 100644 > index 000000000..a8c24c289 > --- /dev/null > +++ b/test/replication/gh-5213-qsync-applier-order.result > @@ -0,0 +1,302 @@ > +-- test-run result file version 2 > +-- > +-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL. > +-- As a result it could happen that the transactions became visible on CONFIRM, > +-- then somehow weren't written to WAL, and on restart the data might not be > +-- visible again. Which means rollback of confirmed data and is not acceptable > +-- (on the contrary with commit after rollback). > +-- > +test_run = require('test_run').new() > + | --- > + | ... > +fiber = require('fiber') > + | --- > + | ... > +old_synchro_quorum = box.cfg.replication_synchro_quorum > + | --- > + | ... > +old_synchro_timeout = box.cfg.replication_synchro_timeout > + | --- > + | ... > + > +box.schema.user.grant('guest', 'super') > + | --- > + | ... > + > +s = box.schema.space.create('test', {is_sync = true}) > + | --- > + | ... > +_ = s:create_index('pk') > + | --- > + | ... > + > +test_run:cmd('create server replica with rpl_master=default,\ > + script="replication/gh-5213-replica.lua"') > + | --- > + | - true > + | ... > +test_run:cmd('start server replica') > + | --- > + | - true > + | ... > + > +fiber = require('fiber') > + | --- > + | ... > +box.cfg{ \ > + replication_synchro_quorum = 3, \ > + replication_synchro_timeout = 1000, \ > +} > + | --- > + | ... > +f = fiber.new(function() s:replace{1} end) > + | --- > + | ... > + > +test_run:switch('replica') > + | --- > + | - true > + | ... > +-- Wait when the transaction is written to WAL. > +test_run:wait_lsn('replica', 'default') > + | --- > + | ... > +s = box.space.test > + | --- > + | ... > +-- But not visible yet. Because MVCC is on, and no CONFIRM yet. > +assert(s:get({1}) == nil) > + | --- > + | - true > + | ... > +-- Block the incoming CONFIRM to be able to ensure the data is not visible until > +-- WAL write ends. > +box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0) > + | --- > + | - ok > + | ... > + > +test_run:switch('default') > + | --- > + | - true > + | ... > +box.cfg{replication_synchro_quorum = 2} > + | --- > + | ... > +test_run:wait_cond(function() return f:status() == 'dead' end) > + | --- > + | - true > + | ... > + > +test_run:switch('replica') > + | --- > + | - true > + | ... > +-- Wait when CONFIRM is in the WAL thread. > +test_run:wait_cond(function() \ > + return box.error.injection.get('ERRINJ_WAL_DELAY') \ > +end) > + | --- > + | - true > + | ... > +assert(s:get({1}) == nil) > + | --- > + | - true > + | ... > +box.error.injection.set('ERRINJ_WAL_DELAY', false) > + | --- > + | - ok > + | ... > +-- After CONFIRM is in WAL, the transaction is committed and its data is > +-- visible. > +test_run:wait_cond(function() return s:get({1}) ~= nil end) > + | --- > + | - true > + | ... > + > +-- > +-- Ensure CONFIRM WAL write fail also works fine when couldn't even start a WAL > +-- write. The data remains not confirmed until WAL write succeeds. > +-- > +test_run:switch('default') > + | --- > + | - true > + | ... > +box.cfg{replication_synchro_quorum = 3} > + | --- > + | ... > +f = fiber.new(function() s:replace{2} end) > + | --- > + | ... > + > +test_run:switch('replica') > + | --- > + | - true > + | ... > +test_run:wait_lsn('replica', 'default') > + | --- > + | ... > +assert(s:get({2}) == nil) > + | --- > + | - true > + | ... > +-- Make journal write fail immediately. > +box.error.injection.set('ERRINJ_WAL_IO', true) > + | --- > + | - ok > + | ... > + > +test_run:switch('default') > + | --- > + | - true > + | ... > +box.cfg{replication_synchro_quorum = 2} > + | --- > + | ... > +test_run:wait_cond(function() return f:status() == 'dead' end) > + | --- > + | - true > + | ... > + > +test_run:switch('replica') > + | --- > + | - true > + | ... > +test_run:wait_upstream(1, {status = 'stopped'}) > + | --- > + | - true > + | ... > +box.error.injection.set('ERRINJ_WAL_IO', false) > + | --- > + | - ok > + | ... > +assert(s:get({2}) == nil) > + | --- > + | - true > + | ... > +-- Re-subscribe. > +replication = box.cfg.replication > + | --- > + | ... > +box.cfg{replication = {}} > + | --- > + | ... > +box.cfg{replication = replication} > + | --- > + | ... > +test_run:wait_lsn('replica', 'default') > + | --- > + | ... > +test_run:wait_upstream(1, {status = 'follow'}) > + | --- > + | - true > + | ... > +assert(s:get({2}) ~= nil) > + | --- > + | - true > + | ... > + > +-- > +-- Ensure CONFIRM WAL write fail works fine when an error happens inside WAL > +-- thread. The data remains not confirmed until WAL write succeeds. > +-- > +test_run:switch('default') > + | --- > + | - true > + | ... > +box.cfg{replication_synchro_quorum = 3} > + | --- > + | ... > +f = fiber.new(function() s:replace{3} end) > + | --- > + | ... > + > +test_run:switch('replica') > + | --- > + | - true > + | ... > +test_run:wait_lsn('replica', 'default') > + | --- > + | ... > +assert(s:get({3}) == nil) > + | --- > + | - true > + | ... > +-- Journal write start is going to succeed, but it will fail later on return > +-- from the WAL thread. > +box.error.injection.set('ERRINJ_WAL_ROTATE', true) > + | --- > + | - ok > + | ... > + > +test_run:switch('default') > + | --- > + | - true > + | ... > +box.cfg{replication_synchro_quorum = 2} > + | --- > + | ... > +test_run:wait_cond(function() return f:status() == 'dead' end) > + | --- > + | - true > + | ... > + > +test_run:switch('replica') > + | --- > + | - true > + | ... > +test_run:wait_upstream(1, {status = 'stopped'}) > + | --- > + | - true > + | ... > +box.error.injection.set('ERRINJ_WAL_ROTATE', false) > + | --- > + | - ok > + | ... > +assert(s:get({3}) == nil) > + | --- > + | - true > + | ... > +box.cfg{replication = {}} > + | --- > + | ... > +box.cfg{replication = replication} > + | --- > + | ... > +test_run:wait_lsn('replica', 'default') > + | --- > + | ... > +test_run:wait_upstream(1, {status = 'follow'}) > + | --- > + | - true > + | ... > +assert(s:get({3}) ~= nil) > + | --- > + | - true > + | ... > + > +test_run:switch('default') > + | --- > + | - true > + | ... > +test_run:cmd('stop server replica') > + | --- > + | - true > + | ... > +test_run:cmd('delete server replica') > + | --- > + | - true > + | ... > + > +s:drop() > + | --- > + | ... > +box.schema.user.revoke('guest', 'super') > + | --- > + | ... > +box.cfg{ \ > + replication_synchro_quorum = old_synchro_quorum, \ > + replication_synchro_timeout = old_synchro_timeout, \ > +} > + | --- > + | ... > diff --git a/test/replication/gh-5213-qsync-applier-order.test.lua b/test/replication/gh-5213-qsync-applier-order.test.lua > new file mode 100644 > index 000000000..f1eccfa84 > --- /dev/null > +++ b/test/replication/gh-5213-qsync-applier-order.test.lua > @@ -0,0 +1,122 @@ > +-- > +-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL. > +-- As a result it could happen that the transactions became visible on CONFIRM, > +-- then somehow weren't written to WAL, and on restart the data might not be > +-- visible again. Which means rollback of confirmed data and is not acceptable > +-- (on the contrary with commit after rollback). > +-- > +test_run = require('test_run').new() > +fiber = require('fiber') > +old_synchro_quorum = box.cfg.replication_synchro_quorum > +old_synchro_timeout = box.cfg.replication_synchro_timeout > + > +box.schema.user.grant('guest', 'super') > + > +s = box.schema.space.create('test', {is_sync = true}) > +_ = s:create_index('pk') > + > +test_run:cmd('create server replica with rpl_master=default,\ > + script="replication/gh-5213-replica.lua"') > +test_run:cmd('start server replica') > + > +fiber = require('fiber') > +box.cfg{ \ > + replication_synchro_quorum = 3, \ > + replication_synchro_timeout = 1000, \ > +} > +f = fiber.new(function() s:replace{1} end) > + > +test_run:switch('replica') > +-- Wait when the transaction is written to WAL. > +test_run:wait_lsn('replica', 'default') > +s = box.space.test > +-- But not visible yet. Because MVCC is on, and no CONFIRM yet. > +assert(s:get({1}) == nil) > +-- Block the incoming CONFIRM to be able to ensure the data is not visible until > +-- WAL write ends. > +box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0) > + > +test_run:switch('default') > +box.cfg{replication_synchro_quorum = 2} > +test_run:wait_cond(function() return f:status() == 'dead' end) > + > +test_run:switch('replica') > +-- Wait when CONFIRM is in the WAL thread. > +test_run:wait_cond(function() \ > + return box.error.injection.get('ERRINJ_WAL_DELAY') \ > +end) > +assert(s:get({1}) == nil) > +box.error.injection.set('ERRINJ_WAL_DELAY', false) > +-- After CONFIRM is in WAL, the transaction is committed and its data is > +-- visible. > +test_run:wait_cond(function() return s:get({1}) ~= nil end) > + > +-- > +-- Ensure CONFIRM WAL write fail also works fine when couldn't even start a WAL > +-- write. The data remains not confirmed until WAL write succeeds. > +-- > +test_run:switch('default') > +box.cfg{replication_synchro_quorum = 3} > +f = fiber.new(function() s:replace{2} end) > + > +test_run:switch('replica') > +test_run:wait_lsn('replica', 'default') > +assert(s:get({2}) == nil) > +-- Make journal write fail immediately. > +box.error.injection.set('ERRINJ_WAL_IO', true) > + > +test_run:switch('default') > +box.cfg{replication_synchro_quorum = 2} > +test_run:wait_cond(function() return f:status() == 'dead' end) > + > +test_run:switch('replica') > +test_run:wait_upstream(1, {status = 'stopped'}) > +box.error.injection.set('ERRINJ_WAL_IO', false) > +assert(s:get({2}) == nil) > +-- Re-subscribe. > +replication = box.cfg.replication > +box.cfg{replication = {}} > +box.cfg{replication = replication} > +test_run:wait_lsn('replica', 'default') > +test_run:wait_upstream(1, {status = 'follow'}) > +assert(s:get({2}) ~= nil) > + > +-- > +-- Ensure CONFIRM WAL write fail works fine when an error happens inside WAL > +-- thread. The data remains not confirmed until WAL write succeeds. > +-- > +test_run:switch('default') > +box.cfg{replication_synchro_quorum = 3} > +f = fiber.new(function() s:replace{3} end) > + > +test_run:switch('replica') > +test_run:wait_lsn('replica', 'default') > +assert(s:get({3}) == nil) > +-- Journal write start is going to succeed, but it will fail later on return > +-- from the WAL thread. > +box.error.injection.set('ERRINJ_WAL_ROTATE', true) > + > +test_run:switch('default') > +box.cfg{replication_synchro_quorum = 2} > +test_run:wait_cond(function() return f:status() == 'dead' end) > + > +test_run:switch('replica') > +test_run:wait_upstream(1, {status = 'stopped'}) > +box.error.injection.set('ERRINJ_WAL_ROTATE', false) > +assert(s:get({3}) == nil) > +box.cfg{replication = {}} > +box.cfg{replication = replication} > +test_run:wait_lsn('replica', 'default') > +test_run:wait_upstream(1, {status = 'follow'}) > +assert(s:get({3}) ~= nil) > + > +test_run:switch('default') > +test_run:cmd('stop server replica') > +test_run:cmd('delete server replica') > + > +s:drop() > +box.schema.user.revoke('guest', 'super') > +box.cfg{ \ > + replication_synchro_quorum = old_synchro_quorum, \ > + replication_synchro_timeout = old_synchro_timeout, \ > +} > diff --git a/test/replication/gh-5213-replica.lua b/test/replication/gh-5213-replica.lua > new file mode 100644 > index 000000000..3ba90d7ab > --- /dev/null > +++ b/test/replication/gh-5213-replica.lua > @@ -0,0 +1,10 @@ > +#!/usr/bin/env tarantool > + > +box.cfg({ > + listen = os.getenv("LISTEN"), > + replication = os.getenv("MASTER"), > + read_only = true, > + memtx_use_mvcc_engine = true, > +}) > + > +require('console').listen(os.getenv('ADMIN')) > diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg > index aff5fda26..4a9ca0a46 100644 > --- a/test/replication/suite.cfg > +++ b/test/replication/suite.cfg > @@ -15,6 +15,8 @@ > "gh-3760-misc-return-on-quorum-0.test.lua": {}, > "gh-4399-misc-no-failure-on-error-reading-wal.test.lua": {}, > "gh-4424-misc-orphan-on-reconfiguration-error.test.lua": {}, > + "gh-5213-qsync-applier-order.test.lua": {}, > + "gh-5213-qsync-applier-order-3.test.lua": {}, > "gh-5426-election-on-off.test.lua": {}, > "gh-5433-election-restart-recovery.test.lua": {}, > "gh-5506-election-on-off.test.lua": {}, > diff --git a/test/replication/suite.ini b/test/replication/suite.ini > index a9e44e8cf..1d9c0a4ae 100644 > --- a/test/replication/suite.ini > +++ b/test/replication/suite.ini > @@ -3,7 +3,7 @@ core = tarantool > script = master.lua > description = tarantool/box, replication > disabled = consistent.test.lua > -release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua > +release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua > config = suite.cfg > lua_libs = lua/fast_replica.lua lua/rlimit.lua > use_unix_sockets = True -- Serge Petrenko