From: Serge Petrenko via Tarantool-patches <tarantool-patches@dev.tarantool.org> To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>, Cyrill Gorcunov <gorcunov@gmail.com> Cc: tarantool-patches@dev.tarantool.org Subject: Re: [Tarantool-patches] [PATCH 1/1] applier: process synchro rows after WAL write Date: Fri, 9 Apr 2021 11:25:18 +0300 [thread overview] Message-ID: <bce9294c-36cc-daf9-2546-4d1fe2e2cdac@tarantool.org> (raw) In-Reply-To: <bfdc702b-2afd-681d-4044-179c958cc6db@tarantool.org> 09.04.2021 01:57, Vladislav Shpilevoy пишет: > On 08.04.2021 12:32, Serge Petrenko via Tarantool-patches wrote> 08.04.2021 13:19, Cyrill Gorcunov пишет: >>> On Thu, Apr 08, 2021 at 11:39:03AM +0300, Serge Petrenko wrote: >>>> Thanks for the patch! >>>> >>>> I'm a bit worried about two different synchro rows coming from two >>>> appliers. Is everything ok in this case? >>> Serge, you mean the scenario when some instances in replicaset >>> have the patch applied and some are not? >> No. Let's suppose everyone has this patch applied. >> Now look at one particular instance. It may happen that while one of >> its appliers is writing this synchro row (either CONFIRM or ROLLBACK, >> doesn't matter), some other applier may still apply requests coming >> from other replicaset members. >> >> I was wondering what would happen if someone else sent this instance >> another synchro row. Looks like nothing bad but I just wanted to >> double-check. >> >> And looks like there's a bug, which I'm speaking of below. It's about >> someone sending us normal rows (either synchronous transactions or >> asynchronous, but not CONFIRM/ROLLBACK entries) while we're waiting for >> syncro row's write to end. >> >> Say, limbo was owned by instance 1, and instance 2 has written CONFIRM >> for everything there was. While we wait for 2's CONFIRM to be written to >> WAL, we may receive some rows from instance 3, who has already applied 2's >> CONFIRM. Since we haven't written the CONFIRM yet, we haven't applied it, >> and the limbo on our instance still isn't empty. All the rows coming from >> 3 will get rejected and replication between 3 and us will be broken. > ## Regarding the described issue > > This is a super good question, I am glad you noticed. I firstly thought > it is a bug, but then I realized it is not possible. > > If the instance 3 already wrote that CONFIRM from instance_id 1, it would > send it us too before its own rows, and it would block on the latch of the > struct replica with instance_id 1 until the CONFIRM is finally written. Then > the second attempt to write it would be skipped and the rows of instance_id 3 > would be applied fine. > > I made a test which replayed the described situation and all worked fine. I > added it to the patch, and also I provide it below for easy reproduction in > 3 consoles. Yes, indeed. Thanks for the explanation! > > ## A problem not related to the patch > > However there might happen a similar issue, not related to the fixed bug. > Assume the limbo belongs to instance 1. Then instance 1 is gone, and > instance 2 tries to write CONFIRM on behalf of instance 1 using > box.ctl.clear_synchro_queue(). > > While the CONFIRM is in progress, it goes to WAL, from WAL it goes to > replication, from there it goes to instance 3. This one applies it, the limbo > becomes empty on the instance 3. It writes a new row, which is written to > its WAL, and is sent back to instance 2. The instance 2 still didn't finish > the original CONFIRM WAL write, TX thread didn't see it yet. So its limbo still > has rows from instance 1, and the latch is not locked. Now the replication > between instances 2 and 3 is broken. Yes, that's possible. I've also thought of a lock on CONFIRM, etc. > > I didn't try to reproduce this craziness though. However I have an idea how > we could fix it "for good". Introduce a fiber read-write lock. All the > transactions take read-lock on the limbo. All CONFIRM/ROLLBACK/PROMOTE take > a write-lock on the limbo. Then all the limbo owner transitions would be > fixed. Sounds good. Isn't it enough to only have a write lock though? > > I also thought more about how to evolve my idea about waiting for WAL write > only if we get a conflicting transaction. We need to know transaction's > origin ID before calling txn_commit/txn_commit_try_async. To make it fast we > need to calculate it while create txn_stmt objects so it is already known > right before commit. And here it is tricky, because locally-created > transactions might rollback individual statements using savepoints. That > makes the on-fly calculation of the origin_id not feasible. So far I couldn't > find what to do with that. > > ## The console test > > It reproduces your scenario and all works fine, because the latch saves > our asses. You need 3 instances in different consoles and need to do the > steps in the order as they are marked in the comments. The test's fine. Thanks for checking this! > > -- > -- Instance 1 > -- > -- Step 1 > box.cfg{ > listen = 3313, > replication_synchro_quorum = 4, > replication_synchro_timeout = 1000, > } > box.schema.user.grant('guest', 'super') > s = box.schema.create_space('test', {is_sync = true}) > _ = s:create_index('pk') > > -- Step 4 > fiber = require('fiber') > f = fiber.new(function() s:replace{1} end) > > -- Step 8 > box.cfg{replication_synchro_quorum = 3} > while f:status() ~= 'dead' do fiber.sleep(0.1) end > > > > -- > -- Instance 2 > -- > -- Step 2 > box.cfg{ > listen = 3314, > replication = {'localhost:3313', 'localhost:3314', 'localhost:3315'}, > read_only = true, > } > > -- Step 5 > -- Wait vclock. > > -- Step 7 > box.error.injection.set("ERRINJ_WAL_DELAY", true) > > -- Step 10 > box.error.injection.set("ERRINJ_WAL_DELAY", false) > > > > -- > -- Instance 3 > -- > -- Step 3 > box.cfg{ > listen = 3315, > replication = {'localhost:3313', 'localhost:3314', 'localhost:3315'}, > read_only = true, > } > > -- Step 6 > -- Wait vclock. > > -- Step 9 > box.cfg{read_only = false} > s = box.space.test > box.cfg{ > replication_synchro_quorum = 2, > replication_synchro_timeout = 1000, > } > s:replace{2} > > ## The patch > > Here is a full diff of the patch. It includes the test above as a > new .test.lua file, and all review fixes. > > ==================== > > applier: process synchro rows after WAL write > > Applier used to process synchronous rows CONFIRM and ROLLBACK > right after receipt before they are written to WAL. > > That led to a bug that the confirmed data became visible, might be > accessed by user requests, then the node restarted before CONFIRM > finished its WAL write, and the data was not visible again. That > is just like if it would be rolled back, which is not acceptable. > > Another case - CONFIRM WAL write could simply fail due to any > reason (no disk space, OOM), but the transactions would remain > confirmed anyway. > > Also that produced some hacks in the limbo's code to support the > confirmation and rollback of transactions not yet written to WAL. > > The patch makes the synchro rows processed only after they are > written to WAL. Although the 'rollback' case above might still > happen if the xlogs were in the kernel caches, and the machine was > powered off before they were flushed to disk. But that is not > related to qsync specifically. > > To handle the synchro rows after WAL write the patch makes them go > to WAL in a blocking way (journal_write() instead of > journal_write_try_async()). Otherwise it could happen that a > CONFIRM/ROLLBACK is being written to WAL and would clear the limbo > afterwards, but a new transaction arrives with a different owner, > and it conflicts with the current limbo owner. > > Closes #5213 > > diff --git a/changelogs/unreleased/applier-qsync-write-order.md b/changelogs/unreleased/applier-qsync-write-order.md > new file mode 100644 > index 000000000..10b9ade80 > --- /dev/null > +++ b/changelogs/unreleased/applier-qsync-write-order.md > @@ -0,0 +1,6 @@ > +## bugfix/replication > + > +* Fix the bug when a synchronous transaction could be confirmed and visible on > + a replica, but then not confirmed / invisible again after restart. Could > + happen more likely on memtx spaces with `memtx_use_mvcc_engine` enabled > + (gh-5213). > diff --git a/src/box/applier.cc b/src/box/applier.cc > index 971b2e64c..4898f9f7b 100644 > --- a/src/box/applier.cc > +++ b/src/box/applier.cc > @@ -762,26 +762,21 @@ applier_txn_wal_write_cb(struct trigger *trigger, void *event) > } > > struct synchro_entry { > - /** Encoded form of a synchro record. */ > - struct synchro_body_bin body_bin; > - > - /** xrow to write, used by the journal engine. */ > - struct xrow_header row; > - > + /** Request to process when WAL write is done. */ > + struct synchro_request *req; > + /** Fiber created the entry. To wakeup when WAL write is done. */ > + struct fiber *owner; > /** > - * The journal entry itself. Note since > - * it has unsized array it must be the > - * last entry in the structure. > + * The base journal entry. It has unsized array and then must be the > + * last entry in the structure. But can workaround it via a union > + * adding the needed tail as char[]. > */ > - struct journal_entry journal_entry; > + union { > + struct journal_entry base; > + char base_buf[sizeof(base) + sizeof(base.rows[0])]; > + }; > }; > > -static void > -synchro_entry_delete(struct synchro_entry *entry) > -{ > - free(entry); > -} > - > /** > * Async write journal completion. > */ > @@ -791,50 +786,15 @@ apply_synchro_row_cb(struct journal_entry *entry) > assert(entry->complete_data != NULL); > struct synchro_entry *synchro_entry = > (struct synchro_entry *)entry->complete_data; > - if (entry->res < 0) > + if (entry->res < 0) { > applier_rollback_by_wal_io(); > - else > + } else { > + txn_limbo_process(&txn_limbo, synchro_entry->req); > trigger_run(&replicaset.applier.on_wal_write, NULL); > - > - synchro_entry_delete(synchro_entry); > -} > - > -/** > - * Allocate a new synchro_entry to be passed to > - * the journal engine in async write way. > - */ > -static struct synchro_entry * > -synchro_entry_new(struct xrow_header *applier_row, > - struct synchro_request *req) > -{ > - struct synchro_entry *entry; > - size_t size = sizeof(*entry) + sizeof(struct xrow_header *); > - > - /* > - * For simplicity we use malloc here but > - * probably should provide some cache similar > - * to txn cache. > - */ > - entry = (struct synchro_entry *)malloc(size); > - if (entry == NULL) { > - diag_set(OutOfMemory, size, "malloc", "synchro_entry"); > - return NULL; > } > - > - struct journal_entry *journal_entry = &entry->journal_entry; > - struct synchro_body_bin *body_bin = &entry->body_bin; > - struct xrow_header *row = &entry->row; > - > - journal_entry->rows[0] = row; > - > - xrow_encode_synchro(row, body_bin, req); > - > - row->lsn = applier_row->lsn; > - row->replica_id = applier_row->replica_id; > - > - journal_entry_create(journal_entry, 1, xrow_approx_len(row), > - apply_synchro_row_cb, entry); > - return entry; > + /* The fiber is the same on final join. */ > + if (synchro_entry->owner != fiber()) > + fiber_wakeup(synchro_entry->owner); > } > > /** Process a synchro request. */ > @@ -847,14 +807,37 @@ apply_synchro_row(struct xrow_header *row) > if (xrow_decode_synchro(row, &req) != 0) > goto err; > > - txn_limbo_process(&txn_limbo, &req); > - > - struct synchro_entry *entry; > - entry = synchro_entry_new(row, &req); > - if (entry == NULL) > - goto err; > - > - if (journal_write_try_async(&entry->journal_entry) != 0) { > + struct synchro_entry entry; > + /* > + * Rows array is cast from *[] to **, because otherwise g++ complains > + * about out of array bounds access. > + */ > + struct xrow_header **rows; > + rows = entry.base.rows; > + rows[0] = row; > ==================== > This is how I fixed the compilation issue. I couldn't reproduce it > locally in my GCC installed via brew, but it was failing in CI in > one job. A simple cast didn't help. Only creation of a new variable > did help. Also it is not assigned right away - the declaration is > separate. Otherwise 'goto's before that line do not work. > ==================== > + journal_entry_create(&entry.base, 1, xrow_approx_len(row), > + apply_synchro_row_cb, &entry); > + entry.req = &req; > + entry.owner = fiber(); > + /* > + * The WAL write is blocking. Otherwise it might happen that a CONFIRM > + * or ROLLBACK is sent to WAL, and it would empty the limbo, but before > + * it is written, more transactions arrive with a different owner. They > + * won't be able to enter the limbo due to owner ID mismatch. Hence the > + * synchro rows must block receipt of new transactions. > + * > + * Don't forget to return -1 both if the journal write failed right > + * away, and if it failed inside of WAL thread (res < 0). Otherwise the > + * caller would propagate committed vclock to this row thinking it was > + * a success. > + * > + * XXX: in theory it could be done vice-versa. The write could be made > + * non-blocking, and instead the potentially conflicting transactions > + * could try to wait for all the current synchro WAL writes to end > + * before trying to commit. But that requires extra steps from the > + * transactions side, including the async ones. > + */ > + if (journal_write(&entry.base) != 0 || entry.base.res < 0) { > diag_set(ClientError, ER_WAL_IO); > goto err; > } > diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c > index cf0ad9350..c63211191 100644 > --- a/src/box/txn_limbo.c > +++ b/src/box/txn_limbo.c > @@ -400,20 +400,11 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn) > txn_limbo_remove(limbo, e); > txn_clear_flags(e->txn, TXN_WAIT_SYNC | TXN_WAIT_ACK); > /* > - * If already written to WAL by now, finish tx processing. > - * Otherwise just clear the sync flags. Tx procesing will finish > - * automatically once the tx is written to WAL. > - * > - * XXX: Normally at this point all transactions covered by this > - * CONFIRM should be in WAL already, but there is a bug, that > - * replica always processes received synchro requests *before* > - * writing them to WAL. So it can happen, that a CONFIRM is > - * 'read', but the transaction is not written yet. Should be > - * fixed when the replica will behave properly, and then this > - * branch won't exist. > + * Should be written to WAL by now. Confirm is always written > + * after the affected transactions. > */ > - if (e->txn->signature >= 0) > - txn_complete_success(e->txn); > + assert(e->txn->signature >= 0); > + txn_complete_success(e->txn); > } > /* Update is_ro once the limbo is clear. */ > if (txn_limbo_is_empty(limbo)) > @@ -455,30 +446,13 @@ txn_limbo_read_rollback(struct txn_limbo *limbo, int64_t lsn) > rlist_foreach_entry_safe_reverse(e, &limbo->queue, in_queue, tmp) { > txn_limbo_abort(limbo, e); > txn_clear_flags(e->txn, TXN_WAIT_SYNC | TXN_WAIT_ACK); > - if (e->txn->signature >= 0) { > - /* Rollback the transaction. */ > - e->txn->signature = TXN_SIGNATURE_SYNC_ROLLBACK; > - txn_complete_fail(e->txn); > - } else { > - /* > - * Rollback the transaction, but don't free it yet. It > - * will be freed after its WAL write is completed. > - * > - * XXX: Normally at this point all transactions covered > - * by this ROLLBACK should be in WAL already, but there > - * is a bug, that replica always processes received > - * synchro requests *before* writing them to WAL. So it > - * can happen, that a ROLLBACK is 'read', but the > - * transaction is not written yet. Should be fixed when > - * the replica will behave properly, and then this > - * branch won't exist. > - */ > - e->txn->signature = TXN_SIGNATURE_SYNC_ROLLBACK; > - struct fiber *fiber = e->txn->fiber; > - e->txn->fiber = fiber(); > - txn_complete_fail(e->txn); > - e->txn->fiber = fiber; > - } > + /* > + * Should be written to WAL by now. Rollback is always written > + * after the affected transactions. > + */ > + assert(e->txn->signature >= 0); > + e->txn->signature = TXN_SIGNATURE_SYNC_ROLLBACK; > + txn_complete_fail(e->txn); > if (e == last_rollback) > break; > } > diff --git a/test/replication/gh-5213-qsync-applier-order-3.result b/test/replication/gh-5213-qsync-applier-order-3.result > new file mode 100644 > index 000000000..bcb18b5c0 > --- /dev/null > +++ b/test/replication/gh-5213-qsync-applier-order-3.result > @@ -0,0 +1,263 @@ > +-- test-run result file version 2 > +-- > +-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL. > +-- As a result it could happen that the transactions became visible on CONFIRM, > +-- then somehow weren't written to WAL, and on restart the data might not be > +-- visible again. Which means rollback of confirmed data and is not acceptable > +-- (on the contrary with commit after rollback). > +-- > +-- To fix that there was a patch making synchro rows processing after WAL write. > +-- As a result, the following situation could happen. Instance 1 owns the limbo, > +-- instances 2 and 3 know about that. The limbo is not empty. Now instance 1 > +-- writes CONFIRM and sends it to the instances 2 and 3. Both start a WAL write > +-- for the CONFIRM. Now instance 3 finishes it, creates a new synchro > +-- transaction, owns the local limbo, and sends the transaction to the instance > +-- 2. Here the CONFIRM WAL write is not done yet. It should not happen, that > +-- the instance 3's new transaction is rejected. Because firstly instance 3 will > +-- send the same instance 1's CONFIRM to the instance 2 due to it being earlier > +-- in WAL of instance 3. Then on instance 2 it will block on a latch with > +-- replica_id 1 until the original CONFIRM received from the instance 1 is > +-- finished. Afterwards the instance 3's transaction is applied just fine - the > +-- limbo on instance 2 is empty now. > +-- > +-- It is not related to read-views, but could break the replication. > +-- > +test_run = require('test_run').new() > + | --- > + | ... > +fiber = require('fiber') > + | --- > + | ... > +old_synchro_quorum = box.cfg.replication_synchro_quorum > + | --- > + | ... > +old_synchro_timeout = box.cfg.replication_synchro_timeout > + | --- > + | ... > + > +box.schema.user.grant('guest', 'super') > + | --- > + | ... > + > +s = box.schema.space.create('test', {is_sync = true}) > + | --- > + | ... > +_ = s:create_index('pk') > + | --- > + | ... > + > +test_run:cmd('create server replica1 with rpl_master=default,\ > + script="replication/replica1.lua"') > + | --- > + | - true > + | ... > +test_run:cmd('start server replica1') > + | --- > + | - true > + | ... > + > +test_run:cmd('create server replica2 with rpl_master=default,\ > + script="replication/replica2.lua"') > + | --- > + | - true > + | ... > +test_run:cmd('start server replica2') > + | --- > + | - true > + | ... > + > +-- Build mutual replication between replica1 and replica2. > +test_run:switch('replica1') > + | --- > + | - true > + | ... > +replication = box.cfg.replication > + | --- > + | ... > +table.insert(replication, test_run:eval('replica2', 'return box.cfg.listen')[1]) > + | --- > + | ... > +box.cfg{replication = {}} > + | --- > + | ... > +box.cfg{replication = replication} > + | --- > + | ... > + > +test_run:switch('replica2') > + | --- > + | - true > + | ... > +replication = box.cfg.replication > + | --- > + | ... > +table.insert(replication, test_run:eval('replica1', 'return box.cfg.listen')[1]) > + | --- > + | ... > +box.cfg{replication = {}} > + | --- > + | ... > +box.cfg{replication = replication} > + | --- > + | ... > + > +test_run:switch('default') > + | --- > + | - true > + | ... > +fiber = require('fiber') > + | --- > + | ... > +box.cfg{ \ > + replication_synchro_quorum = 4, \ > + replication_synchro_timeout = 1000, \ > +} > + | --- > + | ... > +-- Send a transaction to all 3 nodes. The limbo is owned by the default node > +-- everywhere. > +f = fiber.new(function() s:replace{1} end) > + | --- > + | ... > +test_run:wait_lsn('replica1', 'default') > + | --- > + | ... > +test_run:wait_lsn('replica2', 'default') > + | --- > + | ... > + > +-- Make so the replica1 will apply CONFIRM from the default instance for a long > +-- time. > +test_run:switch('replica1') > + | --- > + | - true > + | ... > +box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0) > + | --- > + | - ok > + | ... > + > +-- Emit the CONFIRM. > +test_run:switch('default') > + | --- > + | - true > + | ... > +box.cfg{replication_synchro_quorum = 3} > + | --- > + | ... > +test_run:wait_cond(function() return f:status() == 'dead' end) > + | --- > + | - true > + | ... > + > +-- It hangs on the replica1. > +test_run:switch('replica1') > + | --- > + | - true > + | ... > +test_run:wait_cond(function() \ > + return box.error.injection.get('ERRINJ_WAL_DELAY') \ > +end) > + | --- > + | - true > + | ... > + > +-- But is applied on the replica2. The limbo is empty here now. > +test_run:switch('replica2') > + | --- > + | - true > + | ... > +test_run:wait_lsn('replica2', 'default') > + | --- > + | ... > +box.cfg{ \ > + replication_synchro_quorum = 2, \ > + replication_synchro_timeout = 1000, \ > +} > + | --- > + | ... > +-- Replica2 takes the limbo ownership and sends the transaction to the replica1. > +-- Along with the CONFIRM from the default node, which is still not applied > +-- on the replica1. > +fiber = require('fiber') > + | --- > + | ... > +f = fiber.new(function() box.space.test:replace{2} end) > + | --- > + | ... > + > +test_run:switch('replica1') > + | --- > + | - true > + | ... > +fiber = require('fiber') > + | --- > + | ... > +-- WAL write of the CONFIRM from the default node still is not done. Give it > +-- some time to get the new rows from the replica2 and block on the latch. > +-- Can't catch it anyhow via conds, so the only way is to sleep a bit. > +fiber.sleep(0.1) > + | --- > + | ... > +-- Let the WAL writes finish. Firstly CONFIRM is finished, the limbo is emptied > +-- and the replica_id 1 latch is unlocked. Now the replica2 transaction is > +-- applied and persisted. > +box.error.injection.set('ERRINJ_WAL_DELAY', false) > + | --- > + | - ok > + | ... > +test_run:wait_lsn('replica1', 'replica2') > + | --- > + | ... > +box.space.test:get({2}) > + | --- > + | - [2] > + | ... > + > +-- Ensure the replication works fine, nothing is broken. > +test_run:wait_upstream(test_run:get_server_id('replica2'), {status = 'follow'}) > + | --- > + | - true > + | ... > + > +test_run:switch('replica2') > + | --- > + | - true > + | ... > +test_run:wait_upstream(test_run:get_server_id('replica1'), {status = 'follow'}) > + | --- > + | - true > + | ... > + > +test_run:switch('default') > + | --- > + | - true > + | ... > +test_run:cmd('stop server replica1') > + | --- > + | - true > + | ... > +test_run:cmd('delete server replica1') > + | --- > + | - true > + | ... > +test_run:cmd('stop server replica2') > + | --- > + | - true > + | ... > +test_run:cmd('delete server replica2') > + | --- > + | - true > + | ... > +s:drop() > + | --- > + | ... > +box.schema.user.revoke('guest', 'super') > + | --- > + | ... > +box.cfg{ \ > + replication_synchro_quorum = old_synchro_quorum, \ > + replication_synchro_timeout = old_synchro_timeout, \ > +} > + | --- > + | ... > diff --git a/test/replication/gh-5213-qsync-applier-order-3.test.lua b/test/replication/gh-5213-qsync-applier-order-3.test.lua > new file mode 100644 > index 000000000..37b569da7 > --- /dev/null > +++ b/test/replication/gh-5213-qsync-applier-order-3.test.lua > @@ -0,0 +1,125 @@ > +-- > +-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL. > +-- As a result it could happen that the transactions became visible on CONFIRM, > +-- then somehow weren't written to WAL, and on restart the data might not be > +-- visible again. Which means rollback of confirmed data and is not acceptable > +-- (on the contrary with commit after rollback). > +-- > +-- To fix that there was a patch making synchro rows processing after WAL write. > +-- As a result, the following situation could happen. Instance 1 owns the limbo, > +-- instances 2 and 3 know about that. The limbo is not empty. Now instance 1 > +-- writes CONFIRM and sends it to the instances 2 and 3. Both start a WAL write > +-- for the CONFIRM. Now instance 3 finishes it, creates a new synchro > +-- transaction, owns the local limbo, and sends the transaction to the instance > +-- 2. Here the CONFIRM WAL write is not done yet. It should not happen, that > +-- the instance 3's new transaction is rejected. Because firstly instance 3 will > +-- send the same instance 1's CONFIRM to the instance 2 due to it being earlier > +-- in WAL of instance 3. Then on instance 2 it will block on a latch with > +-- replica_id 1 until the original CONFIRM received from the instance 1 is > +-- finished. Afterwards the instance 3's transaction is applied just fine - the > +-- limbo on instance 2 is empty now. > +-- > +-- It is not related to read-views, but could break the replication. > +-- > +test_run = require('test_run').new() > +fiber = require('fiber') > +old_synchro_quorum = box.cfg.replication_synchro_quorum > +old_synchro_timeout = box.cfg.replication_synchro_timeout > + > +box.schema.user.grant('guest', 'super') > + > +s = box.schema.space.create('test', {is_sync = true}) > +_ = s:create_index('pk') > + > +test_run:cmd('create server replica1 with rpl_master=default,\ > + script="replication/replica1.lua"') > +test_run:cmd('start server replica1') > + > +test_run:cmd('create server replica2 with rpl_master=default,\ > + script="replication/replica2.lua"') > +test_run:cmd('start server replica2') > + > +-- Build mutual replication between replica1 and replica2. > +test_run:switch('replica1') > +replication = box.cfg.replication > +table.insert(replication, test_run:eval('replica2', 'return box.cfg.listen')[1]) > +box.cfg{replication = {}} > +box.cfg{replication = replication} > + > +test_run:switch('replica2') > +replication = box.cfg.replication > +table.insert(replication, test_run:eval('replica1', 'return box.cfg.listen')[1]) > +box.cfg{replication = {}} > +box.cfg{replication = replication} > + > +test_run:switch('default') > +fiber = require('fiber') > +box.cfg{ \ > + replication_synchro_quorum = 4, \ > + replication_synchro_timeout = 1000, \ > +} > +-- Send a transaction to all 3 nodes. The limbo is owned by the default node > +-- everywhere. > +f = fiber.new(function() s:replace{1} end) > +test_run:wait_lsn('replica1', 'default') > +test_run:wait_lsn('replica2', 'default') > + > +-- Make so the replica1 will apply CONFIRM from the default instance for a long > +-- time. > +test_run:switch('replica1') > +box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0) > + > +-- Emit the CONFIRM. > +test_run:switch('default') > +box.cfg{replication_synchro_quorum = 3} > +test_run:wait_cond(function() return f:status() == 'dead' end) > + > +-- It hangs on the replica1. > +test_run:switch('replica1') > +test_run:wait_cond(function() \ > + return box.error.injection.get('ERRINJ_WAL_DELAY') \ > +end) > + > +-- But is applied on the replica2. The limbo is empty here now. > +test_run:switch('replica2') > +test_run:wait_lsn('replica2', 'default') > +box.cfg{ \ > + replication_synchro_quorum = 2, \ > + replication_synchro_timeout = 1000, \ > +} > +-- Replica2 takes the limbo ownership and sends the transaction to the replica1. > +-- Along with the CONFIRM from the default node, which is still not applied > +-- on the replica1. > +fiber = require('fiber') > +f = fiber.new(function() box.space.test:replace{2} end) > + > +test_run:switch('replica1') > +fiber = require('fiber') > +-- WAL write of the CONFIRM from the default node still is not done. Give it > +-- some time to get the new rows from the replica2 and block on the latch. > +-- Can't catch it anyhow via conds, so the only way is to sleep a bit. > +fiber.sleep(0.1) > +-- Let the WAL writes finish. Firstly CONFIRM is finished, the limbo is emptied > +-- and the replica_id 1 latch is unlocked. Now the replica2 transaction is > +-- applied and persisted. > +box.error.injection.set('ERRINJ_WAL_DELAY', false) > +test_run:wait_lsn('replica1', 'replica2') > +box.space.test:get({2}) > + > +-- Ensure the replication works fine, nothing is broken. > +test_run:wait_upstream(test_run:get_server_id('replica2'), {status = 'follow'}) > + > +test_run:switch('replica2') > +test_run:wait_upstream(test_run:get_server_id('replica1'), {status = 'follow'}) > + > +test_run:switch('default') > +test_run:cmd('stop server replica1') > +test_run:cmd('delete server replica1') > +test_run:cmd('stop server replica2') > +test_run:cmd('delete server replica2') > +s:drop() > +box.schema.user.revoke('guest', 'super') > +box.cfg{ \ > + replication_synchro_quorum = old_synchro_quorum, \ > + replication_synchro_timeout = old_synchro_timeout, \ > +} > diff --git a/test/replication/gh-5213-qsync-applier-order.result b/test/replication/gh-5213-qsync-applier-order.result > new file mode 100644 > index 000000000..a8c24c289 > --- /dev/null > +++ b/test/replication/gh-5213-qsync-applier-order.result > @@ -0,0 +1,302 @@ > +-- test-run result file version 2 > +-- > +-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL. > +-- As a result it could happen that the transactions became visible on CONFIRM, > +-- then somehow weren't written to WAL, and on restart the data might not be > +-- visible again. Which means rollback of confirmed data and is not acceptable > +-- (on the contrary with commit after rollback). > +-- > +test_run = require('test_run').new() > + | --- > + | ... > +fiber = require('fiber') > + | --- > + | ... > +old_synchro_quorum = box.cfg.replication_synchro_quorum > + | --- > + | ... > +old_synchro_timeout = box.cfg.replication_synchro_timeout > + | --- > + | ... > + > +box.schema.user.grant('guest', 'super') > + | --- > + | ... > + > +s = box.schema.space.create('test', {is_sync = true}) > + | --- > + | ... > +_ = s:create_index('pk') > + | --- > + | ... > + > +test_run:cmd('create server replica with rpl_master=default,\ > + script="replication/gh-5213-replica.lua"') > + | --- > + | - true > + | ... > +test_run:cmd('start server replica') > + | --- > + | - true > + | ... > + > +fiber = require('fiber') > + | --- > + | ... > +box.cfg{ \ > + replication_synchro_quorum = 3, \ > + replication_synchro_timeout = 1000, \ > +} > + | --- > + | ... > +f = fiber.new(function() s:replace{1} end) > + | --- > + | ... > + > +test_run:switch('replica') > + | --- > + | - true > + | ... > +-- Wait when the transaction is written to WAL. > +test_run:wait_lsn('replica', 'default') > + | --- > + | ... > +s = box.space.test > + | --- > + | ... > +-- But not visible yet. Because MVCC is on, and no CONFIRM yet. > +assert(s:get({1}) == nil) > + | --- > + | - true > + | ... > +-- Block the incoming CONFIRM to be able to ensure the data is not visible until > +-- WAL write ends. > +box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0) > + | --- > + | - ok > + | ... > + > +test_run:switch('default') > + | --- > + | - true > + | ... > +box.cfg{replication_synchro_quorum = 2} > + | --- > + | ... > +test_run:wait_cond(function() return f:status() == 'dead' end) > + | --- > + | - true > + | ... > + > +test_run:switch('replica') > + | --- > + | - true > + | ... > +-- Wait when CONFIRM is in the WAL thread. > +test_run:wait_cond(function() \ > + return box.error.injection.get('ERRINJ_WAL_DELAY') \ > +end) > + | --- > + | - true > + | ... > +assert(s:get({1}) == nil) > + | --- > + | - true > + | ... > +box.error.injection.set('ERRINJ_WAL_DELAY', false) > + | --- > + | - ok > + | ... > +-- After CONFIRM is in WAL, the transaction is committed and its data is > +-- visible. > +test_run:wait_cond(function() return s:get({1}) ~= nil end) > + | --- > + | - true > + | ... > + > +-- > +-- Ensure CONFIRM WAL write fail also works fine when couldn't even start a WAL > +-- write. The data remains not confirmed until WAL write succeeds. > +-- > +test_run:switch('default') > + | --- > + | - true > + | ... > +box.cfg{replication_synchro_quorum = 3} > + | --- > + | ... > +f = fiber.new(function() s:replace{2} end) > + | --- > + | ... > + > +test_run:switch('replica') > + | --- > + | - true > + | ... > +test_run:wait_lsn('replica', 'default') > + | --- > + | ... > +assert(s:get({2}) == nil) > + | --- > + | - true > + | ... > +-- Make journal write fail immediately. > +box.error.injection.set('ERRINJ_WAL_IO', true) > + | --- > + | - ok > + | ... > + > +test_run:switch('default') > + | --- > + | - true > + | ... > +box.cfg{replication_synchro_quorum = 2} > + | --- > + | ... > +test_run:wait_cond(function() return f:status() == 'dead' end) > + | --- > + | - true > + | ... > + > +test_run:switch('replica') > + | --- > + | - true > + | ... > +test_run:wait_upstream(1, {status = 'stopped'}) > + | --- > + | - true > + | ... > +box.error.injection.set('ERRINJ_WAL_IO', false) > + | --- > + | - ok > + | ... > +assert(s:get({2}) == nil) > + | --- > + | - true > + | ... > +-- Re-subscribe. > +replication = box.cfg.replication > + | --- > + | ... > +box.cfg{replication = {}} > + | --- > + | ... > +box.cfg{replication = replication} > + | --- > + | ... > +test_run:wait_lsn('replica', 'default') > + | --- > + | ... > +test_run:wait_upstream(1, {status = 'follow'}) > + | --- > + | - true > + | ... > +assert(s:get({2}) ~= nil) > + | --- > + | - true > + | ... > + > +-- > +-- Ensure CONFIRM WAL write fail works fine when an error happens inside WAL > +-- thread. The data remains not confirmed until WAL write succeeds. > +-- > +test_run:switch('default') > + | --- > + | - true > + | ... > +box.cfg{replication_synchro_quorum = 3} > + | --- > + | ... > +f = fiber.new(function() s:replace{3} end) > + | --- > + | ... > + > +test_run:switch('replica') > + | --- > + | - true > + | ... > +test_run:wait_lsn('replica', 'default') > + | --- > + | ... > +assert(s:get({3}) == nil) > + | --- > + | - true > + | ... > +-- Journal write start is going to succeed, but it will fail later on return > +-- from the WAL thread. > +box.error.injection.set('ERRINJ_WAL_ROTATE', true) > + | --- > + | - ok > + | ... > + > +test_run:switch('default') > + | --- > + | - true > + | ... > +box.cfg{replication_synchro_quorum = 2} > + | --- > + | ... > +test_run:wait_cond(function() return f:status() == 'dead' end) > + | --- > + | - true > + | ... > + > +test_run:switch('replica') > + | --- > + | - true > + | ... > +test_run:wait_upstream(1, {status = 'stopped'}) > + | --- > + | - true > + | ... > +box.error.injection.set('ERRINJ_WAL_ROTATE', false) > + | --- > + | - ok > + | ... > +assert(s:get({3}) == nil) > + | --- > + | - true > + | ... > +box.cfg{replication = {}} > + | --- > + | ... > +box.cfg{replication = replication} > + | --- > + | ... > +test_run:wait_lsn('replica', 'default') > + | --- > + | ... > +test_run:wait_upstream(1, {status = 'follow'}) > + | --- > + | - true > + | ... > +assert(s:get({3}) ~= nil) > + | --- > + | - true > + | ... > + > +test_run:switch('default') > + | --- > + | - true > + | ... > +test_run:cmd('stop server replica') > + | --- > + | - true > + | ... > +test_run:cmd('delete server replica') > + | --- > + | - true > + | ... > + > +s:drop() > + | --- > + | ... > +box.schema.user.revoke('guest', 'super') > + | --- > + | ... > +box.cfg{ \ > + replication_synchro_quorum = old_synchro_quorum, \ > + replication_synchro_timeout = old_synchro_timeout, \ > +} > + | --- > + | ... > diff --git a/test/replication/gh-5213-qsync-applier-order.test.lua b/test/replication/gh-5213-qsync-applier-order.test.lua > new file mode 100644 > index 000000000..f1eccfa84 > --- /dev/null > +++ b/test/replication/gh-5213-qsync-applier-order.test.lua > @@ -0,0 +1,122 @@ > +-- > +-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL. > +-- As a result it could happen that the transactions became visible on CONFIRM, > +-- then somehow weren't written to WAL, and on restart the data might not be > +-- visible again. Which means rollback of confirmed data and is not acceptable > +-- (on the contrary with commit after rollback). > +-- > +test_run = require('test_run').new() > +fiber = require('fiber') > +old_synchro_quorum = box.cfg.replication_synchro_quorum > +old_synchro_timeout = box.cfg.replication_synchro_timeout > + > +box.schema.user.grant('guest', 'super') > + > +s = box.schema.space.create('test', {is_sync = true}) > +_ = s:create_index('pk') > + > +test_run:cmd('create server replica with rpl_master=default,\ > + script="replication/gh-5213-replica.lua"') > +test_run:cmd('start server replica') > + > +fiber = require('fiber') > +box.cfg{ \ > + replication_synchro_quorum = 3, \ > + replication_synchro_timeout = 1000, \ > +} > +f = fiber.new(function() s:replace{1} end) > + > +test_run:switch('replica') > +-- Wait when the transaction is written to WAL. > +test_run:wait_lsn('replica', 'default') > +s = box.space.test > +-- But not visible yet. Because MVCC is on, and no CONFIRM yet. > +assert(s:get({1}) == nil) > +-- Block the incoming CONFIRM to be able to ensure the data is not visible until > +-- WAL write ends. > +box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0) > + > +test_run:switch('default') > +box.cfg{replication_synchro_quorum = 2} > +test_run:wait_cond(function() return f:status() == 'dead' end) > + > +test_run:switch('replica') > +-- Wait when CONFIRM is in the WAL thread. > +test_run:wait_cond(function() \ > + return box.error.injection.get('ERRINJ_WAL_DELAY') \ > +end) > +assert(s:get({1}) == nil) > +box.error.injection.set('ERRINJ_WAL_DELAY', false) > +-- After CONFIRM is in WAL, the transaction is committed and its data is > +-- visible. > +test_run:wait_cond(function() return s:get({1}) ~= nil end) > + > +-- > +-- Ensure CONFIRM WAL write fail also works fine when couldn't even start a WAL > +-- write. The data remains not confirmed until WAL write succeeds. > +-- > +test_run:switch('default') > +box.cfg{replication_synchro_quorum = 3} > +f = fiber.new(function() s:replace{2} end) > + > +test_run:switch('replica') > +test_run:wait_lsn('replica', 'default') > +assert(s:get({2}) == nil) > +-- Make journal write fail immediately. > +box.error.injection.set('ERRINJ_WAL_IO', true) > + > +test_run:switch('default') > +box.cfg{replication_synchro_quorum = 2} > +test_run:wait_cond(function() return f:status() == 'dead' end) > + > +test_run:switch('replica') > +test_run:wait_upstream(1, {status = 'stopped'}) > +box.error.injection.set('ERRINJ_WAL_IO', false) > +assert(s:get({2}) == nil) > +-- Re-subscribe. > +replication = box.cfg.replication > +box.cfg{replication = {}} > +box.cfg{replication = replication} > +test_run:wait_lsn('replica', 'default') > +test_run:wait_upstream(1, {status = 'follow'}) > +assert(s:get({2}) ~= nil) > + > +-- > +-- Ensure CONFIRM WAL write fail works fine when an error happens inside WAL > +-- thread. The data remains not confirmed until WAL write succeeds. > +-- > +test_run:switch('default') > +box.cfg{replication_synchro_quorum = 3} > +f = fiber.new(function() s:replace{3} end) > + > +test_run:switch('replica') > +test_run:wait_lsn('replica', 'default') > +assert(s:get({3}) == nil) > +-- Journal write start is going to succeed, but it will fail later on return > +-- from the WAL thread. > +box.error.injection.set('ERRINJ_WAL_ROTATE', true) > + > +test_run:switch('default') > +box.cfg{replication_synchro_quorum = 2} > +test_run:wait_cond(function() return f:status() == 'dead' end) > + > +test_run:switch('replica') > +test_run:wait_upstream(1, {status = 'stopped'}) > +box.error.injection.set('ERRINJ_WAL_ROTATE', false) > +assert(s:get({3}) == nil) > +box.cfg{replication = {}} > +box.cfg{replication = replication} > +test_run:wait_lsn('replica', 'default') > +test_run:wait_upstream(1, {status = 'follow'}) > +assert(s:get({3}) ~= nil) > + > +test_run:switch('default') > +test_run:cmd('stop server replica') > +test_run:cmd('delete server replica') > + > +s:drop() > +box.schema.user.revoke('guest', 'super') > +box.cfg{ \ > + replication_synchro_quorum = old_synchro_quorum, \ > + replication_synchro_timeout = old_synchro_timeout, \ > +} > diff --git a/test/replication/gh-5213-replica.lua b/test/replication/gh-5213-replica.lua > new file mode 100644 > index 000000000..3ba90d7ab > --- /dev/null > +++ b/test/replication/gh-5213-replica.lua > @@ -0,0 +1,10 @@ > +#!/usr/bin/env tarantool > + > +box.cfg({ > + listen = os.getenv("LISTEN"), > + replication = os.getenv("MASTER"), > + read_only = true, > + memtx_use_mvcc_engine = true, > +}) > + > +require('console').listen(os.getenv('ADMIN')) > diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg > index aff5fda26..4a9ca0a46 100644 > --- a/test/replication/suite.cfg > +++ b/test/replication/suite.cfg > @@ -15,6 +15,8 @@ > "gh-3760-misc-return-on-quorum-0.test.lua": {}, > "gh-4399-misc-no-failure-on-error-reading-wal.test.lua": {}, > "gh-4424-misc-orphan-on-reconfiguration-error.test.lua": {}, > + "gh-5213-qsync-applier-order.test.lua": {}, > + "gh-5213-qsync-applier-order-3.test.lua": {}, > "gh-5426-election-on-off.test.lua": {}, > "gh-5433-election-restart-recovery.test.lua": {}, > "gh-5506-election-on-off.test.lua": {}, > diff --git a/test/replication/suite.ini b/test/replication/suite.ini > index a9e44e8cf..1d9c0a4ae 100644 > --- a/test/replication/suite.ini > +++ b/test/replication/suite.ini > @@ -3,7 +3,7 @@ core = tarantool > script = master.lua > description = tarantool/box, replication > disabled = consistent.test.lua > -release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua > +release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua > config = suite.cfg > lua_libs = lua/fast_replica.lua lua/rlimit.lua > use_unix_sockets = True -- Serge Petrenko
next prev parent reply other threads:[~2021-04-09 8:25 UTC|newest] Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-04-07 22:47 Vladislav Shpilevoy via Tarantool-patches 2021-04-08 8:39 ` Serge Petrenko via Tarantool-patches 2021-04-08 10:19 ` Cyrill Gorcunov via Tarantool-patches 2021-04-08 10:32 ` Serge Petrenko via Tarantool-patches 2021-04-08 10:46 ` Cyrill Gorcunov via Tarantool-patches 2021-04-08 22:57 ` Vladislav Shpilevoy via Tarantool-patches 2021-04-09 8:25 ` Serge Petrenko via Tarantool-patches [this message] 2021-04-09 21:32 ` Vladislav Shpilevoy via Tarantool-patches 2021-04-08 22:56 ` Vladislav Shpilevoy via Tarantool-patches 2021-04-09 8:26 ` Serge Petrenko via Tarantool-patches
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=bce9294c-36cc-daf9-2546-4d1fe2e2cdac@tarantool.org \ --to=tarantool-patches@dev.tarantool.org \ --cc=gorcunov@gmail.com \ --cc=sergepetrenko@tarantool.org \ --cc=v.shpilevoy@tarantool.org \ --subject='Re: [Tarantool-patches] [PATCH 1/1] applier: process synchro rows after WAL write' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox