* [Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering @ 2021-09-10 15:29 Cyrill Gorcunov via Tarantool-patches 2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 1/6] qsync: track confirmed lsn number on reads Cyrill Gorcunov via Tarantool-patches ` (6 more replies) 0 siblings, 7 replies; 21+ messages in thread From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-10 15:29 UTC (permalink / raw) To: tml; +Cc: Vladislav Shpilevoy Guys, take a look please, once time permit. The questionable moments: - use filter disabling procedure for join/recovery: we make it so since snapshot has promote record which fills initial limbo state - need more tests to cover all possible scenarios - I keep filter_confirm_rollback() as is but rereading Vlad's comment > > 9. What if rollback is for LSN > limbo's last LSN? It > also means nothing to do. The same for confirm LSN < limbo's > first LSN. > I presume I need to traverse limbo and test if incoming LSN is present inside current queue. Anyway I send this version early to gather more comments, I hope not that much left to implement to be ready for merging. previous series https://lists.tarantool.org/tarantool-patches/20210804190752.488147-1-gorcunov@gmail.com/ branch gorcunov/gh-6036-rollback-confirm-14 issue https://github.com/tarantool/tarantool/issues/6036 v6: - use txn_limbo_terms name for structure - rebase on fresh sp/gh-6034-empty-limbo-transition branch - rework filtering chains v8: - add ability to disable filtering for local recovery and join stages - update tests v9: - opencode terms tracking - fix tests to use wait function since log output might be deferred by OS v10: - rework FILTER_IN and FILTER_PROMOTE chains with more detailed packets inspection - preserve old naming for terms manipulations - require the packet's replica_id to match limbo owner_id all the time v11-13: internal v14: - use straightforward packet inspection by their type without more general type routing - tried to hide locking api inside limbo level but since journal completion is called from inside of sched fiber the lock owner get migrated which cause error thus leave explicit locking instead - added updating of limbo::confirmed_lsn since we need it for proper validation - added new error code to distinguish filter errors from anything else - use say_error instead of say_info - keep disabling of filtration inside initial join/recovery because we're filling initial limbo state Cyrill Gorcunov (6): qsync: track confirmed lsn number on reads qsync: update confirmed lsn on initial promote request latch: add latch_is_locked helper qsync: order access to the limbo terms qsync: filter incoming synchro requests test: add replication/gh-6036-rollback-confirm .../gh-6036-qsync-filter-packets.md | 9 + src/box/applier.cc | 26 +- src/box/box.cc | 30 +- src/box/errcode.h | 1 + src/box/memtx_engine.cc | 3 +- src/box/txn_limbo.c | 337 +++++++++++++++--- src/box/txn_limbo.h | 85 ++++- src/lib/core/latch.h | 11 + test/box/error.result | 1 + test/replication/gh-6036-master.lua | 1 + test/replication/gh-6036-node.lua | 33 ++ test/replication/gh-6036-replica.lua | 1 + .../gh-6036-rollback-confirm.result | 180 ++++++++++ .../gh-6036-rollback-confirm.test.lua | 92 +++++ 14 files changed, 747 insertions(+), 63 deletions(-) create mode 100644 changelogs/unreleased/gh-6036-qsync-filter-packets.md create mode 120000 test/replication/gh-6036-master.lua create mode 100644 test/replication/gh-6036-node.lua create mode 120000 test/replication/gh-6036-replica.lua create mode 100644 test/replication/gh-6036-rollback-confirm.result create mode 100644 test/replication/gh-6036-rollback-confirm.test.lua base-commit: b0431cf8f47e9d081f6a402bc18edb1d6ad49847 -- 2.31.1 ^ permalink raw reply [flat|nested] 21+ messages in thread
* [Tarantool-patches] [PATCH v14 1/6] qsync: track confirmed lsn number on reads 2021-09-10 15:29 [Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering Cyrill Gorcunov via Tarantool-patches @ 2021-09-10 15:29 ` Cyrill Gorcunov via Tarantool-patches 2021-09-12 15:44 ` Vladislav Shpilevoy via Tarantool-patches 2021-09-13 8:50 ` Serge Petrenko via Tarantool-patches 2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 2/6] qsync: update confirmed lsn on initial promote request Cyrill Gorcunov via Tarantool-patches ` (5 subsequent siblings) 6 siblings, 2 replies; 21+ messages in thread From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-10 15:29 UTC (permalink / raw) To: tml; +Cc: Vladislav Shpilevoy We will use this lsn for requests validation in next patches for sake of split-brain detection. Part-of #6036 Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> --- src/box/txn_limbo.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c index 70447caaf..cca2ce493 100644 --- a/src/box/txn_limbo.c +++ b/src/box/txn_limbo.c @@ -437,6 +437,13 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn) assert(e->txn->signature >= 0); txn_complete_success(e->txn); } + + /* + * We use confirmed lsn number to verify requests and + * reject ones coming from split-brain cluster configurations, + * so update it even if there were no entries to process. + */ + limbo->confirmed_lsn = lsn; } /** -- 2.31.1 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v14 1/6] qsync: track confirmed lsn number on reads 2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 1/6] qsync: track confirmed lsn number on reads Cyrill Gorcunov via Tarantool-patches @ 2021-09-12 15:44 ` Vladislav Shpilevoy via Tarantool-patches 2021-09-12 22:18 ` Cyrill Gorcunov via Tarantool-patches 2021-09-13 8:33 ` Serge Petrenko via Tarantool-patches 2021-09-13 8:50 ` Serge Petrenko via Tarantool-patches 1 sibling, 2 replies; 21+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2021-09-12 15:44 UTC (permalink / raw) To: Cyrill Gorcunov, tml Thanks for the patch! On 10.09.2021 17:29, Cyrill Gorcunov via Tarantool-patches wrote: > We will use this lsn for requests validation > in next patches for sake of split-brain detection. I don't understand. How exactly will it help? > Part-of #6036 > > Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> > --- > src/box/txn_limbo.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c > index 70447caaf..cca2ce493 100644 > --- a/src/box/txn_limbo.c > +++ b/src/box/txn_limbo.c > @@ -437,6 +437,13 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn) > assert(e->txn->signature >= 0); > txn_complete_success(e->txn); > } > + > + /* > + * We use confirmed lsn number to verify requests and > + * reject ones coming from split-brain cluster configurations, > + * so update it even if there were no entries to process. > + */ > + limbo->confirmed_lsn = lsn; ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v14 1/6] qsync: track confirmed lsn number on reads 2021-09-12 15:44 ` Vladislav Shpilevoy via Tarantool-patches @ 2021-09-12 22:18 ` Cyrill Gorcunov via Tarantool-patches 2021-09-13 8:33 ` Serge Petrenko via Tarantool-patches 1 sibling, 0 replies; 21+ messages in thread From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-12 22:18 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tml On Sun, Sep 12, 2021 at 05:44:00PM +0200, Vladislav Shpilevoy wrote: > Thanks for the patch! > > On 10.09.2021 17:29, Cyrill Gorcunov via Tarantool-patches wrote: > > We will use this lsn for requests validation > > in next patches for sake of split-brain detection. > > I don't understand. How exactly will it help? Sorry for not putting more detailed explanation. Here it is: we've a test ./test-run replication/qsync_advanced.test.lua where limbo owner is migrated in result our filter refuses to accept new limbo owner > txn_limbo.c:872 E> RAFT: rejecting PROMOTE (31) request from origin_id 2 replica_id 1 term 3. confirmed_lsn 72 > promote_lsn 0 > ER_CLUSTER_SPLIT: Cluster split detected. Backward promote LSN become promote request comes in with LSN = 0 when confirmed_lsn is bigger, which in turn happens because we update LSN on write operation only. In this test we have two nodes "default" and "replica". Initially "default" node is limbo owner, which writes some data into sync space. Then we wait until this sync data get replicated (node the "default" has confirmed_lsn > 0 because it been writting the data). Then we jump to "replica" node and call box.promote() there which initiate PROMOTE request with lsn = 0 and send it back to "default" node which has been limbo owner before and has confirmed_lsn=72. When this request comes in the filtration fails. (the replica node didn't write _anything_ locally and its confirmed_lsn = 0, which we send in promote body). ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v14 1/6] qsync: track confirmed lsn number on reads 2021-09-12 15:44 ` Vladislav Shpilevoy via Tarantool-patches 2021-09-12 22:18 ` Cyrill Gorcunov via Tarantool-patches @ 2021-09-13 8:33 ` Serge Petrenko via Tarantool-patches 1 sibling, 0 replies; 21+ messages in thread From: Serge Petrenko via Tarantool-patches @ 2021-09-13 8:33 UTC (permalink / raw) To: Vladislav Shpilevoy, Cyrill Gorcunov, tml 12.09.2021 18:44, Vladislav Shpilevoy via Tarantool-patches пишет: > Thanks for the patch! > > On 10.09.2021 17:29, Cyrill Gorcunov via Tarantool-patches wrote: >> We will use this lsn for requests validation >> in next patches for sake of split-brain detection. > I don't understand. How exactly will it help? Confirmed_lsn wasn't tracked during recovery and while following a master. So, essentially, only the living master could detect splitbrains by comparing confirmed_lsn to something else. > >> Part-of #6036 >> >> Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> >> --- >> src/box/txn_limbo.c | 7 +++++++ >> 1 file changed, 7 insertions(+) >> >> diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c >> index 70447caaf..cca2ce493 100644 >> --- a/src/box/txn_limbo.c >> +++ b/src/box/txn_limbo.c >> @@ -437,6 +437,13 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn) >> assert(e->txn->signature >= 0); >> txn_complete_success(e->txn); >> } >> + >> + /* >> + * We use confirmed lsn number to verify requests and >> + * reject ones coming from split-brain cluster configurations, >> + * so update it even if there were no entries to process. >> + */ >> + limbo->confirmed_lsn = lsn; -- Serge Petrenko ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v14 1/6] qsync: track confirmed lsn number on reads 2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 1/6] qsync: track confirmed lsn number on reads Cyrill Gorcunov via Tarantool-patches 2021-09-12 15:44 ` Vladislav Shpilevoy via Tarantool-patches @ 2021-09-13 8:50 ` Serge Petrenko via Tarantool-patches 1 sibling, 0 replies; 21+ messages in thread From: Serge Petrenko via Tarantool-patches @ 2021-09-13 8:50 UTC (permalink / raw) To: Cyrill Gorcunov, tml; +Cc: Vladislav Shpilevoy 10.09.2021 18:29, Cyrill Gorcunov пишет: > We will use this lsn for requests validation > in next patches for sake of split-brain detection. > > Part-of #6036 > > Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> > --- > src/box/txn_limbo.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c > index 70447caaf..cca2ce493 100644 > --- a/src/box/txn_limbo.c > +++ b/src/box/txn_limbo.c > @@ -437,6 +437,13 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn) > assert(e->txn->signature >= 0); > txn_complete_success(e->txn); > } > + > + /* > + * We use confirmed lsn number to verify requests and > + * reject ones coming from split-brain cluster configurations, > + * so update it even if there were no entries to process. > + */ > + limbo->confirmed_lsn = lsn; > } > > /** I guess there'll be problems on master with this approach. Say, a pair of CONFIRM requests is written, with lsns N and N+1. So you first enter write_confirm(N), then write_confirm(N+1). Now both fibers issuing the requests yield waiting for the write to happen, and confirmed_lsn is N+1. Once the first CONFIRM (N) is written, you reset confirmed_lsn to N right in read_confirm. So until the second CONFIRM (N+1) is written, there's a window when confirmed_lsn is N, but it should be N+1. I think read_confirm should set confirmed_lsn on replica only. On master this task is performed by write_confirm. You may split read_confirm in two parts: set confirmed lsn (used only on replica) and apply_confirm (everything read_confirm did before your patch) -- Serge Petrenko ^ permalink raw reply [flat|nested] 21+ messages in thread
* [Tarantool-patches] [PATCH v14 2/6] qsync: update confirmed lsn on initial promote request 2021-09-10 15:29 [Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering Cyrill Gorcunov via Tarantool-patches 2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 1/6] qsync: track confirmed lsn number on reads Cyrill Gorcunov via Tarantool-patches @ 2021-09-10 15:29 ` Cyrill Gorcunov via Tarantool-patches 2021-09-12 15:44 ` Vladislav Shpilevoy via Tarantool-patches 2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 3/6] latch: add latch_is_locked helper Cyrill Gorcunov via Tarantool-patches ` (4 subsequent siblings) 6 siblings, 1 reply; 21+ messages in thread From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-10 15:29 UTC (permalink / raw) To: tml; +Cc: Vladislav Shpilevoy When promote request is handled we drop last confirmed lsn to zero because its value make sense for sync queue owner only. Still the case where we become queue owner for the first time is special - we need to fetch the obtained lsn from the request and remember it so we will be able to filter any next malformed requests with wrong lsn numbers (see queue filtering procedure in next patch). Part-of #6036 Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> --- src/box/txn_limbo.c | 8 +++++++- src/box/txn_limbo.h | 6 ++++++ 2 files changed, 13 insertions(+), 1 deletion(-) diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c index cca2ce493..08463219d 100644 --- a/src/box/txn_limbo.c +++ b/src/box/txn_limbo.c @@ -50,6 +50,7 @@ txn_limbo_create(struct txn_limbo *limbo) limbo->confirmed_lsn = 0; limbo->rollback_count = 0; limbo->is_in_rollback = false; + limbo->has_initial_promote = false; } bool @@ -521,8 +522,13 @@ txn_limbo_read_promote(struct txn_limbo *limbo, uint32_t replica_id, txn_limbo_read_rollback(limbo, lsn + 1); assert(txn_limbo_is_empty(&txn_limbo)); limbo->owner_id = replica_id; + if (likely(limbo->has_initial_promote)) { + limbo->confirmed_lsn = 0; + } else { + limbo->confirmed_lsn = lsn; + limbo->has_initial_promote = true; + } box_update_ro_summary(); - limbo->confirmed_lsn = 0; } void diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h index 53e52f676..e0d17de4b 100644 --- a/src/box/txn_limbo.h +++ b/src/box/txn_limbo.h @@ -179,6 +179,12 @@ struct txn_limbo { * by the 'reversed rollback order' rule - contradiction. */ bool is_in_rollback; + /** + * Whether the limbo received initial PROMOTE request. It is needed to + * update confirmed_lsn appropriately and pass packet validation/filtering + * procedure. + */ + bool has_initial_promote; }; /** -- 2.31.1 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v14 2/6] qsync: update confirmed lsn on initial promote request 2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 2/6] qsync: update confirmed lsn on initial promote request Cyrill Gorcunov via Tarantool-patches @ 2021-09-12 15:44 ` Vladislav Shpilevoy via Tarantool-patches 2021-09-12 22:25 ` Cyrill Gorcunov via Tarantool-patches 0 siblings, 1 reply; 21+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2021-09-12 15:44 UTC (permalink / raw) To: Cyrill Gorcunov, tml Thanks for the patch! On 10.09.2021 17:29, Cyrill Gorcunov wrote: > When promote request is handled we drop last confirmed > lsn to zero because its value make sense for sync queue > owner only. Still the case where we become queue owner > for the first time is special - we need to fetch the > obtained lsn from the request and remember it so we > will be able to filter any next malformed requests > with wrong lsn numbers (see queue filtering procedure > in next patch). I don't understand anything. Why isn't it needed always? And how exactly will it help to filter stuff? > diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c > index cca2ce493..08463219d 100644 > --- a/src/box/txn_limbo.c > +++ b/src/box/txn_limbo.c > @@ -521,8 +522,13 @@ txn_limbo_read_promote(struct txn_limbo *limbo, uint32_t replica_id, > txn_limbo_read_rollback(limbo, lsn + 1); > assert(txn_limbo_is_empty(&txn_limbo)); > limbo->owner_id = replica_id; > + if (likely(limbo->has_initial_promote)) { > + limbo->confirmed_lsn = 0; > + } else { > + limbo->confirmed_lsn = lsn; > + limbo->has_initial_promote = true; > + } > box_update_ro_summary(); > - limbo->confirmed_lsn = 0; > } ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v14 2/6] qsync: update confirmed lsn on initial promote request 2021-09-12 15:44 ` Vladislav Shpilevoy via Tarantool-patches @ 2021-09-12 22:25 ` Cyrill Gorcunov via Tarantool-patches 2021-09-13 8:52 ` Serge Petrenko via Tarantool-patches 0 siblings, 1 reply; 21+ messages in thread From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-12 22:25 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tml On Sun, Sep 12, 2021 at 05:44:04PM +0200, Vladislav Shpilevoy wrote: > Thanks for the patch! > > On 10.09.2021 17:29, Cyrill Gorcunov wrote: > > When promote request is handled we drop last confirmed > > lsn to zero because its value make sense for sync queue > > owner only. Still the case where we become queue owner > > for the first time is special - we need to fetch the > > obtained lsn from the request and remember it so we > > will be able to filter any next malformed requests > > with wrong lsn numbers (see queue filtering procedure > > in next patch). > > I don't understand anything. Why isn't it needed always? And > how exactly will it help to filter stuff? This problem is revealed when run of ./test-run replication/gh-6034-qsync-limbo-ownership.test.lua with filteration turned on. The confirmed_lsn make sence in bound with limbo owner as far as I understand. And in test we have two nodes "default" and "replica". Initially default gets up, filled with some data into sync space and then we start up a replica node. The replica get subscribed and then we call box.promote() on it, since replica itself has not been writting anything its confirmed_lsn = 0, which we send back to the "default" inside promote request body. And it get rejected because "default" has non-zero confirmed_lsn. I've been talking to Serge a lot about this problem and if I'm not missing somthing obvious updating this filed on first promote arrival is mostly correct way to handle the issue. I presume we should get a meeting and talk again. Or maybe better via email. Serge could you please write the details here? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v14 2/6] qsync: update confirmed lsn on initial promote request 2021-09-12 22:25 ` Cyrill Gorcunov via Tarantool-patches @ 2021-09-13 8:52 ` Serge Petrenko via Tarantool-patches 2021-09-13 14:20 ` [Tarantool-patches] [RFC] qsync: overall design Cyrill Gorcunov via Tarantool-patches 0 siblings, 1 reply; 21+ messages in thread From: Serge Petrenko via Tarantool-patches @ 2021-09-13 8:52 UTC (permalink / raw) To: Cyrill Gorcunov, Vladislav Shpilevoy; +Cc: tml 13.09.2021 01:25, Cyrill Gorcunov пишет: > On Sun, Sep 12, 2021 at 05:44:04PM +0200, Vladislav Shpilevoy wrote: >> Thanks for the patch! >> >> On 10.09.2021 17:29, Cyrill Gorcunov wrote: >>> When promote request is handled we drop last confirmed >>> lsn to zero because its value make sense for sync queue >>> owner only. Still the case where we become queue owner >>> for the first time is special - we need to fetch the >>> obtained lsn from the request and remember it so we >>> will be able to filter any next malformed requests >>> with wrong lsn numbers (see queue filtering procedure >>> in next patch). >> I don't understand anything. Why isn't it needed always? And >> how exactly will it help to filter stuff? > This problem is revealed when run of ./test-run replication/gh-6034-qsync-limbo-ownership.test.lua > with filteration turned on. The confirmed_lsn make sence in bound > with limbo owner as far as I understand. And in test we have > two nodes "default" and "replica". Initially default gets up, filled > with some data into sync space and then we start up a replica node. > The replica get subscribed and then we call box.promote() on it, > since replica itself has not been writting anything its confirmed_lsn = 0, > which we send back to the "default" inside promote request body. And > it get rejected because "default" has non-zero confirmed_lsn. I've > been talking to Serge a lot about this problem and if I'm not missing > somthing obvious updating this filed on first promote arrival is > mostly correct way to handle the issue. I presume we should get > a meeting and talk again. > > Or maybe better via email. Serge could you please write the details here? It would be easier to discuss this verbally, I think. -- Serge Petrenko ^ permalink raw reply [flat|nested] 21+ messages in thread
* [Tarantool-patches] [RFC] qsync: overall design 2021-09-13 8:52 ` Serge Petrenko via Tarantool-patches @ 2021-09-13 14:20 ` Cyrill Gorcunov via Tarantool-patches 0 siblings, 0 replies; 21+ messages in thread From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-13 14:20 UTC (permalink / raw) To: Serge Petrenko, Vladislav Shpilevoy; +Cc: tml On Mon, Sep 13, 2021 at 11:52:39AM +0300, Serge Petrenko wrote: > > > > Or maybe better via email. Serge could you please write the details here? > > It would be easier to discuss this verbally, I think. Verbal meetings are good indeed, but maybe I could summarize all the problems found so far and imprint them here. Guys, please comment, I would really appreciate. Terms accessing ordering ------------------------ We've found that fibers can read old terms which are already updated but not yet written into WAL. To address this we order term reading so appliers will wait until write to WAL is complete. While everyone is agree that there is an issue and ordering solves it we're not yet complete clear about internal design. I proposed to use explicit locking via txn_limbo_term_lock/unlock calls. The calls are used inside applier's code apply_synchro_row txn_limbo_term_lock journal_write txn_limbo_term_unlock the key moment is journal_write() call which queues completion to run and the completion code is called from inside of sched() fiber, ie not the fiber which took the lock (and such lock migration is prohibited in our latch-lock engine). The propose was to hide the locking mechanism inside limbo internals code completely, so the callers won't know about it. When I tried to make so I hit the problem with lock context migration and had to step back to use explicit locks as in code above. Still Vlad's question remains | 2. As for your complaint about the begin/commit/rollback API | being not working because you can't unlock from a non-owner | fiber - well, it works in your patch somehow, doesn't it? I already explained why it works with explicit locking. https://lists.tarantool.org/tarantool-patches/YT8tZ0CuIDKwzcC4@grain/ In short - we take and release the lock in same context. | | Why do you in your patch unlock here, but in the newly proposed | API you only tried to unlock in the trigger? Because our commit/rollback are called from inside of sched() fiber, and we will have to provide some helper like completion of completion where second completion will be called from inside of applier context to unlock terms. As to me this is a way more messy than explicit locking scheme. | | You could call commit/rollback from this function, like you | do with unlock now. This moment I don't understand. We already have commit/rollback helpers, so I ask Vlad to write some pseudocode, to figure out what exactly you have in mind. Limbo's confirmed_lsn update upon CONFIRM request read ------------------------------------------------------ Currently we update limbo::confirmed_lsn when node _writes_ this request into the WAL. This is done on limbo owner node only, ie transaction initiator. In result when the node which has not been leader at all takes limbo ownership it sends own "confirmed_lsn = 0" inside PROMOTE request, and when this request reaches previous leader node we don't allow to proceed (due to our filtration rules where we require the LSN to be > than current confirmed_lsn). Also Serge pointed out a) | Confirmed_lsn wasn't tracked during recovery and while | following a master. So, essentially, only the living master could | detect splitbrains by comparing confirmed_lsn to something else. b) | Say, a pair of CONFIRM requests is written, with lsns | N and N+1. So you first enter write_confirm(N), then | write_confirm(N+1). Now both fibers issuing the requests yield | waiting for the write to happen, and confirmed_lsn is N+1. | | Once the first CONFIRM (N) is written, you reset confirmed_lsn to N | right in read_confirm. | | So until the second CONFIRM (N+1) is written, there's a window | when confirmed_lsn is N, but it should be N+1. | | I think read_confirm should set confirmed_lsn on replica only. | On master this task is performed by write_confirm. | You may split read_confirm in two parts: | - set confirmed lsn (used only on replica) and | - apply_confirm (everything read_confirm did before your patch) Thus I seems need to rework this aspect. Update confirmed_lsn on first PROMOTE request arrival ----------------------------------------------------- Detailed explanation what I've seen is there https://lists.tarantool.org/tarantool-patches/YT5+YqCJuAh0HAQg@grain/ I must confess I don't like much this moment as well since this is a bit vague point for me so we gonna look into it soon on a meeting. Filtration procedure itself (split detector) ------------------------------------------- When CONFIRM or ROLLBACK packet comes in it is not enough to test for limbo emptiness only. We should rather traverse the queue and figure out if LSN inside the packet belongs to the current queue. So the *preliminary* conclusion is the following: when CONFIRM or ROLLBACK is coming in a) queue is empty -- then such request is invalid and we should exit with error b) queue is not empty -- then LSN should belong to a range covered by the queue c) it is unclear how to test this scenario Filtration disabling for joining and local recovery --------------------------------------------------- When joining or recovery happens the limbo is in empty state then our filtration start triggering false positives. For example > autobootstrap1.sock I> limbo: filter PROMOTE replica_id 0 origin_id 0 > term 0 lsn 0, queue owner_id 0 len 0 promote_greatest_term 0 confirmed_lsn 0 This is because we require the term to be nonzero when cluster is running. /* * PROMOTE and DEMOTE packets must not have zero * term supplied, otherwise it is a broken packet. */ if (req->term == 0) { say_error("%s. Zero term detected", reject_str(req)); diag_set(ClientError, ER_CLUSTER_SPLIT, "Request with zero term"); return -1; } If we want to not disable filtration at all then we need to introduce some state machine which would cover initial -> working state. I think better to start with simpler approach where we don't verify data on join/recovery and then extend filtration if needed. ^ permalink raw reply [flat|nested] 21+ messages in thread
* [Tarantool-patches] [PATCH v14 3/6] latch: add latch_is_locked helper 2021-09-10 15:29 [Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering Cyrill Gorcunov via Tarantool-patches 2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 1/6] qsync: track confirmed lsn number on reads Cyrill Gorcunov via Tarantool-patches 2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 2/6] qsync: update confirmed lsn on initial promote request Cyrill Gorcunov via Tarantool-patches @ 2021-09-10 15:29 ` Cyrill Gorcunov via Tarantool-patches 2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 4/6] qsync: order access to the limbo terms Cyrill Gorcunov via Tarantool-patches ` (3 subsequent siblings) 6 siblings, 0 replies; 21+ messages in thread From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-10 15:29 UTC (permalink / raw) To: tml; +Cc: Vladislav Shpilevoy To test if latch is locked. In-scope-of #6036 Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> --- src/lib/core/latch.h | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/src/lib/core/latch.h b/src/lib/core/latch.h index 49c59cf63..0aaa8b634 100644 --- a/src/lib/core/latch.h +++ b/src/lib/core/latch.h @@ -95,6 +95,17 @@ latch_owner(struct latch *l) return l->owner; } +/** + * Return true if the latch is locked. + * + * @param l - latch to be tested. + */ +static inline bool +latch_is_locked(const struct latch *l) +{ + return l->owner != NULL; +} + /** * Lock a latch. If the latch is already locked by another fiber, * waits for timeout. -- 2.31.1 ^ permalink raw reply [flat|nested] 21+ messages in thread
* [Tarantool-patches] [PATCH v14 4/6] qsync: order access to the limbo terms 2021-09-10 15:29 [Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering Cyrill Gorcunov via Tarantool-patches ` (2 preceding siblings ...) 2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 3/6] latch: add latch_is_locked helper Cyrill Gorcunov via Tarantool-patches @ 2021-09-10 15:29 ` Cyrill Gorcunov via Tarantool-patches 2021-09-12 15:44 ` Vladislav Shpilevoy via Tarantool-patches 2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 5/6] qsync: filter incoming synchro requests Cyrill Gorcunov via Tarantool-patches ` (2 subsequent siblings) 6 siblings, 1 reply; 21+ messages in thread From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-10 15:29 UTC (permalink / raw) To: tml; +Cc: Vladislav Shpilevoy Limbo terms tracking is shared between appliers and when one of appliers is waiting for write to complete inside journal_write() routine, an other may need to access read term value to figure out if promote request is valid to apply. Due to cooperative multitasking access to the terms is not consistent so we need to be sure that other fibers either read up to date terms (ie written to the WAL). For this sake we use latching mechanism, when one fiber took terms lock for updating other readers are waiting until the operation is complete. For example here is a call graph of two appliers applier 1 --------- applier_apply_tx (promote term = 3 current max term = 2) applier_synchro_filter_tx apply_synchro_row journal_write (sleeping) at this moment another applier comes in with obsolete data and term 2 applier 2 --------- applier_apply_tx (term 2) applier_synchro_filter_tx txn_limbo_is_replica_outdated -> false journal_write (sleep) applier 1 --------- journal wakes up apply_synchro_row_cb set max term to 3 So the applier 2 didn't notice that term 3 is already seen and wrote obsolete data. With locking the applier 2 will wait until applier 1 has finished its write. Part-of #6036 Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> --- src/box/applier.cc | 5 ++++- src/box/txn_limbo.c | 17 ++++++++++++++-- src/box/txn_limbo.h | 48 ++++++++++++++++++++++++++++++++++++++++----- 3 files changed, 62 insertions(+), 8 deletions(-) diff --git a/src/box/applier.cc b/src/box/applier.cc index b981bd436..845a7d015 100644 --- a/src/box/applier.cc +++ b/src/box/applier.cc @@ -857,7 +857,7 @@ apply_synchro_row_cb(struct journal_entry *entry) applier_rollback_by_wal_io(entry->res); } else { replica_txn_wal_write_cb(synchro_entry->rcb); - txn_limbo_process(&txn_limbo, synchro_entry->req); + txn_limbo_process_locked(&txn_limbo, synchro_entry->req); trigger_run(&replicaset.applier.on_wal_write, NULL); } fiber_wakeup(synchro_entry->owner); @@ -873,6 +873,7 @@ apply_synchro_row(uint32_t replica_id, struct xrow_header *row) if (xrow_decode_synchro(row, &req) != 0) goto err; + txn_limbo_term_lock(&txn_limbo); struct replica_cb_data rcb_data; struct synchro_entry entry; /* @@ -915,8 +916,10 @@ apply_synchro_row(uint32_t replica_id, struct xrow_header *row) diag_set_journal_res(entry.base.res); goto err; } + txn_limbo_term_unlock(&txn_limbo); return 0; err: + txn_limbo_term_unlock(&txn_limbo); diag_log(); return -1; } diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c index 08463219d..65fbd0cac 100644 --- a/src/box/txn_limbo.c +++ b/src/box/txn_limbo.c @@ -47,6 +47,7 @@ txn_limbo_create(struct txn_limbo *limbo) vclock_create(&limbo->vclock); vclock_create(&limbo->promote_term_map); limbo->promote_greatest_term = 0; + latch_create(&limbo->promote_latch); limbo->confirmed_lsn = 0; limbo->rollback_count = 0; limbo->is_in_rollback = false; @@ -737,11 +738,14 @@ txn_limbo_wait_empty(struct txn_limbo *limbo, double timeout) } void -txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req) +txn_limbo_process_locked(struct txn_limbo *limbo, + const struct synchro_request *req) { + assert(latch_is_locked(&limbo->promote_latch)); + uint64_t term = req->term; uint32_t origin = req->origin_id; - if (txn_limbo_replica_term(limbo, origin) < term) { + if (txn_limbo_replica_term_locked(limbo, origin) < term) { vclock_follow(&limbo->promote_term_map, origin, term); if (term > limbo->promote_greatest_term) limbo->promote_greatest_term = term; @@ -799,6 +803,15 @@ txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req) return; } +void +txn_limbo_process(struct txn_limbo *limbo, + const struct synchro_request *req) +{ + txn_limbo_term_lock(limbo); + txn_limbo_process_locked(limbo, req); + txn_limbo_term_unlock(limbo); +} + void txn_limbo_on_parameters_change(struct txn_limbo *limbo) { diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h index e0d17de4b..1ee815d1c 100644 --- a/src/box/txn_limbo.h +++ b/src/box/txn_limbo.h @@ -31,6 +31,7 @@ */ #include "small/rlist.h" #include "vclock/vclock.h" +#include "latch.h" #include <stdint.h> @@ -147,6 +148,10 @@ struct txn_limbo { * limbo and raft are in sync and the terms are the same. */ uint64_t promote_greatest_term; + /** + * To order access to the promote data. + */ + struct latch promote_latch; /** * Maximal LSN gathered quorum and either already confirmed in WAL, or * whose confirmation is in progress right now. Any attempt to confirm @@ -217,14 +222,39 @@ txn_limbo_last_entry(struct txn_limbo *limbo) in_queue); } +/** Lock promote data. */ +static inline void +txn_limbo_term_lock(struct txn_limbo *limbo) +{ + latch_lock(&limbo->promote_latch); +} + +/** Unlock promote data. */ +static inline void +txn_limbo_term_unlock(struct txn_limbo *limbo) +{ + latch_unlock(&limbo->promote_latch); +} + +/** Fetch replica's term with lock taken. */ +static inline uint64_t +txn_limbo_replica_term_locked(struct txn_limbo *limbo, uint32_t replica_id) +{ + assert(latch_is_locked(&limbo->promote_latch)); + return vclock_get(&limbo->promote_term_map, replica_id); +} + /** * Return the latest term as seen in PROMOTE requests from instance with id * @a replica_id. */ static inline uint64_t -txn_limbo_replica_term(const struct txn_limbo *limbo, uint32_t replica_id) +txn_limbo_replica_term(struct txn_limbo *limbo, uint32_t replica_id) { - return vclock_get(&limbo->promote_term_map, replica_id); + txn_limbo_term_lock(limbo); + uint64_t v = txn_limbo_replica_term_locked(limbo, replica_id); + txn_limbo_term_unlock(limbo); + return v; } /** @@ -232,11 +262,14 @@ txn_limbo_replica_term(const struct txn_limbo *limbo, uint32_t replica_id) * data from it. The check is only valid when elections are enabled. */ static inline bool -txn_limbo_is_replica_outdated(const struct txn_limbo *limbo, +txn_limbo_is_replica_outdated(struct txn_limbo *limbo, uint32_t replica_id) { - return txn_limbo_replica_term(limbo, replica_id) < - limbo->promote_greatest_term; + txn_limbo_term_lock(limbo); + bool res = txn_limbo_replica_term_locked(limbo, replica_id) < + limbo->promote_greatest_term; + txn_limbo_term_unlock(limbo); + return res; } /** @@ -308,6 +341,11 @@ txn_limbo_wait_complete(struct txn_limbo *limbo, struct txn_limbo_entry *entry); /** Execute a synchronous replication request. */ void +txn_limbo_process_locked(struct txn_limbo *limbo, + const struct synchro_request *req); + +/** Lock limbo terms and execute a synchronous replication request. */ +void txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req); /** -- 2.31.1 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v14 4/6] qsync: order access to the limbo terms 2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 4/6] qsync: order access to the limbo terms Cyrill Gorcunov via Tarantool-patches @ 2021-09-12 15:44 ` Vladislav Shpilevoy via Tarantool-patches 2021-09-13 10:52 ` Cyrill Gorcunov via Tarantool-patches 0 siblings, 1 reply; 21+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2021-09-12 15:44 UTC (permalink / raw) To: Cyrill Gorcunov, tml Thanks for the patch! See 3 comments below. > diff --git a/src/box/applier.cc b/src/box/applier.cc > index b981bd436..845a7d015 100644 > --- a/src/box/applier.cc > +++ b/src/box/applier.cc > @@ -915,8 +916,10 @@ apply_synchro_row(uint32_t replica_id, struct xrow_header *row) > diag_set_journal_res(entry.base.res); > goto err; > } > + txn_limbo_term_unlock(&txn_limbo); > return 0; > err: > + txn_limbo_term_unlock(&txn_limbo); > diag_log(); 1. This function can go to 'err' before the lock is taken. 2. As for your complaint about the begin/commit/rollback API being not working because you can't unlock from a non-owner fiber - well, it works in your patch somehow, doesn't it? Why do you in your patch unlock here, but in the newly proposed API you only tried to unlock in the trigger? You could call commit/rollback from this function, like you do with unlock now. > return -1; > } > diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h > index e0d17de4b..1ee815d1c 100644 > --- a/src/box/txn_limbo.h > +++ b/src/box/txn_limbo.h > @@ -217,14 +222,39 @@ txn_limbo_last_entry(struct txn_limbo *limbo) > in_queue); > } > > +/** Lock promote data. */ > +static inline void > +txn_limbo_term_lock(struct txn_limbo *limbo) > +{ > + latch_lock(&limbo->promote_latch); > +} > + > +/** Unlock promote data. */ > +static inline void > +txn_limbo_term_unlock(struct txn_limbo *limbo) > +{ > + latch_unlock(&limbo->promote_latch); > +} > + > +/** Fetch replica's term with lock taken. */ > +static inline uint64_t > +txn_limbo_replica_term_locked(struct txn_limbo *limbo, uint32_t replica_id) 3. Limbo can be made const here. > +{ > + assert(latch_is_locked(&limbo->promote_latch)); > + return vclock_get(&limbo->promote_term_map, replica_id); > +} ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v14 4/6] qsync: order access to the limbo terms 2021-09-12 15:44 ` Vladislav Shpilevoy via Tarantool-patches @ 2021-09-13 10:52 ` Cyrill Gorcunov via Tarantool-patches 0 siblings, 0 replies; 21+ messages in thread From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-13 10:52 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tml On Sun, Sep 12, 2021 at 05:44:11PM +0200, Vladislav Shpilevoy wrote: > > err: > > + txn_limbo_term_unlock(&txn_limbo); > > diag_log(); > > 1. This function can go to 'err' before the lock is taken. yup, thanks! > > > 2. As for your complaint about the begin/commit/rollback API > being not working because you can't unlock from a non-owner > fiber - well, it works in your patch somehow, doesn't it? Yes, it works with my patch because journal_write() are ordered, we take and release lock explicitly inside caller code, ie inside one same fiber(). Imagine two appliers running simultaneously applier 1 applier 2 sched --------- --------- ----- apply_synchro_row txn_limbo_term_lock journal_write context-switch --> apply_synchro_row txn_limbo_term_lock wait for release apply_synchro_row_cb context-switch txn_limbo_term_unlock <-- --+ return txn_limbo_term_lock finishes ... > Why do you in your patch unlock here, but in the newly proposed > API you only tried to unlock in the trigger? Because you proposed to *hide* locking completely inside limbo code, so the caller won't know anything about it. So our commit/rollback would handle locking internally, unfortunately this doesn't work. > > You could call commit/rollback from this function, like you > do with unlock now. Not sure if I follow you here. Our journal engine implies completions to be called. We pass such completion into journal entry creation. With my patch everything remains as is except we take a lock explicitly and release it then. Could you please point more explicitly what you've in mind? > > + > > +/** Fetch replica's term with lock taken. */ > > +static inline uint64_t > > +txn_limbo_replica_term_locked(struct txn_limbo *limbo, uint32_t replica_id) > > 3. Limbo can be made const here. ok ^ permalink raw reply [flat|nested] 21+ messages in thread
* [Tarantool-patches] [PATCH v14 5/6] qsync: filter incoming synchro requests 2021-09-10 15:29 [Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering Cyrill Gorcunov via Tarantool-patches ` (3 preceding siblings ...) 2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 4/6] qsync: order access to the limbo terms Cyrill Gorcunov via Tarantool-patches @ 2021-09-10 15:29 ` Cyrill Gorcunov via Tarantool-patches 2021-09-12 15:44 ` Vladislav Shpilevoy via Tarantool-patches 2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 6/6] test: add replication/gh-6036-rollback-confirm Cyrill Gorcunov via Tarantool-patches 2021-09-12 15:43 ` [Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering Vladislav Shpilevoy via Tarantool-patches 6 siblings, 1 reply; 21+ messages in thread From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-10 15:29 UTC (permalink / raw) To: tml; +Cc: Vladislav Shpilevoy When we receive synchro requests we can't just apply them blindly because in worse case they may come from split-brain configuration (where a cluster split into several clusters and each one has own leader elected, then clusters are trying to merge back into the original one). We need to do our best to detect such disunity and force these nodes to rejoin from the scratch for data consistency sake. Thus when we're processing requests we pass them to the packet filter first which validates their contents and refuse to apply if they are not matched. Filter logic depends on request type. First there is a common chain for any synchro packet, this is kind of a general pre-validation: 1) Zero LSN allowed for PROMOTE | DEMOTE packets, since CONFIRM | ROLLBACK has to proceed some real data with LSN already assigned. 2) request:replica_id = 0 allowed for PROMOTE request only. 3) request:replica_id should match limbo:owner_id, iow the limbo migration should be noticed by all instances in the cluster. For CONFIRM and ROLLBACK packets: 1) Both of them can't be considered if limbo is already empty, ie there is no data in a local queue and everything is processed already. The request is obviously from the node which is out of date. For PROMOTE and DEMOTE packets: 1) The requests should come in with nonzero term, otherwise the packet is corrupted. 2) The request's term should not be less than maximal known one, iow it should not come in from nodes which didn't notice raft epoch changes and living in the past. 3) If LSN of the request matches current confirmed LSN the packet is obviously correct to process. 4) If LSN is less than confirmed LSN then the request is wrong, we have processed the requested LSN already. 5) If LSN is less than confirmed LSN then a) If limbo is empty we can't do anything, since data is already processed and should issue an error; b) If there is some data in the limbo then requested LSN should be in range of limbo's [first; last] LSNs, thus the request will be able to commit and rollback limbo queue. Because snapshot have promote packet we disable filtering at moment of joining to the leader node and similarly due to recovery. The thing is that our filtration procedure implies that limbo is already initialized to some valid state otherwise we will have to distinguish initial states from working ones, this can be done actuially but will make code more complex. Thus for now lets leave filtration on and off. Closes #6036 Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> --- .../gh-6036-qsync-filter-packets.md | 9 + src/box/applier.cc | 21 +- src/box/box.cc | 30 +- src/box/errcode.h | 1 + src/box/memtx_engine.cc | 3 +- src/box/txn_limbo.c | 309 +++++++++++++++--- src/box/txn_limbo.h | 33 +- test/box/error.result | 1 + 8 files changed, 350 insertions(+), 57 deletions(-) create mode 100644 changelogs/unreleased/gh-6036-qsync-filter-packets.md diff --git a/changelogs/unreleased/gh-6036-qsync-filter-packets.md b/changelogs/unreleased/gh-6036-qsync-filter-packets.md new file mode 100644 index 000000000..0db629e83 --- /dev/null +++ b/changelogs/unreleased/gh-6036-qsync-filter-packets.md @@ -0,0 +1,9 @@ +## feature/replication + +* Implemented incoming synchronous packets filtration to discard + requests from outdated cluster nodes. This can happen when + replication cluster is partitioned on a transport level and + two or more sub-clusters are running simultaneously for some + time, then they are trying to merge back. Since the subclusters + had own leaders they should not be able to form original cluster + because data is not longer consistent (gh-6036). diff --git a/src/box/applier.cc b/src/box/applier.cc index 845a7d015..45098e3dd 100644 --- a/src/box/applier.cc +++ b/src/box/applier.cc @@ -458,7 +458,8 @@ applier_wait_snapshot(struct applier *applier) struct synchro_request req; if (xrow_decode_synchro(&row, &req) != 0) diag_raise(); - txn_limbo_process(&txn_limbo, &req); + if (txn_limbo_process(&txn_limbo, &req) != 0) + diag_raise(); } else if (iproto_type_is_raft_request(row.type)) { struct raft_request req; if (xrow_decode_raft(&row, &req, NULL) != 0) @@ -514,6 +515,11 @@ applier_fetch_snapshot(struct applier *applier) struct ev_io *coio = &applier->io; struct xrow_header row; + txn_limbo_filter_disable(&txn_limbo); + auto filter_guard = make_scoped_guard([&]{ + txn_limbo_filter_enable(&txn_limbo); + }); + memset(&row, 0, sizeof(row)); row.type = IPROTO_FETCH_SNAPSHOT; coio_write_xrow(coio, &row); @@ -587,6 +593,11 @@ applier_register(struct applier *applier, bool was_anon) struct ev_io *coio = &applier->io; struct xrow_header row; + txn_limbo_filter_disable(&txn_limbo); + auto filter_guard = make_scoped_guard([&]{ + txn_limbo_filter_enable(&txn_limbo); + }); + memset(&row, 0, sizeof(row)); /* * Send this instance's current vclock together @@ -620,6 +631,11 @@ applier_join(struct applier *applier) struct xrow_header row; uint64_t row_count; + txn_limbo_filter_disable(&txn_limbo); + auto filter_guard = make_scoped_guard([&]{ + txn_limbo_filter_enable(&txn_limbo); + }); + xrow_encode_join_xc(&row, &INSTANCE_UUID); coio_write_xrow(coio, &row); @@ -874,6 +890,9 @@ apply_synchro_row(uint32_t replica_id, struct xrow_header *row) goto err; txn_limbo_term_lock(&txn_limbo); + if (txn_limbo_filter_locked(&txn_limbo, &req) != 0) + goto err; + struct replica_cb_data rcb_data; struct synchro_entry entry; /* diff --git a/src/box/box.cc b/src/box/box.cc index 7b11d56d6..f134dc8bb 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -424,8 +424,7 @@ wal_stream_apply_synchro_row(struct wal_stream *stream, struct xrow_header *row) say_error("couldn't decode a synchro request"); return -1; } - txn_limbo_process(&txn_limbo, &syn_req); - return 0; + return txn_limbo_process(&txn_limbo, &syn_req); } static int @@ -1671,7 +1670,7 @@ box_wait_limbo_acked(double timeout) } /** Write and process a PROMOTE request. */ -static void +static int box_issue_promote(uint32_t prev_leader_id, int64_t promote_lsn) { struct raft *raft = box_raft(); @@ -1686,15 +1685,17 @@ box_issue_promote(uint32_t prev_leader_id, int64_t promote_lsn) .lsn = promote_lsn, .term = raft->term, }; - txn_limbo_process(&txn_limbo, &req); + if (txn_limbo_process(&txn_limbo, &req) != 0) + return -1; assert(txn_limbo_is_empty(&txn_limbo)); + return 0; } /** A guard to block multiple simultaneous promote()/demote() invocations. */ static bool is_in_box_promote = false; /** Write and process a DEMOTE request. */ -static void +static int box_issue_demote(uint32_t prev_leader_id, int64_t promote_lsn) { assert(box_raft()->volatile_term == box_raft()->term); @@ -1708,8 +1709,10 @@ box_issue_demote(uint32_t prev_leader_id, int64_t promote_lsn) .lsn = promote_lsn, .term = box_raft()->term, }; - txn_limbo_process(&txn_limbo, &req); + if (txn_limbo_process(&txn_limbo, &req) != 0) + return -1; assert(txn_limbo_is_empty(&txn_limbo)); + return 0; } int @@ -1732,8 +1735,7 @@ box_promote_qsync(void) diag_set(ClientError, ER_NOT_LEADER, raft->leader); return -1; } - box_issue_promote(txn_limbo.owner_id, wait_lsn); - return 0; + return box_issue_promote(txn_limbo.owner_id, wait_lsn); } int @@ -1789,9 +1791,7 @@ box_promote(void) if (wait_lsn < 0) return -1; - box_issue_promote(txn_limbo.owner_id, wait_lsn); - - return 0; + return box_issue_promote(txn_limbo.owner_id, wait_lsn); } int @@ -1826,8 +1826,7 @@ box_demote(void) int64_t wait_lsn = box_wait_limbo_acked(replication_synchro_timeout); if (wait_lsn < 0) return -1; - box_issue_demote(txn_limbo.owner_id, wait_lsn); - return 0; + return box_issue_demote(txn_limbo.owner_id, wait_lsn); } int @@ -3296,6 +3295,11 @@ local_recovery(const struct tt_uuid *instance_uuid, say_info("instance uuid %s", tt_uuid_str(&INSTANCE_UUID)); + txn_limbo_filter_disable(&txn_limbo); + auto filter_guard = make_scoped_guard([&]{ + txn_limbo_filter_enable(&txn_limbo); + }); + struct wal_stream wal_stream; wal_stream_create(&wal_stream); auto stream_guard = make_scoped_guard([&]{ diff --git a/src/box/errcode.h b/src/box/errcode.h index a6f096698..002fcc1e1 100644 --- a/src/box/errcode.h +++ b/src/box/errcode.h @@ -283,6 +283,7 @@ struct errcode_record { /*228 */_(ER_SYNC_QUEUE_FOREIGN, "The synchronous transaction queue belongs to other instance with id %u")\ /*226 */_(ER_UNABLE_TO_PROCESS_IN_STREAM, "Unable to process %s request in stream") \ /*227 */_(ER_UNABLE_TO_PROCESS_OUT_OF_STREAM, "Unable to process %s request out of stream") \ + /*228 */_(ER_CLUSTER_SPLIT, "Cluster split detected. %s") \ /* * !IMPORTANT! Please follow instructions at start of the file diff --git a/src/box/memtx_engine.cc b/src/box/memtx_engine.cc index de918c335..de4298929 100644 --- a/src/box/memtx_engine.cc +++ b/src/box/memtx_engine.cc @@ -238,7 +238,8 @@ memtx_engine_recover_synchro(const struct xrow_header *row) * because all its rows have a zero replica_id. */ req.origin_id = req.replica_id; - txn_limbo_process(&txn_limbo, &req); + if (txn_limbo_process(&txn_limbo, &req) != 0) + return -1; return 0; } diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c index 65fbd0cac..925f401e7 100644 --- a/src/box/txn_limbo.c +++ b/src/box/txn_limbo.c @@ -52,6 +52,7 @@ txn_limbo_create(struct txn_limbo *limbo) limbo->rollback_count = 0; limbo->is_in_rollback = false; limbo->has_initial_promote = false; + limbo->is_filtering = true; } bool @@ -737,6 +738,261 @@ txn_limbo_wait_empty(struct txn_limbo *limbo, double timeout) return 0; } +/** + * Fill the reject reason with request data. + * The function is not reenterable, use with caution. + */ +static char * +reject_str(const struct synchro_request *req) +{ + static char prefix[128]; + + snprintf(prefix, sizeof(prefix), "RAFT: rejecting %s (%d) " + "request from origin_id %u replica_id %u term %llu", + iproto_type_name(req->type), req->type, + req->origin_id, req->replica_id, + (long long)req->term); + + return prefix; +} + +/** + * Common chain for any incoming packet. + */ +static int +filter_in(struct txn_limbo *limbo, const struct synchro_request *req) +{ + assert(limbo->is_filtering); + + /* + * Zero LSN are allowed for PROMOTE + * and DEMOTE requests only. + */ + if (req->lsn == 0 && !iproto_type_is_promote_request(req->type)) { + say_error("%s. Zero lsn detected", reject_str(req)); + diag_set(ClientError, ER_CLUSTER_SPLIT, + "Zero LSN on promote/demote"); + return -1; + } + + /* + * Zero @a replica_id is allowed for PROMOTE packets only. + */ + if (req->replica_id == REPLICA_ID_NIL && + req->type != IPROTO_RAFT_PROMOTE) { + say_error("%s. Zero replica_id detected", + reject_str(req)); + diag_set(ClientError, ER_CLUSTER_SPLIT, + "Zero replica_id"); + return -1; + } + + /* + * Incoming packets should esteem limbo owner, + * if it doesn't match it means the sender + * missed limbo owner migrations and out of date. + */ + if (req->replica_id != limbo->owner_id) { + say_error("%s. Limbo owner mismatch, owner_id %u", + reject_str(req), limbo->owner_id); + diag_set(ClientError, ER_CLUSTER_SPLIT, + "Sync queue silent owner migration"); + return -1; + } + + return 0; +} + +/** + * Filter CONFIRM and ROLLBACK packets. + */ +static int +filter_confirm_rollback(struct txn_limbo *limbo, + const struct synchro_request *req) +{ + assert(limbo->is_filtering); + + /* + * When limbo is empty we have nothing to + * confirm/commit and if this request comes + * in it means the split brain has happened. + */ + if (!txn_limbo_is_empty(limbo)) + return 0; + + say_error("%s. Empty limbo detected", reject_str(req)); + diag_set(ClientError, ER_CLUSTER_SPLIT, + "Confirm/rollback with empty limbo"); + return -1; +} + +/** + * Filter PROMOTE and DEMOTE packets. + */ +static int +filter_promote_demote(struct txn_limbo *limbo, + const struct synchro_request *req) +{ + assert(limbo->is_filtering); + + /* + * PROMOTE and DEMOTE packets must not have zero + * term supplied, otherwise it is a broken packet. + */ + if (req->term == 0) { + say_error("%s. Zero term detected", reject_str(req)); + diag_set(ClientError, ER_CLUSTER_SPLIT, + "Request with zero term"); + return -1; + } + + /* + * If the term is already seen it means it comes + * from a node which didn't notice new elections, + * thus been living in subdomain and its data is + * no longer consistent. + */ + if (limbo->promote_greatest_term > req->term) { + say_error("%s. Max term seen is %llu", reject_str(req), + (long long)limbo->promote_greatest_term); + diag_set(ClientError, ER_CLUSTER_SPLIT, + "Obsolete terms"); + return -1; + } + + int64_t promote_lsn = req->lsn; + + /* + * Easy case -- processed LSN matches the new + * one which comes inside request, everything + * is consistent. + */ + if (limbo->confirmed_lsn == promote_lsn) + return 0; + + /* + * Explicit split brain situation. Promote + * comes in with an old LSN which we've already + * processed. + */ + if (limbo->confirmed_lsn > promote_lsn) { + say_error("%s. confirmed_lsn %lld > promote_lsn %lld", + reject_str(req), + (long long)limbo->confirmed_lsn, + (long long)promote_lsn); + diag_set(ClientError, ER_CLUSTER_SPLIT, + "Backward promote LSN"); + return -1; + } + + /* + * The last case requires a few subcases. + */ + assert(limbo->confirmed_lsn < promote_lsn); + + if (txn_limbo_is_empty(limbo)) { + /* + * Transactions are rolled back already, + * since the limbo is empty. + */ + say_error("%s. confirmed_lsn %lld < promote_lsn %lld " + "and empty limbo", reject_str(req), + (long long)limbo->confirmed_lsn, + (long long)promote_lsn); + diag_set(ClientError, ER_CLUSTER_SPLIT, + "Forward promote LSN"); + return -1; + } + + /* + * Some entries are present in the limbo, + * we need to make sure the @a promote_lsn + * lays inside limbo [first; last] range. + * So that the promote request has some + * queued data to process, otherwise it + * means the request comes from split + * brained node. + */ + struct txn_limbo_entry *first, *last; + + first = txn_limbo_first_entry(limbo); + last = txn_limbo_last_entry(limbo); + + if (first->lsn > promote_lsn || last->lsn < promote_lsn) { + say_error("%s. promote_lsn %lld out of " + "range [%lld; %lld]", + reject_str(req), + (long long)promote_lsn, + (long long)first->lsn, + (long long)last->lsn); + diag_set(ClientError, ER_CLUSTER_SPLIT, + "Promote LSN out of queue range"); + return -1; + } + + return 0; +} + +int +txn_limbo_filter_locked(struct txn_limbo *limbo, + const struct synchro_request *req) +{ + assert(latch_is_locked(&limbo->promote_latch)); + +#ifndef NDEBUG + say_info("limbo: filter %s replica_id %u origin_id %u " + "term %lld lsn %lld, queue owner_id %u len %lld " + "promote_greatest_term %lld confirmed_lsn %lld (%s)", + iproto_type_name(req->type), + req->replica_id, req->origin_id, + (long long)req->term, (long long)req->lsn, + limbo->owner_id, (long long)limbo->len, + (long long)limbo->promote_greatest_term, + (long long)limbo->confirmed_lsn, + limbo->is_filtering ? "on" : "off"); +#endif + + /* + * Our filtering engine implies that limbo is + * in "steady" state where variables are initialized, + * thus filtering prevent wrong data to step in. Still + * there are stages such as local recovery and joining + * to another leader node where we fetch an initial state + * of the limbo such as we can't apply the filtering rules + * at this moment. + */ + if (!limbo->is_filtering) + return 0; + + if (filter_in(limbo, req) != 0) + return -1; + + switch (req->type) { + case IPROTO_RAFT_CONFIRM: + case IPROTO_RAFT_ROLLBACK: + if (filter_confirm_rollback(limbo, req) != 0) + return -1; + break; + case IPROTO_RAFT_PROMOTE: + case IPROTO_RAFT_DEMOTE: + if (filter_promote_demote(limbo, req) != 0) + return -1; + break; + default: + say_error("RAFT: rejecting unexpected %d " + "request from instance id %u " + "for term %llu.", + req->type, req->origin_id, + (long long)req->term); + diag_set(ClientError, ER_UNSUPPORTED, + "Replication", + "unexpected request type"); + return -1; + } + + return 0; +} + void txn_limbo_process_locked(struct txn_limbo *limbo, const struct synchro_request *req) @@ -745,71 +1001,42 @@ txn_limbo_process_locked(struct txn_limbo *limbo, uint64_t term = req->term; uint32_t origin = req->origin_id; + if (txn_limbo_replica_term_locked(limbo, origin) < term) { vclock_follow(&limbo->promote_term_map, origin, term); if (term > limbo->promote_greatest_term) limbo->promote_greatest_term = term; - } else if (iproto_type_is_promote_request(req->type) && - limbo->promote_greatest_term > 1) { - /* PROMOTE for outdated term. Ignore. */ - say_info("RAFT: ignoring %s request from instance " - "id %u for term %llu. Greatest term seen " - "before (%llu) is bigger.", - iproto_type_name(req->type), origin, (long long)term, - (long long)limbo->promote_greatest_term); - return; } - int64_t lsn = req->lsn; - if (req->replica_id == REPLICA_ID_NIL) { - /* - * The limbo was empty on the instance issuing the request. - * This means this instance must empty its limbo as well. - */ - assert(lsn == 0 && iproto_type_is_promote_request(req->type)); - } else if (req->replica_id != limbo->owner_id) { - /* - * Ignore CONFIRM/ROLLBACK messages for a foreign master. - * These are most likely outdated messages for already confirmed - * data from an old leader, who has just started and written - * confirm right on synchronous transaction recovery. - */ - if (!iproto_type_is_promote_request(req->type)) - return; - /* - * Promote has a bigger term, and tries to steal the limbo. It - * means it probably was elected with a quorum, and it makes no - * sense to wait here for confirmations. The other nodes already - * elected a new leader. Rollback all the local txns. - */ - lsn = 0; - } switch (req->type) { case IPROTO_RAFT_CONFIRM: - txn_limbo_read_confirm(limbo, lsn); + txn_limbo_read_confirm(limbo, req->lsn); break; case IPROTO_RAFT_ROLLBACK: - txn_limbo_read_rollback(limbo, lsn); + txn_limbo_read_rollback(limbo, req->lsn); break; case IPROTO_RAFT_PROMOTE: - txn_limbo_read_promote(limbo, req->origin_id, lsn); + txn_limbo_read_promote(limbo, req->origin_id, req->lsn); break; case IPROTO_RAFT_DEMOTE: - txn_limbo_read_demote(limbo, lsn); + txn_limbo_read_demote(limbo, req->lsn); break; default: - unreachable(); + panic("limbo: unexpected request type %d", req->type); + break; } - return; } -void +int txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req) { txn_limbo_term_lock(limbo); - txn_limbo_process_locked(limbo, req); + int rc = txn_limbo_filter_locked(limbo, req); + if (rc == 0) + txn_limbo_process_locked(limbo, req); txn_limbo_term_unlock(limbo); + return rc; } void diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h index 1ee815d1c..74c77c16b 100644 --- a/src/box/txn_limbo.h +++ b/src/box/txn_limbo.h @@ -190,6 +190,14 @@ struct txn_limbo { * procedure. */ bool has_initial_promote; + /** + * Whether the limbo should filter incoming requests. + * The phases of local recovery from WAL file and on applier's + * join phase we are in complete trust of incoming data because + * this data forms an initial limbo state and should not + * filter out requests. + */ + bool is_filtering; }; /** @@ -339,15 +347,38 @@ txn_limbo_ack(struct txn_limbo *limbo, uint32_t replica_id, int64_t lsn); int txn_limbo_wait_complete(struct txn_limbo *limbo, struct txn_limbo_entry *entry); +/** + * Verify if the request is valid for processing. + */ +int +txn_limbo_filter_locked(struct txn_limbo *limbo, + const struct synchro_request *req); + /** Execute a synchronous replication request. */ void txn_limbo_process_locked(struct txn_limbo *limbo, const struct synchro_request *req); /** Lock limbo terms and execute a synchronous replication request. */ -void +int txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req); +/** Enable filtering of synchro requests. */ +static inline void +txn_limbo_filter_enable(struct txn_limbo *limbo) +{ + limbo->is_filtering = true; + say_info("limbo: filter enabled"); +} + +/** Disable filtering of synchro requests. */ +static inline void +txn_limbo_filter_disable(struct txn_limbo *limbo) +{ + limbo->is_filtering = false; + say_info("limbo: filter disabled"); +} + /** * Waiting for confirmation of all "sync" transactions * during confirm timeout or fail. diff --git a/test/box/error.result b/test/box/error.result index bc804197a..45ea7714c 100644 --- a/test/box/error.result +++ b/test/box/error.result @@ -449,6 +449,7 @@ t; | 228: box.error.SYNC_QUEUE_FOREIGN | 229: box.error.UNABLE_TO_PROCESS_IN_STREAM | 230: box.error.UNABLE_TO_PROCESS_OUT_OF_STREAM + | 231: box.error.CLUSTER_SPLIT | ... test_run:cmd("setopt delimiter ''"); -- 2.31.1 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v14 5/6] qsync: filter incoming synchro requests 2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 5/6] qsync: filter incoming synchro requests Cyrill Gorcunov via Tarantool-patches @ 2021-09-12 15:44 ` Vladislav Shpilevoy via Tarantool-patches 2021-09-14 19:41 ` Cyrill Gorcunov via Tarantool-patches 0 siblings, 1 reply; 21+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2021-09-12 15:44 UTC (permalink / raw) To: Cyrill Gorcunov, tml Thanks for the patch! See 8 comments below. On 10.09.2021 17:29, Cyrill Gorcunov wrote: > When we receive synchro requests we can't just apply > them blindly because in worse case they may come from > split-brain configuration (where a cluster split into > several clusters and each one has own leader elected, > then clusters are trying to merge back into the original > one). We need to do our best to detect such disunity > and force these nodes to rejoin from the scratch for > data consistency sake. > > Thus when we're processing requests we pass them to the > packet filter first which validates their contents and > refuse to apply if they are not matched. > > Filter logic depends on request type. > > First there is a common chain for any synchro packet, this > is kind of a general pre-validation: > 1) Zero LSN allowed for PROMOTE | DEMOTE packets, since > CONFIRM | ROLLBACK has to proceed some real data with > LSN already assigned. > 2) request:replica_id = 0 allowed for PROMOTE request only. > 3) request:replica_id should match limbo:owner_id, iow the > limbo migration should be noticed by all instances in the > cluster. > > For CONFIRM and ROLLBACK packets: > 1) Both of them can't be considered if limbo is already empty, > ie there is no data in a local queue and everything is > processed already. The request is obviously from the node which > is out of date. 1. It is not just about empty. They might try to touch a range of transactions out of the LSN range waiting in the limbo. Then their effect is also void. The question remains from the previous review. What is the resolution here? Besides, I don't know how could such requests happen, but I don't how to get the ones in your example either TBH. An theoretical examle? A test? > For PROMOTE and DEMOTE packets: > 1) The requests should come in with nonzero term, otherwise > the packet is corrupted. > 2) The request's term should not be less than maximal known > one, iow it should not come in from nodes which didn't notice > raft epoch changes and living in the past. > 3) If LSN of the request matches current confirmed LSN the packet > is obviously correct to process. > 4) If LSN is less than confirmed LSN then the request is wrong, > we have processed the requested LSN already. > 5) If LSN is less than confirmed LSN then 2. You didn't fix the typo from the previous review. Still two points say "less than confirmed LSN". Please, re-read the comments of the previous review and address them all. As I already told not once, it would be best if you would respond to the comments individually and with diff if possible. Otherwise you will continue missing them. > a) If limbo is empty we can't do anything, since data is already > processed and should issue an error; > b) If there is some data in the limbo then requested LSN should > be in range of limbo's [first; last] LSNs, thus the request > will be able to commit and rollback limbo queue. > > Because snapshot have promote packet we disable filtering at moment > of joining to the leader node and similarly due to recovery. The thing > is that our filtration procedure implies that limbo is already > initialized to some valid state otherwise we will have to distinguish > initial states from working ones, this can be done actuially but will > make code more complex. 3. How 'more complex'? And why do you distinguish between 'initial' and 'working' states? All states should work. Initial only means the limbo does not belong to anybody. Currently I only see complexity coming from the filtering being turned on/off. > Thus for now lets leave filtration on and off. Please, find the real reason why is it needed. All states should be working. 'Initial' state is not any different than for example when DEMOTE was called. > diff --git a/changelogs/unreleased/gh-6036-qsync-filter-packets.md b/changelogs/unreleased/gh-6036-qsync-filter-packets.md > new file mode 100644 > index 000000000..0db629e83 > --- /dev/null > +++ b/changelogs/unreleased/gh-6036-qsync-filter-packets.md > @@ -0,0 +1,9 @@ > +## feature/replication 4. It is a bugfix, not feature. > + > +* Implemented incoming synchronous packets filtration to discard > + requests from outdated cluster nodes. This can happen when > + replication cluster is partitioned on a transport level and > + two or more sub-clusters are running simultaneously for some > + time, then they are trying to merge back. Since the subclusters > + had own leaders they should not be able to form original cluster > + because data is not longer consistent (gh-6036).> diff --git a/src/box/box.cc b/src/box/box.cc > index 7b11d56d6..f134dc8bb 100644 > --- a/src/box/box.cc > +++ b/src/box/box.cc > @@ -1686,15 +1685,17 @@ box_issue_promote(uint32_t prev_leader_id, int64_t promote_lsn) > .lsn = promote_lsn, > .term = raft->term, > }; > - txn_limbo_process(&txn_limbo, &req); > + if (txn_limbo_process(&txn_limbo, &req) != 0) > + return -1; 5. There was already done txn_limbo_write_promote() above. If you bail now, you have an inconsistent state - in WAL the promote is written, in the limbo it is not applied. What will happen on recovery? It seems you need to lock box_promote(), box_promote_qsync(), and box_demote(). Otherwise you have the exact same problem with local promotions vs coming from the applier as the one you tried to fix for applier vs applier. > assert(txn_limbo_is_empty(&txn_limbo)); > + return 0; > } > diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c > index 65fbd0cac..925f401e7 100644 > --- a/src/box/txn_limbo.c > +++ b/src/box/txn_limbo.c > @@ -737,6 +738,261 @@ txn_limbo_wait_empty(struct txn_limbo *limbo, double timeout) > return 0; > } > > +/** > + * Fill the reject reason with request data. > + * The function is not reenterable, use with caution. > + */ > +static char * > +reject_str(const struct synchro_request *req) > +{ > + static char prefix[128]; 6. Please, don't try to re-invent the static buffer. Just use it. > + > + snprintf(prefix, sizeof(prefix), "RAFT: rejecting %s (%d) " > + "request from origin_id %u replica_id %u term %llu", > + iproto_type_name(req->type), req->type, > + req->origin_id, req->replica_id, > + (long long)req->term); > + > + return prefix; > +} > + > +/** > + * Common chain for any incoming packet. > + */ > +static int > +filter_in(struct txn_limbo *limbo, const struct synchro_request *req) > +{ > + assert(limbo->is_filtering); > + > + /* > + * Zero LSN are allowed for PROMOTE > + * and DEMOTE requests only. > + */ > + if (req->lsn == 0 && !iproto_type_is_promote_request(req->type)) { > + say_error("%s. Zero lsn detected", reject_str(req)); > + diag_set(ClientError, ER_CLUSTER_SPLIT, > + "Zero LSN on promote/demote"); > + return -1; > + } > + > + /* > + * Zero @a replica_id is allowed for PROMOTE packets only. 7. Why not for DEMOTE? > + */ > + if (req->replica_id == REPLICA_ID_NIL && > + req->type != IPROTO_RAFT_PROMOTE) { > + say_error("%s. Zero replica_id detected", > + reject_str(req)); > + diag_set(ClientError, ER_CLUSTER_SPLIT, > + "Zero replica_id"); > + return -1; > + } > + > + /* > + * Incoming packets should esteem limbo owner, > + * if it doesn't match it means the sender > + * missed limbo owner migrations and out of date. > + */ > + if (req->replica_id != limbo->owner_id) { > + say_error("%s. Limbo owner mismatch, owner_id %u", > + reject_str(req), limbo->owner_id); > + diag_set(ClientError, ER_CLUSTER_SPLIT, > + "Sync queue silent owner migration"); > + return -1; > + } > + > + return 0; > +} > + > +/** > + * Filter CONFIRM and ROLLBACK packets. > + */ > +static int > +filter_confirm_rollback(struct txn_limbo *limbo, 8. The limbo can be const in all filter functions. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v14 5/6] qsync: filter incoming synchro requests 2021-09-12 15:44 ` Vladislav Shpilevoy via Tarantool-patches @ 2021-09-14 19:41 ` Cyrill Gorcunov via Tarantool-patches 0 siblings, 0 replies; 21+ messages in thread From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-14 19:41 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tml On Sun, Sep 12, 2021 at 05:44:15PM +0200, Vladislav Shpilevoy wrote: > > @@ -1686,15 +1685,17 @@ box_issue_promote(uint32_t prev_leader_id, int64_t promote_lsn) > > .lsn = promote_lsn, > > .term = raft->term, > > }; > > - txn_limbo_process(&txn_limbo, &req); > > + if (txn_limbo_process(&txn_limbo, &req) != 0) > > + return -1; > > 5. There was already done txn_limbo_write_promote() above. If you > bail now, you have an inconsistent state - in WAL the promote is > written, in the limbo it is not applied. What will happen on recovery? > > It seems you need to lock box_promote(), box_promote_qsync(), > and box_demote(). Otherwise you have the exact same problem with > local promotions vs coming from the applier as the one you tried > to fix for applier vs applier. That's a good point, but as you pointed in previous reviews we should try to remove locking from api (which i did in new yet not sent patches) thus we need some kind of a safe filter which would take a lock, filter request, and release the lock then... As to me our try to hide locking was a mistake in first place, locks I proposed simply serialize access to terms they are underlated to begin/commit/rollback semantics. Actually for now I'm not sure which approach would fit your architecture ideas. Maybe some txn_limbo_filter() helper exposed? ^ permalink raw reply [flat|nested] 21+ messages in thread
* [Tarantool-patches] [PATCH v14 6/6] test: add replication/gh-6036-rollback-confirm 2021-09-10 15:29 [Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering Cyrill Gorcunov via Tarantool-patches ` (4 preceding siblings ...) 2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 5/6] qsync: filter incoming synchro requests Cyrill Gorcunov via Tarantool-patches @ 2021-09-10 15:29 ` Cyrill Gorcunov via Tarantool-patches 2021-09-12 15:44 ` Vladislav Shpilevoy via Tarantool-patches 2021-09-12 15:43 ` [Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering Vladislav Shpilevoy via Tarantool-patches 6 siblings, 1 reply; 21+ messages in thread From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-10 15:29 UTC (permalink / raw) To: tml; +Cc: Vladislav Shpilevoy Follow-up #6036 Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> --- test/replication/gh-6036-master.lua | 1 + test/replication/gh-6036-node.lua | 33 ++++ test/replication/gh-6036-replica.lua | 1 + .../gh-6036-rollback-confirm.result | 180 ++++++++++++++++++ .../gh-6036-rollback-confirm.test.lua | 92 +++++++++ 5 files changed, 307 insertions(+) create mode 120000 test/replication/gh-6036-master.lua create mode 100644 test/replication/gh-6036-node.lua create mode 120000 test/replication/gh-6036-replica.lua create mode 100644 test/replication/gh-6036-rollback-confirm.result create mode 100644 test/replication/gh-6036-rollback-confirm.test.lua diff --git a/test/replication/gh-6036-master.lua b/test/replication/gh-6036-master.lua new file mode 120000 index 000000000..65baed5de --- /dev/null +++ b/test/replication/gh-6036-master.lua @@ -0,0 +1 @@ +gh-6036-node.lua \ No newline at end of file diff --git a/test/replication/gh-6036-node.lua b/test/replication/gh-6036-node.lua new file mode 100644 index 000000000..ac701b7a2 --- /dev/null +++ b/test/replication/gh-6036-node.lua @@ -0,0 +1,33 @@ +local INSTANCE_ID = string.match(arg[0], "gh%-6036%-(.+)%.lua") + +local function unix_socket(name) + return "unix/:./" .. name .. '.sock'; +end + +require('console').listen(os.getenv('ADMIN')) + +if INSTANCE_ID == "master" then + box.cfg({ + listen = unix_socket("master"), + replication_connect_quorum = 0, + election_mode = 'candidate', + replication_synchro_quorum = 3, + replication_synchro_timeout = 1000, + }) +elseif INSTANCE_ID == "replica" then + box.cfg({ + listen = unix_socket("replica"), + replication = { + unix_socket("master"), + unix_socket("replica") + }, + read_only = true, + election_mode = 'voter', + replication_synchro_quorum = 2, + replication_synchro_timeout = 1000, + }) +end + +box.once("bootstrap", function() + box.schema.user.grant('guest', 'super') +end) diff --git a/test/replication/gh-6036-replica.lua b/test/replication/gh-6036-replica.lua new file mode 120000 index 000000000..65baed5de --- /dev/null +++ b/test/replication/gh-6036-replica.lua @@ -0,0 +1 @@ +gh-6036-node.lua \ No newline at end of file diff --git a/test/replication/gh-6036-rollback-confirm.result b/test/replication/gh-6036-rollback-confirm.result new file mode 100644 index 000000000..e85f6af37 --- /dev/null +++ b/test/replication/gh-6036-rollback-confirm.result @@ -0,0 +1,180 @@ +-- test-run result file version 2 +-- +-- gh-6036: Test for record collision detection. We have a cluster +-- of two nodes: master and replica. The master initiates syncho write +-- but fails to gather a quorum. Before it rolls back the record the +-- network breakage occurs and replica lives with dirty data while +-- master node goes offline. The replica becomes a new raft leader +-- and commits the dirty data, same time master node rolls back this +-- record and tries to connect to the new raft leader back. Such +-- connection should be refused because old master node is not longer +-- consistent. +-- +test_run = require('test_run').new() + | --- + | ... + +test_run:cmd('create server master with script="replication/gh-6036-master.lua"') + | --- + | - true + | ... +test_run:cmd('create server replica with script="replication/gh-6036-replica.lua"') + | --- + | - true + | ... + +test_run:cmd('start server master') + | --- + | - true + | ... +test_run:cmd('start server replica') + | --- + | - true + | ... + +-- +-- Connect master to the replica and write a record. Since the quorum +-- value is bigger than number of nodes in a cluster it will be rolled +-- back later. +test_run:switch('master') + | --- + | - true + | ... +box.cfg({ \ + replication = { \ + "unix/:./master.sock", \ + "unix/:./replica.sock", \ + }, \ +}) + | --- + | ... +_ = box.schema.create_space('sync', {is_sync = true}) + | --- + | ... +_ = box.space.sync:create_index('pk') + | --- + | ... + +-- +-- Wait the record to appear on the master. +f = require('fiber').create(function() box.space.sync:replace{1} end) + | --- + | ... +test_run:wait_cond(function() return box.space.sync:get({1}) ~= nil end, 100) + | --- + | - true + | ... +box.space.sync:select{} + | --- + | - - [1] + | ... + +-- +-- Wait the record from master get written and then +-- drop the replication. +test_run:switch('replica') + | --- + | - true + | ... +test_run:wait_cond(function() return box.space.sync:get({1}) ~= nil end, 100) + | --- + | - true + | ... +box.space.sync:select{} + | --- + | - - [1] + | ... +box.cfg{replication = {}} + | --- + | ... + +-- +-- Then we jump back to the master and drop the replication, +-- thus unconfirmed record get rolled back. +test_run:switch('master') + | --- + | - true + | ... +box.cfg({ \ + replication = {}, \ + replication_synchro_timeout = 0.001, \ + election_mode = 'manual', \ +}) + | --- + | ... +while f:status() ~= 'dead' do require('fiber').sleep(0.1) end + | --- + | ... +test_run:wait_cond(function() return box.space.sync:get({1}) == nil end, 100) + | --- + | - true + | ... + +-- +-- Force the replica to become a RAFT leader and +-- commit this new record. +test_run:switch('replica') + | --- + | - true + | ... +box.cfg({ \ + replication_synchro_quorum = 1, \ + election_mode = 'manual' \ +}) + | --- + | ... +box.ctl.promote() + | --- + | ... +box.space.sync:select{} + | --- + | - - [1] + | ... + +-- +-- Connect master back to the replica, it should +-- be refused. +test_run:switch('master') + | --- + | - true + | ... +box.cfg({ \ + replication = { \ + "unix/:./replica.sock", \ + }, \ +}) + | --- + | ... +box.space.sync:select{} + | --- + | - [] + | ... +test_run:wait_cond(function() return \ + test_run:grep_log('master', \ + 'rejecting PROMOTE') ~= nil end, 100) \ +test_run:wait_cond(function() return \ + test_run:grep_log('master', \ + 'ER_CLUSTER_SPLIT') ~= nil end, 100) + | --- + | ... + +test_run:switch('default') + | --- + | - true + | ... +test_run:cmd('stop server master') + | --- + | - true + | ... +test_run:cmd('delete server master') + | --- + | - true + | ... +test_run:cmd('stop server replica') + | --- + | - true + | ... +test_run:cmd('delete server replica') + | --- + | - true + | ... diff --git a/test/replication/gh-6036-rollback-confirm.test.lua b/test/replication/gh-6036-rollback-confirm.test.lua new file mode 100644 index 000000000..6eca23d8b --- /dev/null +++ b/test/replication/gh-6036-rollback-confirm.test.lua @@ -0,0 +1,92 @@ +-- +-- gh-6036: Test for record collision detection. We have a cluster +-- of two nodes: master and replica. The master initiates syncho write +-- but fails to gather a quorum. Before it rolls back the record the +-- network breakage occurs and replica lives with dirty data while +-- master node goes offline. The replica becomes a new raft leader +-- and commits the dirty data, same time master node rolls back this +-- record and tries to connect to the new raft leader back. Such +-- connection should be refused because old master node is not longer +-- consistent. +-- +test_run = require('test_run').new() + +test_run:cmd('create server master with script="replication/gh-6036-master.lua"') +test_run:cmd('create server replica with script="replication/gh-6036-replica.lua"') + +test_run:cmd('start server master') +test_run:cmd('start server replica') + +-- +-- Connect master to the replica and write a record. Since the quorum +-- value is bigger than number of nodes in a cluster it will be rolled +-- back later. +test_run:switch('master') +box.cfg({ \ + replication = { \ + "unix/:./master.sock", \ + "unix/:./replica.sock", \ + }, \ +}) +_ = box.schema.create_space('sync', {is_sync = true}) +_ = box.space.sync:create_index('pk') + +-- +-- Wait the record to appear on the master. +f = require('fiber').create(function() box.space.sync:replace{1} end) +test_run:wait_cond(function() return box.space.sync:get({1}) ~= nil end, 100) +box.space.sync:select{} + +-- +-- Wait the record from master get written and then +-- drop the replication. +test_run:switch('replica') +test_run:wait_cond(function() return box.space.sync:get({1}) ~= nil end, 100) +box.space.sync:select{} +box.cfg{replication = {}} + +-- +-- Then we jump back to the master and drop the replication, +-- thus unconfirmed record get rolled back. +test_run:switch('master') +box.cfg({ \ + replication = {}, \ + replication_synchro_timeout = 0.001, \ + election_mode = 'manual', \ +}) +while f:status() ~= 'dead' do require('fiber').sleep(0.1) end +test_run:wait_cond(function() return box.space.sync:get({1}) == nil end, 100) + +-- +-- Force the replica to become a RAFT leader and +-- commit this new record. +test_run:switch('replica') +box.cfg({ \ + replication_synchro_quorum = 1, \ + election_mode = 'manual' \ +}) +box.ctl.promote() +box.space.sync:select{} + +-- +-- Connect master back to the replica, it should +-- be refused. +test_run:switch('master') +box.cfg({ \ + replication = { \ + "unix/:./replica.sock", \ + }, \ +}) +box.space.sync:select{} +test_run:wait_cond(function() return \ + test_run:grep_log('master', \ + 'rejecting PROMOTE') ~= nil end, 100) \ +test_run:wait_cond(function() return \ + test_run:grep_log('master', \ + 'ER_CLUSTER_SPLIT') ~= nil end, 100) + +test_run:switch('default') +test_run:cmd('stop server master') +test_run:cmd('delete server master') +test_run:cmd('stop server replica') +test_run:cmd('delete server replica') -- 2.31.1 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v14 6/6] test: add replication/gh-6036-rollback-confirm 2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 6/6] test: add replication/gh-6036-rollback-confirm Cyrill Gorcunov via Tarantool-patches @ 2021-09-12 15:44 ` Vladislav Shpilevoy via Tarantool-patches 0 siblings, 0 replies; 21+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2021-09-12 15:44 UTC (permalink / raw) To: Cyrill Gorcunov, tml Thanks for the patch! See 3 comments below. > diff --git a/test/replication/gh-6036-rollback-confirm.result b/test/replication/gh-6036-rollback-confirm.result > new file mode 100644 > index 000000000..e85f6af37 > --- /dev/null > +++ b/test/replication/gh-6036-rollback-confirm.result <...> > +-- Connect master to the replica and write a record. Since the quorum > +-- value is bigger than number of nodes in a cluster it will be rolled > +-- back later. > +test_run:switch('master') > + | --- > + | - true > + | ... > +box.cfg({ \ > + replication = { \ > + "unix/:./master.sock", \ > + "unix/:./replica.sock", \ > + }, \ > +}) > + | --- > + | ... > +_ = box.schema.create_space('sync', {is_sync = true}) > + | --- > + | ... > +_ = box.space.sync:create_index('pk') > + | --- > + | ... > + > +-- > +-- Wait the record to appear on the master. > +f = require('fiber').create(function() box.space.sync:replace{1} end) > + | --- > + | ... > +test_run:wait_cond(function() return box.space.sync:get({1}) ~= nil end, 100) 1. Why do you need a custom wait_cond timeout? > + | --- > + | - true > + | ... > +box.space.sync:select{} > + | --- > + | - - [1] > + | ... > + > +-- > +-- Wait the record from master get written and then > +-- drop the replication. > +test_run:switch('replica') > + | --- > + | - true > + | ... > +test_run:wait_cond(function() return box.space.sync:get({1}) ~= nil end, 100) > + | --- > + | - true > + | ... > +box.space.sync:select{} 2. You don't need the waiting on the master if you wait for the same on the replica. It couldn't get there before master itself. > + | --- > + | - - [1] > + | ... > +box.cfg{replication = {}} > + | --- > + | ... > + > +-- > +-- Then we jump back to the master and drop the replication, > +-- thus unconfirmed record get rolled back. > +test_run:switch('master') > + | --- > + | - true > + | ... > +box.cfg({ \ > + replication = {}, \ > + replication_synchro_timeout = 0.001, \ > + election_mode = 'manual', \ > +}) > + | --- > + | ... > +while f:status() ~= 'dead' do require('fiber').sleep(0.1) end > + | --- > + | ... > +test_run:wait_cond(function() return box.space.sync:get({1}) == nil end, 100) > + | --- > + | - true > + | ... > + > +-- > +-- Force the replica to become a RAFT leader and > +-- commit this new record. > +test_run:switch('replica') > + | --- > + | - true > + | ... > +box.cfg({ \ > + replication_synchro_quorum = 1, \ > + election_mode = 'manual' \ > +}) > + | --- > + | ... > +box.ctl.promote() > + | --- > + | ... > +box.space.sync:select{} > + | --- > + | - - [1] > + | ... > + > +-- > +-- Connect master back to the replica, it should > +-- be refused. > +test_run:switch('master') > + | --- > + | - true > + | ... > +box.cfg({ \ > + replication = { \ > + "unix/:./replica.sock", \ > + }, \ > +}) > + | --- > + | ... > +box.space.sync:select{} > + | --- > + | - [] > + | ... > +test_run:wait_cond(function() return \ > + test_run:grep_log('master', \ > + 'rejecting PROMOTE') ~= nil end, 100) \ > +test_run:wait_cond(function() return \ > + test_run:grep_log('master', \ > + 'ER_CLUSTER_SPLIT') ~= nil end, 100) 3. Why do you need these 2 conds with \? The only reason for using \ between multiple statements is to prevent yields. Why can't you have yields between the two wait_cond() calls? Also could you make one cond with 'grep_log() and grep_log()'? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering 2021-09-10 15:29 [Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering Cyrill Gorcunov via Tarantool-patches ` (5 preceding siblings ...) 2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 6/6] test: add replication/gh-6036-rollback-confirm Cyrill Gorcunov via Tarantool-patches @ 2021-09-12 15:43 ` Vladislav Shpilevoy via Tarantool-patches 6 siblings, 0 replies; 21+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2021-09-12 15:43 UTC (permalink / raw) To: Cyrill Gorcunov, tml Hi! Thanks for the fixes! On 10.09.2021 17:29, Cyrill Gorcunov via Tarantool-patches wrote: > Guys, take a look please, once time permit. The questionable moments: > > - use filter disabling procedure for join/recovery: we make it so since > snapshot has promote record which fills initial limbo state I still don't understand why do you need the disabling. Before the snapshot is recovered, the limbo is empty and not owner by anybody. It is fully valid and working, like if DEMOTE is called. Correct? Snapshot's promote should make it belong to the owner persisted in promote. Also correct? The next rows just replay already applied data. Also correct? It managed to apply first time and must manage to do so again. Agree? In what of these statements I miss a mistake which makes you disable the filtering? > - need more tests to cover all possible scenarios > > - I keep filter_confirm_rollback() as is but rereading Vlad's comment > > > > 9. What if rollback is for LSN > limbo's last LSN? It > > also means nothing to do. The same for confirm LSN < limbo's > > first LSN. > > > I presume I need to traverse limbo and test if incoming LSN is > present inside current queue. It should be enough to know the LSN range in there AFAIU. Additionally, I tried the test from the ticket again. It still does not behave as expected. I remind, on the last review I also tried: On top of the branch I tried the test I pasted in the ticket's description. I see the connection now breaks in one direction. But the new leader still follows the old leader somewhy. And you said: I'll take more precise look, thanks! https://lists.tarantool.org/tarantool-patches/YRBUww6p1dUNL0mx@grain/ So what are the news on that? The new leader should not follow the old one. If anything, even the vice-versa situation would be fine I suppose - the old one following the new one. But the current way does not look valid. The old leader could send all kinds of irrelevant garbage and the new leader would happily swallow it. The same happens in this test (on top of the last commit of this patchset): https://github.com/tarantool/tarantool/issues/5295#issuecomment-912106680 The new leader still replicates from the old broken one. ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2021-09-14 19:41 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-09-10 15:29 [Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering Cyrill Gorcunov via Tarantool-patches 2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 1/6] qsync: track confirmed lsn number on reads Cyrill Gorcunov via Tarantool-patches 2021-09-12 15:44 ` Vladislav Shpilevoy via Tarantool-patches 2021-09-12 22:18 ` Cyrill Gorcunov via Tarantool-patches 2021-09-13 8:33 ` Serge Petrenko via Tarantool-patches 2021-09-13 8:50 ` Serge Petrenko via Tarantool-patches 2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 2/6] qsync: update confirmed lsn on initial promote request Cyrill Gorcunov via Tarantool-patches 2021-09-12 15:44 ` Vladislav Shpilevoy via Tarantool-patches 2021-09-12 22:25 ` Cyrill Gorcunov via Tarantool-patches 2021-09-13 8:52 ` Serge Petrenko via Tarantool-patches 2021-09-13 14:20 ` [Tarantool-patches] [RFC] qsync: overall design Cyrill Gorcunov via Tarantool-patches 2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 3/6] latch: add latch_is_locked helper Cyrill Gorcunov via Tarantool-patches 2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 4/6] qsync: order access to the limbo terms Cyrill Gorcunov via Tarantool-patches 2021-09-12 15:44 ` Vladislav Shpilevoy via Tarantool-patches 2021-09-13 10:52 ` Cyrill Gorcunov via Tarantool-patches 2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 5/6] qsync: filter incoming synchro requests Cyrill Gorcunov via Tarantool-patches 2021-09-12 15:44 ` Vladislav Shpilevoy via Tarantool-patches 2021-09-14 19:41 ` Cyrill Gorcunov via Tarantool-patches 2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 6/6] test: add replication/gh-6036-rollback-confirm Cyrill Gorcunov via Tarantool-patches 2021-09-12 15:44 ` Vladislav Shpilevoy via Tarantool-patches 2021-09-12 15:43 ` [Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering Vladislav Shpilevoy via Tarantool-patches
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox