[Tarantool-patches] [PATCH v9 4/5] limbo: filter incoming synchro requests

Vladislav Shpilevoy v.shpilevoy at tarantool.org
Tue Aug 3 02:50:49 MSK 2021

Thanks for the patch!

On 30.07.2021 13:35, Cyrill Gorcunov wrote:
> When we receive synchro requests we can't just apply
> them blindly because in worse case they may come from
> split-brain configuration (where a cluster splitted into

splitted -> split.

> several subclusters and each one has own leader elected,
> then subclisters are trying to merge back into original

subclisters -> subclusters.

> cluster). We need to do our best to detect such configs
> and force these nodes to rejoin from the scratch for
> data consistency sake.
> Thus when we're processing requests we pass them to the
> packet filter first which validates their contents and
> refuse to apply if they are not matched.
> Depending on request type each packet traverse an
> appropriate chain(s)
>  - Common chain for any synchro packet. We verify
>    that if replica_id is nil then it shall be
>    PROMOTE request with lsn 0 to migrate limbo owner

How can it be 0 for non PROMOTE/DEMOTE requests?
Do we ever encode such rows at all? Why isn't this

>  - Both confirm and rollback requests shall not come
>    with empty limbo since it measn the synchro queue

measn -> means.

>    is already processed and the peer didn't notice
>    that

Is it the only issue? What about ROLLBACK coming to
an instance, which already made PROMOTE on the rolled back
data? That is a part of the original problem in the ticket.

>  - Promote request should come in with new terms only,
>    otherwise it means the peer didn't notice election
>  - If limbo's confirmed_lsn is equal to promote LSN then
>    it is a valid request to process
>  - If limbo's confirmed_lsn is bigger than requested then
>    it is valid in one case only -- limbo migration so the
>    queue shall be empty

I don't understand. How is it valid? PROMOTE(lsn) rolls
back everything > lsn. If the local confirmed_lsn > lsn, it
means that data can't be rolled back now and the data becomes

>  - If limbo's confirmed_lsn is less than promote LSN then
>    - If queue is empty then it means the transactions are
>      already rolled back and request is invalid
>    - If queue is not empty then its first entry might be
>      greater than promote LSN and it means that old data
>      either committed or rolled back already and request
>      is invalid

If the first entry's LSN in the limbo > promote LSN, it
means it wasn't committed yet. The promote will roll it back
and it is fine. This will make the data consistent.

The problem appears if there were some other sync txns
rolled back or even committed with quorum=1 before this
hanging txn. And I don't remember we figured a way to
distinguish between these situations. Did we?

I didn't get to the code yet. Will do later.

>  - NOP, reserved for future use
> Closes #6036

More information about the Tarantool-patches mailing list