[Tarantool-patches] [PATCH v4 08/12] election: introduce a new election mode: "manual"

Tue Apr 20 20:37:27 MSK 2021

20.04.2021 12:25, Serge Petrenko via Tarantool-patches пишет:
>
>
> 20.04.2021 01:34, Vladislav Shpilevoy пишет:
>> Hi! Thanks for working on this!
>>
>> It seems starting from this commit the election stress test
>> hangs on my machine in 100% cases. I didn't have time to
>> investigate why yet.
> Yes, you're correct. I also see this. It's not 100% cases though.
>
> On my machine the test doesn't hang at all (at least the first 20 runs)
> until commit "txn_limbo: filter rows based on known peer terms"
>
> Starting with commit "txn_limbo: filter rows based on known peer terms"
> one or two of the 20 runs hang and get restarted.
>
> I need some time to investigate this. Will return once I have some 
> results.
>

Ok, seems like the case is closed.

So, here's a couple of facts that lead to the test hang:

1) The instance may still write CONFIRM for its own transactions after 
restart.
    It may do so even before receiving a CONFIRM from some remote 
instance, which
    took ownership of the limbo later.
    This fact alone would be ok, but:
    a) the instance doesn't count its own WAL write as the first ack 
after restart,
       so if quorum is M it waits for M+1 acks from remote instances 
before writing
       confirm
    b) the instance writes CONFIRM unconditionally even before getting 
in sync
       with other replicas, which could have already written CONFIRM for 
its rows.
       (this may be fine).
    There's an issue related to the cause, but it needs some reformulation:
    https://github.com/tarantool/tarantool/issues/5856

2) Any failure in txn_commit_try_async is treated as a WAL write error 
by mistake,
    and the actual reason for rollback is lost. I've opened a ticket for 
this:
    https://github.com/tarantool/tarantool/issues/6027
    ER_WAL_IO is unrecoverable and breaks connection between master and 
replica.
    (We might make it recoverable as well? Why not retry WAL write after 
some time?
     It may work out this time).

3) NOPs are added to txn_limbo, when it isn't empty.

And here's what happened when the test hung:

1) Some instance used to be the leader and got restarted before
    writing CONFIRM for its own transactions

2) Once the instance got restarted, its relays were faster than
    its appliers, meaning it first gathered 2 acks for the old
    transaction, and wrote CONFIRM right away, and received CONFIRM
    from a remote instance later

3) This instance was elected the leader once again. Once this
    happened other 2 instances started accepting rows from this
    instance

4) The first row remote instances got was this CONFIRM which the
    instance wrote after restart

5) The instance was considered outdated, because while it was an
    elected leader, it hasn't yet sent PROMOTE to the other
    instances (PROMOTE comes right after that notorious CONFIRM)

6) Like any row from an outdated instance, CONFIRM was replaced
    with a NOP

7) Other instances try to insert that NOP to their limbos, which
    aren't empty, due to the nature of the test (and would get
    emptied with PROMOTE). Insertion fails with
    ER_UNCOMMITTED_FOREIGN_SYNC_TXNS

8) ER_UNCOMMITTED_FOREIGN_SYNC_TXNS is replaced with ER_WAL_IO by
    applier's on_rollback trigger. This is an unrecoverable error,
    so both the remote instances' appliers break connection to
    the leader.

9) Now there's an infinite loop of elections. This node never
    votes for any of the remote nodes, because they are behind it.

What I've done to fix this is I've allowed transactions that consist
of NOPs solely to pass through limbo without waiting even when it's
non-empty.

The test's now rock-solid on my machine. 0 failures in 100 runs.
(with 1 worker, to be honest, but that's still better than a couple of
failures in 20 runs with 1 worker).

I've sent the new patch as [PATCH v4 14/12] in reply to this series.
Please, take a look.

-- 
Serge Petrenko