[Tarantool-patches] [PATCH v4 08/12] election: introduce a new election mode: "manual"
Serge Petrenko
sergepetrenko at tarantool.org
Tue Apr 20 20:37:27 MSK 2021
20.04.2021 12:25, Serge Petrenko via Tarantool-patches пишет:
>
>
> 20.04.2021 01:34, Vladislav Shpilevoy пишет:
>> Hi! Thanks for working on this!
>>
>> It seems starting from this commit the election stress test
>> hangs on my machine in 100% cases. I didn't have time to
>> investigate why yet.
> Yes, you're correct. I also see this. It's not 100% cases though.
>
> On my machine the test doesn't hang at all (at least the first 20 runs)
> until commit "txn_limbo: filter rows based on known peer terms"
>
> Starting with commit "txn_limbo: filter rows based on known peer terms"
> one or two of the 20 runs hang and get restarted.
>
> I need some time to investigate this. Will return once I have some
> results.
>
Ok, seems like the case is closed.
So, here's a couple of facts that lead to the test hang:
1) The instance may still write CONFIRM for its own transactions after
restart.
It may do so even before receiving a CONFIRM from some remote
instance, which
took ownership of the limbo later.
This fact alone would be ok, but:
a) the instance doesn't count its own WAL write as the first ack
after restart,
so if quorum is M it waits for M+1 acks from remote instances
before writing
confirm
b) the instance writes CONFIRM unconditionally even before getting
in sync
with other replicas, which could have already written CONFIRM for
its rows.
(this may be fine).
There's an issue related to the cause, but it needs some reformulation:
https://github.com/tarantool/tarantool/issues/5856
2) Any failure in txn_commit_try_async is treated as a WAL write error
by mistake,
and the actual reason for rollback is lost. I've opened a ticket for
this:
https://github.com/tarantool/tarantool/issues/6027
ER_WAL_IO is unrecoverable and breaks connection between master and
replica.
(We might make it recoverable as well? Why not retry WAL write after
some time?
It may work out this time).
3) NOPs are added to txn_limbo, when it isn't empty.
And here's what happened when the test hung:
1) Some instance used to be the leader and got restarted before
writing CONFIRM for its own transactions
2) Once the instance got restarted, its relays were faster than
its appliers, meaning it first gathered 2 acks for the old
transaction, and wrote CONFIRM right away, and received CONFIRM
from a remote instance later
3) This instance was elected the leader once again. Once this
happened other 2 instances started accepting rows from this
instance
4) The first row remote instances got was this CONFIRM which the
instance wrote after restart
5) The instance was considered outdated, because while it was an
elected leader, it hasn't yet sent PROMOTE to the other
instances (PROMOTE comes right after that notorious CONFIRM)
6) Like any row from an outdated instance, CONFIRM was replaced
with a NOP
7) Other instances try to insert that NOP to their limbos, which
aren't empty, due to the nature of the test (and would get
emptied with PROMOTE). Insertion fails with
ER_UNCOMMITTED_FOREIGN_SYNC_TXNS
8) ER_UNCOMMITTED_FOREIGN_SYNC_TXNS is replaced with ER_WAL_IO by
applier's on_rollback trigger. This is an unrecoverable error,
so both the remote instances' appliers break connection to
the leader.
9) Now there's an infinite loop of elections. This node never
votes for any of the remote nodes, because they are behind it.
What I've done to fix this is I've allowed transactions that consist
of NOPs solely to pass through limbo without waiting even when it's
non-empty.
The test's now rock-solid on my machine. 0 failures in 100 runs.
(with 1 worker, to be honest, but that's still better than a couple of
failures in 20 runs with 1 worker).
I've sent the new patch as [PATCH v4 14/12] in reply to this series.
Please, take a look.
--
Serge Petrenko
More information about the Tarantool-patches
mailing list