Tarantool development patches archive
 help / color / mirror / Atom feed
From: Serge Petrenko via Tarantool-patches <tarantool-patches@dev.tarantool.org>
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>, gorcunov@gmail.com
Cc: tarantool-patches@dev.tarantool.org
Subject: Re: [Tarantool-patches] [PATCH v4 08/12] election: introduce a new election mode: "manual"
Date: Tue, 20 Apr 2021 20:37:27 +0300	[thread overview]
Message-ID: <99a07dad-e38a-623e-303a-ecf412582be5@tarantool.org> (raw)
In-Reply-To: <858d30b2-f988-9fd0-ee75-3281721e54b1@tarantool.org>



20.04.2021 12:25, Serge Petrenko via Tarantool-patches пишет:
>
>
> 20.04.2021 01:34, Vladislav Shpilevoy пишет:
>> Hi! Thanks for working on this!
>>
>> It seems starting from this commit the election stress test
>> hangs on my machine in 100% cases. I didn't have time to
>> investigate why yet.
> Yes, you're correct. I also see this. It's not 100% cases though.
>
> On my machine the test doesn't hang at all (at least the first 20 runs)
> until commit "txn_limbo: filter rows based on known peer terms"
>
> Starting with commit "txn_limbo: filter rows based on known peer terms"
> one or two of the 20 runs hang and get restarted.
>
> I need some time to investigate this. Will return once I have some 
> results.
>

Ok, seems like the case is closed.

So, here's a couple of facts that lead to the test hang:

1) The instance may still write CONFIRM for its own transactions after 
restart.
    It may do so even before receiving a CONFIRM from some remote 
instance, which
    took ownership of the limbo later.
    This fact alone would be ok, but:
    a) the instance doesn't count its own WAL write as the first ack 
after restart,
       so if quorum is M it waits for M+1 acks from remote instances 
before writing
       confirm
    b) the instance writes CONFIRM unconditionally even before getting 
in sync
       with other replicas, which could have already written CONFIRM for 
its rows.
       (this may be fine).
    There's an issue related to the cause, but it needs some reformulation:
    https://github.com/tarantool/tarantool/issues/5856

2) Any failure in txn_commit_try_async is treated as a WAL write error 
by mistake,
    and the actual reason for rollback is lost. I've opened a ticket for 
this:
    https://github.com/tarantool/tarantool/issues/6027
    ER_WAL_IO is unrecoverable and breaks connection between master and 
replica.
    (We might make it recoverable as well? Why not retry WAL write after 
some time?
     It may work out this time).

3) NOPs are added to txn_limbo, when it isn't empty.


And here's what happened when the test hung:

1) Some instance used to be the leader and got restarted before
    writing CONFIRM for its own transactions

2) Once the instance got restarted, its relays were faster than
    its appliers, meaning it first gathered 2 acks for the old
    transaction, and wrote CONFIRM right away, and received CONFIRM
    from a remote instance later

3) This instance was elected the leader once again. Once this
    happened other 2 instances started accepting rows from this
    instance

4) The first row remote instances got was this CONFIRM which the
    instance wrote after restart

5) The instance was considered outdated, because while it was an
    elected leader, it hasn't yet sent PROMOTE to the other
    instances (PROMOTE comes right after that notorious CONFIRM)

6) Like any row from an outdated instance, CONFIRM was replaced
    with a NOP

7) Other instances try to insert that NOP to their limbos, which
    aren't empty, due to the nature of the test (and would get
    emptied with PROMOTE). Insertion fails with
    ER_UNCOMMITTED_FOREIGN_SYNC_TXNS

8) ER_UNCOMMITTED_FOREIGN_SYNC_TXNS is replaced with ER_WAL_IO by
    applier's on_rollback trigger. This is an unrecoverable error,
    so both the remote instances' appliers break connection to
    the leader.

9) Now there's an infinite loop of elections. This node never
    votes for any of the remote nodes, because they are behind it.


What I've done to fix this is I've allowed transactions that consist
of NOPs solely to pass through limbo without waiting even when it's
non-empty.

The test's now rock-solid on my machine. 0 failures in 100 runs.
(with 1 worker, to be honest, but that's still better than a couple of
failures in 20 runs with 1 worker).

I've sent the new patch as [PATCH v4 14/12] in reply to this series.
Please, take a look.

-- 
Serge Petrenko


  reply	other threads:[~2021-04-20 17:37 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-16 16:25 [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 01/12] wal: make wal_assign_lsn accept journal entry Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 02/12] xrow: enrich row's meta information with sync replication flags Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 03/12] xrow: introduce a PROMOTE entry Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 04/12] box: actualise iproto_key_type array Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 05/12] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK Serge Petrenko via Tarantool-patches
2021-04-16 22:12   ` Vladislav Shpilevoy via Tarantool-patches
2021-04-18  8:24     ` Serge Petrenko via Tarantool-patches
2021-04-20 22:30   ` Vladislav Shpilevoy via Tarantool-patches
2021-04-21  5:58     ` Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 06/12] box: write PROMOTE even for empty limbo Serge Petrenko via Tarantool-patches
2021-04-19 13:39   ` Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 07/12] raft: filter rows based on known peer terms Serge Petrenko via Tarantool-patches
2021-04-16 22:21   ` Vladislav Shpilevoy via Tarantool-patches
2021-04-18  8:49     ` Serge Petrenko via Tarantool-patches
2021-04-18 15:44     ` Vladislav Shpilevoy via Tarantool-patches
2021-04-19  9:31       ` Serge Petrenko via Tarantool-patches
2021-04-18 16:27   ` Vladislav Shpilevoy via Tarantool-patches
2021-04-19  9:30     ` Serge Petrenko via Tarantool-patches
2021-04-20 20:29   ` Serge Petrenko via Tarantool-patches
2021-04-20 20:31     ` Serge Petrenko via Tarantool-patches
2021-04-20 20:55       ` Serge Petrenko via Tarantool-patches
2021-04-20 22:30       ` Vladislav Shpilevoy via Tarantool-patches
2021-04-21  5:58         ` Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 08/12] election: introduce a new election mode: "manual" Serge Petrenko via Tarantool-patches
2021-04-19 22:34   ` Vladislav Shpilevoy via Tarantool-patches
2021-04-20  9:25     ` Serge Petrenko via Tarantool-patches
2021-04-20 17:37       ` Serge Petrenko via Tarantool-patches [this message]
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 09/12] raft: introduce raft_start/stop_candidate Serge Petrenko via Tarantool-patches
2021-04-16 22:23   ` Vladislav Shpilevoy via Tarantool-patches
2021-04-18  8:59     ` Serge Petrenko via Tarantool-patches
2021-04-19 22:35       ` Vladislav Shpilevoy via Tarantool-patches
2021-04-20  9:28         ` Serge Petrenko via Tarantool-patches
2021-04-19 12:52   ` Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 10/12] election: support manual elections in clear_synchro_queue() Serge Petrenko via Tarantool-patches
2021-04-16 22:24   ` Vladislav Shpilevoy via Tarantool-patches
2021-04-18  9:26     ` Serge Petrenko via Tarantool-patches
2021-04-18 16:07       ` Vladislav Shpilevoy via Tarantool-patches
2021-04-19  9:32         ` Serge Petrenko via Tarantool-patches
2021-04-19 12:47   ` Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 11/12] box: remove parameter from clear_synchro_queue Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 12/12] box.ctl: rename clear_synchro_queue to promote Serge Petrenko via Tarantool-patches
2021-04-19 22:35   ` Vladislav Shpilevoy via Tarantool-patches
2021-04-20 10:22     ` Serge Petrenko via Tarantool-patches
2021-04-18 12:00 ` [Tarantool-patches] [PATCH v4 13/12] replication: send accumulated Raft messages after relay start Serge Petrenko via Tarantool-patches
2021-04-18 16:03   ` Vladislav Shpilevoy via Tarantool-patches
2021-04-19 12:11     ` Serge Petrenko via Tarantool-patches
2021-04-19 22:36       ` Vladislav Shpilevoy via Tarantool-patches
2021-04-20 10:38         ` Serge Petrenko via Tarantool-patches
2021-04-20 22:31           ` Vladislav Shpilevoy via Tarantool-patches
2021-04-21  5:59             ` Serge Petrenko via Tarantool-patches
2021-04-19 22:37 ` [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Vladislav Shpilevoy via Tarantool-patches
2021-04-20 17:38 ` [Tarantool-patches] [PATCH v4 14/12] txn: make NOPs fully asynchronous Serge Petrenko via Tarantool-patches
2021-04-20 22:31   ` Vladislav Shpilevoy via Tarantool-patches
2021-04-21  5:59     ` Serge Petrenko via Tarantool-patches
2021-04-20 22:30 ` [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Vladislav Shpilevoy via Tarantool-patches
2021-04-21  6:01   ` Serge Petrenko via Tarantool-patches

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=99a07dad-e38a-623e-303a-ecf412582be5@tarantool.org \
    --to=tarantool-patches@dev.tarantool.org \
    --cc=gorcunov@gmail.com \
    --cc=sergepetrenko@tarantool.org \
    --cc=v.shpilevoy@tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH v4 08/12] election: introduce a new election mode: "manual"' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox