Tarantool development patches archive
 help / color / mirror / Atom feed
From: Serge Petrenko <sergepetrenko@tarantool.org>
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>,
	tarantool-patches@dev.tarantool.org
Subject: Re: [Tarantool-patches] [PATCH 0/3] Raft on leader election recovery restart
Date: Tue, 20 Oct 2020 11:18:38 +0300	[thread overview]
Message-ID: <c65b75a0-6686-5fb9-c596-45236749a51f@tarantool.org> (raw)
In-Reply-To: <508aa0bf-153c-ed0a-ea81-15720ef9c988@tarantool.org>


19.10.2020 23:26, Vladislav Shpilevoy пишет:
> Hi! Thanks for the patch!
No problem =)
>
>> 17.10.2020 20:17, Vladislav Shpilevoy пишет:
>>> There were 2 issues with the relay restarting recovery cursor when the node is
>>> elected as a leader. Fixed in the last 2 commits. First was about local LSN not
>>> being set, second about GC not being propagated.
>>>
>>> The first patch is not related to the bugs above directly. Just was found while
>>> working on this. In theory without the first patch we can get flakiness into
>>> the testes changed in this commit, but only if a replication connection will
>>> break without a reason.
>>>
>>> Additionally, the new test - gh-5433-election-restart-recovery - hangs on my
>>> machine when I start tens of it. All workers, after executing it several times,
>>> hang. But!!! not in something related to the raft - they hang in the first
>>> box.snapshot(), where the election is not even enabled yet. From some debug
>>> prints I see it hangs somewhere in engine_being_checkpoint(), and consumes
>>> ~80% of the CPU. But it may be just a consequence of the corrupted memory on
>>> Mac, due to libeio being broken. Don't know what to do with that now.
>> Hi! Thanks  for the patchset!
>>
>> Patches 2 and 3 LGTM.
>>
>> Patch 1 looks ok, but I have one question.
>> What happens when a user accidentally enables raft  during a cluster upgrade, when
>> some of the instances support raft, and some don't?
>> Looks like it'll lead to even more inconvenience.
>>
>> In my opinion it's fine if the leader just disappears without further notice.
>> We have an election timeout set up for this anyway.
> Election timeout won't work if Raft is disabled or is in 'voter' mode on the other
> nodes. Moreover, even if we enable the timeout, they will see the leader alive!
> Even if it is not a leader. Because with Raft disabled, the node does not send
> Raft state, but keeps sending regular replication heartbeats.
Oh, I see
>
> Also then the box.info.election output will freeze. Even after a connection
> to the old master is established, the other nodes will show it as a leader
> in box.info.election.
>
> I don't see how to fix it except by sending Raft state always. This looks more
> confusing than accidental error problems. May affect monitoring depending on
> box.info.election, and routing, if someone will make it depend on box.info.election.
>
> Other option - we could add a new raft mode: 'mute'. Such node can't vote,
> can't become a leader, but send Raft state, and is read-only.
>
> 'off' means that your totally understand the consequences - the node won't
> send Raft state at all, and the others still may think it is a leader.
>
> This would help to protect from the accidental enabling of Raft. You would set it
> to 'off' and nothing will be sent. I will ask in chats.

Ok then, let's just send raft state always, if raft was ever used.

The new mode looks more like a crutch.

The whole patchset LGTM then.

-- 
Serge Petrenko

  reply	other threads:[~2020-10-20  8:18 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-17 17:17 Vladislav Shpilevoy
2020-10-17 17:17 ` [Tarantool-patches] [PATCH 1/3] raft: send state to new subscribers if Raft worked Vladislav Shpilevoy
2020-10-20 20:43   ` Vladislav Shpilevoy
2020-10-21 11:41     ` Serge Petrenko
2020-10-21 21:41       ` Vladislav Shpilevoy
2020-10-22  8:53         ` Alexander V. Tikhonov
2020-10-17 17:17 ` [Tarantool-patches] [PATCH 2/3] raft: use local LSN in relay recovery restart Vladislav Shpilevoy
2020-10-17 17:17 ` [Tarantool-patches] [PATCH 3/3] raft: don't drop GC when restart relay recovery Vladislav Shpilevoy
2020-10-19  9:36 ` [Tarantool-patches] [PATCH 0/3] Raft on leader election recovery restart Serge Petrenko
2020-10-19 20:26   ` Vladislav Shpilevoy
2020-10-20  8:18     ` Serge Petrenko [this message]
2020-10-22  8:55 ` Alexander V. Tikhonov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c65b75a0-6686-5fb9-c596-45236749a51f@tarantool.org \
    --to=sergepetrenko@tarantool.org \
    --cc=tarantool-patches@dev.tarantool.org \
    --cc=v.shpilevoy@tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH 0/3] Raft on leader election recovery restart' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox