From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <v.shpilevoy@tarantool.org>
Received: from smtpng3.m.smailru.net (smtpng3.m.smailru.net [94.100.177.149])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
 bits)) (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id 41FED469719
 for <tarantool-patches@dev.tarantool.org>;
 Tue, 10 Nov 2020 01:36:44 +0300 (MSK)
References: <cover.1604706353.git.v.shpilevoy@tarantool.org>
 <a1a4bcb4745df8844ea31cf00df0a383252d941a.1604706353.git.v.shpilevoy@tarantool.org>
 <7bc1dd8d-2b2d-2808-5c98-db649937ef3f@tarantool.org>
From: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Message-ID: <70029618-854d-ddec-95f7-b7de0adcc057@tarantool.org>
Date: Mon, 9 Nov 2020 23:36:42 +0100
MIME-Version: 1.0
In-Reply-To: <7bc1dd8d-2b2d-2808-5c98-db649937ef3f@tarantool.org>
Content-Type: text/plain; charset="utf-8"
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Subject: Re: [Tarantool-patches] [PATCH 2/2] test: speed up election_qsync
List-Id: Tarantool development patches <tarantool-patches.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-patches/>
List-Post: <mailto:tarantool-patches@dev.tarantool.org>
List-Help: <mailto:tarantool-patches-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=subscribe>
To: Serge Petrenko <sergepetrenko@tarantool.org>, tarantool-patches@dev.tarantool.org

Hi! Thanks for the review!

>> @@ -62,6 +65,7 @@ fiber = require('fiber')
>>   -- Replication timeout is small to speed up a first election start.
> 
> I checked, and the test speeds up a lot with this patch. But I don't understand
> why. We have only two instances, only one of them is candidate.
> Thanks to the small replication_timeout, the election starts shortly after the
> old leader dies. election_timeout isn't involved here, AFAICS.
> Am I missing something?
> Even when the test is restarted, there shouldn't be any 're-elections'.

That is a very good question! Honestly, I didn't think of it. Strangely,
when I did the patch, I just "believed" this is what I should do.

I did some digging now, and there is a simple explanation. And a bug. So
this is really good you looked at this patch skeptically.

Here is what is happening in the test, in short:

	master:  create_replica()
	master:  set_mode('voter')

	replica: set_mode('candidate')
	replica: wait_leader()

When the instance is clear, master's term is 1 and this term does not have
votes yet. When replica is started, it votes for self, and gets elected
quite fast.

This is happening fast even without this patch, when you run this test
separately.

But when the test is running after some other Raft tests, on the leader the
term is not 1. It can be 3, 4, or more. When replica is attached, the
leader does not send its state to the replica, because election is disabled.

So the replica starts from term 1, tries to elect itself and is ignored
because its term is too old. Waits for election timeout and tries again. This
is repeated as many times as is the term value.

This is why the tests running in the end were always slower.

There is a bug. The first node should have sent its state when election mode
was set to the voter on it. But it didn't.

I reworked the patch so now the timeout, on the contrary, is set to 1000000
seconds. And the tests hangs infinitely unless we send Raft state when election
is enabled.

The new patch:

====================
    raft: send state when state machine is started
    
    Raft didn't broadcast its state when the state machine was
    started. It could lead to the state being never sent until some
    other node would generate a term number bigger that the local one.
    
    That happened when a node participated in some elections,
    accumulated a big term number, then the election was turned off,
    and a new replica was connected in a 'candidate' state. Then the
    first node was configured to be a 'voter'.
    
    The first node didn't send anything to the replica, because at
    the moment of its connection the election was off.
    
    So the replica started from term 1, tried to start elections in
    this term, but was ignored by the first node. It waited for
    election timeout, bumped the term to 2, and the process was
    repeated until the replica reached the first node's term + 1. It
    could take very long time.
    
    The patch fixes it so now Raft broadcasts its state when it is
    enabled. To cover the replicas connected while it was disabled.
    
    Closes #5499

diff --git a/src/box/raft.c b/src/box/raft.c
index 914b0d68f..28ca74cb5 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -877,6 +877,14 @@ raft_sm_start(void)
 		raft_sm_wait_leader_found();
 	}
 	box_update_ro_summary();
+	/*
+	 * Nothing changed. But when raft was stopped, its state wasn't sent to
+	 * replicas. At least this was happening at the moment of this being
+	 * written. On the other hand, this instance may have a term bigger than
+	 * any other term in the cluster. And if it wouldn't share the term, it
+	 * would ignore all the messages, including vote requests.
+	 */
+	raft_schedule_broadcast();
 }
 
 static void
diff --git a/test/replication/election_qsync.result b/test/replication/election_qsync.result
index 086b17686..cb349efcc 100644
--- a/test/replication/election_qsync.result
+++ b/test/replication/election_qsync.result
@@ -9,6 +9,9 @@ box.schema.user.grant('guest', 'super')
 old_election_mode = box.cfg.election_mode
  | ---
  | ...
+old_election_timeout = box.cfg.election_timeout
+ | ---
+ | ...
 old_replication_synchro_timeout = box.cfg.replication_synchro_timeout
  | ---
  | ...
@@ -60,8 +63,11 @@ fiber = require('fiber')
  | ---
  | ...
 -- Replication timeout is small to speed up a first election start.
+-- Election timeout is set to a huge value to ensure the election does not hang
+-- anywhere. Indeed, there can't be a split-vote when candidate is only one.
 box.cfg{                                                                        \
     election_mode = 'candidate',                                                \
+    election_timeout = 1000000,                                                 \
     replication_synchro_quorum = 3,                                             \
     replication_synchro_timeout = 1000000,                                      \
     replication_timeout = 0.1,                                                  \
@@ -114,8 +120,11 @@ box.cfg{replication_synchro_timeout = 1000000}
 -- Configure separately from synchro timeout not to depend on the order of
 -- synchro and election options appliance. Replication timeout is tiny to speed
 -- up notice of the old leader death.
+-- Election timeout is set to a huge value to ensure the election does not hang
+-- anywhere. Indeed, there can't be a split-vote when candidate is only one.
 box.cfg{                                                                        \
     election_mode = 'candidate',                                                \
+    election_timeout = 1000000,                                                 \
     replication_timeout = 0.01,                                                 \
 }
  | ---
@@ -143,6 +152,7 @@ test_run:cmd('delete server replica')
  | ...
 box.cfg{                                                                        \
     election_mode = old_election_mode,                                          \
+    election_timeout = old_election_timeout,                                    \
     replication_timeout = old_replication_timeout,                              \
     replication = old_replication,                                              \
     replication_synchro_timeout = old_replication_synchro_timeout,              \
diff --git a/test/replication/election_qsync.test.lua b/test/replication/election_qsync.test.lua
index 6a80f4859..eb89e5b79 100644
--- a/test/replication/election_qsync.test.lua
+++ b/test/replication/election_qsync.test.lua
@@ -2,6 +2,7 @@ test_run = require('test_run').new()
 box.schema.user.grant('guest', 'super')
 
 old_election_mode = box.cfg.election_mode
+old_election_timeout = box.cfg.election_timeout
 old_replication_synchro_timeout = box.cfg.replication_synchro_timeout
 old_replication_timeout = box.cfg.replication_timeout
 old_replication = box.cfg.replication
@@ -28,8 +29,11 @@ box.cfg{election_mode = 'voter'}
 test_run:switch('replica')
 fiber = require('fiber')
 -- Replication timeout is small to speed up a first election start.
+-- Election timeout is set to a huge value to ensure the election does not hang
+-- anywhere. Indeed, there can't be a split-vote when candidate is only one.
 box.cfg{                                                                        \
     election_mode = 'candidate',                                                \
+    election_timeout = 1000000,                                                 \
     replication_synchro_quorum = 3,                                             \
     replication_synchro_timeout = 1000000,                                      \
     replication_timeout = 0.1,                                                  \
@@ -57,8 +61,11 @@ box.cfg{replication_synchro_timeout = 1000000}
 -- Configure separately from synchro timeout not to depend on the order of
 -- synchro and election options appliance. Replication timeout is tiny to speed
 -- up notice of the old leader death.
+-- Election timeout is set to a huge value to ensure the election does not hang
+-- anywhere. Indeed, there can't be a split-vote when candidate is only one.
 box.cfg{                                                                        \
     election_mode = 'candidate',                                                \
+    election_timeout = 1000000,                                                 \
     replication_timeout = 0.01,                                                 \
 }
 
@@ -70,6 +77,7 @@ box.space.test:drop()
 test_run:cmd('delete server replica')
 box.cfg{                                                                        \
     election_mode = old_election_mode,                                          \
+    election_timeout = old_election_timeout,                                    \
     replication_timeout = old_replication_timeout,                              \
     replication = old_replication,                                              \
     replication_synchro_timeout = old_replication_synchro_timeout,              \