[Tarantool-patches] [PATCH 0/6] Raft auto-commit

Tarantool development patches archive
 help / color / mirror / Atom feed

* [Tarantool-patches] [PATCH 0/6] Raft auto-commit
@ 2020-10-13 23:28 Vladislav Shpilevoy
  2020-10-13 23:28 ` [Tarantool-patches] [PATCH 1/6] test: add '_stress' suffix to election_qsync test Vladislav Shpilevoy
                   ` (7 more replies)
  0 siblings, 8 replies; 13+ messages in thread
From: Vladislav Shpilevoy @ 2020-10-13 23:28 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko

The patchset implements Raft auto-commit feature, so when a new
leader is elected, it clears the limbo queue after the old leader.

There are 4 preparatory patches for that, and 1 follow-up not
related to the auto-commit - the last commit fixes the issue 5395
about a crash in the limbo, which started failing on every run of
the tests now.

Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5339-raft-auto-commit
Issue: https://github.com/tarantool/tarantool/issues/5339

@ChangeLog
* When election is enabled, a newly elected leader will automatically finish all the synchronous transactions, created by the old leader (gh-5339).

Vladislav Shpilevoy (6):
  test: add '_stress' suffix to election_qsync test
  raft: factor out the code to wakeup worker fiber
  raft: new candidate should wait for leader death
  raft: introduce on_update trigger
  raft: auto-commit transactions of the old leader
  qsync: reset confirmed lsn in limbo on owner change

 src/box/box.cc                                |  42 ++++-
 src/box/box.h                                 |   3 +-
 src/box/lua/ctl.c                             |   2 +-
 src/box/raft.c                                |  58 ++++--
 src/box/raft.h                                |  13 ++
 src/box/txn_limbo.c                           |   1 +
 test/replication/election_qsync.result        | 177 ++++++++++--------
 test/replication/election_qsync.test.lua      | 129 ++++++-------
 test/replication/election_qsync_stress.result | 127 +++++++++++++
 .../election_qsync_stress.test.lua            |  72 +++++++
 test/replication/suite.ini                    |   2 +-
 11 files changed, 460 insertions(+), 166 deletions(-)
 create mode 100644 test/replication/election_qsync_stress.result
 create mode 100644 test/replication/election_qsync_stress.test.lua

-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH 1/6] test: add '_stress' suffix to election_qsync test
  2020-10-13 23:28 [Tarantool-patches] [PATCH 0/6] Raft auto-commit Vladislav Shpilevoy
@ 2020-10-13 23:28 ` Vladislav Shpilevoy
  2020-10-13 23:28 ` [Tarantool-patches] [PATCH 2/6] raft: factor out the code to wakeup worker fiber Vladislav Shpilevoy
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Vladislav Shpilevoy @ 2020-10-13 23:28 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko

The test is long, about 10 seconds. But its name is too general.
And it would be better used for a simpler more basic test. This is
going to happen in the next commits.

election_qsync.test.lua will check if the election and qsync work
fine together without any stress cases.

Needed for #5339
---
 .../{election_qsync.result => election_qsync_stress.result}     | 0
 .../{election_qsync.test.lua => election_qsync_stress.test.lua} | 0
 test/replication/suite.ini                                      | 2 +-
 3 files changed, 1 insertion(+), 1 deletion(-)
 rename test/replication/{election_qsync.result => election_qsync_stress.result} (100%)
 rename test/replication/{election_qsync.test.lua => election_qsync_stress.test.lua} (100%)

diff --git a/test/replication/election_qsync.result b/test/replication/election_qsync_stress.result
similarity index 100%
rename from test/replication/election_qsync.result
rename to test/replication/election_qsync_stress.result
diff --git a/test/replication/election_qsync.test.lua b/test/replication/election_qsync_stress.test.lua
similarity index 100%
rename from test/replication/election_qsync.test.lua
rename to test/replication/election_qsync_stress.test.lua
diff --git a/test/replication/suite.ini b/test/replication/suite.ini
index bc1d3a8bf..0347e7600 100644
--- a/test/replication/suite.ini
+++ b/test/replication/suite.ini
@@ -113,7 +113,7 @@ fragile = {
             "issues": [ "gh-5381" ],
             "checksums": [ "a73b46d27fc48d2d7016597eeadbed2c" ]
         },
-        "election_qsync.test.lua": {
+        "election_qsync_stress.test.lua": {
             "issues": [ "gh-5395" ],
             "checksums": [ "55ccd87be70d4cb818c8d4dd43a8a2f9", "cc93d7c69c6368217634718bdf3de16c", "3fb2e6cef4c8fa1d0edd8654fd2d8ef6", "2f03ab89040bf2435a9fab8a1cddc990", "2870483307db27542796a3b2cf76a9f1", "2c21702af2ed09ae0f0d7cf651eb5646", "ac2c562932ea624c6ffbf7f7671c9803", "1aec2786b758bf7dc408a83c2e83c37d", "37e671aea27e3396098f9d13c7fa7316", "ca38ff2cdfa65b3defb26607b24454c6", "eb43a3f7feaa629a474a144665b6d3d6", "6e1d222f2cf5d8c437e572784c748489a", "bcab5da1daa9feb85b75ade6285934ea", "57191f7c3b108d34a06b4172ef88d676", "dc89f0d78ee7a6321b95af96159bd6c4", "2cb9873226dbf765d539b0143174635b", "7ccb7455c4c58a8cc64fff02ce8e9b37", "1e31ad3a916708a77da75485b6febdcd", "553c91d230489534d500fc441b5708da", "142a059788b39c5f252d83357b1e59a3", "6d0e40a1556254cdacc7afa35f298633", "eb1b45595ffabb01c2f7599b74531111", "259bde2222cc8368db35a348c74146b4", "ce44ea01d3621ef9796f5769a061edd6", "5be4e1cc9120e60dd433547745979024", "16b9325575cc12f7f9630c581caa425a", "3a6def04cf9624969883ee967c4cdf68", "3196fdcad898d99295b675668b00a7cf", "32797413e2d2a61eab5e39e8eccf6d82", "7ec7ab20c83ff8d06d6c6911af7ed54b", "216012f2e376d3712825bbb0412a9920", "f68cb670318f744645ab82935ca91f8b", "a1e897328d7a4c84e8557a9c64858771", "114360fccdaa8408609d5536e2fc4fcb", "51421bd96547351b0695784310c7fc41", "ebb760957c9d30a5e2453df3891d991b", "4a82db713bca762cdfc04e0ed539e60b", "0375dee5560bd34fd83140a2a24437e5", "a0302ead43647f7489c83bb12b54853d", "4b1268d16c9826a61036ef36cb04cc80", "2363b71ed7b4a3cff9dbf5ae1d29d36e", "463c9bccef60915ea55d1875125eb6bb", "d459b6ab666f3b21214ef6efe3d0989f", "b0a0893162e8fe43cad023d3857ff3ba", "013f7a4a36c36603e8156be120f07e2c", "f749c06bcf4a2c8e7c55251df700acd1", "f0d248af7c764fda6d98f5f1b3d6310a", "fc835fcd850db2e1030af34dd97b68b5", "321b0c2f7343da3c14a4f7e00932ed1c", "796fb7996d6b2e5859b862887b3471f9", "ce8d19167542bd2b2eb6a9b66ff31f08", "fb39d69b94eea968781e71ddd9d30c63", "e242f5f0cff02af325351a9a01043828", "8a36ef30432cda293c689ff8524f6644", "ea9b9d01e95834cfd5cd332cb40461fe", "951106fbf2872bb935912110f11b7fa9" ]
         },
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH 2/6] raft: factor out the code to wakeup worker fiber
  2020-10-13 23:28 [Tarantool-patches] [PATCH 0/6] Raft auto-commit Vladislav Shpilevoy
  2020-10-13 23:28 ` [Tarantool-patches] [PATCH 1/6] test: add '_stress' suffix to election_qsync test Vladislav Shpilevoy
@ 2020-10-13 23:28 ` Vladislav Shpilevoy
  2020-10-14 13:29   ` Cyrill Gorcunov
  2020-10-13 23:28 ` [Tarantool-patches] [PATCH 3/6] raft: new candidate should wait for leader death Vladislav Shpilevoy
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 13+ messages in thread
From: Vladislav Shpilevoy @ 2020-10-13 23:28 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko

Raft has a worker fiber to perform async tasks such as WAL write,
state broadcast.

The worker was created and woken up from 2 places, leading at
least to code duplication. The patch wraps it into a new function
raft_worker_wakeup(), and uses it.

The patch is not need for anything functional, but was created
while working on #5339 and trying ideas. The patch seems to be
good refactoring making the code simpler, and therefore it is
submitted.
---
 src/box/raft.c | 34 +++++++++++++++++++++++-----------
 1 file changed, 23 insertions(+), 11 deletions(-)

diff --git a/src/box/raft.c b/src/box/raft.c
index 0b6c373e8..f51e87fde 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -178,6 +178,13 @@ raft_election_quorum(void)
 	return MIN(replication_synchro_quorum, replicaset.registered_count);
 }
 
+/**
+ * Wakeup the Raft worker fiber in order to do some async work. If the fiber
+ * does not exist yet, it is created.
+ */
+static void
+raft_worker_wakeup(void);
+
 /** Schedule broadcast of the complete Raft state to all the followers. */
 static void
 raft_schedule_broadcast(void);
@@ -670,10 +677,8 @@ raft_sm_pause_and_dump(void)
 	if (raft.is_write_in_progress)
 		return;
 	ev_timer_stop(loop(), &raft.timer);
+	raft_worker_wakeup();
 	raft.is_write_in_progress = true;
-	if (raft.worker == NULL)
-		raft.worker = fiber_new("raft_worker", raft_worker_f);
-	fiber_wakeup(raft.worker);
 }
 
 static void
@@ -988,21 +993,28 @@ raft_new_term(void)
 }
 
 static void
-raft_schedule_broadcast(void)
+raft_worker_wakeup(void)
 {
-	raft.is_broadcast_scheduled = true;
+	if (raft.worker == NULL) {
+		raft.worker = fiber_new("raft_worker", raft_worker_f);
+		fiber_set_joinable(raft.worker, true);
+	}
 	/*
 	 * Don't wake the fiber if it writes something. Otherwise it would be a
-	 * spurious wakeup breaking the WAL write not adapted to this.
+	 * spurious wakeup breaking the WAL write not adapted to this. Also
+	 * don't wakeup the current fiber - it leads to undefined behaviour.
 	 */
-	if (raft.is_write_in_progress)
-		return;
-	if (raft.worker == NULL)
-		raft.worker = fiber_new("raft_worker", raft_worker_f);
-	if (raft.worker != fiber())
+	if (!raft.is_write_in_progress && fiber() != raft.worker)
 		fiber_wakeup(raft.worker);
 }
 
+static void
+raft_schedule_broadcast(void)
+{
+	raft.is_broadcast_scheduled = true;
+	raft_worker_wakeup();
+}
+
 void
 raft_init(void)
 {
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH 3/6] raft: new candidate should wait for leader death
  2020-10-13 23:28 [Tarantool-patches] [PATCH 0/6] Raft auto-commit Vladislav Shpilevoy
  2020-10-13 23:28 ` [Tarantool-patches] [PATCH 1/6] test: add '_stress' suffix to election_qsync test Vladislav Shpilevoy
  2020-10-13 23:28 ` [Tarantool-patches] [PATCH 2/6] raft: factor out the code to wakeup worker fiber Vladislav Shpilevoy
@ 2020-10-13 23:28 ` Vladislav Shpilevoy
  2020-10-13 23:28 ` [Tarantool-patches] [PATCH 4/6] raft: introduce on_update trigger Vladislav Shpilevoy
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Vladislav Shpilevoy @ 2020-10-13 23:28 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko

When a raft node was configured to be a candidate via
election_mode, it didn't do anything if there was an active
leader.

But it should have started monitoring its health in order to
initiate a new election round when it dies.

The patch fixes this bug. It does not contain a test, because will
be covered by a test for #5339.

Needed for #5339
---
 src/box/raft.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/src/box/raft.c b/src/box/raft.c
index f51e87fde..ff28e7f08 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -926,12 +926,18 @@ raft_cfg_is_candidate(bool is_candidate)
 
 	if (raft.is_candidate) {
 		assert(raft.state == RAFT_STATE_FOLLOWER);
-		/*
-		 * If there is an on-going WAL write, it means there was some
-		 * node who sent newer data to this node.
-		 */
-		if (raft.leader == 0 && raft_is_fully_on_disk())
+		if (raft.leader != 0) {
+			raft_sm_wait_leader_dead();
+		} else if (raft_is_fully_on_disk()) {
 			raft_sm_wait_leader_found();
+		} else {
+			/*
+			 * If there is an on-going WAL write, it means there was
+			 * some node who sent newer data to this node. So it is
+			 * probably a better candidate. Anyway can't do anything
+			 * until the new state is fully persisted.
+			 */
+		}
 	} else if (raft.state != RAFT_STATE_FOLLOWER) {
 		if (raft.state == RAFT_STATE_LEADER)
 			raft.leader = 0;
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH 4/6] raft: introduce on_update trigger
  2020-10-13 23:28 [Tarantool-patches] [PATCH 0/6] Raft auto-commit Vladislav Shpilevoy
                   ` (2 preceding siblings ...)
  2020-10-13 23:28 ` [Tarantool-patches] [PATCH 3/6] raft: new candidate should wait for leader death Vladislav Shpilevoy
@ 2020-10-13 23:28 ` Vladislav Shpilevoy
  2020-10-13 23:28 ` [Tarantool-patches] [PATCH 5/6] raft: auto-commit transactions of the old leader Vladislav Shpilevoy
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Vladislav Shpilevoy @ 2020-10-13 23:28 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko

Raft state machine now has a trigger invoked each time when any of
the visible Raft attributes is changed: state, term, vote.

The trigger is needed to commit synchronous transactions of an old
leader, when a new leader is elected. This is done via a trigger
so as not to depend on box in raft code too much. That would make
it harder to extract it into a new module later.

The trigger is executed in the Raft worker fiber, so as not to
stop the state machine transitions anywhere, which currently don't
contain a single yield. And the synchronous transaction queue
clearance requires a yield, to write CONFIRM and ROLLBACK records
to WAL.

Part of #5339
---
 src/box/raft.c |  8 ++++++++
 src/box/raft.h | 13 +++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/src/box/raft.c b/src/box/raft.c
index ff28e7f08..8c127f268 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -643,6 +643,7 @@ raft_worker_handle_broadcast(void)
 	}
 	replicaset_foreach(replica)
 		relay_push_raft(replica->relay, &req);
+	trigger_run(&raft.on_update, NULL);
 	raft.is_broadcast_scheduled = false;
 }
 
@@ -903,6 +904,12 @@ raft_serialize_for_disk(struct raft_request *req)
 	req->vote = raft.vote;
 }
 
+void
+raft_on_update(struct trigger *trigger)
+{
+	trigger_add(&raft.on_update, trigger);
+}
+
 void
 raft_cfg_is_enabled(bool is_enabled)
 {
@@ -1025,4 +1032,5 @@ void
 raft_init(void)
 {
 	ev_timer_init(&raft.timer, raft_sm_schedule_new_election_cb, 0, 0);
+	rlist_create(&raft.on_update);
 }
diff --git a/src/box/raft.h b/src/box/raft.h
index be77a5473..82d5aa442 100644
--- a/src/box/raft.h
+++ b/src/box/raft.h
@@ -32,6 +32,7 @@
 #include <stdint.h>
 #include <stdbool.h>
 #include "tarantool_ev.h"
+#include "trigger.h"
 
 #if defined(__cplusplus)
 extern "C" {
@@ -150,6 +151,11 @@ struct raft {
 	struct fiber *worker;
 	/** Configured election timeout in seconds. */
 	double election_timeout;
+	/**
+	 * Trigger invoked each time any of the Raft node visible attributes are
+	 * changed.
+	 */
+	struct rlist on_update;
 };
 
 extern struct raft raft;
@@ -251,6 +257,13 @@ raft_serialize_for_network(struct raft_request *req, struct vclock *vclock);
 void
 raft_serialize_for_disk(struct raft_request *req);
 
+/**
+ * Add a trigger invoked each time any of the Raft node visible attributes are
+ * changed.
+ */
+void
+raft_on_update(struct trigger *trigger);
+
 /** Initialize Raft global data structures. */
 void
 raft_init(void);
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH 5/6] raft: auto-commit transactions of the old leader
  2020-10-13 23:28 [Tarantool-patches] [PATCH 0/6] Raft auto-commit Vladislav Shpilevoy
                   ` (3 preceding siblings ...)
  2020-10-13 23:28 ` [Tarantool-patches] [PATCH 4/6] raft: introduce on_update trigger Vladislav Shpilevoy
@ 2020-10-13 23:28 ` Vladislav Shpilevoy
  2020-10-13 23:28 ` [Tarantool-patches] [PATCH 6/6] qsync: reset confirmed lsn in limbo on owner change Vladislav Shpilevoy
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Vladislav Shpilevoy @ 2020-10-13 23:28 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko

According to Raft, when a new leader is elected, it should finish
transactions of the old leader. In Raft this is done via adding a
new transaction originated from the new leader.

In case of Tarantool this can be done without a new transaction
due to WAL format specifics, and the function doing it is called
box_clear_synchro_queue().

Before the patch, when a node was elected as a leader, it didn't
finish the pending transactions. The queue clearance was expected
to be done by a user. There was no any issue with that, just
technical debt. The patch fixes it.

Now when a node becomes a leader, it finished synchronous
transactions of the old leader. This is done a bit differently
than in the public box.ctl.clear_synchro_queue().

The public box.ctl.clear_synchro_queue() tries to wait for CONFIRM
messages, which may be late. For replication_synchro_timeout * 2
time.

But when a new leader is elected, the leader will ignore all rows
from all the other nodes, as it thinks it is the only source of
truth. Therefore it makes no sense to wait for CONFIRMs here, and
the waiting is omitted.

Closes #5339
---
 src/box/box.cc                                |  42 ++++-
 src/box/box.h                                 |   3 +-
 src/box/lua/ctl.c                             |   2 +-
 test/replication/election_qsync.result        | 153 ++++++++++++++++++
 test/replication/election_qsync.test.lua      |  76 +++++++++
 test/replication/election_qsync_stress.result |   1 -
 .../election_qsync_stress.test.lua            |   1 -
 7 files changed, 266 insertions(+), 12 deletions(-)
 create mode 100644 test/replication/election_qsync.result
 create mode 100644 test/replication/election_qsync.test.lua

diff --git a/src/box/box.cc b/src/box/box.cc
index 566cb41ef..18568df3b 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -152,6 +152,11 @@ static struct fiber_pool tx_fiber_pool;
  * are too many messages in flight (gh-1892).
  */
 static struct cbus_endpoint tx_prio_endpoint;
+/**
+ * A trigger executed each time the Raft state machine updates any
+ * of its visible attributes.
+ */
+static struct trigger box_raft_on_update;
 
 void
 box_update_ro_summary(void)
@@ -1005,7 +1010,7 @@ box_set_replication_anon(void)
 }
 
 void
-box_clear_synchro_queue(void)
+box_clear_synchro_queue(bool try_wait)
 {
 	if (!is_box_configured || txn_limbo_is_empty(&txn_limbo))
 		return;
@@ -1014,13 +1019,15 @@ box_clear_synchro_queue(void)
 	if (former_leader_id == instance_id)
 		return;
 
-	/* Wait until pending confirmations/rollbacks reach us. */
-	double timeout = 2 * replication_synchro_timeout;
-	double start_tm = fiber_clock();
-	while (!txn_limbo_is_empty(&txn_limbo)) {
-		if (fiber_clock() - start_tm > timeout)
-			break;
-		fiber_sleep(0.001);
+	if (try_wait) {
+		/* Wait until pending confirmations/rollbacks reach us. */
+		double timeout = 2 * replication_synchro_timeout;
+		double start_tm = fiber_clock();
+		while (!txn_limbo_is_empty(&txn_limbo)) {
+			if (fiber_clock() - start_tm > timeout)
+				break;
+			fiber_sleep(0.001);
+		}
 	}
 
 	if (!txn_limbo_is_empty(&txn_limbo)) {
@@ -1053,6 +1060,22 @@ box_clear_synchro_queue(void)
 	}
 }
 
+static int
+box_raft_on_update_f(struct trigger *trigger, void *event)
+{
+	(void)trigger;
+	(void)event;
+	if (raft.state != RAFT_STATE_LEADER)
+		return 0;
+	/*
+	 * When the node became a leader, it means it will ignore all records
+	 * from all the other nodes, and won't get late CONFIRM messages anyway.
+	 * Can clear the queue without waiting for confirmations.
+	 */
+	box_clear_synchro_queue(false);
+	return 0;
+}
+
 void
 box_listen(void)
 {
@@ -2633,6 +2656,9 @@ box_init(void)
 	txn_limbo_init();
 	sequence_init();
 	raft_init();
+
+	trigger_create(&box_raft_on_update, box_raft_on_update_f, NULL, NULL);
+	raft_on_update(&box_raft_on_update);
 }
 
 bool
diff --git a/src/box/box.h b/src/box/box.h
index a151fb8f1..b47a220b7 100644
--- a/src/box/box.h
+++ b/src/box/box.h
@@ -266,7 +266,8 @@ extern "C" {
 
 typedef struct tuple box_tuple_t;
 
-void box_clear_synchro_queue(void);
+void
+box_clear_synchro_queue(bool try_wait);
 
 /* box_select is private and used only by FFI */
 API_EXPORT int
diff --git a/src/box/lua/ctl.c b/src/box/lua/ctl.c
index 2017ddc18..bf26465e6 100644
--- a/src/box/lua/ctl.c
+++ b/src/box/lua/ctl.c
@@ -82,7 +82,7 @@ static int
 lbox_ctl_clear_synchro_queue(struct lua_State *L)
 {
 	(void) L;
-	box_clear_synchro_queue();
+	box_clear_synchro_queue(true);
 	return 0;
 }
 
diff --git a/test/replication/election_qsync.result b/test/replication/election_qsync.result
new file mode 100644
index 000000000..5dc9d182a
--- /dev/null
+++ b/test/replication/election_qsync.result
@@ -0,0 +1,153 @@
+-- test-run result file version 2
+test_run = require('test_run').new()
+ | ---
+ | ...
+box.schema.user.grant('guest', 'super')
+ | ---
+ | ...
+
+old_election_mode = box.cfg.election_mode
+ | ---
+ | ...
+old_replication_synchro_timeout = box.cfg.replication_synchro_timeout
+ | ---
+ | ...
+old_replication_timeout = box.cfg.replication_timeout
+ | ---
+ | ...
+old_replication = box.cfg.replication
+ | ---
+ | ...
+
+test_run:cmd('create server replica with rpl_master=default,\
+              script="replication/replica.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica with wait=True, wait_load=True')
+ | ---
+ | - true
+ | ...
+-- Any election activities require fullmesh.
+box.cfg{replication = test_run:eval('replica', 'box.cfg.listen')[1]}
+ | ---
+ | ...
+
+--
+-- gh-5339: leader election manages transaction limbo automatically.
+--
+-- Idea of the test is that there are 2 nodes. A leader and a
+-- follower. The leader creates a synchronous transaction, it gets
+-- replicated to the follower, the leader dies. Now when the
+-- follower is elected as a new leader, it should finish the
+-- pending transaction.
+--
+_ = box.schema.create_space('test', {is_sync = true})
+ | ---
+ | ...
+_ = _:create_index('pk')
+ | ---
+ | ...
+box.cfg{election_mode = 'voter'}
+ | ---
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+fiber = require('fiber')
+ | ---
+ | ...
+-- Replication timeout is small to speed up a first election start.
+box.cfg{                                                                        \
+    election_mode = 'candidate',                                                \
+    replication_synchro_quorum = 3,                                             \
+    replication_timeout = 0.1,                                                  \
+}
+ | ---
+ | ...
+
+test_run:wait_cond(function() return box.info.election.state == 'leader' end)
+ | ---
+ | - true
+ | ...
+lsn = box.info.lsn
+ | ---
+ | ...
+_ = fiber.create(function()                                                     \
+    ok, err = pcall(box.space.test.replace, box.space.test, {1})                \
+end)
+ | ---
+ | ...
+-- Wait WAL write.
+test_run:wait_cond(function() return box.info.lsn > lsn end)
+ | ---
+ | - true
+ | ...
+-- Wait replication to the other instance.
+test_run:wait_lsn('default', 'replica')
+ | ---
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+-- Will fail - the node is not a leader.
+box.space.test:replace{2}
+ | ---
+ | - error: Can't modify data because this instance is in read-only mode.
+ | ...
+
+-- Set synchro timeout to a huge value to ensure, that when a leader is elected,
+-- it won't wait for this timeout.
+box.cfg{replication_synchro_timeout = 1000000}
+ | ---
+ | ...
+
+-- Configure separately from synchro timeout not to depend on the order of
+-- synchro and election options appliance. Replication timeout is tiny to speed
+-- up notice of the old leader death.
+box.cfg{                                                                        \
+    election_mode = 'candidate',                                                \
+    replication_timeout = 0.01,                                                 \
+}
+ | ---
+ | ...
+
+test_run:wait_cond(function() return box.info.election.state == 'leader' end)
+ | ---
+ | - true
+ | ...
+_ = box.space.test:replace{2}
+ | ---
+ | ...
+box.space.test:select{}
+ | ---
+ | - - [1]
+ |   - [2]
+ | ...
+box.space.test:drop()
+ | ---
+ | ...
+
+test_run:cmd('delete server replica')
+ | ---
+ | - true
+ | ...
+box.cfg{                                                                        \
+    election_mode = old_election_mode,                                          \
+    replication_timeout = old_replication_timeout,                              \
+    replication = old_replication,                                              \
+    replication_synchro_timeout = old_replication_synchro_timeout,              \
+}
+ | ---
+ | ...
+box.schema.user.revoke('guest', 'super')
+ | ---
+ | ...
diff --git a/test/replication/election_qsync.test.lua b/test/replication/election_qsync.test.lua
new file mode 100644
index 000000000..69ef33da2
--- /dev/null
+++ b/test/replication/election_qsync.test.lua
@@ -0,0 +1,76 @@
+test_run = require('test_run').new()
+box.schema.user.grant('guest', 'super')
+
+old_election_mode = box.cfg.election_mode
+old_replication_synchro_timeout = box.cfg.replication_synchro_timeout
+old_replication_timeout = box.cfg.replication_timeout
+old_replication = box.cfg.replication
+
+test_run:cmd('create server replica with rpl_master=default,\
+              script="replication/replica.lua"')
+test_run:cmd('start server replica with wait=True, wait_load=True')
+-- Any election activities require fullmesh.
+box.cfg{replication = test_run:eval('replica', 'box.cfg.listen')[1]}
+
+--
+-- gh-5339: leader election manages transaction limbo automatically.
+--
+-- Idea of the test is that there are 2 nodes. A leader and a
+-- follower. The leader creates a synchronous transaction, it gets
+-- replicated to the follower, the leader dies. Now when the
+-- follower is elected as a new leader, it should finish the
+-- pending transaction.
+--
+_ = box.schema.create_space('test', {is_sync = true})
+_ = _:create_index('pk')
+box.cfg{election_mode = 'voter'}
+
+test_run:switch('replica')
+fiber = require('fiber')
+-- Replication timeout is small to speed up a first election start.
+box.cfg{                                                                        \
+    election_mode = 'candidate',                                                \
+    replication_synchro_quorum = 3,                                             \
+    replication_timeout = 0.1,                                                  \
+}
+
+test_run:wait_cond(function() return box.info.election.state == 'leader' end)
+lsn = box.info.lsn
+_ = fiber.create(function()                                                     \
+    ok, err = pcall(box.space.test.replace, box.space.test, {1})                \
+end)
+-- Wait WAL write.
+test_run:wait_cond(function() return box.info.lsn > lsn end)
+-- Wait replication to the other instance.
+test_run:wait_lsn('default', 'replica')
+
+test_run:switch('default')
+test_run:cmd('stop server replica')
+-- Will fail - the node is not a leader.
+box.space.test:replace{2}
+
+-- Set synchro timeout to a huge value to ensure, that when a leader is elected,
+-- it won't wait for this timeout.
+box.cfg{replication_synchro_timeout = 1000000}
+
+-- Configure separately from synchro timeout not to depend on the order of
+-- synchro and election options appliance. Replication timeout is tiny to speed
+-- up notice of the old leader death.
+box.cfg{                                                                        \
+    election_mode = 'candidate',                                                \
+    replication_timeout = 0.01,                                                 \
+}
+
+test_run:wait_cond(function() return box.info.election.state == 'leader' end)
+_ = box.space.test:replace{2}
+box.space.test:select{}
+box.space.test:drop()
+
+test_run:cmd('delete server replica')
+box.cfg{                                                                        \
+    election_mode = old_election_mode,                                          \
+    replication_timeout = old_replication_timeout,                              \
+    replication = old_replication,                                              \
+    replication_synchro_timeout = old_replication_synchro_timeout,              \
+}
+box.schema.user.revoke('guest', 'super')
diff --git a/test/replication/election_qsync_stress.result b/test/replication/election_qsync_stress.result
index 9497b37bf..bb419c3bf 100644
--- a/test/replication/election_qsync_stress.result
+++ b/test/replication/election_qsync_stress.result
@@ -93,7 +93,6 @@ for i = 1,10 do
     new_leader = 'election_replica'..new_leader_nr
     leader_port = test_run:eval(new_leader, 'box.cfg.listen')[1]
     c = netbox.connect(leader_port)
-    c:eval('box.ctl.clear_synchro_queue()')
     c:eval('box.cfg{replication_synchro_timeout=1000}')
     c.space._schema:replace{'smth'}
     c.space.test:get{i}
diff --git a/test/replication/election_qsync_stress.test.lua b/test/replication/election_qsync_stress.test.lua
index bca1b20c7..2f379efdb 100644
--- a/test/replication/election_qsync_stress.test.lua
+++ b/test/replication/election_qsync_stress.test.lua
@@ -57,7 +57,6 @@ for i = 1,10 do
     new_leader = 'election_replica'..new_leader_nr
     leader_port = test_run:eval(new_leader, 'box.cfg.listen')[1]
     c = netbox.connect(leader_port)
-    c:eval('box.ctl.clear_synchro_queue()')
     c:eval('box.cfg{replication_synchro_timeout=1000}')
     c.space._schema:replace{'smth'}
     c.space.test:get{i}
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH 6/6] qsync: reset confirmed lsn in limbo on owner change
  2020-10-13 23:28 [Tarantool-patches] [PATCH 0/6] Raft auto-commit Vladislav Shpilevoy
                   ` (4 preceding siblings ...)
  2020-10-13 23:28 ` [Tarantool-patches] [PATCH 5/6] raft: auto-commit transactions of the old leader Vladislav Shpilevoy
@ 2020-10-13 23:28 ` Vladislav Shpilevoy
  2020-11-24 23:23   ` Vladislav Shpilevoy
  2020-10-14  7:34 ` [Tarantool-patches] [PATCH 0/6] Raft auto-commit Serge Petrenko
  2020-10-14 22:40 ` Vladislav Shpilevoy
  7 siblings, 1 reply; 13+ messages in thread
From: Vladislav Shpilevoy @ 2020-10-13 23:28 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko

Order of LSNs from different instances can be any. It means, that
if the synchronous transaction limbo changed its owner, the old
value of maximal confirmed LSN does not make sense. The new owner
means new LSNs, even less than the previously confirmed LSN of an
old owner.

The patch resets the confirmed LSN when the owner is changed. No
specific test for that, as this is a hotfix - tests start fail on
every run after a new election test was added in the previous
commit. A test for this bug will be added later.

Part of #5395
---
 src/box/txn_limbo.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index 3655338d8..908a17fcc 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -76,6 +76,7 @@ txn_limbo_append(struct txn_limbo *limbo, uint32_t id, struct txn *txn)
 		if (limbo->instance_id == REPLICA_ID_NIL ||
 		    rlist_empty(&limbo->queue)) {
 			limbo->instance_id = id;
+			limbo->confirmed_lsn = 0;
 		} else {
 			diag_set(ClientError, ER_UNCOMMITTED_FOREIGN_SYNC_TXNS,
 				 limbo->instance_id);
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Tarantool-patches] [PATCH 0/6] Raft auto-commit
  2020-10-13 23:28 [Tarantool-patches] [PATCH 0/6] Raft auto-commit Vladislav Shpilevoy
                   ` (5 preceding siblings ...)
  2020-10-13 23:28 ` [Tarantool-patches] [PATCH 6/6] qsync: reset confirmed lsn in limbo on owner change Vladislav Shpilevoy
@ 2020-10-14  7:34 ` Serge Petrenko
  2020-10-14 22:40 ` Vladislav Shpilevoy
  7 siblings, 0 replies; 13+ messages in thread
From: Serge Petrenko @ 2020-10-14  7:34 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches


14.10.2020 02:28, Vladislav Shpilevoy пишет:
> The patchset implements Raft auto-commit feature, so when a new
> leader is elected, it clears the limbo queue after the old leader.
>
> There are 4 preparatory patches for that, and 1 follow-up not
> related to the auto-commit - the last commit fixes the issue 5395
> about a crash in the limbo, which started failing on every run of
> the tests now.
Hi! Thanks for the patchset!
LGTM.
>
> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5339-raft-auto-commit
> Issue: https://github.com/tarantool/tarantool/issues/5339
>
> @ChangeLog
> * When election is enabled, a newly elected leader will automatically finish all the synchronous transactions, created by the old leader (gh-5339).
>
> Vladislav Shpilevoy (6):
>    test: add '_stress' suffix to election_qsync test
>    raft: factor out the code to wakeup worker fiber
>    raft: new candidate should wait for leader death
>    raft: introduce on_update trigger
>    raft: auto-commit transactions of the old leader
>    qsync: reset confirmed lsn in limbo on owner change
>
>   src/box/box.cc                                |  42 ++++-
>   src/box/box.h                                 |   3 +-
>   src/box/lua/ctl.c                             |   2 +-
>   src/box/raft.c                                |  58 ++++--
>   src/box/raft.h                                |  13 ++
>   src/box/txn_limbo.c                           |   1 +
>   test/replication/election_qsync.result        | 177 ++++++++++--------
>   test/replication/election_qsync.test.lua      | 129 ++++++-------
>   test/replication/election_qsync_stress.result | 127 +++++++++++++
>   .../election_qsync_stress.test.lua            |  72 +++++++
>   test/replication/suite.ini                    |   2 +-
>   11 files changed, 460 insertions(+), 166 deletions(-)
>   create mode 100644 test/replication/election_qsync_stress.result
>   create mode 100644 test/replication/election_qsync_stress.test.lua
>
-- 
Serge Petrenko

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Tarantool-patches] [PATCH 2/6] raft: factor out the code to wakeup worker fiber
  2020-10-13 23:28 ` [Tarantool-patches] [PATCH 2/6] raft: factor out the code to wakeup worker fiber Vladislav Shpilevoy
@ 2020-10-14 13:29   ` Cyrill Gorcunov
  2020-10-14 22:40     ` Vladislav Shpilevoy
  0 siblings, 1 reply; 13+ messages in thread
From: Cyrill Gorcunov @ 2020-10-14 13:29 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

On Wed, Oct 14, 2020 at 01:28:28AM +0200, Vladislav Shpilevoy wrote:
> -raft_schedule_broadcast(void)
> +raft_worker_wakeup(void)
>  {
> -	raft.is_broadcast_scheduled = true;
> +	if (raft.worker == NULL) {
> +		raft.worker = fiber_new("raft_worker", raft_worker_f);
> +		fiber_set_joinable(raft.worker, true);
> +	}

When fiber_new return NULL you'll get nil dereference in fiber_set_joinable.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Tarantool-patches] [PATCH 2/6] raft: factor out the code to wakeup worker fiber
  2020-10-14 13:29   ` Cyrill Gorcunov
@ 2020-10-14 22:40     ` Vladislav Shpilevoy
  2020-10-15  6:50       ` Cyrill Gorcunov
  0 siblings, 1 reply; 13+ messages in thread
From: Vladislav Shpilevoy @ 2020-10-14 22:40 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tarantool-patches

Hi! Thanks for the review!

On 14.10.2020 15:29, Cyrill Gorcunov wrote:
> On Wed, Oct 14, 2020 at 01:28:28AM +0200, Vladislav Shpilevoy wrote:
>> -raft_schedule_broadcast(void)
>> +raft_worker_wakeup(void)
>>  {
>> -	raft.is_broadcast_scheduled = true;
>> +	if (raft.worker == NULL) {
>> +		raft.worker = fiber_new("raft_worker", raft_worker_f);
>> +		fiber_set_joinable(raft.worker, true);
>> +	}
> 
> When fiber_new return NULL you'll get nil dereference in fiber_set_joinable.

Indeed, you are right! It seems it was there always. There is no a
beautiful way to fix it now, so I added a panic() here in case of OOM:

====================
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -1010,6 +1010,21 @@ raft_worker_wakeup(void)
 {
 	if (raft.worker == NULL) {
 		raft.worker = fiber_new("raft_worker", raft_worker_f);
+		if (raft.worker == NULL) {
+			/*
+			 * XXX: should be handled properly, no need to panic.
+			 * The issue though is that most of the Raft state
+			 * machine functions are not supposed to fail, and also
+			 * they usually wakeup the fiber when their work is
+			 * finished. So it is too late to fail. On the other
+			 * hand it looks not so good to create the fiber when
+			 * Raft is initialized. Because then it will occupy
+			 * memory even if Raft is not used.
+			 */
+			diag_log();
+			panic("Could't create Raft worker fiber");
+			return;
+		}
 		fiber_set_joinable(raft.worker, true);
 	}
 	/*

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Tarantool-patches] [PATCH 0/6] Raft auto-commit
  2020-10-13 23:28 [Tarantool-patches] [PATCH 0/6] Raft auto-commit Vladislav Shpilevoy
                   ` (6 preceding siblings ...)
  2020-10-14  7:34 ` [Tarantool-patches] [PATCH 0/6] Raft auto-commit Serge Petrenko
@ 2020-10-14 22:40 ` Vladislav Shpilevoy
  7 siblings, 0 replies; 13+ messages in thread
From: Vladislav Shpilevoy @ 2020-10-14 22:40 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko

Pushed to master.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Tarantool-patches] [PATCH 2/6] raft: factor out the code to wakeup worker fiber
  2020-10-14 22:40     ` Vladislav Shpilevoy
@ 2020-10-15  6:50       ` Cyrill Gorcunov
  0 siblings, 0 replies; 13+ messages in thread
From: Cyrill Gorcunov @ 2020-10-15  6:50 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

On Thu, Oct 15, 2020 at 12:40:36AM +0200, Vladislav Shpilevoy wrote:
> >> -	raft.is_broadcast_scheduled = true;
> >> +	if (raft.worker == NULL) {
> >> +		raft.worker = fiber_new("raft_worker", raft_worker_f);
> >> +		fiber_set_joinable(raft.worker, true);
> >> +	}
> > 
> > When fiber_new return NULL you'll get nil dereference in fiber_set_joinable.
> 
> Indeed, you are right! It seems it was there always. There is no a
> beautiful way to fix it now, so I added a panic() here in case of OOM:

Sure, thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Tarantool-patches] [PATCH 6/6] qsync: reset confirmed lsn in limbo on owner change
  2020-10-13 23:28 ` [Tarantool-patches] [PATCH 6/6] qsync: reset confirmed lsn in limbo on owner change Vladislav Shpilevoy
@ 2020-11-24 23:23   ` Vladislav Shpilevoy
  0 siblings, 0 replies; 13+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-24 23:23 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko

I forgot to push this to 2.5. Did now.
I know it lacks a test, but better cherry-pick the hotfix,
than not fix it anyhow.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-11-24 23:23 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-13 23:28 [Tarantool-patches] [PATCH 0/6] Raft auto-commit Vladislav Shpilevoy
2020-10-13 23:28 ` [Tarantool-patches] [PATCH 1/6] test: add '_stress' suffix to election_qsync test Vladislav Shpilevoy
2020-10-13 23:28 ` [Tarantool-patches] [PATCH 2/6] raft: factor out the code to wakeup worker fiber Vladislav Shpilevoy
2020-10-14 13:29   ` Cyrill Gorcunov
2020-10-14 22:40     ` Vladislav Shpilevoy
2020-10-15  6:50       ` Cyrill Gorcunov
2020-10-13 23:28 ` [Tarantool-patches] [PATCH 3/6] raft: new candidate should wait for leader death Vladislav Shpilevoy
2020-10-13 23:28 ` [Tarantool-patches] [PATCH 4/6] raft: introduce on_update trigger Vladislav Shpilevoy
2020-10-13 23:28 ` [Tarantool-patches] [PATCH 5/6] raft: auto-commit transactions of the old leader Vladislav Shpilevoy
2020-10-13 23:28 ` [Tarantool-patches] [PATCH 6/6] qsync: reset confirmed lsn in limbo on owner change Vladislav Shpilevoy
2020-11-24 23:23   ` Vladislav Shpilevoy
2020-10-14  7:34 ` [Tarantool-patches] [PATCH 0/6] Raft auto-commit Serge Petrenko
2020-10-14 22:40 ` Vladislav Shpilevoy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox