Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH 00/12] Raft module, part 2 - relocation to src/lib/raft
@ 2020-11-17  0:02 Vladislav Shpilevoy
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 01/12] raft: move sources to raftlib.h/.c Vladislav Shpilevoy
                   ` (11 more replies)
  0 siblings, 12 replies; 37+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-17  0:02 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

The patchset is a second part of Raft relocation to a new module for the sake of
unit testing. This part does the relocation itself.

It entirely consists of removal of box dependencies from raft code.

The third part will virtualize Raft event loop at compile-time, and will
introduce customizable implementations for network, disk, event loop, and time
to perform unit tests.

Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5303-p2-src-lib-raft
Issue: https://github.com/tarantool/tarantool/issues/5303

Vladislav Shpilevoy (12):
  raft: move sources to raftlib.h/.c
  raft: move box_raft_* to src/box/raft.h and .c
  raft: stop using replication_disconnect_timeout()
  raft: stop using replication_synchro_quorum
  raft: stop using instance_id
  raft: make raft_request.vclock constant
  raft: stop using replicaset.vclock
  raft: introduce vtab for disk and network
  raft: introduce raft_msg, drop xrow dependency
  raft: move box_update_ro_summary to update trigger
  raft: introduce RaftError
  raft: move algorithm code to src/lib/raft

 src/box/CMakeLists.txt      |    2 +-
 src/box/applier.cc          |    2 +-
 src/box/box.cc              |   44 +-
 src/box/memtx_engine.c      |    4 +-
 src/box/raft.c              | 1130 ++++-------------------------------
 src/box/raft.h              |  252 +-------
 src/box/replication.cc      |    3 +
 src/box/xrow.c              |    2 +-
 src/box/xrow.h              |    6 +-
 src/lib/CMakeLists.txt      |    1 +
 src/lib/core/diag.h         |    2 +
 src/lib/core/exception.cc   |   24 +
 src/lib/core/exception.h    |    7 +
 src/lib/raft/CMakeLists.txt |    7 +
 src/lib/raft/raft.c         | 1009 +++++++++++++++++++++++++++++++
 src/lib/raft/raft.h         |  350 +++++++++++
 16 files changed, 1577 insertions(+), 1268 deletions(-)
 create mode 100644 src/lib/raft/CMakeLists.txt
 create mode 100644 src/lib/raft/raft.c
 create mode 100644 src/lib/raft/raft.h

-- 
2.24.3 (Apple Git-128)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [Tarantool-patches] [PATCH 01/12] raft: move sources to raftlib.h/.c
  2020-11-17  0:02 [Tarantool-patches] [PATCH 00/12] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
@ 2020-11-17  0:02 ` Vladislav Shpilevoy
  2020-11-17  8:14   ` Serge Petrenko
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 10/12] raft: move box_update_ro_summary to update trigger Vladislav Shpilevoy
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 37+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-17  0:02 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

The commit renames raft.h and raft.c to raftlib.h and raftlib.c.
This is done to prepare to Raft split into src/box/ and
src/lib/raft.

The commit is not atomic, the build won't work here. Because if
raft is renamed to raftlib, and in the same commit new raft.c and
raft.h are added, git thinks the original file was changed, and
ruins all the git history.

By splitting move of raft to raftlib and introduction of box/raft
into 2 commits the git history is saved.

Part of #5303
---
 src/box/CMakeLists.txt        | 1 +
 src/box/{raft.c => raftlib.c} | 0
 src/box/{raft.h => raftlib.h} | 0
 3 files changed, 1 insertion(+)
 rename src/box/{raft.c => raftlib.c} (100%)
 rename src/box/{raft.h => raftlib.h} (100%)

diff --git a/src/box/CMakeLists.txt b/src/box/CMakeLists.txt
index d1667796a..fcf779379 100644
--- a/src/box/CMakeLists.txt
+++ b/src/box/CMakeLists.txt
@@ -169,6 +169,7 @@ add_library(box STATIC
     port.c
     txn.c
     txn_limbo.c
+    raftlib.c
     raft.c
     box.cc
     gc.c
diff --git a/src/box/raft.c b/src/box/raftlib.c
similarity index 100%
rename from src/box/raft.c
rename to src/box/raftlib.c
diff --git a/src/box/raft.h b/src/box/raftlib.h
similarity index 100%
rename from src/box/raft.h
rename to src/box/raftlib.h
-- 
2.24.3 (Apple Git-128)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [Tarantool-patches] [PATCH 10/12] raft: move box_update_ro_summary to update trigger
  2020-11-17  0:02 [Tarantool-patches] [PATCH 00/12] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 01/12] raft: move sources to raftlib.h/.c Vladislav Shpilevoy
@ 2020-11-17  0:02 ` Vladislav Shpilevoy
  2020-11-17 12:42   ` Serge Petrenko
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 11/12] raft: introduce RaftError Vladislav Shpilevoy
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 37+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-17  0:02 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

box_update_ro_summary() was called from the basic Raft library,
making it depend on box. But it was called every time when Raft
state was changed and broadcasted. It means the same effect can be
achieved by updating RO summary from Raft state update trigger.

The patch does it, and now Raft code does not depend on box.h.

Part of #5303
---
 src/box/raft.c    | 2 ++
 src/box/raftlib.c | 8 --------
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/src/box/raft.c b/src/box/raft.c
index f3652bbcb..db1a3f423 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -77,6 +77,8 @@ box_raft_on_update_f(struct trigger *trigger, void *event)
 	(void)trigger;
 	struct raft *raft = (struct raft *)event;
 	assert(raft == box_raft());
+	/* State or enablence could be changed, affecting read-only state. */
+	box_update_ro_summary();
 	if (raft->state != RAFT_STATE_LEADER)
 		return 0;
 	/*
diff --git a/src/box/raftlib.c b/src/box/raftlib.c
index 512dbd51f..2e09d5405 100644
--- a/src/box/raftlib.c
+++ b/src/box/raftlib.c
@@ -33,7 +33,6 @@
 #include "error.h"
 #include "fiber.h"
 #include "small/region.h"
-#include "box.h"
 #include "tt_static.h"
 
 /**
@@ -603,8 +602,6 @@ raft_sm_become_leader(struct raft *raft)
 	raft->state = RAFT_STATE_LEADER;
 	raft->leader = raft->self;
 	ev_timer_stop(loop(), &raft->timer);
-	/* Make read-write (if other subsystems allow that. */
-	box_update_ro_summary();
 	/* State is visible and it is changed - broadcast. */
 	raft_schedule_broadcast(raft);
 }
@@ -655,7 +652,6 @@ raft_sm_schedule_new_term(struct raft *raft, uint64_t new_term)
 	raft->volatile_vote = 0;
 	raft->leader = 0;
 	raft->state = RAFT_STATE_FOLLOWER;
-	box_update_ro_summary();
 	raft_sm_pause_and_dump(raft);
 	/*
 	 * State is visible and it is changed - broadcast. Term is also visible,
@@ -686,7 +682,6 @@ raft_sm_schedule_new_election(struct raft *raft)
 	/* Everyone is a follower until its vote for self is persisted. */
 	raft_sm_schedule_new_term(raft, raft->term + 1);
 	raft_sm_schedule_new_vote(raft, raft->self);
-	box_update_ro_summary();
 }
 
 static void
@@ -771,7 +766,6 @@ raft_sm_start(struct raft *raft)
 		 */
 		raft_sm_wait_leader_found(raft);
 	}
-	box_update_ro_summary();
 	/*
 	 * Nothing changed. But when raft was stopped, its state wasn't sent to
 	 * replicas. At least this was happening at the moment of this being
@@ -793,7 +787,6 @@ raft_sm_stop(struct raft *raft)
 		raft->leader = 0;
 	raft->state = RAFT_STATE_FOLLOWER;
 	ev_timer_stop(loop(), &raft->timer);
-	box_update_ro_summary();
 	/* State is visible and changed - broadcast. */
 	raft_schedule_broadcast(raft);
 }
@@ -879,7 +872,6 @@ raft_cfg_is_candidate(struct raft *raft, bool is_candidate)
 			raft_schedule_broadcast(raft);
 		}
 	}
-	box_update_ro_summary();
 }
 
 void
-- 
2.24.3 (Apple Git-128)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [Tarantool-patches] [PATCH 11/12] raft: introduce RaftError
  2020-11-17  0:02 [Tarantool-patches] [PATCH 00/12] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 01/12] raft: move sources to raftlib.h/.c Vladislav Shpilevoy
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 10/12] raft: move box_update_ro_summary to update trigger Vladislav Shpilevoy
@ 2020-11-17  0:02 ` Vladislav Shpilevoy
  2020-11-17 15:13   ` Serge Petrenko
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 12/12] raft: move algorithm code to src/lib/raft Vladislav Shpilevoy
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 37+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-17  0:02 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

Last piece of src/box used in Raft code was error.h. It was added
to be able to raise ClientErrors. To get rid of it the libraries
usually introduce their own error type available from
src/lib/core. Such as CollationError, SwimError, CryptoError.

This patch adds RaftError and removes the last box dependency from
Raft code.

Part of #5303
---
 src/box/raftlib.c         |  9 ++++-----
 src/lib/core/diag.h       |  2 ++
 src/lib/core/exception.cc | 24 ++++++++++++++++++++++++
 src/lib/core/exception.h  |  7 +++++++
 4 files changed, 37 insertions(+), 5 deletions(-)

diff --git a/src/box/raftlib.c b/src/box/raftlib.c
index 2e09d5405..b669475f3 100644
--- a/src/box/raftlib.c
+++ b/src/box/raftlib.c
@@ -30,7 +30,7 @@
  */
 #include "raft.h"
 
-#include "error.h"
+#include "exception.h"
 #include "fiber.h"
 #include "small/region.h"
 #include "tt_static.h"
@@ -291,14 +291,13 @@ raft_process_msg(struct raft *raft, const struct raft_msg *req, uint32_t source)
 	assert(source > 0);
 	assert(source != raft->self);
 	if (req->term == 0 || req->state == 0) {
-		diag_set(ClientError, ER_PROTOCOL, "Raft term and state can't "
-			 "be zero");
+		diag_set(RaftError, "Raft term and state can't be zero");
 		return -1;
 	}
 	if (req->state == RAFT_STATE_CANDIDATE &&
 	    (req->vote != source || req->vclock == NULL)) {
-		diag_set(ClientError, ER_PROTOCOL, "Candidate should always "
-			 "vote for self and provide its vclock");
+		diag_set(RaftError, "Candidate should always vote for self and "
+			 "provide its vclock");
 		return -1;
 	}
 	/* Outdated request. */
diff --git a/src/lib/core/diag.h b/src/lib/core/diag.h
index 6bf0b139a..b07eea838 100644
--- a/src/lib/core/diag.h
+++ b/src/lib/core/diag.h
@@ -335,6 +335,8 @@ struct error *
 BuildSwimError(const char *file, unsigned line, const char *format, ...);
 struct error *
 BuildCryptoError(const char *file, unsigned line, const char *format, ...);
+struct error *
+BuildRaftError(const char *file, unsigned line, const char *format, ...);
 
 struct index_def;
 
diff --git a/src/lib/core/exception.cc b/src/lib/core/exception.cc
index 180cb0e97..395baff6f 100644
--- a/src/lib/core/exception.cc
+++ b/src/lib/core/exception.cc
@@ -288,6 +288,18 @@ CryptoError::CryptoError(const char *file, unsigned line,
 	va_end(ap);
 }
 
+const struct type_info type_RaftError =
+	make_type("RaftError", &type_Exception);
+
+RaftError::RaftError(const char *file, unsigned line, const char *format, ...)
+	: Exception(&type_RaftError, file, line)
+{
+	va_list ap;
+	va_start(ap, format);
+	error_vformat_msg(this, format, ap);
+	va_end(ap);
+}
+
 #define BuildAlloc(type)				\
 	void *p = malloc(sizeof(type));			\
 	if (p == NULL)					\
@@ -409,6 +421,18 @@ BuildSocketError(const char *file, unsigned line, const char *socketname,
 	return e;
 }
 
+struct error *
+BuildRaftError(const char *file, unsigned line, const char *format, ...)
+{
+	BuildAlloc(RaftError);
+	RaftError *e =  new (p) RaftError(file, line, "");
+	va_list ap;
+	va_start(ap, format);
+	error_vformat_msg(e, format, ap);
+	va_end(ap);
+	return e;
+}
+
 void
 exception_init()
 {
diff --git a/src/lib/core/exception.h b/src/lib/core/exception.h
index 1947b4f00..7277b2784 100644
--- a/src/lib/core/exception.h
+++ b/src/lib/core/exception.h
@@ -52,6 +52,7 @@ extern const struct type_info type_SystemError;
 extern const struct type_info type_CollationError;
 extern const struct type_info type_SwimError;
 extern const struct type_info type_CryptoError;
+extern const struct type_info type_RaftError;
 
 const char *
 exception_get_string(struct error *e, const struct method_info *method);
@@ -168,6 +169,12 @@ public:
 	virtual void raise() { throw this; }
 };
 
+class RaftError: public Exception {
+public:
+	RaftError(const char *file, unsigned line, const char *format, ...);
+	virtual void raise() { throw this; }
+};
+
 /**
  * Initialize the exception subsystem.
  */
-- 
2.24.3 (Apple Git-128)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [Tarantool-patches] [PATCH 12/12] raft: move algorithm code to src/lib/raft
  2020-11-17  0:02 [Tarantool-patches] [PATCH 00/12] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
                   ` (2 preceding siblings ...)
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 11/12] raft: introduce RaftError Vladislav Shpilevoy
@ 2020-11-17  0:02 ` Vladislav Shpilevoy
  2020-11-17 15:13   ` Serge Petrenko
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 02/12] raft: move box_raft_* to src/box/raft.h and .c Vladislav Shpilevoy
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 37+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-17  0:02 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

Raft algorithm code does not depend on box anymore, and is moved
to src/lib/raft.

This is done to be able to unit test Raft similarly to Swim - with
virtual event loop, network, time, and disk. Using any number of
instances. That will allow to cover all crazy and rare cases
possible in Raft, but without problems of functional tests
stability and clumsiness.

Part of #5303
---
 src/box/CMakeLists.txt                 | 3 +--
 src/box/raft.h                         | 2 +-
 src/lib/CMakeLists.txt                 | 1 +
 src/lib/raft/CMakeLists.txt            | 7 +++++++
 src/{box/raftlib.c => lib/raft/raft.c} | 0
 src/{box/raftlib.h => lib/raft/raft.h} | 0
 6 files changed, 10 insertions(+), 3 deletions(-)
 create mode 100644 src/lib/raft/CMakeLists.txt
 rename src/{box/raftlib.c => lib/raft/raft.c} (100%)
 rename src/{box/raftlib.h => lib/raft/raft.h} (100%)

diff --git a/src/box/CMakeLists.txt b/src/box/CMakeLists.txt
index fcf779379..a7547c29f 100644
--- a/src/box/CMakeLists.txt
+++ b/src/box/CMakeLists.txt
@@ -169,7 +169,6 @@ add_library(box STATIC
     port.c
     txn.c
     txn_limbo.c
-    raftlib.c
     raft.c
     box.cc
     gc.c
@@ -263,6 +262,6 @@ add_custom_command(OUTPUT ${SQL_BIN_DIR}/opcodes.c
         ${SQL_BIN_DIR}/opcodes.h)
 
 target_link_libraries(box box_error tuple stat xrow xlog vclock crc32 scramble
-                      ${common_libraries})
+                      raft ${common_libraries})
 
 add_dependencies(box build_bundled_libs generate_sql_files)
diff --git a/src/box/raft.h b/src/box/raft.h
index 4dffce380..7e0768cd3 100644
--- a/src/box/raft.h
+++ b/src/box/raft.h
@@ -29,7 +29,7 @@
  * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
-#include "raftlib.h"
+#include "raft/raft.h"
 
 #if defined(__cplusplus)
 extern "C" {
diff --git a/src/lib/CMakeLists.txt b/src/lib/CMakeLists.txt
index de1b902c6..cabbe3d89 100644
--- a/src/lib/CMakeLists.txt
+++ b/src/lib/CMakeLists.txt
@@ -14,6 +14,7 @@ add_subdirectory(crypto)
 add_subdirectory(swim)
 add_subdirectory(mpstream)
 add_subdirectory(vclock)
+add_subdirectory(raft)
 if(ENABLE_BUNDLED_MSGPUCK)
     add_subdirectory(msgpuck EXCLUDE_FROM_ALL)
 endif()
diff --git a/src/lib/raft/CMakeLists.txt b/src/lib/raft/CMakeLists.txt
new file mode 100644
index 000000000..aef2bacf7
--- /dev/null
+++ b/src/lib/raft/CMakeLists.txt
@@ -0,0 +1,7 @@
+set(lib_sources
+    raft.c
+)
+
+set_source_files_compile_flags(${lib_sources})
+add_library(raft STATIC ${lib_sources})
+target_link_libraries(raft core)
diff --git a/src/box/raftlib.c b/src/lib/raft/raft.c
similarity index 100%
rename from src/box/raftlib.c
rename to src/lib/raft/raft.c
diff --git a/src/box/raftlib.h b/src/lib/raft/raft.h
similarity index 100%
rename from src/box/raftlib.h
rename to src/lib/raft/raft.h
-- 
2.24.3 (Apple Git-128)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [Tarantool-patches] [PATCH 02/12] raft: move box_raft_* to src/box/raft.h and .c
  2020-11-17  0:02 [Tarantool-patches] [PATCH 00/12] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
                   ` (3 preceding siblings ...)
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 12/12] raft: move algorithm code to src/lib/raft Vladislav Shpilevoy
@ 2020-11-17  0:02 ` Vladislav Shpilevoy
  2020-11-17  8:14   ` Serge Petrenko
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 03/12] raft: stop using replication_disconnect_timeout() Vladislav Shpilevoy
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 37+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-17  0:02 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

The commit moves Raft functions and objects specific for box to
src/box/raft from src/box/box and src/box/raftlib.

The goal is to gradually eliminate all box dependencies from
src/box/raftlib and move it to src/lib/raft.

Part of #5303
---
 src/box/box.cc    | 25 --------------
 src/box/raft.c    | 86 +++++++++++++++++++++++++++++++++++++++++++++++
 src/box/raft.h    | 59 ++++++++++++++++++++++++++++++++
 src/box/raftlib.c | 29 ----------------
 src/box/raftlib.h | 19 -----------
 5 files changed, 145 insertions(+), 73 deletions(-)
 create mode 100644 src/box/raft.c
 create mode 100644 src/box/raft.h

diff --git a/src/box/box.cc b/src/box/box.cc
index 1f7dec362..8dd92a5f5 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -152,11 +152,6 @@ static struct fiber_pool tx_fiber_pool;
  * are too many messages in flight (gh-1892).
  */
 static struct cbus_endpoint tx_prio_endpoint;
-/**
- * A trigger executed each time the Raft state machine updates any
- * of its visible attributes.
- */
-static struct trigger box_raft_on_update;
 
 void
 box_update_ro_summary(void)
@@ -1060,23 +1055,6 @@ box_clear_synchro_queue(bool try_wait)
 	}
 }
 
-static int
-box_raft_on_update_f(struct trigger *trigger, void *event)
-{
-	(void)trigger;
-	struct raft *raft = (struct raft *)event;
-	assert(raft == box_raft());
-	if (raft->state != RAFT_STATE_LEADER)
-		return 0;
-	/*
-	 * When the node became a leader, it means it will ignore all records
-	 * from all the other nodes, and won't get late CONFIRM messages anyway.
-	 * Can clear the queue without waiting for confirmations.
-	 */
-	box_clear_synchro_queue(false);
-	return 0;
-}
-
 void
 box_listen(void)
 {
@@ -2658,9 +2636,6 @@ box_init(void)
 	txn_limbo_init();
 	sequence_init();
 	box_raft_init();
-
-	trigger_create(&box_raft_on_update, box_raft_on_update_f, NULL, NULL);
-	raft_on_update(box_raft(), &box_raft_on_update);
 }
 
 bool
diff --git a/src/box/raft.c b/src/box/raft.c
new file mode 100644
index 000000000..f289a6993
--- /dev/null
+++ b/src/box/raft.c
@@ -0,0 +1,86 @@
+/*
+ * Copyright 2010-2020, Tarantool AUTHORS, please see AUTHORS file.
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above
+ *    copyright notice, this list of conditions and the
+ *    following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above
+ *    copyright notice, this list of conditions and the following
+ *    disclaimer in the documentation and/or other materials
+ *    provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY <COPYRIGHT HOLDER> ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
+ * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
+ * <COPYRIGHT HOLDER> OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
+ * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+#include "box.h"
+#include "raft.h"
+
+struct raft box_raft_global = {
+	/*
+	 * Set an invalid state to validate in runtime the global raft node is
+	 * not used before initialization.
+	 */
+	.state = 0,
+};
+
+/**
+ * A trigger executed each time the Raft state machine updates any
+ * of its visible attributes.
+ */
+static struct trigger box_raft_on_update;
+
+static int
+box_raft_on_update_f(struct trigger *trigger, void *event)
+{
+	(void)trigger;
+	struct raft *raft = (struct raft *)event;
+	assert(raft == box_raft());
+	if (raft->state != RAFT_STATE_LEADER)
+		return 0;
+	/*
+	 * When the node became a leader, it means it will ignore all records
+	 * from all the other nodes, and won't get late CONFIRM messages anyway.
+	 * Can clear the queue without waiting for confirmations.
+	 */
+	box_clear_synchro_queue(false);
+	return 0;
+}
+
+void
+box_raft_init(void)
+{
+	raft_create(&box_raft_global);
+	trigger_create(&box_raft_on_update, box_raft_on_update_f, NULL, NULL);
+	raft_on_update(box_raft(), &box_raft_on_update);
+}
+
+void
+box_raft_free(void)
+{
+	/*
+	 * Can't join the fiber, because the event loop is stopped already, and
+	 * yields are not allowed.
+	 */
+	box_raft_global.worker = NULL;
+	raft_destroy(&box_raft_global);
+	/*
+	 * Invalidate so as box_raft() would fail if any usage attempt happens.
+	 */
+	box_raft_global.state = 0;
+}
diff --git a/src/box/raft.h b/src/box/raft.h
new file mode 100644
index 000000000..fe0f073dc
--- /dev/null
+++ b/src/box/raft.h
@@ -0,0 +1,59 @@
+#pragma once
+/*
+ * Copyright 2010-2020, Tarantool AUTHORS, please see AUTHORS file.
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above
+ *    copyright notice, this list of conditions and the
+ *    following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above
+ *    copyright notice, this list of conditions and the following
+ *    disclaimer in the documentation and/or other materials
+ *    provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY <COPYRIGHT HOLDER> ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
+ * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
+ * <COPYRIGHT HOLDER> OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
+ * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+#include "raftlib.h"
+
+#if defined(__cplusplus)
+extern "C" {
+#endif
+
+/** Raft state of this instance. */
+static inline struct raft *
+box_raft(void)
+{
+	extern struct raft box_raft_global;
+	/**
+	 * Ensure the raft node can be used. I.e. that it is properly
+	 * initialized. Entirely for debug purposes.
+	 */
+	assert(box_raft_global.state != 0);
+	return &box_raft_global;
+}
+
+void
+box_raft_init(void);
+
+void
+box_raft_free(void);
+
+#if defined(__cplusplus)
+}
+#endif
diff --git a/src/box/raftlib.c b/src/box/raftlib.c
index ff664a4d1..3867c63e0 100644
--- a/src/box/raftlib.c
+++ b/src/box/raftlib.c
@@ -44,14 +44,6 @@
  */
 #define RAFT_RANDOM_ELECTION_FACTOR 0.1
 
-struct raft box_raft_global = {
-	/*
-	 * Set an invalid state to validate in runtime the global raft node is
-	 * not used before initialization.
-	 */
-	.state = 0,
-};
-
 /**
  * When decoding we should never trust that there is
  * a valid data incomes.
@@ -1106,24 +1098,3 @@ raft_destroy(struct raft *raft)
 		raft->worker = NULL;
 	}
 }
-
-void
-box_raft_init(void)
-{
-	raft_create(&box_raft_global);
-}
-
-void
-box_raft_free(void)
-{
-	/*
-	 * Can't join the fiber, because the event loop is stopped already, and
-	 * yields are not allowed.
-	 */
-	box_raft_global.worker = NULL;
-	raft_destroy(&box_raft_global);
-	/*
-	 * Invalidate so as box_raft() would fail if any usage attempt happens.
-	 */
-	box_raft_global.state = 0;
-}
diff --git a/src/box/raftlib.h b/src/box/raftlib.h
index 3088fba23..805f69d64 100644
--- a/src/box/raftlib.h
+++ b/src/box/raftlib.h
@@ -271,25 +271,6 @@ raft_create(struct raft *raft);
 void
 raft_destroy(struct raft *raft);
 
-/** Raft state of this instance. */
-static inline struct raft *
-box_raft(void)
-{
-	extern struct raft box_raft_global;
-	/**
-	 * Ensure the raft node can be used. I.e. that it is properly
-	 * initialized. Entirely for debug purposes.
-	 */
-	assert(box_raft_global.state != 0);
-	return &box_raft_global;
-}
-
-void
-box_raft_init(void);
-
-void
-box_raft_free(void);
-
 #if defined(__cplusplus)
 }
 #endif
-- 
2.24.3 (Apple Git-128)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [Tarantool-patches] [PATCH 03/12] raft: stop using replication_disconnect_timeout()
  2020-11-17  0:02 [Tarantool-patches] [PATCH 00/12] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
                   ` (4 preceding siblings ...)
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 02/12] raft: move box_raft_* to src/box/raft.h and .c Vladislav Shpilevoy
@ 2020-11-17  0:02 ` Vladislav Shpilevoy
  2020-11-17  8:15   ` Serge Petrenko
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 04/12] raft: stop using replication_synchro_quorum Vladislav Shpilevoy
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 37+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-17  0:02 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

Raft is being moved to a separate library in src/lib. It means,
it can't depend on anything from box/, including global
replication parameters such as replication_timeout, and functions
like replication_disconnect_timeout().

The patch makes raft stop using replication_disconnect_timeout().
Instead, it stores death timeout in struct raft. It is configured
by box simultaneously with replication_timeout.

Part of #5303
---
 src/box/box.cc    |  2 +-
 src/box/raftlib.c | 13 ++++++-------
 src/box/raftlib.h | 10 +++++++---
 3 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 8dd92a5f5..25673ed42 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -890,7 +890,7 @@ void
 box_set_replication_timeout(void)
 {
 	replication_timeout = box_check_replication_timeout();
-	raft_cfg_death_timeout(box_raft());
+	raft_cfg_death_timeout(box_raft(), replication_disconnect_timeout());
 }
 
 void
diff --git a/src/box/raftlib.c b/src/box/raftlib.c
index 3867c63e0..c156d6f46 100644
--- a/src/box/raftlib.c
+++ b/src/box/raftlib.c
@@ -804,8 +804,7 @@ raft_sm_wait_leader_dead(struct raft *raft)
 	assert(raft->is_candidate);
 	assert(raft->state == RAFT_STATE_FOLLOWER);
 	assert(raft->leader != 0);
-	double death_timeout = replication_disconnect_timeout();
-	ev_timer_set(&raft->timer, death_timeout, death_timeout);
+	ev_timer_set(&raft->timer, raft->death_timeout, raft->death_timeout);
 	ev_timer_start(loop(), &raft->timer);
 }
 
@@ -817,8 +816,7 @@ raft_sm_wait_leader_found(struct raft *raft)
 	assert(raft->is_candidate);
 	assert(raft->state == RAFT_STATE_FOLLOWER);
 	assert(raft->leader == 0);
-	double death_timeout = replication_disconnect_timeout();
-	ev_timer_set(&raft->timer, death_timeout, death_timeout);
+	ev_timer_set(&raft->timer, raft->death_timeout, raft->death_timeout);
 	ev_timer_start(loop(), &raft->timer);
 }
 
@@ -1012,14 +1010,14 @@ raft_cfg_election_quorum(struct raft *raft)
 }
 
 void
-raft_cfg_death_timeout(struct raft *raft)
+raft_cfg_death_timeout(struct raft *raft, double death_timeout)
 {
+	raft->death_timeout = death_timeout;
 	if (raft->state == RAFT_STATE_FOLLOWER && raft->is_candidate &&
 	    raft->leader != 0) {
 		assert(ev_is_active(&raft->timer));
-		double death_timeout = replication_disconnect_timeout();
 		double timeout = ev_timer_remaining(loop(), &raft->timer) -
-				 raft->timer.at + death_timeout;
+				 raft->timer.at + raft->death_timeout;
 		ev_timer_stop(loop(), &raft->timer);
 		ev_timer_set(&raft->timer, timeout, timeout);
 		ev_timer_start(loop(), &raft->timer);
@@ -1080,6 +1078,7 @@ raft_create(struct raft *raft)
 		.volatile_term = 1,
 		.term =	1,
 		.election_timeout = 5,
+		.death_timeout = 5,
 	};
 	ev_timer_init(&raft->timer, raft_sm_schedule_new_election_cb, 0, 0);
 	raft->timer.data = raft;
diff --git a/src/box/raftlib.h b/src/box/raftlib.h
index 805f69d64..b33a20326 100644
--- a/src/box/raftlib.h
+++ b/src/box/raftlib.h
@@ -156,6 +156,11 @@ struct raft {
 	struct fiber *worker;
 	/** Configured election timeout in seconds. */
 	double election_timeout;
+	/**
+	 * Leader death timeout, after which it is considered dead and new
+	 * elections can be started.
+	 */
+	double death_timeout;
 	/**
 	 * Trigger invoked each time any of the Raft node visible attributes are
 	 * changed.
@@ -229,11 +234,10 @@ raft_cfg_election_quorum(struct raft *raft);
 
 /**
  * Configure Raft leader death timeout. I.e. number of seconds without
- * heartbeats from the leader to consider it dead. There is no a separate
- * option. Raft uses replication timeout for that.
+ * heartbeats from the leader to consider it dead.
  */
 void
-raft_cfg_death_timeout(struct raft *raft);
+raft_cfg_death_timeout(struct raft *raft, double death_timeout);
 
 /**
  * Bump the term. When it is persisted, the node checks if there is a leader,
-- 
2.24.3 (Apple Git-128)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [Tarantool-patches] [PATCH 04/12] raft: stop using replication_synchro_quorum
  2020-11-17  0:02 [Tarantool-patches] [PATCH 00/12] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
                   ` (5 preceding siblings ...)
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 03/12] raft: stop using replication_disconnect_timeout() Vladislav Shpilevoy
@ 2020-11-17  0:02 ` Vladislav Shpilevoy
  2020-11-17  8:17   ` Serge Petrenko
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 05/12] raft: stop using instance_id Vladislav Shpilevoy
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 37+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-17  0:02 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

Raft is being moved to a separate library in src/lib. It means,
it can't depend on anything from box/, including global
replication parameters such as replication_synchro_quorum.

The patch makes raft stop using replication_synchro_quorum.

Instead, it has a new option 'election_quorum'. Note, that this is
just Raft API. Box API still uses replication_synchro_quorum. But
it is used to calculate the final quorum in src/box/raft, not
in src/box/raftlib. And to pass it to the base Raft
implementation.

Part of #5303
---
 src/box/box.cc         |  2 +-
 src/box/raft.c         | 52 +++++++++++++++++++++++++++++++
 src/box/raft.h         |  8 +++++
 src/box/raftlib.c      | 70 ++++++++----------------------------------
 src/box/raftlib.h      | 10 +++---
 src/box/replication.cc |  3 ++
 6 files changed, 83 insertions(+), 62 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 25673ed42..cc0d7b81d 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -921,7 +921,7 @@ box_set_replication_synchro_quorum(void)
 		return -1;
 	replication_synchro_quorum = value;
 	txn_limbo_on_parameters_change(&txn_limbo);
-	raft_cfg_election_quorum(box_raft());
+	box_raft_reconsider_election_quorum();
 	return 0;
 }
 
diff --git a/src/box/raft.c b/src/box/raft.c
index f289a6993..af6e71e0b 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -30,6 +30,7 @@
  */
 #include "box.h"
 #include "raft.h"
+#include "replication.h"
 
 struct raft box_raft_global = {
 	/*
@@ -62,6 +63,57 @@ box_raft_on_update_f(struct trigger *trigger, void *event)
 	return 0;
 }
 
+void
+box_raft_reconsider_election_quorum(void)
+{
+	/*
+	 * When the instance is started first time, it does not have an ID, so
+	 * the registered count is 0. But the quorum can never be 0. At least
+	 * the current instance should participate in the quorum.
+	 */
+	int max = MAX(replicaset.registered_count, 1);
+	/**
+	 * Election quorum is not strictly equal to synchronous replication
+	 * quorum. Sometimes it can be lowered. That is about bootstrap.
+	 *
+	 * The problem with bootstrap is that when the replicaset boots, all the
+	 * instances can't write to WAL and can't recover from their initial
+	 * snapshot. They need one node which will boot first, and then they
+	 * will replicate from it.
+	 *
+	 * This one node should boot from its zero snapshot, create replicaset
+	 * UUID, register self with ID 1 in _cluster space, and then register
+	 * all the other instances here. To do that the node must be writable.
+	 * It should have read_only = false, connection quorum satisfied, and be
+	 * a Raft leader if Raft is enabled.
+	 *
+	 * To be elected a Raft leader it needs to perform election. But it
+	 * can't be done before at least synchronous quorum of the replicas is
+	 * bootstrapped. And they can't be bootstrapped because wait for a
+	 * leader to initialize _cluster. Cyclic dependency.
+	 *
+	 * This is resolved by truncation of the election quorum to the number
+	 * of registered replicas, if their count is less than synchronous
+	 * quorum. That helps to elect a first leader.
+	 *
+	 * It may seem that the first node could just declare itself a leader
+	 * and then strictly follow the protocol from now on, but that won't
+	 * work, because if the first node will restart after it is booted, but
+	 * before quorum of replicas is booted, the cluster will stuck again.
+	 *
+	 * The current solution is totally safe because
+	 *
+	 * - after all the cluster will have node count >= quorum, if user used
+	 *   a correct config (God help him if he didn't);
+	 *
+	 * - synchronous replication quorum is untouched - it is not truncated.
+	 *   Only leader election quorum is affected. So synchronous data won't
+	 *   be lost.
+	 */
+	int quorum = MIN(replication_synchro_quorum, max);
+	raft_cfg_election_quorum(box_raft(), quorum);
+}
+
 void
 box_raft_init(void)
 {
diff --git a/src/box/raft.h b/src/box/raft.h
index fe0f073dc..09297273f 100644
--- a/src/box/raft.h
+++ b/src/box/raft.h
@@ -48,6 +48,14 @@ box_raft(void)
 	return &box_raft_global;
 }
 
+/**
+ * Let the global raft know that the election quorum could change. It happens
+ * when configuration is updated, and when new nodes are added or old are
+ * deleted from the cluster.
+ */
+void
+box_raft_reconsider_election_quorum(void);
+
 void
 box_raft_init(void);
 
diff --git a/src/box/raftlib.c b/src/box/raftlib.c
index c156d6f46..0657fa85a 100644
--- a/src/box/raftlib.c
+++ b/src/box/raftlib.c
@@ -130,50 +130,6 @@ raft_can_vote_for(const struct raft *raft, const struct vclock *v)
 	return cmp == 0 || cmp == 1;
 }
 
-/**
- * Election quorum is not strictly equal to synchronous replication quorum.
- * Sometimes it can be lowered. That is about bootstrap.
- *
- * The problem with bootstrap is that when the replicaset boots, all the
- * instances can't write to WAL and can't recover from their initial snapshot.
- * They need one node which will boot first, and then they will replicate from
- * it.
- *
- * This one node should boot from its zero snapshot, create replicaset UUID,
- * register self with ID 1 in _cluster space, and then register all the other
- * instances here. To do that the node must be writable. It should have
- * read_only = false, connection quorum satisfied, and be a Raft leader if Raft
- * is enabled.
- *
- * To be elected a Raft leader it needs to perform election. But it can't be
- * done before at least synchronous quorum of the replicas is bootstrapped. And
- * they can't be bootstrapped because wait for a leader to initialize _cluster.
- * Cyclic dependency.
- *
- * This is resolved by truncation of the election quorum to the number of
- * registered replicas, if their count is less than synchronous quorum. That
- * helps to elect a first leader.
- *
- * It may seem that the first node could just declare itself a leader and then
- * strictly follow the protocol from now on, but that won't work, because if the
- * first node will restart after it is booted, but before quorum of replicas is
- * booted, the cluster will stuck again.
- *
- * The current solution is totally safe because
- *
- * - after all the cluster will have node count >= quorum, if user used a
- *   correct config (God help him if he didn't);
- *
- * - synchronous replication quorum is untouched - it is not truncated. Only
- *   leader election quorum is affected. So synchronous data won't be lost.
- */
-static inline int
-raft_election_quorum(const struct raft *raft)
-{
-	(void)raft;
-	return MIN(replication_synchro_quorum, replicaset.registered_count);
-}
-
 /**
  * Wakeup the Raft worker fiber in order to do some async work. If the fiber
  * does not exist yet, it is created.
@@ -427,13 +383,12 @@ raft_process_msg(struct raft *raft, const struct raft_request *req,
 			 * and now was answered by some other instance.
 			 */
 			assert(raft->volatile_vote == instance_id);
-			int quorum = raft_election_quorum(raft);
 			bool was_set = bit_set(&raft->vote_mask, source);
 			raft->vote_count += !was_set;
-			if (raft->vote_count < quorum) {
+			if (raft->vote_count < raft->election_quorum) {
 				say_info("RAFT: accepted vote for self, vote "
 					 "count is %d/%d", raft->vote_count,
-					 quorum);
+					 raft->election_quorum);
 				break;
 			}
 			raft_sm_become_leader(raft);
@@ -594,7 +549,7 @@ end_dump:
 			raft_sm_wait_leader_dead(raft);
 		} else if (raft->vote == instance_id) {
 			/* Just wrote own vote. */
-			if (raft_election_quorum(raft) == 1)
+			if (raft->election_quorum == 1)
 				raft_sm_become_leader(raft);
 			else
 				raft_sm_become_candidate(raft);
@@ -692,7 +647,7 @@ raft_sm_become_leader(struct raft *raft)
 {
 	assert(raft->state != RAFT_STATE_LEADER);
 	say_info("RAFT: enter leader state with quorum %d",
-		 raft_election_quorum(raft));
+		 raft->election_quorum);
 	assert(raft->leader == 0);
 	assert(raft->is_candidate);
 	assert(!raft->is_write_in_progress);
@@ -730,7 +685,7 @@ raft_sm_become_candidate(struct raft *raft)
 	assert(raft->vote == instance_id);
 	assert(raft->is_candidate);
 	assert(!raft->is_write_in_progress);
-	assert(raft_election_quorum(raft) > 1);
+	assert(raft->election_quorum > 1);
 	raft->state = RAFT_STATE_CANDIDATE;
 	raft->vote_count = 1;
 	raft->vote_mask = 0;
@@ -999,14 +954,14 @@ raft_cfg_election_timeout(struct raft *raft, double timeout)
 }
 
 void
-raft_cfg_election_quorum(struct raft *raft)
+raft_cfg_election_quorum(struct raft *raft, int election_quorum)
 {
-	if (raft->state != RAFT_STATE_CANDIDATE ||
-	    raft->state == RAFT_STATE_LEADER)
-		return;
-	if (raft->vote_count < raft_election_quorum(raft))
-		return;
-	raft_sm_become_leader(raft);
+	/* At least self is always a part of the quorum. */
+	assert(election_quorum > 0);
+	raft->election_quorum = election_quorum;
+	if (raft->state == RAFT_STATE_CANDIDATE &&
+	    raft->vote_count >= raft->election_quorum)
+		raft_sm_become_leader(raft);
 }
 
 void
@@ -1077,6 +1032,7 @@ raft_create(struct raft *raft)
 		.state = RAFT_STATE_FOLLOWER,
 		.volatile_term = 1,
 		.term =	1,
+		.election_quorum = 1,
 		.election_timeout = 5,
 		.death_timeout = 5,
 	};
diff --git a/src/box/raftlib.h b/src/box/raftlib.h
index b33a20326..c9c13136e 100644
--- a/src/box/raftlib.h
+++ b/src/box/raftlib.h
@@ -150,6 +150,8 @@ struct raft {
 	vclock_map_t vote_mask;
 	/** Number of votes for this instance. Valid only in candidate state. */
 	int vote_count;
+	/** Number of votes necessary for successful election. */
+	int election_quorum;
 	/** State machine timed event trigger. */
 	struct ev_timer timer;
 	/** Worker fiber to execute blocking tasks like IO. */
@@ -225,12 +227,12 @@ void
 raft_cfg_election_timeout(struct raft *raft, double timeout);
 
 /**
- * Configure Raft leader election quorum. There is no a separate option.
- * Instead, synchronous replication quorum is used. Since Raft is tightly bound
- * with synchronous replication.
+ * Configure Raft leader election quorum. That may trigger immediate election,
+ * if the quorum is lowered, and this instance is a candidate having enough
+ * votes for the new quorum.
  */
 void
-raft_cfg_election_quorum(struct raft *raft);
+raft_cfg_election_quorum(struct raft *raft, int election_quorum);
 
 /**
  * Configure Raft leader death timeout. I.e. number of seconds without
diff --git a/src/box/replication.cc b/src/box/replication.cc
index 65512cf0f..19d7f6beb 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -39,6 +39,7 @@
 #include "box.h"
 #include "gc.h"
 #include "error.h"
+#include "raft.h"
 #include "relay.h"
 #include "sio.h"
 
@@ -250,6 +251,7 @@ replica_set_id(struct replica *replica, uint32_t replica_id)
 	say_info("assigned id %d to replica %s",
 		 replica->id, tt_uuid_str(&replica->uuid));
 	replica->anon = false;
+	box_raft_reconsider_election_quorum();
 }
 
 void
@@ -298,6 +300,7 @@ replica_clear_id(struct replica *replica)
 		assert(!replica->anon);
 		replica_delete(replica);
 	}
+	box_raft_reconsider_election_quorum();
 }
 
 void
-- 
2.24.3 (Apple Git-128)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [Tarantool-patches] [PATCH 05/12] raft: stop using instance_id
  2020-11-17  0:02 [Tarantool-patches] [PATCH 00/12] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
                   ` (6 preceding siblings ...)
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 04/12] raft: stop using replication_synchro_quorum Vladislav Shpilevoy
@ 2020-11-17  0:02 ` Vladislav Shpilevoy
  2020-11-17  8:59   ` Serge Petrenko
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 06/12] raft: make raft_request.vclock constant Vladislav Shpilevoy
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 37+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-17  0:02 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

Raft is being moved to a separate library in src/lib. It means,
it can't depend on anything from box/.

The patch makes raft stop using instance_id.

Instead, it has a new option 'instance_id'. It is stored inside
struct raft as 'self', and should be configured using
raft_cfg_instance_id().

The configuration is done when bootstrap ends and the instance_id
is either recovered successfully, or the instance is anonymous.

While working on this, I also considered introducing a new
function raft_boot() instead of raft_cfg_instance_id(). Which I
would also use to configure vclock later. Raft_boot() would be
meant to be called only one time with non-dynamic parameters
instance_id and vclock.

But then I decided to keep adding new raft_cfg_*() functions.
Because:

- It is more consistent with the existing options;

- Does not require to think about too many different functions
  like raft_create(), raft_boot(), raft_cfg_*() and in which order
  to call them;

Also I was thinking to introduce a single raft_cfg() like I did
in swim with swim_cfg(), to reduce number of raft_cfg_*()
functions, but decided it would be even worse with so many
options.

Part of #5303
---
 src/box/box.cc    | 10 +++++-----
 src/box/raftlib.c | 32 ++++++++++++++++++++------------
 src/box/raftlib.h |  9 +++++++++
 3 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index cc0d7b81d..8f5f3558e 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -2763,12 +2763,12 @@ box_cfg_xc(void)
 	 * Fill in leader election parameters after bootstrap. Before it is not
 	 * possible - there may be relevant data to recover from WAL and
 	 * snapshot. Also until recovery is done, it is not possible to write
-	 * new records into WAL. It is also totally safe, because relaying is
-	 * not started until the box is configured. So it can't happen, that
-	 * this election-enabled node will try to relay to another
-	 * election-enabled node without election actually enabled leading to
-	 * disconnect.
+	 * new records into WAL. Another reason - before recovery is done,
+	 * instance_id is not known, so Raft simply can't work.
 	 */
+	if (!replication_anon)
+		raft_cfg_instance_id(box_raft(), instance_id);
+
 	if (box_set_election_timeout() != 0)
 		diag_raise();
 	/*
diff --git a/src/box/raftlib.c b/src/box/raftlib.c
index 0657fa85a..ca1940ba6 100644
--- a/src/box/raftlib.c
+++ b/src/box/raftlib.c
@@ -296,7 +296,7 @@ raft_process_msg(struct raft *raft, const struct raft_request *req,
 	say_info("RAFT: message %s from %u", raft_request_to_string(req),
 		 source);
 	assert(source > 0);
-	assert(source != instance_id);
+	assert(source != raft->self);
 	if (req->term == 0 || req->state == 0) {
 		diag_set(ClientError, ER_PROTOCOL, "Raft term and state can't "
 			 "be zero");
@@ -337,7 +337,7 @@ raft_process_msg(struct raft *raft, const struct raft_request *req,
 					 raft->leader);
 				break;
 			}
-			if (req->vote == instance_id) {
+			if (req->vote == raft->self) {
 				/*
 				 * This is entirely valid. This instance could
 				 * request a vote, then become a follower or
@@ -373,7 +373,7 @@ raft_process_msg(struct raft *raft, const struct raft_request *req,
 			break;
 		case RAFT_STATE_CANDIDATE:
 			/* Check if this is a vote for a competing candidate. */
-			if (req->vote != instance_id) {
+			if (req->vote != raft->self) {
 				say_info("RAFT: vote request is skipped - "
 					 "competing candidate");
 				break;
@@ -382,7 +382,7 @@ raft_process_msg(struct raft *raft, const struct raft_request *req,
 			 * Vote for self was requested earlier in this round,
 			 * and now was answered by some other instance.
 			 */
-			assert(raft->volatile_vote == instance_id);
+			assert(raft->volatile_vote == raft->self);
 			bool was_set = bit_set(&raft->vote_mask, source);
 			raft->vote_count += !was_set;
 			if (raft->vote_count < raft->election_quorum) {
@@ -547,7 +547,7 @@ end_dump:
 		} else if (raft->leader != 0) {
 			/* There is a known leader. Wait until it is dead. */
 			raft_sm_wait_leader_dead(raft);
-		} else if (raft->vote == instance_id) {
+		} else if (raft->vote == raft->self) {
 			/* Just wrote own vote. */
 			if (raft->election_quorum == 1)
 				raft_sm_become_leader(raft);
@@ -561,7 +561,7 @@ end_dump:
 			raft_sm_wait_election_end(raft);
 		} else {
 			/* No leaders, no votes. */
-			raft_sm_schedule_new_vote(raft, instance_id);
+			raft_sm_schedule_new_vote(raft, raft->self);
 		}
 	} else {
 		memset(&req, 0, sizeof(req));
@@ -596,7 +596,7 @@ raft_worker_handle_broadcast(struct raft *raft)
 	req.vote = raft->vote;
 	req.state = raft->state;
 	if (req.state == RAFT_STATE_CANDIDATE) {
-		assert(raft->vote == instance_id);
+		assert(raft->vote == raft->self);
 		req.vclock = &replicaset.vclock;
 	}
 	replicaset_foreach(replica)
@@ -652,7 +652,7 @@ raft_sm_become_leader(struct raft *raft)
 	assert(raft->is_candidate);
 	assert(!raft->is_write_in_progress);
 	raft->state = RAFT_STATE_LEADER;
-	raft->leader = instance_id;
+	raft->leader = raft->self;
 	ev_timer_stop(loop(), &raft->timer);
 	/* Make read-write (if other subsystems allow that. */
 	box_update_ro_summary();
@@ -682,14 +682,14 @@ raft_sm_become_candidate(struct raft *raft)
 	say_info("RAFT: enter candidate state with 1 self vote");
 	assert(raft->state == RAFT_STATE_FOLLOWER);
 	assert(raft->leader == 0);
-	assert(raft->vote == instance_id);
+	assert(raft->vote == raft->self);
 	assert(raft->is_candidate);
 	assert(!raft->is_write_in_progress);
 	assert(raft->election_quorum > 1);
 	raft->state = RAFT_STATE_CANDIDATE;
 	raft->vote_count = 1;
 	raft->vote_mask = 0;
-	bit_set(&raft->vote_mask, instance_id);
+	bit_set(&raft->vote_mask, raft->self);
 	raft_sm_wait_election_end(raft);
 	/* State is visible and it is changed - broadcast. */
 	raft_schedule_broadcast(raft);
@@ -736,7 +736,7 @@ raft_sm_schedule_new_election(struct raft *raft)
 	assert(raft->is_candidate);
 	/* Everyone is a follower until its vote for self is persisted. */
 	raft_sm_schedule_new_term(raft, raft->term + 1);
-	raft_sm_schedule_new_vote(raft, instance_id);
+	raft_sm_schedule_new_vote(raft, raft->self);
 	box_update_ro_summary();
 }
 
@@ -783,7 +783,7 @@ raft_sm_wait_election_end(struct raft *raft)
 	assert(raft->is_candidate);
 	assert(raft->state == RAFT_STATE_FOLLOWER ||
 	       (raft->state == RAFT_STATE_CANDIDATE &&
-		raft->volatile_vote == instance_id));
+		raft->volatile_vote == raft->self));
 	assert(raft->leader == 0);
 	double election_timeout = raft->election_timeout +
 				  raft_new_random_election_shift(raft);
@@ -979,6 +979,14 @@ raft_cfg_death_timeout(struct raft *raft, double death_timeout)
 	}
 }
 
+void
+raft_cfg_instance_id(struct raft *raft, uint32_t instance_id)
+{
+	assert(raft->self == 0);
+	assert(instance_id != 0);
+	raft->self = instance_id;
+}
+
 void
 raft_new_term(struct raft *raft)
 {
diff --git a/src/box/raftlib.h b/src/box/raftlib.h
index c9c13136e..f75ed2567 100644
--- a/src/box/raftlib.h
+++ b/src/box/raftlib.h
@@ -95,6 +95,8 @@ const char *
 raft_state_str(uint32_t state);
 
 struct raft {
+	/** Instance ID of this node. */
+	uint32_t self;
 	/** Instance ID of leader of the current term. */
 	uint32_t leader;
 	/** State of the instance. */
@@ -241,6 +243,13 @@ raft_cfg_election_quorum(struct raft *raft, int election_quorum);
 void
 raft_cfg_death_timeout(struct raft *raft, double death_timeout);
 
+/**
+ * Configure ID of the given Raft instance. The ID can't be changed after it is
+ * assigned first time.
+ */
+void
+raft_cfg_instance_id(struct raft *raft, uint32_t instance_id);
+
 /**
  * Bump the term. When it is persisted, the node checks if there is a leader,
  * and if there is not, a new election is started. That said, this function can
-- 
2.24.3 (Apple Git-128)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [Tarantool-patches] [PATCH 06/12] raft: make raft_request.vclock constant
  2020-11-17  0:02 [Tarantool-patches] [PATCH 00/12] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
                   ` (7 preceding siblings ...)
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 05/12] raft: stop using instance_id Vladislav Shpilevoy
@ 2020-11-17  0:02 ` Vladislav Shpilevoy
  2020-11-17  9:17   ` Serge Petrenko
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 07/12] raft: stop using replicaset.vclock Vladislav Shpilevoy
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 37+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-17  0:02 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

Raft is never supposed to change vclock. Not the stored one, nor
the received ones. The patch makes it checked during compilation.

The patch is mostly motivated by a next patch making Raft use an
externally configured vclock which can't be changed. Since Raft
uses raft_request to carry the vclock in a few places, the
request's vclock also must become const.

Part of #5303
---
 src/box/box.cc    | 2 +-
 src/box/raftlib.c | 9 +++------
 src/box/raftlib.h | 3 +--
 src/box/xrow.c    | 2 +-
 src/box/xrow.h    | 2 +-
 5 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 8f5f3558e..78fca928e 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -2142,7 +2142,7 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
 		 * should be 0.
 		 */
 		struct raft_request req;
-		raft_serialize_for_network(box_raft(), &req, &vclock);
+		raft_serialize_for_network(box_raft(), &req);
 		xrow_encode_raft(&row, &fiber()->gc, &req);
 		coio_write_xrow(io, &row);
 	}
diff --git a/src/box/raftlib.c b/src/box/raftlib.c
index ca1940ba6..78164bf91 100644
--- a/src/box/raftlib.c
+++ b/src/box/raftlib.c
@@ -850,8 +850,7 @@ raft_sm_stop(struct raft *raft)
 }
 
 void
-raft_serialize_for_network(const struct raft *raft, struct raft_request *req,
-			   struct vclock *vclock)
+raft_serialize_for_network(const struct raft *raft, struct raft_request *req)
 {
 	memset(req, 0, sizeof(*req));
 	/*
@@ -865,10 +864,8 @@ raft_serialize_for_network(const struct raft *raft, struct raft_request *req,
 	 * Raft does not own vclock, so it always expects it passed externally.
 	 * Vclock is sent out only by candidate instances.
 	 */
-	if (req->state == RAFT_STATE_CANDIDATE) {
-		req->vclock = vclock;
-		vclock_copy(vclock, &replicaset.vclock);
-	}
+	if (req->state == RAFT_STATE_CANDIDATE)
+		req->vclock = &replicaset.vclock;
 }
 
 void
diff --git a/src/box/raftlib.h b/src/box/raftlib.h
index f75ed2567..2da3cec86 100644
--- a/src/box/raftlib.h
+++ b/src/box/raftlib.h
@@ -263,8 +263,7 @@ raft_new_term(struct raft *raft);
  * cluster. It is allowed to save anything here, not only persistent state.
  */
 void
-raft_serialize_for_network(const struct raft *raft, struct raft_request *req,
-			   struct vclock *vclock);
+raft_serialize_for_network(const struct raft *raft, struct raft_request *req);
 
 /**
  * Save complete Raft state into a request to be persisted on disk. Only term
diff --git a/src/box/xrow.c b/src/box/xrow.c
index 165a00a16..bc06738ad 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -1057,7 +1057,7 @@ xrow_decode_raft(const struct xrow_header *row, struct raft_request *r,
 			r->vclock = vclock;
 			if (r->vclock == NULL)
 				mp_next(&pos);
-			else if (mp_decode_vclock_ignore0(&pos, r->vclock) != 0)
+			else if (mp_decode_vclock_ignore0(&pos, vclock) != 0)
 				goto bad_msgpack;
 			break;
 		default:
diff --git a/src/box/xrow.h b/src/box/xrow.h
index 095911239..3d68c1268 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -268,7 +268,7 @@ struct raft_request {
 	uint64_t term;
 	uint32_t vote;
 	uint32_t state;
-	struct vclock *vclock;
+	const struct vclock *vclock;
 };
 
 int
-- 
2.24.3 (Apple Git-128)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [Tarantool-patches] [PATCH 07/12] raft: stop using replicaset.vclock
  2020-11-17  0:02 [Tarantool-patches] [PATCH 00/12] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
                   ` (8 preceding siblings ...)
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 06/12] raft: make raft_request.vclock constant Vladislav Shpilevoy
@ 2020-11-17  0:02 ` Vladislav Shpilevoy
  2020-11-17  9:23   ` Serge Petrenko
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 08/12] raft: introduce vtab for disk and network Vladislav Shpilevoy
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 09/12] raft: introduce raft_msg, drop xrow dependency Vladislav Shpilevoy
  11 siblings, 1 reply; 37+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-17  0:02 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

Raft is being moved to a separate library in src/lib. It means,
it can't depend on anything from box/.

The patch makes raft stop using replicaset.vclock.

Instead, it has a new option 'vclock'. It is stored inside struct
raft by pointer and should be configured using raft_cfg_vclock().

Box configures it to point at replicaset.vclock like before. But
now raftlib code does not depend on it explicitly.

Vclock is stored in Raft by pointer instead of by value so as not
to update it for each transaction. It would be too high price to
pay for Raft independence from box.

Part of #5303
---
 src/box/box.cc    |  1 +
 src/box/raftlib.c | 15 +++++++++++----
 src/box/raftlib.h | 16 ++++++++++++++++
 3 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 78fca928e..ff80e45a4 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -2768,6 +2768,7 @@ box_cfg_xc(void)
 	 */
 	if (!replication_anon)
 		raft_cfg_instance_id(box_raft(), instance_id);
+	raft_cfg_vclock(box_raft(), &replicaset.vclock);
 
 	if (box_set_election_timeout() != 0)
 		diag_raise();
diff --git a/src/box/raftlib.c b/src/box/raftlib.c
index 78164bf91..ab2e27fd8 100644
--- a/src/box/raftlib.c
+++ b/src/box/raftlib.c
@@ -125,8 +125,7 @@ raft_new_random_election_shift(const struct raft *raft)
 static inline bool
 raft_can_vote_for(const struct raft *raft, const struct vclock *v)
 {
-	(void)raft;
-	int cmp = vclock_compare_ignore0(v, &replicaset.vclock);
+	int cmp = vclock_compare_ignore0(v, raft->vclock);
 	return cmp == 0 || cmp == 1;
 }
 
@@ -597,7 +596,7 @@ raft_worker_handle_broadcast(struct raft *raft)
 	req.state = raft->state;
 	if (req.state == RAFT_STATE_CANDIDATE) {
 		assert(raft->vote == raft->self);
-		req.vclock = &replicaset.vclock;
+		req.vclock = raft->vclock;
 	}
 	replicaset_foreach(replica)
 		relay_push_raft(replica->relay, &req);
@@ -865,7 +864,7 @@ raft_serialize_for_network(const struct raft *raft, struct raft_request *req)
 	 * Vclock is sent out only by candidate instances.
 	 */
 	if (req->state == RAFT_STATE_CANDIDATE)
-		req->vclock = &replicaset.vclock;
+		req->vclock = raft->vclock;
 }
 
 void
@@ -984,6 +983,14 @@ raft_cfg_instance_id(struct raft *raft, uint32_t instance_id)
 	raft->self = instance_id;
 }
 
+void
+raft_cfg_vclock(struct raft *raft, const struct vclock *vclock)
+{
+	assert(raft->vclock == NULL);
+	assert(vclock != NULL);
+	raft->vclock = vclock;
+}
+
 void
 raft_new_term(struct raft *raft)
 {
diff --git a/src/box/raftlib.h b/src/box/raftlib.h
index 2da3cec86..8d0d03da0 100644
--- a/src/box/raftlib.h
+++ b/src/box/raftlib.h
@@ -154,6 +154,15 @@ struct raft {
 	int vote_count;
 	/** Number of votes necessary for successful election. */
 	int election_quorum;
+	/**
+	 * Vclock of the Raft node owner. Raft never changes it, only watches,
+	 * and makes decisions based on it. The value is not stored by copy so
+	 * as to avoid frequent updates. If every transaction would need to
+	 * update several vclocks in different places, it would be too
+	 * expensive. So they update only one vclock, which is shared between
+	 * subsystems, such as Raft.
+	 */
+	const struct vclock *vclock;
 	/** State machine timed event trigger. */
 	struct ev_timer timer;
 	/** Worker fiber to execute blocking tasks like IO. */
@@ -250,6 +259,13 @@ raft_cfg_death_timeout(struct raft *raft, double death_timeout);
 void
 raft_cfg_instance_id(struct raft *raft, uint32_t instance_id);
 
+/**
+ * Configure vclock of the given Raft instance. The vclock is not copied, so the
+ * caller must keep it valid.
+ */
+void
+raft_cfg_vclock(struct raft *raft, const struct vclock *vclock);
+
 /**
  * Bump the term. When it is persisted, the node checks if there is a leader,
  * and if there is not, a new election is started. That said, this function can
-- 
2.24.3 (Apple Git-128)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [Tarantool-patches] [PATCH 08/12] raft: introduce vtab for disk and network
  2020-11-17  0:02 [Tarantool-patches] [PATCH 00/12] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
                   ` (9 preceding siblings ...)
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 07/12] raft: stop using replicaset.vclock Vladislav Shpilevoy
@ 2020-11-17  0:02 ` Vladislav Shpilevoy
  2020-11-17  9:35   ` Serge Petrenko
  2020-11-17 10:00   ` Serge Petrenko
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 09/12] raft: introduce raft_msg, drop xrow dependency Vladislav Shpilevoy
  11 siblings, 2 replies; 37+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-17  0:02 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

Raft is being moved to a separate library in src/lib. It means,
it can't depend on anything from box/.

The patch makes raft stop using replicaset and journal objects.
They were used to broadcast messages to all the other nodes, and
to persist updates.

Now Raft does the same through vtab, which is configured by box.
Broadcast still sends messages via relays, and disk write still
uses the journal. But Raft does not depend on any specific journal
or network API.

Part of #5303
---
 src/box/raft.c    | 63 ++++++++++++++++++++++++++++++++++++++-
 src/box/raftlib.c | 75 ++++++++++-------------------------------------
 src/box/raftlib.h | 24 ++++++++++++++-
 3 files changed, 100 insertions(+), 62 deletions(-)

diff --git a/src/box/raft.c b/src/box/raft.c
index af6e71e0b..845525660 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -29,7 +29,10 @@
  * SUCH DAMAGE.
  */
 #include "box.h"
+#include "error.h"
+#include "journal.h"
 #include "raft.h"
+#include "relay.h"
 #include "replication.h"
 
 struct raft box_raft_global = {
@@ -114,10 +117,68 @@ box_raft_reconsider_election_quorum(void)
 	raft_cfg_election_quorum(box_raft(), quorum);
 }
 
+static void
+box_raft_broadcast(struct raft *raft, const struct raft_request *req)
+{
+	(void)raft;
+	assert(raft == box_raft());
+	replicaset_foreach(replica)
+		relay_push_raft(replica->relay, req);
+}
+
+/** Wakeup Raft state writer fiber waiting for WAL write end. */
+static void
+box_raft_write_cb(struct journal_entry *entry)
+{
+	fiber_wakeup(entry->complete_data);
+}
+
+static void
+box_raft_write(struct raft *raft, const struct raft_request *req)
+{
+	(void)raft;
+	assert(raft == box_raft());
+	/* See Raft implementation why these fields are never written. */
+	assert(req->vclock == NULL);
+	assert(req->state == 0);
+
+	struct region *region = &fiber()->gc;
+	uint32_t svp = region_used(region);
+	struct xrow_header row;
+	char buf[sizeof(struct journal_entry) +
+		 sizeof(struct xrow_header *)];
+	struct journal_entry *entry = (struct journal_entry *)buf;
+	entry->rows[0] = &row;
+
+	if (xrow_encode_raft(&row, region, req) != 0)
+		goto fail;
+	journal_entry_create(entry, 1, xrow_approx_len(&row), box_raft_write_cb,
+			     fiber());
+
+	if (journal_write(entry) != 0 || entry->res < 0) {
+		diag_set(ClientError, ER_WAL_IO);
+		diag_log();
+		goto fail;
+	}
+
+	region_truncate(region, svp);
+	return;
+fail:
+	/*
+	 * XXX: the stub is supposed to be removed once it is defined what to do
+	 * when a raft request WAL write fails.
+	 */
+	panic("Could not write a raft request to WAL\n");
+}
+
 void
 box_raft_init(void)
 {
-	raft_create(&box_raft_global);
+	static const struct raft_vtab box_raft_vtab = {
+		.broadcast = box_raft_broadcast,
+		.write = box_raft_write,
+	};
+	raft_create(&box_raft_global, &box_raft_vtab);
 	trigger_create(&box_raft_on_update, box_raft_on_update_f, NULL, NULL);
 	raft_on_update(box_raft(), &box_raft_on_update);
 }
diff --git a/src/box/raftlib.c b/src/box/raftlib.c
index ab2e27fd8..cc9139a5b 100644
--- a/src/box/raftlib.c
+++ b/src/box/raftlib.c
@@ -31,11 +31,9 @@
 #include "raft.h"
 
 #include "error.h"
-#include "journal.h"
+#include "fiber.h"
 #include "xrow.h"
 #include "small/region.h"
-#include "replication.h"
-#include "relay.h"
 #include "box.h"
 #include "tt_static.h"
 
@@ -469,58 +467,6 @@ raft_process_heartbeat(struct raft *raft, uint32_t source)
 	raft_sm_wait_leader_dead(raft);
 }
 
-/** Wakeup Raft state writer fiber waiting for WAL write end. */
-static void
-raft_write_cb(struct journal_entry *entry)
-{
-	fiber_wakeup(entry->complete_data);
-}
-
-/** Synchronously write a Raft request into WAL. */
-static void
-raft_write_request(const struct raft_request *req)
-{
-	/*
-	 * Vclock is never persisted by Raft. It is used only to
-	 * be sent to network when vote for self.
-	 */
-	assert(req->vclock == NULL);
-	/*
-	 * State is not persisted. That would be strictly against Raft protocol.
-	 * The reason is that it does not make much sense - even if the node is
-	 * a leader now, after the node is restarted, there will be another
-	 * leader elected by that time likely.
-	 */
-	assert(req->state == 0);
-	struct region *region = &fiber()->gc;
-	uint32_t svp = region_used(region);
-	struct xrow_header row;
-	char buf[sizeof(struct journal_entry) +
-		 sizeof(struct xrow_header *)];
-	struct journal_entry *entry = (struct journal_entry *)buf;
-	entry->rows[0] = &row;
-
-	if (xrow_encode_raft(&row, region, req) != 0)
-		goto fail;
-	journal_entry_create(entry, 1, xrow_approx_len(&row), raft_write_cb,
-			     fiber());
-
-	if (journal_write(entry) != 0 || entry->res < 0) {
-		diag_set(ClientError, ER_WAL_IO);
-		diag_log();
-		goto fail;
-	}
-
-	region_truncate(region, svp);
-	return;
-fail:
-	/*
-	 * XXX: the stub is supposed to be removed once it is defined what to do
-	 * when a raft request WAL write fails.
-	 */
-	panic("Could not write a raft request to WAL\n");
-}
-
 /* Dump Raft state to WAL in a blocking way. */
 static void
 raft_worker_handle_io(struct raft *raft)
@@ -567,8 +513,17 @@ end_dump:
 		assert(raft->volatile_term >= raft->term);
 		req.term = raft->volatile_term;
 		req.vote = raft->volatile_vote;
-
-		raft_write_request(&req);
+		/*
+		 * Skip vclock. It is used only to be sent to network when vote
+		 * for self. It is a job of the vclock owner to persist it
+		 * anyhow.
+		 *
+		 * Skip state. That would be strictly against Raft protocol. The
+		 * reason is that it does not make much sense - even if the node
+		 * is a leader now, after the node is restarted, there will be
+		 * another leader elected by that time likely.
+		 */
+		raft->vtab->write(raft, &req);
 		say_info("RAFT: persisted state %s",
 			 raft_request_to_string(&req));
 
@@ -598,8 +553,7 @@ raft_worker_handle_broadcast(struct raft *raft)
 		assert(raft->vote == raft->self);
 		req.vclock = raft->vclock;
 	}
-	replicaset_foreach(replica)
-		relay_push_raft(replica->relay, &req);
+	raft->vtab->broadcast(raft, &req);
 	trigger_run(&raft->on_update, raft);
 	raft->is_broadcast_scheduled = false;
 }
@@ -1038,7 +992,7 @@ raft_schedule_broadcast(struct raft *raft)
 }
 
 void
-raft_create(struct raft *raft)
+raft_create(struct raft *raft, const struct raft_vtab *vtab)
 {
 	*raft = (struct raft) {
 		.state = RAFT_STATE_FOLLOWER,
@@ -1047,6 +1001,7 @@ raft_create(struct raft *raft)
 		.election_quorum = 1,
 		.election_timeout = 5,
 		.death_timeout = 5,
+		.vtab = vtab,
 	};
 	ev_timer_init(&raft->timer, raft_sm_schedule_new_election_cb, 0, 0);
 	raft->timer.data = raft;
diff --git a/src/box/raftlib.h b/src/box/raftlib.h
index 8d0d03da0..6181d9d49 100644
--- a/src/box/raftlib.h
+++ b/src/box/raftlib.h
@@ -69,6 +69,7 @@ extern "C" {
  */
 
 struct fiber;
+struct raft;
 struct raft_request;
 
 enum raft_state {
@@ -94,6 +95,21 @@ enum raft_state {
 const char *
 raft_state_str(uint32_t state);
 
+typedef void (*raft_broadcast_f)(struct raft *raft,
+				 const struct raft_request *req);
+typedef void (*raft_write_f)(struct raft *raft, const struct raft_request *req);
+
+/**
+ * Raft connection to the environment, via which it talks to other nodes and
+ * saves something to disk.
+ */
+struct raft_vtab {
+	/** Send a message to all nodes in the cluster. */
+	raft_broadcast_f broadcast;
+	/** Save a message to disk. */
+	raft_write_f write;
+};
+
 struct raft {
 	/** Instance ID of this node. */
 	uint32_t self;
@@ -174,6 +190,8 @@ struct raft {
 	 * elections can be started.
 	 */
 	double death_timeout;
+	/** Virtual table to perform application-specific actions. */
+	const struct raft_vtab *vtab;
 	/**
 	 * Trigger invoked each time any of the Raft node visible attributes are
 	 * changed.
@@ -295,8 +313,12 @@ raft_serialize_for_disk(const struct raft *raft, struct raft_request *req);
 void
 raft_on_update(struct raft *raft, struct trigger *trigger);
 
+/**
+ * Create a Raft node. The vtab is not copied. Its memory should stay valid even
+ * after the creation.
+ */
 void
-raft_create(struct raft *raft);
+raft_create(struct raft *raft, const struct raft_vtab *vtab);
 
 void
 raft_destroy(struct raft *raft);
-- 
2.24.3 (Apple Git-128)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [Tarantool-patches] [PATCH 09/12] raft: introduce raft_msg, drop xrow dependency
  2020-11-17  0:02 [Tarantool-patches] [PATCH 00/12] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
                   ` (10 preceding siblings ...)
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 08/12] raft: introduce vtab for disk and network Vladislav Shpilevoy
@ 2020-11-17  0:02 ` Vladislav Shpilevoy
  2020-11-17 10:22   ` Serge Petrenko
  11 siblings, 1 reply; 37+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-17  0:02 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

Raft used to depend on xrow, because it used raft_request as a
communication and persistence unit. Xrow is a part of src/box
library set, so it blocked Raft extraction into src/lib/raft.

This patch makes Raft not depend on xrow. For that Raft introduces
a new communication and persistence unit - struct raft_msg.
Interestingly, throughout its source code Raft already uses term
'message' to describe requests, so this patch also restores the
consistency. This is because raft_request name was used to be
consistent with other *_request structs in xrow.h. Now Raft does
not depend on this, and can use its own name.

Struct raft_msg repeats raft_request literally, but it actually
makes sense. Because when Raft is extracted to a new library, it
may start evolving independently. Its raft_msg may be populated
with new members, or their behaviour may change depending on how
the algorithm will evolve.

But inside box it will be possible to tweak and extend raft_msg
whenever it is necessary, via struct raft_request, and without
changing the basic library.

For instance, in future we may want to make nodes forward the
messages to each other during voting to speed the process up, and
for that we may want to add an explicit 'source' field to
raft_request, while it won't be necessary on the level of
raft_msg.

There is a new compatibility layer in src/box/raft.h which hides
raft_msg details from other box code, and does the msg <-> request
conversions.

Part of #5303
---
 src/box/applier.cc     |  2 +-
 src/box/box.cc         |  4 +--
 src/box/memtx_engine.c |  4 +--
 src/box/raft.c         | 70 ++++++++++++++++++++++++++++++++++++++----
 src/box/raft.h         | 24 +++++++++++++++
 src/box/raftlib.c      | 24 ++++++---------
 src/box/raftlib.h      | 38 ++++++++++++++++++-----
 src/box/xrow.h         |  4 +++
 8 files changed, 137 insertions(+), 33 deletions(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index fbde0eccd..fb2f5d130 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -893,7 +893,7 @@ applier_handle_raft(struct applier *applier, struct xrow_header *row)
 	struct vclock candidate_clock;
 	if (xrow_decode_raft(row, &req, &candidate_clock) != 0)
 		return -1;
-	return raft_process_msg(box_raft(), &req, applier->instance_id);
+	return box_raft_process(&req, applier->instance_id);
 }
 
 /**
diff --git a/src/box/box.cc b/src/box/box.cc
index ff80e45a4..7d23de95c 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -394,7 +394,7 @@ apply_wal_row(struct xstream *stream, struct xrow_header *row)
 		/* Vclock is never persisted in WAL by Raft. */
 		if (xrow_decode_raft(row, &raft_req, NULL) != 0)
 			diag_raise();
-		raft_process_recovery(box_raft(), &raft_req);
+		box_raft_recover(&raft_req);
 		return;
 	}
 	xrow_decode_dml_xc(row, &request, dml_request_key_map(row->type));
@@ -2142,7 +2142,7 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
 		 * should be 0.
 		 */
 		struct raft_request req;
-		raft_serialize_for_network(box_raft(), &req);
+		box_raft_checkpoint_remote(&req);
 		xrow_encode_raft(&row, &fiber()->gc, &req);
 		coio_write_xrow(io, &row);
 	}
diff --git a/src/box/memtx_engine.c b/src/box/memtx_engine.c
index 39d3ffa15..db2bb2333 100644
--- a/src/box/memtx_engine.c
+++ b/src/box/memtx_engine.c
@@ -210,7 +210,7 @@ memtx_engine_recover_raft(const struct xrow_header *row)
 	/* Vclock is never persisted in WAL by Raft. */
 	if (xrow_decode_raft(row, &req, NULL) != 0)
 		return -1;
-	raft_process_recovery(box_raft(), &req);
+	box_raft_recover(&req);
 	return 0;
 }
 
@@ -554,7 +554,7 @@ checkpoint_new(const char *snap_dirname, uint64_t snap_io_rate_limit)
 	opts.free_cache = true;
 	xdir_create(&ckpt->dir, snap_dirname, SNAP, &INSTANCE_UUID, &opts);
 	vclock_create(&ckpt->vclock);
-	raft_serialize_for_disk(box_raft(), &ckpt->raft);
+	box_raft_checkpoint_local(&ckpt->raft);
 	ckpt->touch = false;
 	return ckpt;
 }
diff --git a/src/box/raft.c b/src/box/raft.c
index 845525660..f3652bbcb 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -49,6 +49,28 @@ struct raft box_raft_global = {
  */
 static struct trigger box_raft_on_update;
 
+static void
+box_raft_msg_to_request(const struct raft_msg *msg, struct raft_request *req)
+{
+	*req = (struct raft_request) {
+		.term = msg->term,
+		.vote = msg->vote,
+		.state = msg->state,
+		.vclock = msg->vclock,
+	};
+}
+
+static void
+box_raft_request_to_msg(const struct raft_request *req, struct raft_msg *msg)
+{
+	*msg = (struct raft_msg) {
+		.term = req->term,
+		.vote = req->vote,
+		.state = req->state,
+		.vclock = req->vclock,
+	};
+}
+
 static int
 box_raft_on_update_f(struct trigger *trigger, void *event)
 {
@@ -117,13 +139,47 @@ box_raft_reconsider_election_quorum(void)
 	raft_cfg_election_quorum(box_raft(), quorum);
 }
 
+void
+box_raft_recover(const struct raft_request *req)
+{
+	struct raft_msg msg;
+	box_raft_request_to_msg(req, &msg);
+	raft_process_recovery(box_raft(), &msg);
+}
+
+void
+box_raft_checkpoint_local(struct raft_request *req)
+{
+	struct raft_msg msg;
+	raft_checkpoint_local(box_raft(), &msg);
+	box_raft_msg_to_request(&msg, req);
+}
+
+void
+box_raft_checkpoint_remote(struct raft_request *req)
+{
+	struct raft_msg msg;
+	raft_checkpoint_remote(box_raft(), &msg);
+	box_raft_msg_to_request(&msg, req);
+}
+
+int
+box_raft_process(struct raft_request *req, uint32_t source)
+{
+	struct raft_msg msg;
+	box_raft_request_to_msg(req, &msg);
+	return raft_process_msg(box_raft(), &msg, source);
+}
+
 static void
-box_raft_broadcast(struct raft *raft, const struct raft_request *req)
+box_raft_broadcast(struct raft *raft, const struct raft_msg *msg)
 {
 	(void)raft;
 	assert(raft == box_raft());
+	struct raft_request req;
+	box_raft_msg_to_request(msg, &req);
 	replicaset_foreach(replica)
-		relay_push_raft(replica->relay, req);
+		relay_push_raft(replica->relay, &req);
 }
 
 /** Wakeup Raft state writer fiber waiting for WAL write end. */
@@ -134,14 +190,16 @@ box_raft_write_cb(struct journal_entry *entry)
 }
 
 static void
-box_raft_write(struct raft *raft, const struct raft_request *req)
+box_raft_write(struct raft *raft, const struct raft_msg *msg)
 {
 	(void)raft;
 	assert(raft == box_raft());
 	/* See Raft implementation why these fields are never written. */
-	assert(req->vclock == NULL);
-	assert(req->state == 0);
+	assert(msg->vclock == NULL);
+	assert(msg->state == 0);
 
+	struct raft_request req;
+	box_raft_msg_to_request(msg, &req);
 	struct region *region = &fiber()->gc;
 	uint32_t svp = region_used(region);
 	struct xrow_header row;
@@ -150,7 +208,7 @@ box_raft_write(struct raft *raft, const struct raft_request *req)
 	struct journal_entry *entry = (struct journal_entry *)buf;
 	entry->rows[0] = &row;
 
-	if (xrow_encode_raft(&row, region, req) != 0)
+	if (xrow_encode_raft(&row, region, &req) != 0)
 		goto fail;
 	journal_entry_create(entry, 1, xrow_approx_len(&row), box_raft_write_cb,
 			     fiber());
diff --git a/src/box/raft.h b/src/box/raft.h
index 09297273f..4dffce380 100644
--- a/src/box/raft.h
+++ b/src/box/raft.h
@@ -35,6 +35,8 @@
 extern "C" {
 #endif
 
+struct raft_request;
+
 /** Raft state of this instance. */
 static inline struct raft *
 box_raft(void)
@@ -56,6 +58,28 @@ box_raft(void)
 void
 box_raft_reconsider_election_quorum(void);
 
+/**
+ * Recovery a single Raft request. Raft state machine is not turned on yet, this
+ * works only during instance recovery from the journal.
+ */
+void
+box_raft_recover(const struct raft_request *req);
+
+/** Save complete Raft state into a request to be persisted on disk locally. */
+void
+box_raft_checkpoint_local(struct raft_request *req);
+
+/**
+ * Save complete Raft state into a request to be sent to other instances of the
+ * cluster.
+ */
+void
+box_raft_checkpoint_remote(struct raft_request *req);
+
+/** Handle a single Raft request from a node with instance id @a source. */
+int
+box_raft_process(struct raft_request *req, uint32_t source);
+
 void
 box_raft_init(void);
 
diff --git a/src/box/raftlib.c b/src/box/raftlib.c
index cc9139a5b..512dbd51f 100644
--- a/src/box/raftlib.c
+++ b/src/box/raftlib.c
@@ -32,7 +32,6 @@
 
 #include "error.h"
 #include "fiber.h"
-#include "xrow.h"
 #include "small/region.h"
 #include "box.h"
 #include "tt_static.h"
@@ -221,7 +220,7 @@ static void
 raft_sm_become_candidate(struct raft *raft);
 
 static const char *
-raft_request_to_string(const struct raft_request *req)
+raft_msg_to_string(const struct raft_msg *req)
 {
 	assert(req->term != 0);
 	char buf[1024];
@@ -259,9 +258,9 @@ raft_request_to_string(const struct raft_request *req)
 }
 
 void
-raft_process_recovery(struct raft *raft, const struct raft_request *req)
+raft_process_recovery(struct raft *raft, const struct raft_msg *req)
 {
-	say_verbose("RAFT: recover %s", raft_request_to_string(req));
+	say_verbose("RAFT: recover %s", raft_msg_to_string(req));
 	if (req->term != 0) {
 		raft->term = req->term;
 		raft->volatile_term = req->term;
@@ -287,11 +286,9 @@ raft_process_recovery(struct raft *raft, const struct raft_request *req)
 }
 
 int
-raft_process_msg(struct raft *raft, const struct raft_request *req,
-		 uint32_t source)
+raft_process_msg(struct raft *raft, const struct raft_msg *req, uint32_t source)
 {
-	say_info("RAFT: message %s from %u", raft_request_to_string(req),
-		 source);
+	say_info("RAFT: message %s from %u", raft_msg_to_string(req), source);
 	assert(source > 0);
 	assert(source != raft->self);
 	if (req->term == 0 || req->state == 0) {
@@ -474,7 +471,7 @@ raft_worker_handle_io(struct raft *raft)
 	assert(raft->is_write_in_progress);
 	/* During write Raft can't be anything but a follower. */
 	assert(raft->state == RAFT_STATE_FOLLOWER);
-	struct raft_request req;
+	struct raft_msg req;
 
 	if (raft_is_fully_on_disk(raft)) {
 end_dump:
@@ -524,8 +521,7 @@ end_dump:
 		 * another leader elected by that time likely.
 		 */
 		raft->vtab->write(raft, &req);
-		say_info("RAFT: persisted state %s",
-			 raft_request_to_string(&req));
+		say_info("RAFT: persisted state %s", raft_msg_to_string(&req));
 
 		assert(req.term >= raft->term);
 		raft->term = req.term;
@@ -544,7 +540,7 @@ static void
 raft_worker_handle_broadcast(struct raft *raft)
 {
 	assert(raft->is_broadcast_scheduled);
-	struct raft_request req;
+	struct raft_msg req;
 	memset(&req, 0, sizeof(req));
 	req.term = raft->term;
 	req.vote = raft->vote;
@@ -803,7 +799,7 @@ raft_sm_stop(struct raft *raft)
 }
 
 void
-raft_serialize_for_network(const struct raft *raft, struct raft_request *req)
+raft_checkpoint_remote(const struct raft *raft, struct raft_msg *req)
 {
 	memset(req, 0, sizeof(*req));
 	/*
@@ -822,7 +818,7 @@ raft_serialize_for_network(const struct raft *raft, struct raft_request *req)
 }
 
 void
-raft_serialize_for_disk(const struct raft *raft, struct raft_request *req)
+raft_checkpoint_local(const struct raft *raft, struct raft_msg *req)
 {
 	memset(req, 0, sizeof(*req));
 	req->term = raft->term;
diff --git a/src/box/raftlib.h b/src/box/raftlib.h
index 6181d9d49..4f4d24ca8 100644
--- a/src/box/raftlib.h
+++ b/src/box/raftlib.h
@@ -70,7 +70,6 @@ extern "C" {
 
 struct fiber;
 struct raft;
-struct raft_request;
 
 enum raft_state {
 	/**
@@ -95,9 +94,32 @@ enum raft_state {
 const char *
 raft_state_str(uint32_t state);
 
-typedef void (*raft_broadcast_f)(struct raft *raft,
-				 const struct raft_request *req);
-typedef void (*raft_write_f)(struct raft *raft, const struct raft_request *req);
+/**
+ * Basic Raft communication unit for talking to other nodes, and even to other
+ * subsystems such as disk storage.
+ */
+struct raft_msg {
+	/** Term of the instance. */
+	uint64_t term;
+	/**
+	 * Instance ID of the instance this node voted for in the current term.
+	 * 0 means the node didn't vote in this term.
+	 */
+	uint32_t vote;
+	/**
+	 * State of the instance. Can be 0 if the state does not matter for the
+	 * message. For instance, when the message is sent to disk.
+	 */
+	enum raft_state state;
+	/**
+	 * Vclock of the instance. Can be NULL, if the node is not a candidate.
+	 * Also is omitted when does not matter (when the message is for disk).
+	 */
+	const struct vclock *vclock;
+};
+
+typedef void (*raft_broadcast_f)(struct raft *raft, const struct raft_msg *req);
+typedef void (*raft_write_f)(struct raft *raft, const struct raft_msg *req);
 
 /**
  * Raft connection to the environment, via which it talks to other nodes and
@@ -226,11 +248,11 @@ raft_is_enabled(const struct raft *raft)
 
 /** Process a raft entry stored in WAL/snapshot. */
 void
-raft_process_recovery(struct raft *raft, const struct raft_request *req);
+raft_process_recovery(struct raft *raft, const struct raft_msg *req);
 
 /** Process a raft status message coming from the network. */
 int
-raft_process_msg(struct raft *raft, const struct raft_request *req,
+raft_process_msg(struct raft *raft, const struct raft_msg *req,
 		 uint32_t source);
 
 /**
@@ -297,14 +319,14 @@ raft_new_term(struct raft *raft);
  * cluster. It is allowed to save anything here, not only persistent state.
  */
 void
-raft_serialize_for_network(const struct raft *raft, struct raft_request *req);
+raft_checkpoint_remote(const struct raft *raft, struct raft_msg *req);
 
 /**
  * Save complete Raft state into a request to be persisted on disk. Only term
  * and vote are being persisted.
  */
 void
-raft_serialize_for_disk(const struct raft *raft, struct raft_request *req);
+raft_checkpoint_local(const struct raft *raft, struct raft_msg *req);
 
 /**
  * Add a trigger invoked each time any of the Raft node visible attributes are
diff --git a/src/box/xrow.h b/src/box/xrow.h
index 3d68c1268..fde8f9474 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -264,6 +264,10 @@ xrow_encode_synchro(struct xrow_header *row,
 int
 xrow_decode_synchro(const struct xrow_header *row, struct synchro_request *req);
 
+/**
+ * Raft request. It repeats Raft message to the letter, but can be extended in
+ * future not depending on the Raft library.
+ */
 struct raft_request {
 	uint64_t term;
 	uint32_t vote;
-- 
2.24.3 (Apple Git-128)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Tarantool-patches] [PATCH 01/12] raft: move sources to raftlib.h/.c
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 01/12] raft: move sources to raftlib.h/.c Vladislav Shpilevoy
@ 2020-11-17  8:14   ` Serge Petrenko
  0 siblings, 0 replies; 37+ messages in thread
From: Serge Petrenko @ 2020-11-17  8:14 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov

Hi! Thanks for the patch!

LGTM.

17.11.2020 03:02, Vladislav Shpilevoy пишет:
> The commit renames raft.h and raft.c to raftlib.h and raftlib.c.
> This is done to prepare to Raft split into src/box/ and
> src/lib/raft.
>
> The commit is not atomic, the build won't work here. Because if
> raft is renamed to raftlib, and in the same commit new raft.c and
> raft.h are added, git thinks the original file was changed, and
> ruins all the git history.
>
> By splitting move of raft to raftlib and introduction of box/raft
> into 2 commits the git history is saved.
>
> Part of #5303
> ---
>   src/box/CMakeLists.txt        | 1 +
>   src/box/{raft.c => raftlib.c} | 0
>   src/box/{raft.h => raftlib.h} | 0
>   3 files changed, 1 insertion(+)
>   rename src/box/{raft.c => raftlib.c} (100%)
>   rename src/box/{raft.h => raftlib.h} (100%)
>
> diff --git a/src/box/CMakeLists.txt b/src/box/CMakeLists.txt
> index d1667796a..fcf779379 100644
> --- a/src/box/CMakeLists.txt
> +++ b/src/box/CMakeLists.txt
> @@ -169,6 +169,7 @@ add_library(box STATIC
>       port.c
>       txn.c
>       txn_limbo.c
> +    raftlib.c
>       raft.c
>       box.cc
>       gc.c
> diff --git a/src/box/raft.c b/src/box/raftlib.c
> similarity index 100%
> rename from src/box/raft.c
> rename to src/box/raftlib.c
> diff --git a/src/box/raft.h b/src/box/raftlib.h
> similarity index 100%
> rename from src/box/raft.h
> rename to src/box/raftlib.h

-- 
Serge Petrenko

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Tarantool-patches] [PATCH 02/12] raft: move box_raft_* to src/box/raft.h and .c
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 02/12] raft: move box_raft_* to src/box/raft.h and .c Vladislav Shpilevoy
@ 2020-11-17  8:14   ` Serge Petrenko
  0 siblings, 0 replies; 37+ messages in thread
From: Serge Petrenko @ 2020-11-17  8:14 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov

LGTM

17.11.2020 03:02, Vladislav Shpilevoy пишет:
> The commit moves Raft functions and objects specific for box to
> src/box/raft from src/box/box and src/box/raftlib.
>
> The goal is to gradually eliminate all box dependencies from
> src/box/raftlib and move it to src/lib/raft.
>
> Part of #5303
> ---
>   src/box/box.cc    | 25 --------------
>   src/box/raft.c    | 86 +++++++++++++++++++++++++++++++++++++++++++++++
>   src/box/raft.h    | 59 ++++++++++++++++++++++++++++++++
>   src/box/raftlib.c | 29 ----------------
>   src/box/raftlib.h | 19 -----------
>   5 files changed, 145 insertions(+), 73 deletions(-)
>   create mode 100644 src/box/raft.c
>   create mode 100644 src/box/raft.h
>
> diff --git a/src/box/box.cc b/src/box/box.cc
> index 1f7dec362..8dd92a5f5 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -152,11 +152,6 @@ static struct fiber_pool tx_fiber_pool;
>    * are too many messages in flight (gh-1892).
>    */
>   static struct cbus_endpoint tx_prio_endpoint;
> -/**
> - * A trigger executed each time the Raft state machine updates any
> - * of its visible attributes.
> - */
> -static struct trigger box_raft_on_update;
>   
>   void
>   box_update_ro_summary(void)
> @@ -1060,23 +1055,6 @@ box_clear_synchro_queue(bool try_wait)
>   	}
>   }
>   
> -static int
> -box_raft_on_update_f(struct trigger *trigger, void *event)
> -{
> -	(void)trigger;
> -	struct raft *raft = (struct raft *)event;
> -	assert(raft == box_raft());
> -	if (raft->state != RAFT_STATE_LEADER)
> -		return 0;
> -	/*
> -	 * When the node became a leader, it means it will ignore all records
> -	 * from all the other nodes, and won't get late CONFIRM messages anyway.
> -	 * Can clear the queue without waiting for confirmations.
> -	 */
> -	box_clear_synchro_queue(false);
> -	return 0;
> -}
> -
>   void
>   box_listen(void)
>   {
> @@ -2658,9 +2636,6 @@ box_init(void)
>   	txn_limbo_init();
>   	sequence_init();
>   	box_raft_init();
> -
> -	trigger_create(&box_raft_on_update, box_raft_on_update_f, NULL, NULL);
> -	raft_on_update(box_raft(), &box_raft_on_update);
>   }
>   
>   bool
> diff --git a/src/box/raft.c b/src/box/raft.c
> new file mode 100644
> index 000000000..f289a6993
> --- /dev/null
> +++ b/src/box/raft.c
> @@ -0,0 +1,86 @@
> +/*
> + * Copyright 2010-2020, Tarantool AUTHORS, please see AUTHORS file.
> + *
> + * Redistribution and use in source and binary forms, with or
> + * without modification, are permitted provided that the following
> + * conditions are met:
> + *
> + * 1. Redistributions of source code must retain the above
> + *    copyright notice, this list of conditions and the
> + *    following disclaimer.
> + *
> + * 2. Redistributions in binary form must reproduce the above
> + *    copyright notice, this list of conditions and the following
> + *    disclaimer in the documentation and/or other materials
> + *    provided with the distribution.
> + *
> + * THIS SOFTWARE IS PROVIDED BY <COPYRIGHT HOLDER> ``AS IS'' AND
> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
> + * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
> + * <COPYRIGHT HOLDER> OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
> + * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
> + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
> + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
> + * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
> + * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> + * SUCH DAMAGE.
> + */
> +#include "box.h"
> +#include "raft.h"
> +
> +struct raft box_raft_global = {
> +	/*
> +	 * Set an invalid state to validate in runtime the global raft node is
> +	 * not used before initialization.
> +	 */
> +	.state = 0,
> +};
> +
> +/**
> + * A trigger executed each time the Raft state machine updates any
> + * of its visible attributes.
> + */
> +static struct trigger box_raft_on_update;
> +
> +static int
> +box_raft_on_update_f(struct trigger *trigger, void *event)
> +{
> +	(void)trigger;
> +	struct raft *raft = (struct raft *)event;
> +	assert(raft == box_raft());
> +	if (raft->state != RAFT_STATE_LEADER)
> +		return 0;
> +	/*
> +	 * When the node became a leader, it means it will ignore all records
> +	 * from all the other nodes, and won't get late CONFIRM messages anyway.
> +	 * Can clear the queue without waiting for confirmations.
> +	 */
> +	box_clear_synchro_queue(false);
> +	return 0;
> +}
> +
> +void
> +box_raft_init(void)
> +{
> +	raft_create(&box_raft_global);
> +	trigger_create(&box_raft_on_update, box_raft_on_update_f, NULL, NULL);
> +	raft_on_update(box_raft(), &box_raft_on_update);
> +}
> +
> +void
> +box_raft_free(void)
> +{
> +	/*
> +	 * Can't join the fiber, because the event loop is stopped already, and
> +	 * yields are not allowed.
> +	 */
> +	box_raft_global.worker = NULL;
> +	raft_destroy(&box_raft_global);
> +	/*
> +	 * Invalidate so as box_raft() would fail if any usage attempt happens.
> +	 */
> +	box_raft_global.state = 0;
> +}
> diff --git a/src/box/raft.h b/src/box/raft.h
> new file mode 100644
> index 000000000..fe0f073dc
> --- /dev/null
> +++ b/src/box/raft.h
> @@ -0,0 +1,59 @@
> +#pragma once
> +/*
> + * Copyright 2010-2020, Tarantool AUTHORS, please see AUTHORS file.
> + *
> + * Redistribution and use in source and binary forms, with or
> + * without modification, are permitted provided that the following
> + * conditions are met:
> + *
> + * 1. Redistributions of source code must retain the above
> + *    copyright notice, this list of conditions and the
> + *    following disclaimer.
> + *
> + * 2. Redistributions in binary form must reproduce the above
> + *    copyright notice, this list of conditions and the following
> + *    disclaimer in the documentation and/or other materials
> + *    provided with the distribution.
> + *
> + * THIS SOFTWARE IS PROVIDED BY <COPYRIGHT HOLDER> ``AS IS'' AND
> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
> + * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
> + * <COPYRIGHT HOLDER> OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
> + * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
> + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
> + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
> + * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
> + * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> + * SUCH DAMAGE.
> + */
> +#include "raftlib.h"
> +
> +#if defined(__cplusplus)
> +extern "C" {
> +#endif
> +
> +/** Raft state of this instance. */
> +static inline struct raft *
> +box_raft(void)
> +{
> +	extern struct raft box_raft_global;
> +	/**
> +	 * Ensure the raft node can be used. I.e. that it is properly
> +	 * initialized. Entirely for debug purposes.
> +	 */
> +	assert(box_raft_global.state != 0);
> +	return &box_raft_global;
> +}
> +
> +void
> +box_raft_init(void);
> +
> +void
> +box_raft_free(void);
> +
> +#if defined(__cplusplus)
> +}
> +#endif
> diff --git a/src/box/raftlib.c b/src/box/raftlib.c
> index ff664a4d1..3867c63e0 100644
> --- a/src/box/raftlib.c
> +++ b/src/box/raftlib.c
> @@ -44,14 +44,6 @@
>    */
>   #define RAFT_RANDOM_ELECTION_FACTOR 0.1
>   
> -struct raft box_raft_global = {
> -	/*
> -	 * Set an invalid state to validate in runtime the global raft node is
> -	 * not used before initialization.
> -	 */
> -	.state = 0,
> -};
> -
>   /**
>    * When decoding we should never trust that there is
>    * a valid data incomes.
> @@ -1106,24 +1098,3 @@ raft_destroy(struct raft *raft)
>   		raft->worker = NULL;
>   	}
>   }
> -
> -void
> -box_raft_init(void)
> -{
> -	raft_create(&box_raft_global);
> -}
> -
> -void
> -box_raft_free(void)
> -{
> -	/*
> -	 * Can't join the fiber, because the event loop is stopped already, and
> -	 * yields are not allowed.
> -	 */
> -	box_raft_global.worker = NULL;
> -	raft_destroy(&box_raft_global);
> -	/*
> -	 * Invalidate so as box_raft() would fail if any usage attempt happens.
> -	 */
> -	box_raft_global.state = 0;
> -}
> diff --git a/src/box/raftlib.h b/src/box/raftlib.h
> index 3088fba23..805f69d64 100644
> --- a/src/box/raftlib.h
> +++ b/src/box/raftlib.h
> @@ -271,25 +271,6 @@ raft_create(struct raft *raft);
>   void
>   raft_destroy(struct raft *raft);
>   
> -/** Raft state of this instance. */
> -static inline struct raft *
> -box_raft(void)
> -{
> -	extern struct raft box_raft_global;
> -	/**
> -	 * Ensure the raft node can be used. I.e. that it is properly
> -	 * initialized. Entirely for debug purposes.
> -	 */
> -	assert(box_raft_global.state != 0);
> -	return &box_raft_global;
> -}
> -
> -void
> -box_raft_init(void);
> -
> -void
> -box_raft_free(void);
> -
>   #if defined(__cplusplus)
>   }
>   #endif

-- 
Serge Petrenko

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Tarantool-patches] [PATCH 03/12] raft: stop using replication_disconnect_timeout()
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 03/12] raft: stop using replication_disconnect_timeout() Vladislav Shpilevoy
@ 2020-11-17  8:15   ` Serge Petrenko
  0 siblings, 0 replies; 37+ messages in thread
From: Serge Petrenko @ 2020-11-17  8:15 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov

LGTM.

17.11.2020 03:02, Vladislav Shpilevoy пишет:
> Raft is being moved to a separate library in src/lib. It means,
> it can't depend on anything from box/, including global
> replication parameters such as replication_timeout, and functions
> like replication_disconnect_timeout().
>
> The patch makes raft stop using replication_disconnect_timeout().
> Instead, it stores death timeout in struct raft. It is configured
> by box simultaneously with replication_timeout.
>
> Part of #5303
> ---
>   src/box/box.cc    |  2 +-
>   src/box/raftlib.c | 13 ++++++-------
>   src/box/raftlib.h | 10 +++++++---
>   3 files changed, 14 insertions(+), 11 deletions(-)
>
> diff --git a/src/box/box.cc b/src/box/box.cc
> index 8dd92a5f5..25673ed42 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -890,7 +890,7 @@ void
>   box_set_replication_timeout(void)
>   {
>   	replication_timeout = box_check_replication_timeout();
> -	raft_cfg_death_timeout(box_raft());
> +	raft_cfg_death_timeout(box_raft(), replication_disconnect_timeout());
>   }
>   
>   void
> diff --git a/src/box/raftlib.c b/src/box/raftlib.c
> index 3867c63e0..c156d6f46 100644
> --- a/src/box/raftlib.c
> +++ b/src/box/raftlib.c
> @@ -804,8 +804,7 @@ raft_sm_wait_leader_dead(struct raft *raft)
>   	assert(raft->is_candidate);
>   	assert(raft->state == RAFT_STATE_FOLLOWER);
>   	assert(raft->leader != 0);
> -	double death_timeout = replication_disconnect_timeout();
> -	ev_timer_set(&raft->timer, death_timeout, death_timeout);
> +	ev_timer_set(&raft->timer, raft->death_timeout, raft->death_timeout);
>   	ev_timer_start(loop(), &raft->timer);
>   }
>   
> @@ -817,8 +816,7 @@ raft_sm_wait_leader_found(struct raft *raft)
>   	assert(raft->is_candidate);
>   	assert(raft->state == RAFT_STATE_FOLLOWER);
>   	assert(raft->leader == 0);
> -	double death_timeout = replication_disconnect_timeout();
> -	ev_timer_set(&raft->timer, death_timeout, death_timeout);
> +	ev_timer_set(&raft->timer, raft->death_timeout, raft->death_timeout);
>   	ev_timer_start(loop(), &raft->timer);
>   }
>   
> @@ -1012,14 +1010,14 @@ raft_cfg_election_quorum(struct raft *raft)
>   }
>   
>   void
> -raft_cfg_death_timeout(struct raft *raft)
> +raft_cfg_death_timeout(struct raft *raft, double death_timeout)
>   {
> +	raft->death_timeout = death_timeout;
>   	if (raft->state == RAFT_STATE_FOLLOWER && raft->is_candidate &&
>   	    raft->leader != 0) {
>   		assert(ev_is_active(&raft->timer));
> -		double death_timeout = replication_disconnect_timeout();
>   		double timeout = ev_timer_remaining(loop(), &raft->timer) -
> -				 raft->timer.at + death_timeout;
> +				 raft->timer.at + raft->death_timeout;
>   		ev_timer_stop(loop(), &raft->timer);
>   		ev_timer_set(&raft->timer, timeout, timeout);
>   		ev_timer_start(loop(), &raft->timer);
> @@ -1080,6 +1078,7 @@ raft_create(struct raft *raft)
>   		.volatile_term = 1,
>   		.term =	1,
>   		.election_timeout = 5,
> +		.death_timeout = 5,
>   	};
>   	ev_timer_init(&raft->timer, raft_sm_schedule_new_election_cb, 0, 0);
>   	raft->timer.data = raft;
> diff --git a/src/box/raftlib.h b/src/box/raftlib.h
> index 805f69d64..b33a20326 100644
> --- a/src/box/raftlib.h
> +++ b/src/box/raftlib.h
> @@ -156,6 +156,11 @@ struct raft {
>   	struct fiber *worker;
>   	/** Configured election timeout in seconds. */
>   	double election_timeout;
> +	/**
> +	 * Leader death timeout, after which it is considered dead and new
> +	 * elections can be started.
> +	 */
> +	double death_timeout;
>   	/**
>   	 * Trigger invoked each time any of the Raft node visible attributes are
>   	 * changed.
> @@ -229,11 +234,10 @@ raft_cfg_election_quorum(struct raft *raft);
>   
>   /**
>    * Configure Raft leader death timeout. I.e. number of seconds without
> - * heartbeats from the leader to consider it dead. There is no a separate
> - * option. Raft uses replication timeout for that.
> + * heartbeats from the leader to consider it dead.
>    */
>   void
> -raft_cfg_death_timeout(struct raft *raft);
> +raft_cfg_death_timeout(struct raft *raft, double death_timeout);
>   
>   /**
>    * Bump the term. When it is persisted, the node checks if there is a leader,

-- 
Serge Petrenko

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Tarantool-patches] [PATCH 04/12] raft: stop using replication_synchro_quorum
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 04/12] raft: stop using replication_synchro_quorum Vladislav Shpilevoy
@ 2020-11-17  8:17   ` Serge Petrenko
  2020-11-19 23:42     ` Vladislav Shpilevoy
  0 siblings, 1 reply; 37+ messages in thread
From: Serge Petrenko @ 2020-11-17  8:17 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov

[-- Attachment #1: Type: text/plain, Size: 11900 bytes --]


17.11.2020 03:02, Vladislav Shpilevoy пишет:
> Raft is being moved to a separate library in src/lib. It means,
> it can't depend on anything from box/, including global
> replication parameters such as replication_synchro_quorum.
>
> The patch makes raft stop using replication_synchro_quorum.
>
> Instead, it has a new option 'election_quorum'. Note, that this is
> just Raft API. Box API still uses replication_synchro_quorum. But
> it is used to calculate the final quorum in src/box/raft, not
> in src/box/raftlib. And to pass it to the base Raft
> implementation.
>
> Part of #5303
> ---
>   src/box/box.cc         |  2 +-
>   src/box/raft.c         | 52 +++++++++++++++++++++++++++++++
>   src/box/raft.h         |  8 +++++
>   src/box/raftlib.c      | 70 ++++++++----------------------------------
>   src/box/raftlib.h      | 10 +++---
>   src/box/replication.cc |  3 ++
>   6 files changed, 83 insertions(+), 62 deletions(-)
>
> diff --git a/src/box/box.cc b/src/box/box.cc
> index 25673ed42..cc0d7b81d 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -921,7 +921,7 @@ box_set_replication_synchro_quorum(void)
>   		return -1;
>   	replication_synchro_quorum = value;
>   	txn_limbo_on_parameters_change(&txn_limbo);
> -	raft_cfg_election_quorum(box_raft());
> +	box_raft_reconsider_election_quorum();
>   	return 0;
>   }
>   
> diff --git a/src/box/raft.c b/src/box/raft.c
> index f289a6993..af6e71e0b 100644
> --- a/src/box/raft.c
> +++ b/src/box/raft.c
> @@ -30,6 +30,7 @@
>    */
>   #include "box.h"
>   #include "raft.h"
> +#include "replication.h"
>   
>   struct raft box_raft_global = {
>   	/*
> @@ -62,6 +63,57 @@ box_raft_on_update_f(struct trigger *trigger, void *event)
>   	return 0;
>   }
>   
> +void
> +box_raft_reconsider_election_quorum(void)

Suggestion: maybe use "rewrite"/"reset" instead of "reconsider"?
Or plain "update"?

Other than that, LGTM.

> +{
> +	/*
> +	 * When the instance is started first time, it does not have an ID, so
> +	 * the registered count is 0. But the quorum can never be 0. At least
> +	 * the current instance should participate in the quorum.
> +	 */
> +	int max = MAX(replicaset.registered_count, 1);
> +	/**
> +	 * Election quorum is not strictly equal to synchronous replication
> +	 * quorum. Sometimes it can be lowered. That is about bootstrap.
> +	 *
> +	 * The problem with bootstrap is that when the replicaset boots, all the
> +	 * instances can't write to WAL and can't recover from their initial
> +	 * snapshot. They need one node which will boot first, and then they
> +	 * will replicate from it.
> +	 *
> +	 * This one node should boot from its zero snapshot, create replicaset
> +	 * UUID, register self with ID 1 in _cluster space, and then register
> +	 * all the other instances here. To do that the node must be writable.
> +	 * It should have read_only = false, connection quorum satisfied, and be
> +	 * a Raft leader if Raft is enabled.
> +	 *
> +	 * To be elected a Raft leader it needs to perform election. But it
> +	 * can't be done before at least synchronous quorum of the replicas is
> +	 * bootstrapped. And they can't be bootstrapped because wait for a
> +	 * leader to initialize _cluster. Cyclic dependency.
> +	 *
> +	 * This is resolved by truncation of the election quorum to the number
> +	 * of registered replicas, if their count is less than synchronous
> +	 * quorum. That helps to elect a first leader.
> +	 *
> +	 * It may seem that the first node could just declare itself a leader
> +	 * and then strictly follow the protocol from now on, but that won't
> +	 * work, because if the first node will restart after it is booted, but
> +	 * before quorum of replicas is booted, the cluster will stuck again.
> +	 *
> +	 * The current solution is totally safe because
> +	 *
> +	 * - after all the cluster will have node count >= quorum, if user used
> +	 *   a correct config (God help him if he didn't);
> +	 *
> +	 * - synchronous replication quorum is untouched - it is not truncated.
> +	 *   Only leader election quorum is affected. So synchronous data won't
> +	 *   be lost.
> +	 */
> +	int quorum = MIN(replication_synchro_quorum, max);
> +	raft_cfg_election_quorum(box_raft(), quorum);
> +}
> +
>   void
>   box_raft_init(void)
>   {
> diff --git a/src/box/raft.h b/src/box/raft.h
> index fe0f073dc..09297273f 100644
> --- a/src/box/raft.h
> +++ b/src/box/raft.h
> @@ -48,6 +48,14 @@ box_raft(void)
>   	return &box_raft_global;
>   }
>   
> +/**
> + * Let the global raft know that the election quorum could change. It happens
> + * when configuration is updated, and when new nodes are added or old are
> + * deleted from the cluster.
> + */
> +void
> +box_raft_reconsider_election_quorum(void);
> +
>   void
>   box_raft_init(void);
>   
> diff --git a/src/box/raftlib.c b/src/box/raftlib.c
> index c156d6f46..0657fa85a 100644
> --- a/src/box/raftlib.c
> +++ b/src/box/raftlib.c
> @@ -130,50 +130,6 @@ raft_can_vote_for(const struct raft *raft, const struct vclock *v)
>   	return cmp == 0 || cmp == 1;
>   }
>   
> -/**
> - * Election quorum is not strictly equal to synchronous replication quorum.
> - * Sometimes it can be lowered. That is about bootstrap.
> - *
> - * The problem with bootstrap is that when the replicaset boots, all the
> - * instances can't write to WAL and can't recover from their initial snapshot.
> - * They need one node which will boot first, and then they will replicate from
> - * it.
> - *
> - * This one node should boot from its zero snapshot, create replicaset UUID,
> - * register self with ID 1 in _cluster space, and then register all the other
> - * instances here. To do that the node must be writable. It should have
> - * read_only = false, connection quorum satisfied, and be a Raft leader if Raft
> - * is enabled.
> - *
> - * To be elected a Raft leader it needs to perform election. But it can't be
> - * done before at least synchronous quorum of the replicas is bootstrapped. And
> - * they can't be bootstrapped because wait for a leader to initialize _cluster.
> - * Cyclic dependency.
> - *
> - * This is resolved by truncation of the election quorum to the number of
> - * registered replicas, if their count is less than synchronous quorum. That
> - * helps to elect a first leader.
> - *
> - * It may seem that the first node could just declare itself a leader and then
> - * strictly follow the protocol from now on, but that won't work, because if the
> - * first node will restart after it is booted, but before quorum of replicas is
> - * booted, the cluster will stuck again.
> - *
> - * The current solution is totally safe because
> - *
> - * - after all the cluster will have node count >= quorum, if user used a
> - *   correct config (God help him if he didn't);
> - *
> - * - synchronous replication quorum is untouched - it is not truncated. Only
> - *   leader election quorum is affected. So synchronous data won't be lost.
> - */
> -static inline int
> -raft_election_quorum(const struct raft *raft)
> -{
> -	(void)raft;
> -	return MIN(replication_synchro_quorum, replicaset.registered_count);
> -}
> -
>   /**
>    * Wakeup the Raft worker fiber in order to do some async work. If the fiber
>    * does not exist yet, it is created.
> @@ -427,13 +383,12 @@ raft_process_msg(struct raft *raft, const struct raft_request *req,
>   			 * and now was answered by some other instance.
>   			 */
>   			assert(raft->volatile_vote == instance_id);
> -			int quorum = raft_election_quorum(raft);
>   			bool was_set = bit_set(&raft->vote_mask, source);
>   			raft->vote_count += !was_set;
> -			if (raft->vote_count < quorum) {
> +			if (raft->vote_count < raft->election_quorum) {
>   				say_info("RAFT: accepted vote for self, vote "
>   					 "count is %d/%d", raft->vote_count,
> -					 quorum);
> +					 raft->election_quorum);
>   				break;
>   			}
>   			raft_sm_become_leader(raft);
> @@ -594,7 +549,7 @@ end_dump:
>   			raft_sm_wait_leader_dead(raft);
>   		} else if (raft->vote == instance_id) {
>   			/* Just wrote own vote. */
> -			if (raft_election_quorum(raft) == 1)
> +			if (raft->election_quorum == 1)
>   				raft_sm_become_leader(raft);
>   			else
>   				raft_sm_become_candidate(raft);
> @@ -692,7 +647,7 @@ raft_sm_become_leader(struct raft *raft)
>   {
>   	assert(raft->state != RAFT_STATE_LEADER);
>   	say_info("RAFT: enter leader state with quorum %d",
> -		 raft_election_quorum(raft));
> +		 raft->election_quorum);
>   	assert(raft->leader == 0);
>   	assert(raft->is_candidate);
>   	assert(!raft->is_write_in_progress);
> @@ -730,7 +685,7 @@ raft_sm_become_candidate(struct raft *raft)
>   	assert(raft->vote == instance_id);
>   	assert(raft->is_candidate);
>   	assert(!raft->is_write_in_progress);
> -	assert(raft_election_quorum(raft) > 1);
> +	assert(raft->election_quorum > 1);
>   	raft->state = RAFT_STATE_CANDIDATE;
>   	raft->vote_count = 1;
>   	raft->vote_mask = 0;
> @@ -999,14 +954,14 @@ raft_cfg_election_timeout(struct raft *raft, double timeout)
>   }
>   
>   void
> -raft_cfg_election_quorum(struct raft *raft)
> +raft_cfg_election_quorum(struct raft *raft, int election_quorum)
>   {
> -	if (raft->state != RAFT_STATE_CANDIDATE ||
> -	    raft->state == RAFT_STATE_LEADER)
> -		return;
> -	if (raft->vote_count < raft_election_quorum(raft))
> -		return;
> -	raft_sm_become_leader(raft);
> +	/* At least self is always a part of the quorum. */
> +	assert(election_quorum > 0);
> +	raft->election_quorum = election_quorum;
> +	if (raft->state == RAFT_STATE_CANDIDATE &&
> +	    raft->vote_count >= raft->election_quorum)
> +		raft_sm_become_leader(raft);
>   }
>   
>   void
> @@ -1077,6 +1032,7 @@ raft_create(struct raft *raft)
>   		.state = RAFT_STATE_FOLLOWER,
>   		.volatile_term = 1,
>   		.term =	1,
> +		.election_quorum = 1,
>   		.election_timeout = 5,
>   		.death_timeout = 5,
>   	};
> diff --git a/src/box/raftlib.h b/src/box/raftlib.h
> index b33a20326..c9c13136e 100644
> --- a/src/box/raftlib.h
> +++ b/src/box/raftlib.h
> @@ -150,6 +150,8 @@ struct raft {
>   	vclock_map_t vote_mask;
>   	/** Number of votes for this instance. Valid only in candidate state. */
>   	int vote_count;
> +	/** Number of votes necessary for successful election. */
> +	int election_quorum;
>   	/** State machine timed event trigger. */
>   	struct ev_timer timer;
>   	/** Worker fiber to execute blocking tasks like IO. */
> @@ -225,12 +227,12 @@ void
>   raft_cfg_election_timeout(struct raft *raft, double timeout);
>   
>   /**
> - * Configure Raft leader election quorum. There is no a separate option.
> - * Instead, synchronous replication quorum is used. Since Raft is tightly bound
> - * with synchronous replication.
> + * Configure Raft leader election quorum. That may trigger immediate election,
> + * if the quorum is lowered, and this instance is a candidate having enough
> + * votes for the new quorum.
>    */
>   void
> -raft_cfg_election_quorum(struct raft *raft);
> +raft_cfg_election_quorum(struct raft *raft, int election_quorum);
>   
>   /**
>    * Configure Raft leader death timeout. I.e. number of seconds without
> diff --git a/src/box/replication.cc b/src/box/replication.cc
> index 65512cf0f..19d7f6beb 100644
> --- a/src/box/replication.cc
> +++ b/src/box/replication.cc
> @@ -39,6 +39,7 @@
>   #include "box.h"
>   #include "gc.h"
>   #include "error.h"
> +#include "raft.h"
>   #include "relay.h"
>   #include "sio.h"
>   
> @@ -250,6 +251,7 @@ replica_set_id(struct replica *replica, uint32_t replica_id)
>   	say_info("assigned id %d to replica %s",
>   		 replica->id, tt_uuid_str(&replica->uuid));
>   	replica->anon = false;
> +	box_raft_reconsider_election_quorum();
>   }
>   
>   void
> @@ -298,6 +300,7 @@ replica_clear_id(struct replica *replica)
>   		assert(!replica->anon);
>   		replica_delete(replica);
>   	}
> +	box_raft_reconsider_election_quorum();
>   }
>   
>   void

-- 
Serge Petrenko


[-- Attachment #2: Type: text/html, Size: 12143 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Tarantool-patches] [PATCH 05/12] raft: stop using instance_id
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 05/12] raft: stop using instance_id Vladislav Shpilevoy
@ 2020-11-17  8:59   ` Serge Petrenko
  0 siblings, 0 replies; 37+ messages in thread
From: Serge Petrenko @ 2020-11-17  8:59 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov


17.11.2020 03:02, Vladislav Shpilevoy пишет:
> Raft is being moved to a separate library in src/lib. It means,
> it can't depend on anything from box/.
>
> The patch makes raft stop using instance_id.
>
> Instead, it has a new option 'instance_id'. It is stored inside
> struct raft as 'self', and should be configured using
> raft_cfg_instance_id().
>
> The configuration is done when bootstrap ends and the instance_id
> is either recovered successfully, or the instance is anonymous.
>
> While working on this, I also considered introducing a new
> function raft_boot() instead of raft_cfg_instance_id(). Which I
> would also use to configure vclock later. Raft_boot() would be
> meant to be called only one time with non-dynamic parameters
> instance_id and vclock.
>
> But then I decided to keep adding new raft_cfg_*() functions.
> Because:
>
> - It is more consistent with the existing options;
>
> - Does not require to think about too many different functions
>    like raft_create(), raft_boot(), raft_cfg_*() and in which order
>    to call them;
>
> Also I was thinking to introduce a single raft_cfg() like I did
> in swim with swim_cfg(), to reduce number of raft_cfg_*()
> functions, but decided it would be even worse with so many
> options.
>
> Part of #5303
> ---
LGTM.
>   src/box/box.cc    | 10 +++++-----
>   src/box/raftlib.c | 32 ++++++++++++++++++++------------
>   src/box/raftlib.h |  9 +++++++++
>   3 files changed, 34 insertions(+), 17 deletions(-)
>
> diff --git a/src/box/box.cc b/src/box/box.cc
> index cc0d7b81d..8f5f3558e 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -2763,12 +2763,12 @@ box_cfg_xc(void)
>   	 * Fill in leader election parameters after bootstrap. Before it is not
>   	 * possible - there may be relevant data to recover from WAL and
>   	 * snapshot. Also until recovery is done, it is not possible to write
> -	 * new records into WAL. It is also totally safe, because relaying is
> -	 * not started until the box is configured. So it can't happen, that
> -	 * this election-enabled node will try to relay to another
> -	 * election-enabled node without election actually enabled leading to
> -	 * disconnect.
> +	 * new records into WAL. Another reason - before recovery is done,
> +	 * instance_id is not known, so Raft simply can't work.
>   	 */
> +	if (!replication_anon)
> +		raft_cfg_instance_id(box_raft(), instance_id);
> +
>   	if (box_set_election_timeout() != 0)
>   		diag_raise();
>   	/*
> diff --git a/src/box/raftlib.c b/src/box/raftlib.c
> index 0657fa85a..ca1940ba6 100644
> --- a/src/box/raftlib.c
> +++ b/src/box/raftlib.c
> @@ -296,7 +296,7 @@ raft_process_msg(struct raft *raft, const struct raft_request *req,
>   	say_info("RAFT: message %s from %u", raft_request_to_string(req),
>   		 source);
>   	assert(source > 0);
> -	assert(source != instance_id);
> +	assert(source != raft->self);
>   	if (req->term == 0 || req->state == 0) {
>   		diag_set(ClientError, ER_PROTOCOL, "Raft term and state can't "
>   			 "be zero");
> @@ -337,7 +337,7 @@ raft_process_msg(struct raft *raft, const struct raft_request *req,
>   					 raft->leader);
>   				break;
>   			}
> -			if (req->vote == instance_id) {
> +			if (req->vote == raft->self) {
>   				/*
>   				 * This is entirely valid. This instance could
>   				 * request a vote, then become a follower or
> @@ -373,7 +373,7 @@ raft_process_msg(struct raft *raft, const struct raft_request *req,
>   			break;
>   		case RAFT_STATE_CANDIDATE:
>   			/* Check if this is a vote for a competing candidate. */
> -			if (req->vote != instance_id) {
> +			if (req->vote != raft->self) {
>   				say_info("RAFT: vote request is skipped - "
>   					 "competing candidate");
>   				break;
> @@ -382,7 +382,7 @@ raft_process_msg(struct raft *raft, const struct raft_request *req,
>   			 * Vote for self was requested earlier in this round,
>   			 * and now was answered by some other instance.
>   			 */
> -			assert(raft->volatile_vote == instance_id);
> +			assert(raft->volatile_vote == raft->self);
>   			bool was_set = bit_set(&raft->vote_mask, source);
>   			raft->vote_count += !was_set;
>   			if (raft->vote_count < raft->election_quorum) {
> @@ -547,7 +547,7 @@ end_dump:
>   		} else if (raft->leader != 0) {
>   			/* There is a known leader. Wait until it is dead. */
>   			raft_sm_wait_leader_dead(raft);
> -		} else if (raft->vote == instance_id) {
> +		} else if (raft->vote == raft->self) {
>   			/* Just wrote own vote. */
>   			if (raft->election_quorum == 1)
>   				raft_sm_become_leader(raft);
> @@ -561,7 +561,7 @@ end_dump:
>   			raft_sm_wait_election_end(raft);
>   		} else {
>   			/* No leaders, no votes. */
> -			raft_sm_schedule_new_vote(raft, instance_id);
> +			raft_sm_schedule_new_vote(raft, raft->self);
>   		}
>   	} else {
>   		memset(&req, 0, sizeof(req));
> @@ -596,7 +596,7 @@ raft_worker_handle_broadcast(struct raft *raft)
>   	req.vote = raft->vote;
>   	req.state = raft->state;
>   	if (req.state == RAFT_STATE_CANDIDATE) {
> -		assert(raft->vote == instance_id);
> +		assert(raft->vote == raft->self);
>   		req.vclock = &replicaset.vclock;
>   	}
>   	replicaset_foreach(replica)
> @@ -652,7 +652,7 @@ raft_sm_become_leader(struct raft *raft)
>   	assert(raft->is_candidate);
>   	assert(!raft->is_write_in_progress);
>   	raft->state = RAFT_STATE_LEADER;
> -	raft->leader = instance_id;
> +	raft->leader = raft->self;
>   	ev_timer_stop(loop(), &raft->timer);
>   	/* Make read-write (if other subsystems allow that. */
>   	box_update_ro_summary();
> @@ -682,14 +682,14 @@ raft_sm_become_candidate(struct raft *raft)
>   	say_info("RAFT: enter candidate state with 1 self vote");
>   	assert(raft->state == RAFT_STATE_FOLLOWER);
>   	assert(raft->leader == 0);
> -	assert(raft->vote == instance_id);
> +	assert(raft->vote == raft->self);
>   	assert(raft->is_candidate);
>   	assert(!raft->is_write_in_progress);
>   	assert(raft->election_quorum > 1);
>   	raft->state = RAFT_STATE_CANDIDATE;
>   	raft->vote_count = 1;
>   	raft->vote_mask = 0;
> -	bit_set(&raft->vote_mask, instance_id);
> +	bit_set(&raft->vote_mask, raft->self);
>   	raft_sm_wait_election_end(raft);
>   	/* State is visible and it is changed - broadcast. */
>   	raft_schedule_broadcast(raft);
> @@ -736,7 +736,7 @@ raft_sm_schedule_new_election(struct raft *raft)
>   	assert(raft->is_candidate);
>   	/* Everyone is a follower until its vote for self is persisted. */
>   	raft_sm_schedule_new_term(raft, raft->term + 1);
> -	raft_sm_schedule_new_vote(raft, instance_id);
> +	raft_sm_schedule_new_vote(raft, raft->self);
>   	box_update_ro_summary();
>   }
>   
> @@ -783,7 +783,7 @@ raft_sm_wait_election_end(struct raft *raft)
>   	assert(raft->is_candidate);
>   	assert(raft->state == RAFT_STATE_FOLLOWER ||
>   	       (raft->state == RAFT_STATE_CANDIDATE &&
> -		raft->volatile_vote == instance_id));
> +		raft->volatile_vote == raft->self));
>   	assert(raft->leader == 0);
>   	double election_timeout = raft->election_timeout +
>   				  raft_new_random_election_shift(raft);
> @@ -979,6 +979,14 @@ raft_cfg_death_timeout(struct raft *raft, double death_timeout)
>   	}
>   }
>   
> +void
> +raft_cfg_instance_id(struct raft *raft, uint32_t instance_id)
> +{
> +	assert(raft->self == 0);
> +	assert(instance_id != 0);
> +	raft->self = instance_id;
> +}
> +
>   void
>   raft_new_term(struct raft *raft)
>   {
> diff --git a/src/box/raftlib.h b/src/box/raftlib.h
> index c9c13136e..f75ed2567 100644
> --- a/src/box/raftlib.h
> +++ b/src/box/raftlib.h
> @@ -95,6 +95,8 @@ const char *
>   raft_state_str(uint32_t state);
>   
>   struct raft {
> +	/** Instance ID of this node. */
> +	uint32_t self;
>   	/** Instance ID of leader of the current term. */
>   	uint32_t leader;
>   	/** State of the instance. */
> @@ -241,6 +243,13 @@ raft_cfg_election_quorum(struct raft *raft, int election_quorum);
>   void
>   raft_cfg_death_timeout(struct raft *raft, double death_timeout);
>   
> +/**
> + * Configure ID of the given Raft instance. The ID can't be changed after it is
> + * assigned first time.
> + */
> +void
> +raft_cfg_instance_id(struct raft *raft, uint32_t instance_id);
> +
>   /**
>    * Bump the term. When it is persisted, the node checks if there is a leader,
>    * and if there is not, a new election is started. That said, this function can

-- 
Serge Petrenko

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Tarantool-patches] [PATCH 06/12] raft: make raft_request.vclock constant
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 06/12] raft: make raft_request.vclock constant Vladislav Shpilevoy
@ 2020-11-17  9:17   ` Serge Petrenko
  0 siblings, 0 replies; 37+ messages in thread
From: Serge Petrenko @ 2020-11-17  9:17 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov


17.11.2020 03:02, Vladislav Shpilevoy пишет:
> Raft is never supposed to change vclock. Not the stored one, nor
> the received ones. The patch makes it checked during compilation.
>
> The patch is mostly motivated by a next patch making Raft use an
> externally configured vclock which can't be changed. Since Raft
> uses raft_request to carry the vclock in a few places, the
> request's vclock also must become const.
>
> Part of #5303


LGTM.


> ---
>   src/box/box.cc    | 2 +-
>   src/box/raftlib.c | 9 +++------
>   src/box/raftlib.h | 3 +--
>   src/box/xrow.c    | 2 +-
>   src/box/xrow.h    | 2 +-
>   5 files changed, 7 insertions(+), 11 deletions(-)
>
> diff --git a/src/box/box.cc b/src/box/box.cc
> index 8f5f3558e..78fca928e 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -2142,7 +2142,7 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
>   		 * should be 0.
>   		 */
>   		struct raft_request req;
> -		raft_serialize_for_network(box_raft(), &req, &vclock);
> +		raft_serialize_for_network(box_raft(), &req);
>   		xrow_encode_raft(&row, &fiber()->gc, &req);
>   		coio_write_xrow(io, &row);
>   	}
> diff --git a/src/box/raftlib.c b/src/box/raftlib.c
> index ca1940ba6..78164bf91 100644
> --- a/src/box/raftlib.c
> +++ b/src/box/raftlib.c
> @@ -850,8 +850,7 @@ raft_sm_stop(struct raft *raft)
>   }
>   
>   void
> -raft_serialize_for_network(const struct raft *raft, struct raft_request *req,
> -			   struct vclock *vclock)
> +raft_serialize_for_network(const struct raft *raft, struct raft_request *req)
>   {
>   	memset(req, 0, sizeof(*req));
>   	/*
> @@ -865,10 +864,8 @@ raft_serialize_for_network(const struct raft *raft, struct raft_request *req,
>   	 * Raft does not own vclock, so it always expects it passed externally.
>   	 * Vclock is sent out only by candidate instances.
>   	 */
> -	if (req->state == RAFT_STATE_CANDIDATE) {
> -		req->vclock = vclock;
> -		vclock_copy(vclock, &replicaset.vclock);
> -	}
> +	if (req->state == RAFT_STATE_CANDIDATE)
> +		req->vclock = &replicaset.vclock;
>   }
>   
>   void
> diff --git a/src/box/raftlib.h b/src/box/raftlib.h
> index f75ed2567..2da3cec86 100644
> --- a/src/box/raftlib.h
> +++ b/src/box/raftlib.h
> @@ -263,8 +263,7 @@ raft_new_term(struct raft *raft);
>    * cluster. It is allowed to save anything here, not only persistent state.
>    */
>   void
> -raft_serialize_for_network(const struct raft *raft, struct raft_request *req,
> -			   struct vclock *vclock);
> +raft_serialize_for_network(const struct raft *raft, struct raft_request *req);
>   
>   /**
>    * Save complete Raft state into a request to be persisted on disk. Only term
> diff --git a/src/box/xrow.c b/src/box/xrow.c
> index 165a00a16..bc06738ad 100644
> --- a/src/box/xrow.c
> +++ b/src/box/xrow.c
> @@ -1057,7 +1057,7 @@ xrow_decode_raft(const struct xrow_header *row, struct raft_request *r,
>   			r->vclock = vclock;
>   			if (r->vclock == NULL)
>   				mp_next(&pos);
> -			else if (mp_decode_vclock_ignore0(&pos, r->vclock) != 0)
> +			else if (mp_decode_vclock_ignore0(&pos, vclock) != 0)
>   				goto bad_msgpack;
>   			break;
>   		default:
> diff --git a/src/box/xrow.h b/src/box/xrow.h
> index 095911239..3d68c1268 100644
> --- a/src/box/xrow.h
> +++ b/src/box/xrow.h
> @@ -268,7 +268,7 @@ struct raft_request {
>   	uint64_t term;
>   	uint32_t vote;
>   	uint32_t state;
> -	struct vclock *vclock;
> +	const struct vclock *vclock;
>   };
>   
>   int

-- 
Serge Petrenko

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Tarantool-patches] [PATCH 07/12] raft: stop using replicaset.vclock
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 07/12] raft: stop using replicaset.vclock Vladislav Shpilevoy
@ 2020-11-17  9:23   ` Serge Petrenko
  0 siblings, 0 replies; 37+ messages in thread
From: Serge Petrenko @ 2020-11-17  9:23 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov


17.11.2020 03:02, Vladislav Shpilevoy пишет:
> Raft is being moved to a separate library in src/lib. It means,
> it can't depend on anything from box/.
>
> The patch makes raft stop using replicaset.vclock.
>
> Instead, it has a new option 'vclock'. It is stored inside struct
> raft by pointer and should be configured using raft_cfg_vclock().
>
> Box configures it to point at replicaset.vclock like before. But
> now raftlib code does not depend on it explicitly.
>
> Vclock is stored in Raft by pointer instead of by value so as not
> to update it for each transaction. It would be too high price to
> pay for Raft independence from box.
>
> Part of #5303
> ---


LGTM.


>   src/box/box.cc    |  1 +
>   src/box/raftlib.c | 15 +++++++++++----
>   src/box/raftlib.h | 16 ++++++++++++++++
>   3 files changed, 28 insertions(+), 4 deletions(-)
>
> diff --git a/src/box/box.cc b/src/box/box.cc
> index 78fca928e..ff80e45a4 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -2768,6 +2768,7 @@ box_cfg_xc(void)
>   	 */
>   	if (!replication_anon)
>   		raft_cfg_instance_id(box_raft(), instance_id);
> +	raft_cfg_vclock(box_raft(), &replicaset.vclock);
>   
>   	if (box_set_election_timeout() != 0)
>   		diag_raise();
> diff --git a/src/box/raftlib.c b/src/box/raftlib.c
> index 78164bf91..ab2e27fd8 100644
> --- a/src/box/raftlib.c
> +++ b/src/box/raftlib.c
> @@ -125,8 +125,7 @@ raft_new_random_election_shift(const struct raft *raft)
>   static inline bool
>   raft_can_vote_for(const struct raft *raft, const struct vclock *v)
>   {
> -	(void)raft;
> -	int cmp = vclock_compare_ignore0(v, &replicaset.vclock);
> +	int cmp = vclock_compare_ignore0(v, raft->vclock);
>   	return cmp == 0 || cmp == 1;
>   }
>   
> @@ -597,7 +596,7 @@ raft_worker_handle_broadcast(struct raft *raft)
>   	req.state = raft->state;
>   	if (req.state == RAFT_STATE_CANDIDATE) {
>   		assert(raft->vote == raft->self);
> -		req.vclock = &replicaset.vclock;
> +		req.vclock = raft->vclock;
>   	}
>   	replicaset_foreach(replica)
>   		relay_push_raft(replica->relay, &req);
> @@ -865,7 +864,7 @@ raft_serialize_for_network(const struct raft *raft, struct raft_request *req)
>   	 * Vclock is sent out only by candidate instances.
>   	 */
>   	if (req->state == RAFT_STATE_CANDIDATE)
> -		req->vclock = &replicaset.vclock;
> +		req->vclock = raft->vclock;
>   }
>   
>   void
> @@ -984,6 +983,14 @@ raft_cfg_instance_id(struct raft *raft, uint32_t instance_id)
>   	raft->self = instance_id;
>   }
>   
> +void
> +raft_cfg_vclock(struct raft *raft, const struct vclock *vclock)
> +{
> +	assert(raft->vclock == NULL);
> +	assert(vclock != NULL);
> +	raft->vclock = vclock;
> +}
> +
>   void
>   raft_new_term(struct raft *raft)
>   {
> diff --git a/src/box/raftlib.h b/src/box/raftlib.h
> index 2da3cec86..8d0d03da0 100644
> --- a/src/box/raftlib.h
> +++ b/src/box/raftlib.h
> @@ -154,6 +154,15 @@ struct raft {
>   	int vote_count;
>   	/** Number of votes necessary for successful election. */
>   	int election_quorum;
> +	/**
> +	 * Vclock of the Raft node owner. Raft never changes it, only watches,
> +	 * and makes decisions based on it. The value is not stored by copy so
> +	 * as to avoid frequent updates. If every transaction would need to
> +	 * update several vclocks in different places, it would be too
> +	 * expensive. So they update only one vclock, which is shared between
> +	 * subsystems, such as Raft.
> +	 */
> +	const struct vclock *vclock;
>   	/** State machine timed event trigger. */
>   	struct ev_timer timer;
>   	/** Worker fiber to execute blocking tasks like IO. */
> @@ -250,6 +259,13 @@ raft_cfg_death_timeout(struct raft *raft, double death_timeout);
>   void
>   raft_cfg_instance_id(struct raft *raft, uint32_t instance_id);
>   
> +/**
> + * Configure vclock of the given Raft instance. The vclock is not copied, so the
> + * caller must keep it valid.
> + */
> +void
> +raft_cfg_vclock(struct raft *raft, const struct vclock *vclock);
> +
>   /**
>    * Bump the term. When it is persisted, the node checks if there is a leader,
>    * and if there is not, a new election is started. That said, this function can

-- 
Serge Petrenko

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Tarantool-patches] [PATCH 08/12] raft: introduce vtab for disk and network
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 08/12] raft: introduce vtab for disk and network Vladislav Shpilevoy
@ 2020-11-17  9:35   ` Serge Petrenko
  2020-11-19 23:43     ` Vladislav Shpilevoy
  2020-11-17 10:00   ` Serge Petrenko
  1 sibling, 1 reply; 37+ messages in thread
From: Serge Petrenko @ 2020-11-17  9:35 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov


17.11.2020 03:02, Vladislav Shpilevoy пишет:
> Raft is being moved to a separate library in src/lib. It means,
> it can't depend on anything from box/.
>
> The patch makes raft stop using replicaset and journal objects.
> They were used to broadcast messages to all the other nodes, and
> to persist updates.
>
> Now Raft does the same through vtab, which is configured by box.
> Broadcast still sends messages via relays, and disk write still
> uses the journal. But Raft does not depend on any specific journal
> or network API.
>
> Part of #5303
> ---
>   src/box/raft.c    | 63 ++++++++++++++++++++++++++++++++++++++-
>   src/box/raftlib.c | 75 ++++++++++-------------------------------------
>   src/box/raftlib.h | 24 ++++++++++++++-
>   3 files changed, 100 insertions(+), 62 deletions(-)
>
>
>   /* Dump Raft state to WAL in a blocking way. */
>   static void
>   raft_worker_handle_io(struct raft *raft)
> @@ -567,8 +513,17 @@ end_dump:
>   		assert(raft->volatile_term >= raft->term);
>   		req.term = raft->volatile_term;
>   		req.vote = raft->volatile_vote;
> -
> -		raft_write_request(&req);
> +		/*
> +		 * Skip vclock. It is used only to be sent to network when vote
> +		 * for self. It is a job of the vclock owner to persist it
> +		 * anyhow.
> +		 *
> +		 * Skip state. That would be strictly against Raft protocol. The
> +		 * reason is that it does not make much sense - even if the node
> +		 * is a leader now, after the node is restarted, there will be
> +		 * another leader elected by that time likely.
> +		 */
> +		raft->vtab->write(raft, &req);
>   		say_info("RAFT: persisted state %s",
>   			 raft_request_to_string(&req));
>   
> @@ -598,8 +553,7 @@ raft_worker_handle_broadcast(struct raft *raft)
>   		assert(raft->vote == raft->self);
>   		req.vclock = raft->vclock;
>   	}
> -	replicaset_foreach(replica)
> -		relay_push_raft(replica->relay, &req);
> +	raft->vtab->broadcast(raft, &req);


I'd introduce helpers, like raft_write() and raft_broadcast(),
to hide raft->vtab->... calls. Up to you, though.

Other than that LGTM.


>   	trigger_run(&raft->on_update, raft);
>   	raft->is_broadcast_scheduled = false;
>   }
> @@ -1038,7 +992,7 @@ raft_schedule_broadcast(struct raft *raft)
>   }
>   
>   void
> -raft_create(struct raft *raft)
> +raft_create(struct raft *raft, const struct raft_vtab *vtab)
>   {
>   	*raft = (struct raft) {
>   		.state = RAFT_STATE_FOLLOWER,
> @@ -1047,6 +1001,7 @@ raft_create(struct raft *raft)
>   		.election_quorum = 1,
>   		.election_timeout = 5,
>   		.death_timeout = 5,
> +		.vtab = vtab,
>   	};
>   	ev_timer_init(&raft->timer, raft_sm_schedule_new_election_cb, 0, 0);
>   	raft->timer.data = raft;
> diff --git a/src/box/raftlib.h b/src/box/raftlib.h
> index 8d0d03da0..6181d9d49 100644
> --- a/src/box/raftlib.h
> +++ b/src/box/raftlib.h
> @@ -295,8 +313,12 @@ raft_serialize_for_disk(const struct raft *raft, struct raft_request *req);
>   void
>   raft_on_update(struct raft *raft, struct trigger *trigger);
>   
> +/**
> + * Create a Raft node. The vtab is not copied. Its memory should stay valid even
> + * after the creation.
> + */
>   void
> -raft_create(struct raft *raft);
> +raft_create(struct raft *raft, const struct raft_vtab *vtab);
>   
>   void
>   raft_destroy(struct raft *raft);

-- 
Serge Petrenko

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Tarantool-patches] [PATCH 08/12] raft: introduce vtab for disk and network
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 08/12] raft: introduce vtab for disk and network Vladislav Shpilevoy
  2020-11-17  9:35   ` Serge Petrenko
@ 2020-11-17 10:00   ` Serge Petrenko
  2020-11-19 23:43     ` Vladislav Shpilevoy
  1 sibling, 1 reply; 37+ messages in thread
From: Serge Petrenko @ 2020-11-17 10:00 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov


17.11.2020 03:02, Vladislav Shpilevoy пишет:
> Raft is being moved to a separate library in src/lib. It means,
> it can't depend on anything from box/.
>
> The patch makes raft stop using replicaset and journal objects.
> They were used to broadcast messages to all the other nodes, and
> to persist updates.
>
> Now Raft does the same through vtab, which is configured by box.
> Broadcast still sends messages via relays, and disk write still
> uses the journal. But Raft does not depend on any specific journal
> or network API.
>
> Part of #5303
> ---
>   src/box/raft.c    | 63 ++++++++++++++++++++++++++++++++++++++-
>   src/box/raftlib.c | 75 ++++++++++-------------------------------------
>   src/box/raftlib.h | 24 ++++++++++++++-
>   3 files changed, 100 insertions(+), 62 deletions(-)
>
> diff --git a/src/box/raft.c b/src/box/raft.c
> index af6e71e0b..845525660 100644
> --- a/src/box/raft.c
> +++ b/src/box/raft.c
> @@ -29,7 +29,10 @@
>    * SUCH DAMAGE.
>    */
>   #include "box.h"
> +#include "error.h"
> +#include "journal.h"
>   #include "raft.h"
> +#include "relay.h"
>   #include "replication.h"

Just noticed: raft.c starts depending on xrow.h in this patch.

-- 
Serge Petrenko

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Tarantool-patches] [PATCH 09/12] raft: introduce raft_msg, drop xrow dependency
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 09/12] raft: introduce raft_msg, drop xrow dependency Vladislav Shpilevoy
@ 2020-11-17 10:22   ` Serge Petrenko
  2020-11-19 23:43     ` Vladislav Shpilevoy
  0 siblings, 1 reply; 37+ messages in thread
From: Serge Petrenko @ 2020-11-17 10:22 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov


17.11.2020 03:02, Vladislav Shpilevoy пишет:
> Raft used to depend on xrow, because it used raft_request as a
> communication and persistence unit. Xrow is a part of src/box
> library set, so it blocked Raft extraction into src/lib/raft.
>
> This patch makes Raft not depend on xrow. For that Raft introduces
> a new communication and persistence unit - struct raft_msg.
> Interestingly, throughout its source code Raft already uses term
> 'message' to describe requests, so this patch also restores the
> consistency. This is because raft_request name was used to be
> consistent with other *_request structs in xrow.h. Now Raft does
> not depend on this, and can use its own name.
>
> Struct raft_msg repeats raft_request literally, but it actually
> makes sense. Because when Raft is extracted to a new library, it
> may start evolving independently. Its raft_msg may be populated
> with new members, or their behaviour may change depending on how
> the algorithm will evolve.
>
> But inside box it will be possible to tweak and extend raft_msg
> whenever it is necessary, via struct raft_request, and without
> changing the basic library.
>
> For instance, in future we may want to make nodes forward the
> messages to each other during voting to speed the process up, and
> for that we may want to add an explicit 'source' field to
> raft_request, while it won't be necessary on the level of
> raft_msg.
>
> There is a new compatibility layer in src/box/raft.h which hides
> raft_msg details from other box code, and does the msg <-> request
> conversions.
>
> Part of #5303
> ---
>   src/box/applier.cc     |  2 +-
>   src/box/box.cc         |  4 +--
>   src/box/memtx_engine.c |  4 +--
>   src/box/raft.c         | 70 ++++++++++++++++++++++++++++++++++++++----
>   src/box/raft.h         | 24 +++++++++++++++
>   src/box/raftlib.c      | 24 ++++++---------
>   src/box/raftlib.h      | 38 ++++++++++++++++++-----
>   src/box/xrow.h         |  4 +++
>   8 files changed, 137 insertions(+), 33 deletions(-)
>
>
> diff --git a/src/box/raft.c b/src/box/raft.c
> index 845525660..f3652bbcb 100644
> --- a/src/box/raft.c
> +++ b/src/box/raft.c
> @@ -49,6 +49,28 @@ struct raft box_raft_global = {
>    */
>   static struct trigger box_raft_on_update;
>   
> +static void
> +box_raft_msg_to_request(const struct raft_msg *msg, struct raft_request *req)
> +{
> +	*req = (struct raft_request) {
> +		.term = msg->term,
> +		.vote = msg->vote,
> +		.state = msg->state,
> +		.vclock = msg->vclock,
> +	};
> +}
> +
> +static void
> +box_raft_request_to_msg(const struct raft_request *req, struct raft_msg *msg)
> +{
> +	*msg = (struct raft_msg) {
> +		.term = req->term,
> +		.vote = req->vote,
> +		.state = req->state,
> +		.vclock = req->vclock,
> +	};
> +}
> +
>   static int
>   box_raft_on_update_f(struct trigger *trigger, void *event)
>   {

Have you considered making `struct raft_msg` a member of `struct 
raft_request`?
This way you'll avoid copying.
Yes, xrow will start depending on raftlib, but is this too bad?

Never mind, it'll only work on raft_request -> raft_msg transition,
but not vice versa. So, just an idea.

> diff --git a/src/box/raft.h b/src/box/raft.h
> index 09297273f..4dffce380 100644
> --- a/src/box/raft.h
> +++ b/src/box/raft.h
> @@ -35,6 +35,8 @@
>   extern "C" {
>   #endif
>   
> +struct raft_request;
> +
>   /** Raft state of this instance. */
>   static inline struct raft *
>   box_raft(void)
> @@ -56,6 +58,28 @@ box_raft(void)
>   void
>   box_raft_reconsider_election_quorum(void);
>   
> +/**
> + * Recovery a single Raft request. Raft state machine is not turned on yet, this
> + * works only during instance recovery from the journal.
> + */


Typo: Recovery -> Recover


> +void
> +box_raft_recover(const struct raft_request *req);
> +
> +/** Save complete Raft state into a request to be persisted on disk locally. */
> +void
> +box_raft_checkpoint_local(struct raft_request *req);
> +
> +/**
> + * Save complete Raft state into a request to be sent to other instances of the
> + * cluster.
> + */
> +void
> +box_raft_checkpoint_remote(struct raft_request *req);
> +
> +/** Handle a single Raft request from a node with instance id @a source. */
> +int
> +box_raft_process(struct raft_request *req, uint32_t source);
> +
>   void
>   box_raft_init(void);
>   

-- 
Serge Petrenko

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Tarantool-patches] [PATCH 10/12] raft: move box_update_ro_summary to update trigger
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 10/12] raft: move box_update_ro_summary to update trigger Vladislav Shpilevoy
@ 2020-11-17 12:42   ` Serge Petrenko
  2020-11-17 15:17     ` Serge Petrenko
  2020-11-18 23:21     ` Vladislav Shpilevoy
  0 siblings, 2 replies; 37+ messages in thread
From: Serge Petrenko @ 2020-11-17 12:42 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov


17.11.2020 03:02, Vladislav Shpilevoy пишет:
> box_update_ro_summary() was called from the basic Raft library,
> making it depend on box. But it was called every time when Raft
> state was changed and broadcasted. It means the same effect can be
> achieved by updating RO summary from Raft state update trigger.
>
> The patch does it, and now Raft code does not depend on box.h.
>
> Part of #5303
> ---
>   src/box/raft.c    | 2 ++
>   src/box/raftlib.c | 8 --------
>   2 files changed, 2 insertions(+), 8 deletions(-)
>
> diff --git a/src/box/raft.c b/src/box/raft.c
> index f3652bbcb..db1a3f423 100644
> --- a/src/box/raft.c
> +++ b/src/box/raft.c
> @@ -77,6 +77,8 @@ box_raft_on_update_f(struct trigger *trigger, void *event)
>   	(void)trigger;
>   	struct raft *raft = (struct raft *)event;
>   	assert(raft == box_raft());
> +	/* State or enablence could be changed, affecting read-only state. */
> +	box_update_ro_summary();
>   	if (raft->state != RAFT_STATE_LEADER)
>   		return 0;
>   	/*


Raft uses synchronous WAL write, corect?

So there's a yield in raft_worker_handle_io(). Now there's a period of 
time when
an instance is a follower, but it isn't read-only.

When you reconfigure a leader to become voter, everything's fine, since no
writing to disk is involved.

However, if an existing leader receives a message with term greater, 
than its own,
it'll first persist this term, and thus yield, and proceed to broadcast 
and switching
to ro later.

So now it's possible that a follower is writeable for some period of time.

Maybe run on_update triggers before the wal write? Even better, run the 
triggers
on the actual state transition. After each  `raft->state = ...`.

On the bright side, your patch seems to fix
https://github.com/tarantool/tarantool/issues/5440

> diff --git a/src/box/raftlib.c b/src/box/raftlib.c
> index 512dbd51f..2e09d5405 100644
> --- a/src/box/raftlib.c
> +++ b/src/box/raftlib.c
> @@ -33,7 +33,6 @@
>   #include "error.h"
>   #include "fiber.h"
>   #include "small/region.h"
> -#include "box.h"
>   #include "tt_static.h"
>   
>   /**
> @@ -603,8 +602,6 @@ raft_sm_become_leader(struct raft *raft)
>   	raft->state = RAFT_STATE_LEADER;
>   	raft->leader = raft->self;
>   	ev_timer_stop(loop(), &raft->timer);
> -	/* Make read-write (if other subsystems allow that. */
> -	box_update_ro_summary();
>   	/* State is visible and it is changed - broadcast. */
>   	raft_schedule_broadcast(raft);
>   }
> @@ -655,7 +652,6 @@ raft_sm_schedule_new_term(struct raft *raft, uint64_t new_term)
>   	raft->volatile_vote = 0;
>   	raft->leader = 0;
>   	raft->state = RAFT_STATE_FOLLOWER;
> -	box_update_ro_summary();
>   	raft_sm_pause_and_dump(raft);
>   	/*
>   	 * State is visible and it is changed - broadcast. Term is also visible,
> @@ -686,7 +682,6 @@ raft_sm_schedule_new_election(struct raft *raft)
>   	/* Everyone is a follower until its vote for self is persisted. */
>   	raft_sm_schedule_new_term(raft, raft->term + 1);
>   	raft_sm_schedule_new_vote(raft, raft->self);
> -	box_update_ro_summary();
>   }
>   
>   static void
> @@ -771,7 +766,6 @@ raft_sm_start(struct raft *raft)
>   		 */
>   		raft_sm_wait_leader_found(raft);
>   	}
> -	box_update_ro_summary();
>   	/*
>   	 * Nothing changed. But when raft was stopped, its state wasn't sent to
>   	 * replicas. At least this was happening at the moment of this being
> @@ -793,7 +787,6 @@ raft_sm_stop(struct raft *raft)
>   		raft->leader = 0;
>   	raft->state = RAFT_STATE_FOLLOWER;
>   	ev_timer_stop(loop(), &raft->timer);
> -	box_update_ro_summary();
>   	/* State is visible and changed - broadcast. */
>   	raft_schedule_broadcast(raft);
>   }
> @@ -879,7 +872,6 @@ raft_cfg_is_candidate(struct raft *raft, bool is_candidate)
>   			raft_schedule_broadcast(raft);
>   		}
>   	}
> -	box_update_ro_summary();
>   }
>   
>   void

-- 
Serge Petrenko

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Tarantool-patches] [PATCH 11/12] raft: introduce RaftError
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 11/12] raft: introduce RaftError Vladislav Shpilevoy
@ 2020-11-17 15:13   ` Serge Petrenko
  0 siblings, 0 replies; 37+ messages in thread
From: Serge Petrenko @ 2020-11-17 15:13 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov


17.11.2020 03:02, Vladislav Shpilevoy пишет:
> Last piece of src/box used in Raft code was error.h. It was added
> to be able to raise ClientErrors. To get rid of it the libraries
> usually introduce their own error type available from
> src/lib/core. Such as CollationError, SwimError, CryptoError.
>
> This patch adds RaftError and removes the last box dependency from
> Raft code.
>
> Part of #5303
LGTM.
> ---
>   src/box/raftlib.c         |  9 ++++-----
>   src/lib/core/diag.h       |  2 ++
>   src/lib/core/exception.cc | 24 ++++++++++++++++++++++++
>   src/lib/core/exception.h  |  7 +++++++
>   4 files changed, 37 insertions(+), 5 deletions(-)
>
> diff --git a/src/box/raftlib.c b/src/box/raftlib.c
> index 2e09d5405..b669475f3 100644
> --- a/src/box/raftlib.c
> +++ b/src/box/raftlib.c
> @@ -30,7 +30,7 @@
>    */
>   #include "raft.h"
>   
> -#include "error.h"
> +#include "exception.h"
>   #include "fiber.h"
>   #include "small/region.h"
>   #include "tt_static.h"
> @@ -291,14 +291,13 @@ raft_process_msg(struct raft *raft, const struct raft_msg *req, uint32_t source)
>   	assert(source > 0);
>   	assert(source != raft->self);
>   	if (req->term == 0 || req->state == 0) {
> -		diag_set(ClientError, ER_PROTOCOL, "Raft term and state can't "
> -			 "be zero");
> +		diag_set(RaftError, "Raft term and state can't be zero");
>   		return -1;
>   	}
>   	if (req->state == RAFT_STATE_CANDIDATE &&
>   	    (req->vote != source || req->vclock == NULL)) {
> -		diag_set(ClientError, ER_PROTOCOL, "Candidate should always "
> -			 "vote for self and provide its vclock");
> +		diag_set(RaftError, "Candidate should always vote for self and "
> +			 "provide its vclock");
>   		return -1;
>   	}
>   	/* Outdated request. */
> diff --git a/src/lib/core/diag.h b/src/lib/core/diag.h
> index 6bf0b139a..b07eea838 100644
> --- a/src/lib/core/diag.h
> +++ b/src/lib/core/diag.h
> @@ -335,6 +335,8 @@ struct error *
>   BuildSwimError(const char *file, unsigned line, const char *format, ...);
>   struct error *
>   BuildCryptoError(const char *file, unsigned line, const char *format, ...);
> +struct error *
> +BuildRaftError(const char *file, unsigned line, const char *format, ...);
>   
>   struct index_def;
>   
> diff --git a/src/lib/core/exception.cc b/src/lib/core/exception.cc
> index 180cb0e97..395baff6f 100644
> --- a/src/lib/core/exception.cc
> +++ b/src/lib/core/exception.cc
> @@ -288,6 +288,18 @@ CryptoError::CryptoError(const char *file, unsigned line,
>   	va_end(ap);
>   }
>   
> +const struct type_info type_RaftError =
> +	make_type("RaftError", &type_Exception);
> +
> +RaftError::RaftError(const char *file, unsigned line, const char *format, ...)
> +	: Exception(&type_RaftError, file, line)
> +{
> +	va_list ap;
> +	va_start(ap, format);
> +	error_vformat_msg(this, format, ap);
> +	va_end(ap);
> +}
> +
>   #define BuildAlloc(type)				\
>   	void *p = malloc(sizeof(type));			\
>   	if (p == NULL)					\
> @@ -409,6 +421,18 @@ BuildSocketError(const char *file, unsigned line, const char *socketname,
>   	return e;
>   }
>   
> +struct error *
> +BuildRaftError(const char *file, unsigned line, const char *format, ...)
> +{
> +	BuildAlloc(RaftError);
> +	RaftError *e =  new (p) RaftError(file, line, "");
> +	va_list ap;
> +	va_start(ap, format);
> +	error_vformat_msg(e, format, ap);
> +	va_end(ap);
> +	return e;
> +}
> +
>   void
>   exception_init()
>   {
> diff --git a/src/lib/core/exception.h b/src/lib/core/exception.h
> index 1947b4f00..7277b2784 100644
> --- a/src/lib/core/exception.h
> +++ b/src/lib/core/exception.h
> @@ -52,6 +52,7 @@ extern const struct type_info type_SystemError;
>   extern const struct type_info type_CollationError;
>   extern const struct type_info type_SwimError;
>   extern const struct type_info type_CryptoError;
> +extern const struct type_info type_RaftError;
>   
>   const char *
>   exception_get_string(struct error *e, const struct method_info *method);
> @@ -168,6 +169,12 @@ public:
>   	virtual void raise() { throw this; }
>   };
>   
> +class RaftError: public Exception {
> +public:
> +	RaftError(const char *file, unsigned line, const char *format, ...);
> +	virtual void raise() { throw this; }
> +};
> +
>   /**
>    * Initialize the exception subsystem.
>    */

-- 
Serge Petrenko

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Tarantool-patches] [PATCH 12/12] raft: move algorithm code to src/lib/raft
  2020-11-17  0:02 ` [Tarantool-patches] [PATCH 12/12] raft: move algorithm code to src/lib/raft Vladislav Shpilevoy
@ 2020-11-17 15:13   ` Serge Petrenko
  0 siblings, 0 replies; 37+ messages in thread
From: Serge Petrenko @ 2020-11-17 15:13 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov


17.11.2020 03:02, Vladislav Shpilevoy пишет:
> Raft algorithm code does not depend on box anymore, and is moved
> to src/lib/raft.
>
> This is done to be able to unit test Raft similarly to Swim - with
> virtual event loop, network, time, and disk. Using any number of
> instances. That will allow to cover all crazy and rare cases
> possible in Raft, but without problems of functional tests
> stability and clumsiness.
>
> Part of #5303


LGTM.


> ---
>   src/box/CMakeLists.txt                 | 3 +--
>   src/box/raft.h                         | 2 +-
>   src/lib/CMakeLists.txt                 | 1 +
>   src/lib/raft/CMakeLists.txt            | 7 +++++++
>   src/{box/raftlib.c => lib/raft/raft.c} | 0
>   src/{box/raftlib.h => lib/raft/raft.h} | 0
>   6 files changed, 10 insertions(+), 3 deletions(-)
>   create mode 100644 src/lib/raft/CMakeLists.txt
>   rename src/{box/raftlib.c => lib/raft/raft.c} (100%)
>   rename src/{box/raftlib.h => lib/raft/raft.h} (100%)
>
> diff --git a/src/box/CMakeLists.txt b/src/box/CMakeLists.txt
> index fcf779379..a7547c29f 100644
> --- a/src/box/CMakeLists.txt
> +++ b/src/box/CMakeLists.txt
> @@ -169,7 +169,6 @@ add_library(box STATIC
>       port.c
>       txn.c
>       txn_limbo.c
> -    raftlib.c
>       raft.c
>       box.cc
>       gc.c
> @@ -263,6 +262,6 @@ add_custom_command(OUTPUT ${SQL_BIN_DIR}/opcodes.c
>           ${SQL_BIN_DIR}/opcodes.h)
>   
>   target_link_libraries(box box_error tuple stat xrow xlog vclock crc32 scramble
> -                      ${common_libraries})
> +                      raft ${common_libraries})
>   
>   add_dependencies(box build_bundled_libs generate_sql_files)
> diff --git a/src/box/raft.h b/src/box/raft.h
> index 4dffce380..7e0768cd3 100644
> --- a/src/box/raft.h
> +++ b/src/box/raft.h
> @@ -29,7 +29,7 @@
>    * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
>    * SUCH DAMAGE.
>    */
> -#include "raftlib.h"
> +#include "raft/raft.h"
>   
>   #if defined(__cplusplus)
>   extern "C" {
> diff --git a/src/lib/CMakeLists.txt b/src/lib/CMakeLists.txt
> index de1b902c6..cabbe3d89 100644
> --- a/src/lib/CMakeLists.txt
> +++ b/src/lib/CMakeLists.txt
> @@ -14,6 +14,7 @@ add_subdirectory(crypto)
>   add_subdirectory(swim)
>   add_subdirectory(mpstream)
>   add_subdirectory(vclock)
> +add_subdirectory(raft)
>   if(ENABLE_BUNDLED_MSGPUCK)
>       add_subdirectory(msgpuck EXCLUDE_FROM_ALL)
>   endif()
> diff --git a/src/lib/raft/CMakeLists.txt b/src/lib/raft/CMakeLists.txt
> new file mode 100644
> index 000000000..aef2bacf7
> --- /dev/null
> +++ b/src/lib/raft/CMakeLists.txt
> @@ -0,0 +1,7 @@
> +set(lib_sources
> +    raft.c
> +)
> +
> +set_source_files_compile_flags(${lib_sources})
> +add_library(raft STATIC ${lib_sources})
> +target_link_libraries(raft core)
> diff --git a/src/box/raftlib.c b/src/lib/raft/raft.c
> similarity index 100%
> rename from src/box/raftlib.c
> rename to src/lib/raft/raft.c
> diff --git a/src/box/raftlib.h b/src/lib/raft/raft.h
> similarity index 100%
> rename from src/box/raftlib.h
> rename to src/lib/raft/raft.h

-- 
Serge Petrenko

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Tarantool-patches] [PATCH 10/12] raft: move box_update_ro_summary to update trigger
  2020-11-17 12:42   ` Serge Petrenko
@ 2020-11-17 15:17     ` Serge Petrenko
  2020-11-18 23:21     ` Vladislav Shpilevoy
  1 sibling, 0 replies; 37+ messages in thread
From: Serge Petrenko @ 2020-11-17 15:17 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov


17.11.2020 15:42, Serge Petrenko пишет:
>
> 17.11.2020 03:02, Vladislav Shpilevoy пишет:
>> box_update_ro_summary() was called from the basic Raft library,
>> making it depend on box. But it was called every time when Raft
>> state was changed and broadcasted. It means the same effect can be
>> achieved by updating RO summary from Raft state update trigger.
>>
>> The patch does it, and now Raft code does not depend on box.h.
>>
>> Part of #5303
>> ---
>>   src/box/raft.c    | 2 ++
>>   src/box/raftlib.c | 8 --------
>>   2 files changed, 2 insertions(+), 8 deletions(-)
>>
>> diff --git a/src/box/raft.c b/src/box/raft.c
>> index f3652bbcb..db1a3f423 100644
>> --- a/src/box/raft.c
>> +++ b/src/box/raft.c
>> @@ -77,6 +77,8 @@ box_raft_on_update_f(struct trigger *trigger, void 
>> *event)
>>       (void)trigger;
>>       struct raft *raft = (struct raft *)event;
>>       assert(raft == box_raft());
>> +    /* State or enablence could be changed, affecting read-only 
>> state. */
>> +    box_update_ro_summary();
>>       if (raft->state != RAFT_STATE_LEADER)
>>           return 0;
>>       /*
>
>
> Raft uses synchronous WAL write, corect?
>
> So there's a yield in raft_worker_handle_io(). Now there's a period of 
> time when
> an instance is a follower, but it isn't read-only.
>
> When you reconfigure a leader to become voter, everything's fine, 
> since no
> writing to disk is involved.
>
> However, if an existing leader receives a message with term greater, 
> than its own,
> it'll first persist this term, and thus yield, and proceed to 
> broadcast and switching
> to ro later.
>
> So now it's possible that a follower is writeable for some period of 
> time.
>
> Maybe run on_update triggers before the wal write? Even better, run 
> the triggers
> on the actual state transition. After each  `raft->state = ...`.
>
> On the bright side, your patch seems to fix
> https://github.com/tarantool/tarantool/issues/5440


I've addressed the issue in a patch based on your branch,
"[PATCH] raft: execute triggers exactly on state change"

Take a look, please.

>
>> diff --git a/src/box/raftlib.c b/src/box/raftlib.c
>> index 512dbd51f..2e09d5405 100644
>> --- a/src/box/raftlib.c
>> +++ b/src/box/raftlib.c
>> @@ -33,7 +33,6 @@
>>   #include "error.h"
>>   #include "fiber.h"
>>   #include "small/region.h"
>> -#include "box.h"
>>   #include "tt_static.h"
>>     /**
>> @@ -603,8 +602,6 @@ raft_sm_become_leader(struct raft *raft)
>>       raft->state = RAFT_STATE_LEADER;
>>       raft->leader = raft->self;
>>       ev_timer_stop(loop(), &raft->timer);
>> -    /* Make read-write (if other subsystems allow that. */
>> -    box_update_ro_summary();
>>       /* State is visible and it is changed - broadcast. */
>>       raft_schedule_broadcast(raft);
>>   }
>> @@ -655,7 +652,6 @@ raft_sm_schedule_new_term(struct raft *raft, 
>> uint64_t new_term)
>>       raft->volatile_vote = 0;
>>       raft->leader = 0;
>>       raft->state = RAFT_STATE_FOLLOWER;
>> -    box_update_ro_summary();
>>       raft_sm_pause_and_dump(raft);
>>       /*
>>        * State is visible and it is changed - broadcast. Term is also 
>> visible,
>> @@ -686,7 +682,6 @@ raft_sm_schedule_new_election(struct raft *raft)
>>       /* Everyone is a follower until its vote for self is persisted. */
>>       raft_sm_schedule_new_term(raft, raft->term + 1);
>>       raft_sm_schedule_new_vote(raft, raft->self);
>> -    box_update_ro_summary();
>>   }
>>     static void
>> @@ -771,7 +766,6 @@ raft_sm_start(struct raft *raft)
>>            */
>>           raft_sm_wait_leader_found(raft);
>>       }
>> -    box_update_ro_summary();
>>       /*
>>        * Nothing changed. But when raft was stopped, its state wasn't 
>> sent to
>>        * replicas. At least this was happening at the moment of this 
>> being
>> @@ -793,7 +787,6 @@ raft_sm_stop(struct raft *raft)
>>           raft->leader = 0;
>>       raft->state = RAFT_STATE_FOLLOWER;
>>       ev_timer_stop(loop(), &raft->timer);
>> -    box_update_ro_summary();
>>       /* State is visible and changed - broadcast. */
>>       raft_schedule_broadcast(raft);
>>   }
>> @@ -879,7 +872,6 @@ raft_cfg_is_candidate(struct raft *raft, bool 
>> is_candidate)
>>               raft_schedule_broadcast(raft);
>>           }
>>       }
>> -    box_update_ro_summary();
>>   }
>>     void
>
-- 
Serge Petrenko

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Tarantool-patches] [PATCH 10/12] raft: move box_update_ro_summary to update trigger
  2020-11-17 12:42   ` Serge Petrenko
  2020-11-17 15:17     ` Serge Petrenko
@ 2020-11-18 23:21     ` Vladislav Shpilevoy
  2020-11-19 10:08       ` Serge Petrenko
  1 sibling, 1 reply; 37+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-18 23:21 UTC (permalink / raw)
  To: Serge Petrenko, tarantool-patches, gorcunov

Hi! Thanks for the review!

> Raft uses synchronous WAL write, corect?
> 
> So there's a yield in raft_worker_handle_io(). Now there's a period of time when
> an instance is a follower, but it isn't read-only.
> 
> When you reconfigure a leader to become voter, everything's fine, since no
> writing to disk is involved.
> 
> However, if an existing leader receives a message with term greater, than its own,
> it'll first persist this term, and thus yield, and proceed to broadcast and switching
> to ro later.
> 
> So now it's possible that a follower is writeable for some period of time.

You are a savior. Thanks for the deep review and for noticing this.

Indeed. The issue exists. And it is much deeper, it seems.

I also glanced at your patch about RO-update vs limbo-clear order.
Unfortunately, it does not work. As well as my idea about running on_update
triggers from raft_schedule_broadcast().

The reason is that currently our on_update trigger yields. In both our
solutions. Because it calls box_clear_synchro_queue(). And this is exactly
what I was trying to avoid by using a fiber in the first place. The state
machine must not yield.

Also there is another issue, not related to the patch, but which I spotted
while worked on it today. The issue is - we can't cancel limbo clearance.
The node could be demoted to a follower during waiting for confirms, but it
still will wait for confirms and we can't stop it.

I started thinking that we could resolve both these issues if box/raft.c could
run update triggers without yields right away, but schedule async work, like
limbo clear, into a separate fiber, cancellable.

And I realized that we already have this fiber - raft.worker.

The idea is that we can move the worker fiber to box/raft.c. And in libraft
instead of a fiber we will have 2 methods:

	- raft_vtab.async_f - virtual method called by Raft, when it wants
	  to schedule some async heavy work (network, disk). We will call it
	  instead of raft_worker_wakeup().

	- raft_process_async - normal method, which the Raft owner should call
	  to handle all async events. In a separate fiber. This is the same
	  as raft_worker_f(), but not depending on fiber, and finite.

In box/raft.c we have a worker fiber. To libraft we give async_f, which
creates the fiber on demand, and wakes it up. No yields. Like now, but in
box/raft.c.

The fiber in its body will call raft_process_async() and fiber_yield() until
it is cancelled.

Now how does it fix the update triggers? - we fire on_update triggers in
raft_schedule_broadcast(), but box/raft.c in the trigger will only update
RO summary. It won't yield. For the limbo clearance it will wakeup the worker
fiber, which now belongs to box/raft.c, so it is totally fine. The worker
will call raft_process_async() and will clear the limbo when it is time.

Also the worker fiber can be cancelled/interrupted somehow if we want to
stop limbo clear when the node is not a leader anymore.

I started working on this already, and it seems to be good. Raft is simplified
even more, and we delete the ugly hack in box_raft_free() about changing struct
raft with box_raft_global.worker = NULL. We still nullify the fiber, but we
don't change struct raft.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Tarantool-patches] [PATCH 10/12] raft: move box_update_ro_summary to update trigger
  2020-11-18 23:21     ` Vladislav Shpilevoy
@ 2020-11-19 10:08       ` Serge Petrenko
  0 siblings, 0 replies; 37+ messages in thread
From: Serge Petrenko @ 2020-11-19 10:08 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov


19.11.2020 02:21, Vladislav Shpilevoy пишет:
> Hi! Thanks for the review!
>
>> Raft uses synchronous WAL write, corect?
>>
>> So there's a yield in raft_worker_handle_io(). Now there's a period of time when
>> an instance is a follower, but it isn't read-only.
>>
>> When you reconfigure a leader to become voter, everything's fine, since no
>> writing to disk is involved.
>>
>> However, if an existing leader receives a message with term greater, than its own,
>> it'll first persist this term, and thus yield, and proceed to broadcast and switching
>> to ro later.
>>
>> So now it's possible that a follower is writeable for some period of time.
> You are a savior. Thanks for the deep review and for noticing this.
>
> Indeed. The issue exists. And it is much deeper, it seems.
>
> I also glanced at your patch about RO-update vs limbo-clear order.
> Unfortunately, it does not work. As well as my idea about running on_update
> triggers from raft_schedule_broadcast().
>
> The reason is that currently our on_update trigger yields. In both our
> solutions. Because it calls box_clear_synchro_queue(). And this is exactly
> what I was trying to avoid by using a fiber in the first place. The state
> machine must not yield.
>
> Also there is another issue, not related to the patch, but which I spotted
> while worked on it today. The issue is - we can't cancel limbo clearance.
> The node could be demoted to a follower during waiting for confirms, but it
> still will wait for confirms and we can't stop it.
>
> I started thinking that we could resolve both these issues if box/raft.c could
> run update triggers without yields right away, but schedule async work, like
> limbo clear, into a separate fiber, cancellable.
>
> And I realized that we already have this fiber - raft.worker.
>
> The idea is that we can move the worker fiber to box/raft.c. And in libraft
> instead of a fiber we will have 2 methods:
>
> 	- raft_vtab.async_f - virtual method called by Raft, when it wants
> 	  to schedule some async heavy work (network, disk). We will call it
> 	  instead of raft_worker_wakeup().
>
> 	- raft_process_async - normal method, which the Raft owner should call
> 	  to handle all async events. In a separate fiber. This is the same
> 	  as raft_worker_f(), but not depending on fiber, and finite.
>
> In box/raft.c we have a worker fiber. To libraft we give async_f, which
> creates the fiber on demand, and wakes it up. No yields. Like now, but in
> box/raft.c.
>
> The fiber in its body will call raft_process_async() and fiber_yield() until
> it is cancelled.
>
> Now how does it fix the update triggers? - we fire on_update triggers in
> raft_schedule_broadcast(), but box/raft.c in the trigger will only update
> RO summary. It won't yield. For the limbo clearance it will wakeup the worker
> fiber, which now belongs to box/raft.c, so it is totally fine. The worker
> will call raft_process_async() and will clear the limbo when it is time.
>
> Also the worker fiber can be cancelled/interrupted somehow if we want to
> stop limbo clear when the node is not a leader anymore.
>
> I started working on this already, and it seems to be good. Raft is simplified
> even more, and we delete the ugly hack in box_raft_free() about changing struct
> raft with box_raft_global.worker = NULL. We still nullify the fiber, but we
> don't change struct raft.


Thanks for the answer & investigation!

Looks good at first glance.

-- 
Serge Petrenko

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Tarantool-patches] [PATCH 04/12] raft: stop using replication_synchro_quorum
  2020-11-17  8:17   ` Serge Petrenko
@ 2020-11-19 23:42     ` Vladislav Shpilevoy
  0 siblings, 0 replies; 37+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-19 23:42 UTC (permalink / raw)
  To: Serge Petrenko, tarantool-patches, gorcunov

Hi! Thanks for the review!

>> diff --git a/src/box/raft.c b/src/box/raft.c
>> index f289a6993..af6e71e0b 100644
>> --- a/src/box/raft.c
>> +++ b/src/box/raft.c
>> @@ -62,6 +63,57 @@ box_raft_on_update_f(struct trigger *trigger, void *event)
>>  	return 0;
>>  }
>>  
>> +void
>> +box_raft_reconsider_election_quorum(void)
> 
> Suggestion: maybe use "rewrite"/"reset" instead of "reconsider"?
> Or plain "update"?
> 
> Other than that, LGTM.

I took "update". Now it is box_raft_update_election_quorum().

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Tarantool-patches] [PATCH 08/12] raft: introduce vtab for disk and network
  2020-11-17  9:35   ` Serge Petrenko
@ 2020-11-19 23:43     ` Vladislav Shpilevoy
  0 siblings, 0 replies; 37+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-19 23:43 UTC (permalink / raw)
  To: Serge Petrenko, tarantool-patches, gorcunov

Thanks for the review!

>>   @@ -598,8 +553,7 @@ raft_worker_handle_broadcast(struct raft *raft)
>>           assert(raft->vote == raft->self);
>>           req.vclock = raft->vclock;
>>       }
>> -    replicaset_foreach(replica)
>> -        relay_push_raft(replica->relay, &req);
>> +    raft->vtab->broadcast(raft, &req);
> 
> 
> I'd introduce helpers, like raft_write() and raft_broadcast(),
> to hide raft->vtab->... calls. Up to you, though.

Introduced raft_write and raft_broadcast.

====================
diff --git a/src/box/raftlib.c b/src/box/raftlib.c
index cc9139a5b..2f5e90f21 100644
--- a/src/box/raftlib.c
+++ b/src/box/raftlib.c
@@ -62,6 +62,20 @@ raft_state_str(uint32_t state)
 	return "invalid (x)";
 };
 
+/** Shortcut for vtab 'broadcast' method. */
+static inline void
+raft_broadcast(struct raft *raft, const struct raft_request *req)
+{
+	raft->vtab->broadcast(raft, req);
+}
+
+/** Shortcut for vtab 'write' method. */
+static inline void
+raft_write(struct raft *raft, const struct raft_request *req)
+{
+	raft->vtab->write(raft, req);
+}
+
 /**
  * Check if Raft is completely synced with disk. Meaning all its critical values
  * are in WAL. Only in that state the node can become a leader or a candidate.
@@ -523,7 +537,7 @@ end_dump:
 		 * is a leader now, after the node is restarted, there will be
 		 * another leader elected by that time likely.
 		 */
-		raft->vtab->write(raft, &req);
+		raft_write(raft, &req);
 		say_info("RAFT: persisted state %s",
 			 raft_request_to_string(&req));
 
@@ -553,7 +567,7 @@ raft_worker_handle_broadcast(struct raft *raft)
 		assert(raft->vote == raft->self);
 		req.vclock = raft->vclock;
 	}
-	raft->vtab->broadcast(raft, &req);
+	raft_broadcast(raft, &req);
 	trigger_run(&raft->on_update, raft);
 	raft->is_broadcast_scheduled = false;
 }

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Tarantool-patches] [PATCH 08/12] raft: introduce vtab for disk and network
  2020-11-17 10:00   ` Serge Petrenko
@ 2020-11-19 23:43     ` Vladislav Shpilevoy
  2020-11-20  7:56       ` Serge Petrenko
  0 siblings, 1 reply; 37+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-19 23:43 UTC (permalink / raw)
  To: Serge Petrenko, tarantool-patches, gorcunov

>> diff --git a/src/box/raft.c b/src/box/raft.c
>> index af6e71e0b..845525660 100644
>> --- a/src/box/raft.c
>> +++ b/src/box/raft.c
>> @@ -29,7 +29,10 @@
>>    * SUCH DAMAGE.
>>    */
>>   #include "box.h"
>> +#include "error.h"
>> +#include "journal.h"
>>   #include "raft.h"
>> +#include "relay.h"
>>   #include "replication.h"
> 
> Just noticed: raft.c starts depending on xrow.h in this patch.

This is fine. box/raft.c can depend on anything. It is raftlib.c/.h,
who is not allowed to depend on anything except src/lib.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Tarantool-patches] [PATCH 09/12] raft: introduce raft_msg, drop xrow dependency
  2020-11-17 10:22   ` Serge Petrenko
@ 2020-11-19 23:43     ` Vladislav Shpilevoy
  2020-11-20  8:03       ` Serge Petrenko
  0 siblings, 1 reply; 37+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-19 23:43 UTC (permalink / raw)
  To: Serge Petrenko, tarantool-patches, gorcunov

Thanks for the review!

>> diff --git a/src/box/raft.c b/src/box/raft.c
>> index 845525660..f3652bbcb 100644
>> --- a/src/box/raft.c
>> +++ b/src/box/raft.c
>> @@ -49,6 +49,28 @@ struct raft box_raft_global = {
>>    */
>>   static struct trigger box_raft_on_update;
>>   +static void
>> +box_raft_msg_to_request(const struct raft_msg *msg, struct raft_request *req)
>> +{
>> +    *req = (struct raft_request) {
>> +        .term = msg->term,
>> +        .vote = msg->vote,
>> +        .state = msg->state,
>> +        .vclock = msg->vclock,
>> +    };
>> +}
>> +
>> +static void
>> +box_raft_request_to_msg(const struct raft_request *req, struct raft_msg *msg)
>> +{
>> +    *msg = (struct raft_msg) {
>> +        .term = req->term,
>> +        .vote = req->vote,
>> +        .state = req->state,
>> +        .vclock = req->vclock,
>> +    };
>> +}
>> +
>>   static int
>>   box_raft_on_update_f(struct trigger *trigger, void *event)
>>   {
> 
> Have you considered making `struct raft_msg` a member of `struct raft_request`?
> This way you'll avoid copying.
> Yes, xrow will start depending on raftlib, but is this too bad?

This is what I was trying to avoid. Currently xrow does not depend on anything.
It is a pure protocol library. I am not saying it is good, but I don't want to
create a precedent here, allowing to link xrow with everything else in future.

If it was up to me, I would split xrow, so as each library and subsystem decides
how to encode its shit.

> Never mind, it'll only work on raft_request -> raft_msg transition,
> but not vice versa. So, just an idea.
> 
>> diff --git a/src/box/raft.h b/src/box/raft.h
>> index 09297273f..4dffce380 100644
>> --- a/src/box/raft.h
>> +++ b/src/box/raft.h
>> @@ -35,6 +35,8 @@
>>   extern "C" {
>>   #endif
>>   +struct raft_request;
>> +
>>   /** Raft state of this instance. */
>>   static inline struct raft *
>>   box_raft(void)
>> @@ -56,6 +58,28 @@ box_raft(void)
>>   void
>>   box_raft_reconsider_election_quorum(void);
>>   +/**
>> + * Recovery a single Raft request. Raft state machine is not turned on yet, this
>> + * works only during instance recovery from the journal.
>> + */
> 
> 
> Typo: Recovery -> Recover

Indeed, fixed.

====================
 /**
- * Recovery a single Raft request. Raft state machine is not turned on yet, this
+ * Recover a single Raft request. Raft state machine is not turned on yet, this
  * works only during instance recovery from the journal.
====================

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Tarantool-patches] [PATCH 08/12] raft: introduce vtab for disk and network
  2020-11-19 23:43     ` Vladislav Shpilevoy
@ 2020-11-20  7:56       ` Serge Petrenko
  2020-11-20 19:40         ` Vladislav Shpilevoy
  0 siblings, 1 reply; 37+ messages in thread
From: Serge Petrenko @ 2020-11-20  7:56 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov


20.11.2020 02:43, Vladislav Shpilevoy пишет:
>>> diff --git a/src/box/raft.c b/src/box/raft.c
>>> index af6e71e0b..845525660 100644
>>> --- a/src/box/raft.c
>>> +++ b/src/box/raft.c
>>> @@ -29,7 +29,10 @@
>>>     * SUCH DAMAGE.
>>>     */
>>>    #include "box.h"
>>> +#include "error.h"
>>> +#include "journal.h"
>>>    #include "raft.h"
>>> +#include "relay.h"
>>>    #include "replication.h"
>> Just noticed: raft.c starts depending on xrow.h in this patch.
> This is fine. box/raft.c can depend on anything. It is raftlib.c/.h,
> who is not allowed to depend on anything except src/lib.

I meant you should include xrow.h here directly.

-- 
Serge Petrenko

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Tarantool-patches] [PATCH 09/12] raft: introduce raft_msg, drop xrow dependency
  2020-11-19 23:43     ` Vladislav Shpilevoy
@ 2020-11-20  8:03       ` Serge Petrenko
  0 siblings, 0 replies; 37+ messages in thread
From: Serge Petrenko @ 2020-11-20  8:03 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov


20.11.2020 02:43, Vladislav Shpilevoy пишет:
> Thanks for the review!
>
>>> diff --git a/src/box/raft.c b/src/box/raft.c
>>> index 845525660..f3652bbcb 100644
>>> --- a/src/box/raft.c
>>> +++ b/src/box/raft.c
>>> @@ -49,6 +49,28 @@ struct raft box_raft_global = {
>>>     */
>>>    static struct trigger box_raft_on_update;
>>>    +static void
>>> +box_raft_msg_to_request(const struct raft_msg *msg, struct raft_request *req)
>>> +{
>>> +    *req = (struct raft_request) {
>>> +        .term = msg->term,
>>> +        .vote = msg->vote,
>>> +        .state = msg->state,
>>> +        .vclock = msg->vclock,
>>> +    };
>>> +}
>>> +
>>> +static void
>>> +box_raft_request_to_msg(const struct raft_request *req, struct raft_msg *msg)
>>> +{
>>> +    *msg = (struct raft_msg) {
>>> +        .term = req->term,
>>> +        .vote = req->vote,
>>> +        .state = req->state,
>>> +        .vclock = req->vclock,
>>> +    };
>>> +}
>>> +
>>>    static int
>>>    box_raft_on_update_f(struct trigger *trigger, void *event)
>>>    {
>> Have you considered making `struct raft_msg` a member of `struct raft_request`?
>> This way you'll avoid copying.
>> Yes, xrow will start depending on raftlib, but is this too bad?
> This is what I was trying to avoid. Currently xrow does not depend on anything.
> It is a pure protocol library. I am not saying it is good, but I don't want to
> create a precedent here, allowing to link xrow with everything else in future.
>
> If it was up to me, I would split xrow, so as each library and subsystem decides
> how to encode its shit.


Haha, ok.


>
>> Never mind, it'll only work on raft_request -> raft_msg transition,
>> but not vice versa. So, just an idea.
>>
>>> diff --git a/src/box/raft.h b/src/box/raft.h
>>> index 09297273f..4dffce380 100644
>>> --- a/src/box/raft.h
>>> +++ b/src/box/raft.h
>>> @@ -35,6 +35,8 @@
>>>    extern "C" {
>>>    #endif
>>>    +struct raft_request;
>>> +
>>>    /** Raft state of this instance. */
>>>    static inline struct raft *
>>>    box_raft(void)
>>> @@ -56,6 +58,28 @@ box_raft(void)
>>>    void
>>>    box_raft_reconsider_election_quorum(void);
>>>    +/**
>>> + * Recovery a single Raft request. Raft state machine is not turned on yet, this
>>> + * works only during instance recovery from the journal.
>>> + */
>>
>> Typo: Recovery -> Recover
> Indeed, fixed.
>
> ====================
>   /**
> - * Recovery a single Raft request. Raft state machine is not turned on yet, this
> + * Recover a single Raft request. Raft state machine is not turned on yet, this
>    * works only during instance recovery from the journal.
> ====================


Thanks for the fix!

-- 
Serge Petrenko

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Tarantool-patches] [PATCH 08/12] raft: introduce vtab for disk and network
  2020-11-20  7:56       ` Serge Petrenko
@ 2020-11-20 19:40         ` Vladislav Shpilevoy
  2020-11-23  8:09           ` Serge Petrenko
  0 siblings, 1 reply; 37+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-20 19:40 UTC (permalink / raw)
  To: Serge Petrenko, tarantool-patches, gorcunov

>>>> diff --git a/src/box/raft.c b/src/box/raft.c
>>>> index af6e71e0b..845525660 100644
>>>> --- a/src/box/raft.c
>>>> +++ b/src/box/raft.c
>>>> @@ -29,7 +29,10 @@
>>>>     * SUCH DAMAGE.
>>>>     */
>>>>    #include "box.h"
>>>> +#include "error.h"
>>>> +#include "journal.h"
>>>>    #include "raft.h"
>>>> +#include "relay.h"
>>>>    #include "replication.h"
>>> Just noticed: raft.c starts depending on xrow.h in this patch.
>> This is fine. box/raft.c can depend on anything. It is raftlib.c/.h,
>> who is not allowed to depend on anything except src/lib.
> 
> I meant you should include xrow.h here directly.

Aha, ok. I thought that it is enough to have it implicitly through other
files. But I don't mind adding it.

====================
@@ -34,6 +34,7 @@
 #include "raft.h"
 #include "relay.h"
 #include "replication.h"
+#include "xrow.h"

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [Tarantool-patches] [PATCH 08/12] raft: introduce vtab for disk and network
  2020-11-20 19:40         ` Vladislav Shpilevoy
@ 2020-11-23  8:09           ` Serge Petrenko
  0 siblings, 0 replies; 37+ messages in thread
From: Serge Petrenko @ 2020-11-23  8:09 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov


20.11.2020 22:40, Vladislav Shpilevoy пишет:
>>>>> diff --git a/src/box/raft.c b/src/box/raft.c
>>>>> index af6e71e0b..845525660 100644
>>>>> --- a/src/box/raft.c
>>>>> +++ b/src/box/raft.c
>>>>> @@ -29,7 +29,10 @@
>>>>>      * SUCH DAMAGE.
>>>>>      */
>>>>>     #include "box.h"
>>>>> +#include "error.h"
>>>>> +#include "journal.h"
>>>>>     #include "raft.h"
>>>>> +#include "relay.h"
>>>>>     #include "replication.h"
>>>> Just noticed: raft.c starts depending on xrow.h in this patch.
>>> This is fine. box/raft.c can depend on anything. It is raftlib.c/.h,
>>> who is not allowed to depend on anything except src/lib.
>> I meant you should include xrow.h here directly.
> Aha, ok. I thought that it is enough to have it implicitly through other
> files. But I don't mind adding it.
>
> ====================
> @@ -34,6 +34,7 @@
>   #include "raft.h"
>   #include "relay.h"
>   #include "replication.h"
> +#include "xrow.h"


Thanks for the change!

Yes, it's enough to have it included in some other file, but AFAIR we always
include dependencies directly. Either in raft.h (if needed) or raft.c in 
this case.

I may be mistaken though.

-- 
Serge Petrenko

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2020-11-23  8:09 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-17  0:02 [Tarantool-patches] [PATCH 00/12] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
2020-11-17  0:02 ` [Tarantool-patches] [PATCH 01/12] raft: move sources to raftlib.h/.c Vladislav Shpilevoy
2020-11-17  8:14   ` Serge Petrenko
2020-11-17  0:02 ` [Tarantool-patches] [PATCH 10/12] raft: move box_update_ro_summary to update trigger Vladislav Shpilevoy
2020-11-17 12:42   ` Serge Petrenko
2020-11-17 15:17     ` Serge Petrenko
2020-11-18 23:21     ` Vladislav Shpilevoy
2020-11-19 10:08       ` Serge Petrenko
2020-11-17  0:02 ` [Tarantool-patches] [PATCH 11/12] raft: introduce RaftError Vladislav Shpilevoy
2020-11-17 15:13   ` Serge Petrenko
2020-11-17  0:02 ` [Tarantool-patches] [PATCH 12/12] raft: move algorithm code to src/lib/raft Vladislav Shpilevoy
2020-11-17 15:13   ` Serge Petrenko
2020-11-17  0:02 ` [Tarantool-patches] [PATCH 02/12] raft: move box_raft_* to src/box/raft.h and .c Vladislav Shpilevoy
2020-11-17  8:14   ` Serge Petrenko
2020-11-17  0:02 ` [Tarantool-patches] [PATCH 03/12] raft: stop using replication_disconnect_timeout() Vladislav Shpilevoy
2020-11-17  8:15   ` Serge Petrenko
2020-11-17  0:02 ` [Tarantool-patches] [PATCH 04/12] raft: stop using replication_synchro_quorum Vladislav Shpilevoy
2020-11-17  8:17   ` Serge Petrenko
2020-11-19 23:42     ` Vladislav Shpilevoy
2020-11-17  0:02 ` [Tarantool-patches] [PATCH 05/12] raft: stop using instance_id Vladislav Shpilevoy
2020-11-17  8:59   ` Serge Petrenko
2020-11-17  0:02 ` [Tarantool-patches] [PATCH 06/12] raft: make raft_request.vclock constant Vladislav Shpilevoy
2020-11-17  9:17   ` Serge Petrenko
2020-11-17  0:02 ` [Tarantool-patches] [PATCH 07/12] raft: stop using replicaset.vclock Vladislav Shpilevoy
2020-11-17  9:23   ` Serge Petrenko
2020-11-17  0:02 ` [Tarantool-patches] [PATCH 08/12] raft: introduce vtab for disk and network Vladislav Shpilevoy
2020-11-17  9:35   ` Serge Petrenko
2020-11-19 23:43     ` Vladislav Shpilevoy
2020-11-17 10:00   ` Serge Petrenko
2020-11-19 23:43     ` Vladislav Shpilevoy
2020-11-20  7:56       ` Serge Petrenko
2020-11-20 19:40         ` Vladislav Shpilevoy
2020-11-23  8:09           ` Serge Petrenko
2020-11-17  0:02 ` [Tarantool-patches] [PATCH 09/12] raft: introduce raft_msg, drop xrow dependency Vladislav Shpilevoy
2020-11-17 10:22   ` Serge Petrenko
2020-11-19 23:43     ` Vladislav Shpilevoy
2020-11-20  8:03       ` Serge Petrenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox