* [Tarantool-patches] [PATCH v2 01/16] raft: move sources to raftlib.h/.c
2020-11-19 23:45 [Tarantool-patches] [PATCH v2 00/16] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
@ 2020-11-19 23:45 ` Vladislav Shpilevoy
2020-11-19 23:45 ` [Tarantool-patches] [PATCH v2 10/16] raft: make worker non-cancellable during WAL write Vladislav Shpilevoy
` (17 subsequent siblings)
18 siblings, 0 replies; 26+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-19 23:45 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko
The commit renames raft.h and raft.c to raftlib.h and raftlib.c.
This is done to prepare to Raft split into src/box/ and
src/lib/raft.
The commit is not atomic, the build won't work here. Because if
raft is renamed to raftlib, and in the same commit new raft.c and
raft.h are added, git thinks the original file was changed, and
ruins all the git history.
By splitting move of raft to raftlib and introduction of box/raft
into 2 commits the git history is saved.
Part of #5303
---
src/box/CMakeLists.txt | 1 +
src/box/{raft.c => raftlib.c} | 0
src/box/{raft.h => raftlib.h} | 0
3 files changed, 1 insertion(+)
rename src/box/{raft.c => raftlib.c} (100%)
rename src/box/{raft.h => raftlib.h} (100%)
diff --git a/src/box/CMakeLists.txt b/src/box/CMakeLists.txt
index d1667796a..fcf779379 100644
--- a/src/box/CMakeLists.txt
+++ b/src/box/CMakeLists.txt
@@ -169,6 +169,7 @@ add_library(box STATIC
port.c
txn.c
txn_limbo.c
+ raftlib.c
raft.c
box.cc
gc.c
diff --git a/src/box/raft.c b/src/box/raftlib.c
similarity index 100%
rename from src/box/raft.c
rename to src/box/raftlib.c
diff --git a/src/box/raft.h b/src/box/raftlib.h
similarity index 100%
rename from src/box/raft.h
rename to src/box/raftlib.h
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Tarantool-patches] [PATCH v2 10/16] raft: make worker non-cancellable during WAL write
2020-11-19 23:45 [Tarantool-patches] [PATCH v2 00/16] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
2020-11-19 23:45 ` [Tarantool-patches] [PATCH v2 01/16] raft: move sources to raftlib.h/.c Vladislav Shpilevoy
@ 2020-11-19 23:45 ` Vladislav Shpilevoy
2020-11-20 8:33 ` Serge Petrenko
2020-11-19 23:45 ` [Tarantool-patches] [PATCH v2 11/16] raft: move worker fiber from Raft library to box Vladislav Shpilevoy
` (16 subsequent siblings)
18 siblings, 1 reply; 26+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-19 23:45 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko
This is a general practice throughout the code. If a fiber is not
cancellable, it always means a system fiber which can't be woken
up or canceled until its work done. It is used in gc (about
xlogs), recovery, vinyl, WAL, at least.
Before raft used flag raft.is_write_in_progress. But it won't
work soon, because the worker fiber will move to box/raft.c,
where it would be incorrect to rely on deeply internal parts of
struct raft, such as is_write_in_progress.
Hence, this patch makes raft use a more traditional way of
spurious wakeup avoidance.
Part of #5303
---
src/box/raft.c | 9 ++++++++-
src/box/raftlib.c | 10 ++++++----
2 files changed, 14 insertions(+), 5 deletions(-)
diff --git a/src/box/raft.c b/src/box/raft.c
index 2c9ed11b6..8a034687b 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -213,7 +213,14 @@ box_raft_write(struct raft *raft, const struct raft_msg *msg)
journal_entry_create(entry, 1, xrow_approx_len(&row), box_raft_write_cb,
fiber());
- if (journal_write(entry) != 0 || entry->res < 0) {
+ /*
+ * A non-cancelable fiber is considered non-wake-able, generally. Raft
+ * follows this pattern of 'protection'.
+ */
+ bool cancellable = fiber_set_cancellable(false);
+ bool ok = (journal_write(entry) == 0 && entry->res >= 0);
+ fiber_set_cancellable(cancellable);
+ if (!ok) {
diag_set(ClientError, ER_WAL_IO);
diag_log();
goto fail;
diff --git a/src/box/raftlib.c b/src/box/raftlib.c
index d28e51871..a1fca34cd 100644
--- a/src/box/raftlib.c
+++ b/src/box/raftlib.c
@@ -986,11 +986,13 @@ raft_worker_wakeup(struct raft *raft)
fiber_set_joinable(raft->worker, true);
}
/*
- * Don't wake the fiber if it writes something. Otherwise it would be a
- * spurious wakeup breaking the WAL write not adapted to this. Also
- * don't wakeup the current fiber - it leads to undefined behaviour.
+ * Don't wake the fiber if it writes something (not cancellable).
+ * Otherwise it would be a spurious wakeup breaking the WAL write not
+ * adapted to this. Also don't wakeup the current fiber - it leads to
+ * undefined behaviour.
*/
- if (!raft->is_write_in_progress && fiber() != raft->worker)
+ if ((raft->worker->flags & FIBER_IS_CANCELLABLE) != 0 &&
+ fiber() != raft->worker)
fiber_wakeup(raft->worker);
}
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Tarantool-patches] [PATCH v2 10/16] raft: make worker non-cancellable during WAL write
2020-11-19 23:45 ` [Tarantool-patches] [PATCH v2 10/16] raft: make worker non-cancellable during WAL write Vladislav Shpilevoy
@ 2020-11-20 8:33 ` Serge Petrenko
0 siblings, 0 replies; 26+ messages in thread
From: Serge Petrenko @ 2020-11-20 8:33 UTC (permalink / raw)
To: Vladislav Shpilevoy, tarantool-patches
20.11.2020 02:45, Vladislav Shpilevoy пишет:
> This is a general practice throughout the code. If a fiber is not
> cancellable, it always means a system fiber which can't be woken
> up or canceled until its work done. It is used in gc (about
> xlogs), recovery, vinyl, WAL, at least.
>
> Before raft used flag raft.is_write_in_progress. But it won't
> work soon, because the worker fiber will move to box/raft.c,
> where it would be incorrect to rely on deeply internal parts of
> struct raft, such as is_write_in_progress.
>
> Hence, this patch makes raft use a more traditional way of
> spurious wakeup avoidance.
>
> Part of #5303
Hi! Thanks for the patch!
> ---
> src/box/raft.c | 9 ++++++++-
> src/box/raftlib.c | 10 ++++++----
> 2 files changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/src/box/raft.c b/src/box/raft.c
> index 2c9ed11b6..8a034687b 100644
> --- a/src/box/raft.c
> +++ b/src/box/raft.c
> @@ -213,7 +213,14 @@ box_raft_write(struct raft *raft, const struct raft_msg *msg)
> journal_entry_create(entry, 1, xrow_approx_len(&row), box_raft_write_cb,
> fiber());
>
> - if (journal_write(entry) != 0 || entry->res < 0) {
> + /*
> + * A non-cancelable fiber is considered non-wake-able, generally. Raft
> + * follows this pattern of 'protection'.
> + */
> + bool cancellable = fiber_set_cancellable(false);
> + bool ok = (journal_write(entry) == 0 && entry->res >= 0);
> + fiber_set_cancellable(cancellable);
> + if (!ok) {
> diag_set(ClientError, ER_WAL_IO);
> diag_log();
> goto fail;
> diff --git a/src/box/raftlib.c b/src/box/raftlib.c
> index d28e51871..a1fca34cd 100644
> --- a/src/box/raftlib.c
> +++ b/src/box/raftlib.c
> @@ -986,11 +986,13 @@ raft_worker_wakeup(struct raft *raft)
> fiber_set_joinable(raft->worker, true);
> }
> /*
> - * Don't wake the fiber if it writes something. Otherwise it would be a
> - * spurious wakeup breaking the WAL write not adapted to this. Also
> - * don't wakeup the current fiber - it leads to undefined behaviour.
> + * Don't wake the fiber if it writes something (not cancellable).
> + * Otherwise it would be a spurious wakeup breaking the WAL write not
> + * adapted to this. Also don't wakeup the current fiber - it leads to
> + * undefined behaviour.
> */
> - if (!raft->is_write_in_progress && fiber() != raft->worker)
> + if ((raft->worker->flags & FIBER_IS_CANCELLABLE) != 0 &&
> + fiber() != raft->worker)
> fiber_wakeup(raft->worker);
> }
>
LGTM.
--
Serge Petrenko
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Tarantool-patches] [PATCH v2 11/16] raft: move worker fiber from Raft library to box
2020-11-19 23:45 [Tarantool-patches] [PATCH v2 00/16] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
2020-11-19 23:45 ` [Tarantool-patches] [PATCH v2 01/16] raft: move sources to raftlib.h/.c Vladislav Shpilevoy
2020-11-19 23:45 ` [Tarantool-patches] [PATCH v2 10/16] raft: make worker non-cancellable during WAL write Vladislav Shpilevoy
@ 2020-11-19 23:45 ` Vladislav Shpilevoy
2020-11-20 9:06 ` Serge Petrenko
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 12/16] raft: move synchro queue clear to the worker fiber Vladislav Shpilevoy
` (15 subsequent siblings)
18 siblings, 1 reply; 26+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-19 23:45 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko
Worker fiber is used by Raft library to perform yielding tasks
like WAL write, and simply long tasks like network broadcast. That
allows not to block the Raft state machine, and to collect
multiple updates during an event loop iteration to flush them all
at once.
While the worker fiber was inside Raft library, it wasn't possible
to use it for anything else. And that is exactly what is going to
be needed. The reason chain is quite long.
It all starts from that the elimination of all box appearances
from Raft library also includes relocation of
box_update_ro_summary().
The only place it can be moved to is box_raft_on_update trigger.
The trigger is currently called from the Raft worker fiber. It
means, that between Raft state update and trigger invocation there
is a yield. If box_update_ro_summary() would be blindly moved to
the trigger, users sometimes could observe miracles like instance
role being 'follower', but the node is still writable if it was a
leader before, because box_raft_on_update wasn't invoked yet, and
it didn't update RO summary.
Assume, the on_update triggers are invoked by Raft not in the
worker fiber, but right from the state machine. Then
box_update_ro_summary() would always follow a state change without
a yield.
However that creates another problem - the trigger also calls
box_clear_synchro_queue(), which yields. But on_update triggers
must not yield so as not to block the state machine.
This can be easily solved if it would be possible to schedule
box_clear_synchro_queue() from on_update trigger to be executed
later.
And after this patch it becomes possible, because the worker fiber
now can be used not only to handle Raft library async work, but
also for box-Raft async work, like the synchro queue clearance.
Part of #5303
---
src/box/raft.c | 65 ++++++++++++++++++++++++++++++++++-
src/box/raftlib.c | 86 +++++++++++++----------------------------------
src/box/raftlib.h | 13 +++++--
3 files changed, 98 insertions(+), 66 deletions(-)
diff --git a/src/box/raft.c b/src/box/raft.c
index 8a034687b..0027230da 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -49,6 +49,15 @@ struct raft box_raft_global = {
*/
static struct trigger box_raft_on_update;
+/**
+ * Worker fiber does all the asynchronous work, which may need yields and can be
+ * long. These are WAL writes, network broadcasts. That allows not to block the
+ * Raft state machine.
+ */
+static struct fiber *box_raft_worker = NULL;
+/** Flag installed each time when new work appears for the worker fiber. */
+static bool box_raft_has_work = false;
+
static void
box_raft_msg_to_request(const struct raft_msg *msg, struct raft_request *req)
{
@@ -71,6 +80,59 @@ box_raft_request_to_msg(const struct raft_request *req, struct raft_msg *msg)
};
}
+static int
+box_raft_worker_f(va_list args)
+{
+ (void)args;
+ struct raft *raft = fiber()->f_arg;
+ assert(raft == box_raft());
+ while (!fiber_is_cancelled()) {
+ box_raft_has_work = false;
+
+ raft_process_async(raft);
+
+ if (!box_raft_has_work)
+ fiber_yield();
+ }
+ return 0;
+}
+
+static void
+box_raft_schedule_async(struct raft *raft)
+{
+ assert(raft == box_raft());
+ if (box_raft_worker == NULL) {
+ box_raft_worker = fiber_new("raft_worker", box_raft_worker_f);
+ if (box_raft_worker == NULL) {
+ /*
+ * XXX: should be handled properly, no need to panic.
+ * The issue though is that most of the Raft state
+ * machine functions are not supposed to fail, and also
+ * they usually wakeup the fiber when their work is
+ * finished. So it is too late to fail. On the other
+ * hand it looks not so good to create the fiber when
+ * Raft is initialized. Because then it will occupy
+ * memory even if Raft is not used.
+ */
+ diag_log();
+ panic("Could't create Raft worker fiber");
+ return;
+ }
+ box_raft_worker->f_arg = raft;
+ fiber_set_joinable(box_raft_worker, true);
+ }
+ /*
+ * Don't wake the fiber if it writes something (not cancellable).
+ * Otherwise it would be a spurious wakeup breaking the WAL write not
+ * adapted to this. Also don't wakeup the current fiber - it leads to
+ * undefined behaviour.
+ */
+ if ((box_raft_worker->flags & FIBER_IS_CANCELLABLE) != 0 &&
+ fiber() != box_raft_worker)
+ fiber_wakeup(box_raft_worker);
+ box_raft_has_work = true;
+}
+
static int
box_raft_on_update_f(struct trigger *trigger, void *event)
{
@@ -242,6 +304,7 @@ box_raft_init(void)
static const struct raft_vtab box_raft_vtab = {
.broadcast = box_raft_broadcast,
.write = box_raft_write,
+ .schedule_async = box_raft_schedule_async,
};
raft_create(&box_raft_global, &box_raft_vtab);
trigger_create(&box_raft_on_update, box_raft_on_update_f, NULL, NULL);
@@ -255,7 +318,7 @@ box_raft_free(void)
* Can't join the fiber, because the event loop is stopped already, and
* yields are not allowed.
*/
- box_raft_global.worker = NULL;
+ box_raft_worker = NULL;
raft_destroy(&box_raft_global);
/*
* Invalidate so as box_raft() would fail if any usage attempt happens.
diff --git a/src/box/raftlib.c b/src/box/raftlib.c
index a1fca34cd..4457a784f 100644
--- a/src/box/raftlib.c
+++ b/src/box/raftlib.c
@@ -75,6 +75,23 @@ raft_write(struct raft *raft, const struct raft_msg *req)
raft->vtab->write(raft, req);
}
+/**
+ * Schedule async work. The Raft node owner should eventually process the async
+ * events.
+ */
+static inline void
+raft_schedule_async(struct raft *raft)
+{
+ /*
+ * The method is called from inside of the state machine, when yields
+ * are not allowed for its simplicity.
+ */
+ int csw = fiber()->csw;
+ raft->vtab->schedule_async(raft);
+ assert(csw == fiber()->csw);
+ (void)csw;
+}
+
/**
* Check if Raft is completely synced with disk. Meaning all its critical values
* are in WAL. Only in that state the node can become a leader or a candidate.
@@ -140,13 +157,6 @@ raft_can_vote_for(const struct raft *raft, const struct vclock *v)
return cmp == 0 || cmp == 1;
}
-/**
- * Wakeup the Raft worker fiber in order to do some async work. If the fiber
- * does not exist yet, it is created.
- */
-static void
-raft_worker_wakeup(struct raft *raft);
-
/** Schedule broadcast of the complete Raft state to all the followers. */
static void
raft_schedule_broadcast(struct raft *raft);
@@ -568,13 +578,11 @@ raft_worker_handle_broadcast(struct raft *raft)
raft->is_broadcast_scheduled = false;
}
-static int
-raft_worker_f(va_list args)
+void
+raft_process_async(struct raft *raft)
{
- (void)args;
- struct raft *raft = fiber()->f_arg;
bool is_idle;
- while (!fiber_is_cancelled()) {
+ do {
is_idle = true;
if (raft->is_write_in_progress) {
raft_worker_handle_io(raft);
@@ -584,14 +592,8 @@ raft_worker_f(va_list args)
raft_worker_handle_broadcast(raft);
is_idle = false;
}
- if (is_idle) {
- assert(raft_is_fully_on_disk(raft));
- fiber_yield();
- } else {
- fiber_sleep(0);
- }
- }
- return 0;
+ } while (!is_idle);
+ assert(raft_is_fully_on_disk(raft));
}
static void
@@ -601,7 +603,7 @@ raft_sm_pause_and_dump(struct raft *raft)
if (raft->is_write_in_progress)
return;
ev_timer_stop(loop(), &raft->timer);
- raft_worker_wakeup(raft);
+ raft_schedule_async(raft);
raft->is_write_in_progress = true;
}
@@ -962,45 +964,11 @@ raft_new_term(struct raft *raft)
raft_sm_schedule_new_term(raft, raft->volatile_term + 1);
}
-static void
-raft_worker_wakeup(struct raft *raft)
-{
- if (raft->worker == NULL) {
- raft->worker = fiber_new("raft_worker", raft_worker_f);
- if (raft->worker == NULL) {
- /*
- * XXX: should be handled properly, no need to panic.
- * The issue though is that most of the Raft state
- * machine functions are not supposed to fail, and also
- * they usually wakeup the fiber when their work is
- * finished. So it is too late to fail. On the other
- * hand it looks not so good to create the fiber when
- * Raft is initialized. Because then it will occupy
- * memory even if Raft is not used.
- */
- diag_log();
- panic("Could't create Raft worker fiber");
- return;
- }
- raft->worker->f_arg = raft;
- fiber_set_joinable(raft->worker, true);
- }
- /*
- * Don't wake the fiber if it writes something (not cancellable).
- * Otherwise it would be a spurious wakeup breaking the WAL write not
- * adapted to this. Also don't wakeup the current fiber - it leads to
- * undefined behaviour.
- */
- if ((raft->worker->flags & FIBER_IS_CANCELLABLE) != 0 &&
- fiber() != raft->worker)
- fiber_wakeup(raft->worker);
-}
-
static void
raft_schedule_broadcast(struct raft *raft)
{
raft->is_broadcast_scheduled = true;
- raft_worker_wakeup(raft);
+ raft_schedule_async(raft);
}
void
@@ -1025,10 +993,4 @@ raft_destroy(struct raft *raft)
{
ev_timer_stop(loop(), &raft->timer);
trigger_destroy(&raft->on_update);
- if (raft->worker != NULL) {
- raft_worker_wakeup(raft);
- fiber_cancel(raft->worker);
- fiber_join(raft->worker);
- raft->worker = NULL;
- }
}
diff --git a/src/box/raftlib.h b/src/box/raftlib.h
index 4f4d24ca8..f545224a5 100644
--- a/src/box/raftlib.h
+++ b/src/box/raftlib.h
@@ -68,7 +68,6 @@ extern "C" {
* than the configured one. See more details in the code.
*/
-struct fiber;
struct raft;
enum raft_state {
@@ -120,6 +119,7 @@ struct raft_msg {
typedef void (*raft_broadcast_f)(struct raft *raft, const struct raft_msg *req);
typedef void (*raft_write_f)(struct raft *raft, const struct raft_msg *req);
+typedef void (*raft_schedule_async_f)(struct raft *raft);
/**
* Raft connection to the environment, via which it talks to other nodes and
@@ -130,6 +130,11 @@ struct raft_vtab {
raft_broadcast_f broadcast;
/** Save a message to disk. */
raft_write_f write;
+ /**
+ * Schedule asynchronous work which may yield, and it can't be done
+ * right now.
+ */
+ raft_schedule_async_f schedule_async;
};
struct raft {
@@ -203,8 +208,6 @@ struct raft {
const struct vclock *vclock;
/** State machine timed event trigger. */
struct ev_timer timer;
- /** Worker fiber to execute blocking tasks like IO. */
- struct fiber *worker;
/** Configured election timeout in seconds. */
double election_timeout;
/**
@@ -255,6 +258,10 @@ int
raft_process_msg(struct raft *raft, const struct raft_msg *req,
uint32_t source);
+/** Process all asynchronous events accumulated by Raft. */
+void
+raft_process_async(struct raft *raft);
+
/**
* Process a heartbeat message from an instance with the given ID. It is used to
* watch leader's health and start election when necessary.
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Tarantool-patches] [PATCH v2 11/16] raft: move worker fiber from Raft library to box
2020-11-19 23:45 ` [Tarantool-patches] [PATCH v2 11/16] raft: move worker fiber from Raft library to box Vladislav Shpilevoy
@ 2020-11-20 9:06 ` Serge Petrenko
0 siblings, 0 replies; 26+ messages in thread
From: Serge Petrenko @ 2020-11-20 9:06 UTC (permalink / raw)
To: Vladislav Shpilevoy, tarantool-patches
20.11.2020 02:45, Vladislav Shpilevoy пишет:
> Worker fiber is used by Raft library to perform yielding tasks
> like WAL write, and simply long tasks like network broadcast. That
> allows not to block the Raft state machine, and to collect
> multiple updates during an event loop iteration to flush them all
> at once.
>
> While the worker fiber was inside Raft library, it wasn't possible
> to use it for anything else. And that is exactly what is going to
> be needed. The reason chain is quite long.
>
> It all starts from that the elimination of all box appearances
> from Raft library also includes relocation of
> box_update_ro_summary().
>
> The only place it can be moved to is box_raft_on_update trigger.
>
> The trigger is currently called from the Raft worker fiber. It
> means, that between Raft state update and trigger invocation there
> is a yield. If box_update_ro_summary() would be blindly moved to
> the trigger, users sometimes could observe miracles like instance
> role being 'follower', but the node is still writable if it was a
> leader before, because box_raft_on_update wasn't invoked yet, and
> it didn't update RO summary.
>
> Assume, the on_update triggers are invoked by Raft not in the
> worker fiber, but right from the state machine. Then
> box_update_ro_summary() would always follow a state change without
> a yield.
>
> However that creates another problem - the trigger also calls
> box_clear_synchro_queue(), which yields. But on_update triggers
> must not yield so as not to block the state machine.
>
> This can be easily solved if it would be possible to schedule
> box_clear_synchro_queue() from on_update trigger to be executed
> later.
>
> And after this patch it becomes possible, because the worker fiber
> now can be used not only to handle Raft library async work, but
> also for box-Raft async work, like the synchro queue clearance.
>
> Part of #5303
Thanks for the patch!
LGTM.
> ---
> src/box/raft.c | 65 ++++++++++++++++++++++++++++++++++-
> src/box/raftlib.c | 86 +++++++++++++----------------------------------
> src/box/raftlib.h | 13 +++++--
> 3 files changed, 98 insertions(+), 66 deletions(-)
>
> diff --git a/src/box/raft.c b/src/box/raft.c
> index 8a034687b..0027230da 100644
> --- a/src/box/raft.c
> +++ b/src/box/raft.c
> @@ -49,6 +49,15 @@ struct raft box_raft_global = {
> */
> static struct trigger box_raft_on_update;
>
> +/**
> + * Worker fiber does all the asynchronous work, which may need yields and can be
> + * long. These are WAL writes, network broadcasts. That allows not to block the
> + * Raft state machine.
> + */
> +static struct fiber *box_raft_worker = NULL;
> +/** Flag installed each time when new work appears for the worker fiber. */
> +static bool box_raft_has_work = false;
> +
> static void
> box_raft_msg_to_request(const struct raft_msg *msg, struct raft_request *req)
> {
> @@ -71,6 +80,59 @@ box_raft_request_to_msg(const struct raft_request *req, struct raft_msg *msg)
> };
> }
>
> +static int
> +box_raft_worker_f(va_list args)
> +{
> + (void)args;
> + struct raft *raft = fiber()->f_arg;
> + assert(raft == box_raft());
> + while (!fiber_is_cancelled()) {
> + box_raft_has_work = false;
> +
> + raft_process_async(raft);
> +
> + if (!box_raft_has_work)
> + fiber_yield();
> + }
> + return 0;
> +}
> +
> +static void
> +box_raft_schedule_async(struct raft *raft)
> +{
> + assert(raft == box_raft());
> + if (box_raft_worker == NULL) {
> + box_raft_worker = fiber_new("raft_worker", box_raft_worker_f);
> + if (box_raft_worker == NULL) {
> + /*
> + * XXX: should be handled properly, no need to panic.
> + * The issue though is that most of the Raft state
> + * machine functions are not supposed to fail, and also
> + * they usually wakeup the fiber when their work is
> + * finished. So it is too late to fail. On the other
> + * hand it looks not so good to create the fiber when
> + * Raft is initialized. Because then it will occupy
> + * memory even if Raft is not used.
> + */
> + diag_log();
> + panic("Could't create Raft worker fiber");
> + return;
> + }
> + box_raft_worker->f_arg = raft;
> + fiber_set_joinable(box_raft_worker, true);
> + }
> + /*
> + * Don't wake the fiber if it writes something (not cancellable).
> + * Otherwise it would be a spurious wakeup breaking the WAL write not
> + * adapted to this. Also don't wakeup the current fiber - it leads to
> + * undefined behaviour.
> + */
> + if ((box_raft_worker->flags & FIBER_IS_CANCELLABLE) != 0 &&
> + fiber() != box_raft_worker)
> + fiber_wakeup(box_raft_worker);
> + box_raft_has_work = true;
> +}
> +
> static int
> box_raft_on_update_f(struct trigger *trigger, void *event)
> {
> @@ -242,6 +304,7 @@ box_raft_init(void)
> static const struct raft_vtab box_raft_vtab = {
> .broadcast = box_raft_broadcast,
> .write = box_raft_write,
> + .schedule_async = box_raft_schedule_async,
> };
> raft_create(&box_raft_global, &box_raft_vtab);
> trigger_create(&box_raft_on_update, box_raft_on_update_f, NULL, NULL);
> @@ -255,7 +318,7 @@ box_raft_free(void)
> * Can't join the fiber, because the event loop is stopped already, and
> * yields are not allowed.
> */
> - box_raft_global.worker = NULL;
> + box_raft_worker = NULL;
> raft_destroy(&box_raft_global);
> /*
> * Invalidate so as box_raft() would fail if any usage attempt happens.
> diff --git a/src/box/raftlib.c b/src/box/raftlib.c
> index a1fca34cd..4457a784f 100644
> --- a/src/box/raftlib.c
> +++ b/src/box/raftlib.c
> @@ -75,6 +75,23 @@ raft_write(struct raft *raft, const struct raft_msg *req)
> raft->vtab->write(raft, req);
> }
>
> +/**
> + * Schedule async work. The Raft node owner should eventually process the async
> + * events.
> + */
> +static inline void
> +raft_schedule_async(struct raft *raft)
> +{
> + /*
> + * The method is called from inside of the state machine, when yields
> + * are not allowed for its simplicity.
> + */
> + int csw = fiber()->csw;
> + raft->vtab->schedule_async(raft);
> + assert(csw == fiber()->csw);
> + (void)csw;
> +}
> +
> /**
> * Check if Raft is completely synced with disk. Meaning all its critical values
> * are in WAL. Only in that state the node can become a leader or a candidate.
> @@ -140,13 +157,6 @@ raft_can_vote_for(const struct raft *raft, const struct vclock *v)
> return cmp == 0 || cmp == 1;
> }
>
> -/**
> - * Wakeup the Raft worker fiber in order to do some async work. If the fiber
> - * does not exist yet, it is created.
> - */
> -static void
> -raft_worker_wakeup(struct raft *raft);
> -
> /** Schedule broadcast of the complete Raft state to all the followers. */
> static void
> raft_schedule_broadcast(struct raft *raft);
> @@ -568,13 +578,11 @@ raft_worker_handle_broadcast(struct raft *raft)
> raft->is_broadcast_scheduled = false;
> }
>
> -static int
> -raft_worker_f(va_list args)
> +void
> +raft_process_async(struct raft *raft)
> {
> - (void)args;
> - struct raft *raft = fiber()->f_arg;
> bool is_idle;
> - while (!fiber_is_cancelled()) {
> + do {
> is_idle = true;
> if (raft->is_write_in_progress) {
> raft_worker_handle_io(raft);
> @@ -584,14 +592,8 @@ raft_worker_f(va_list args)
> raft_worker_handle_broadcast(raft);
> is_idle = false;
> }
> - if (is_idle) {
> - assert(raft_is_fully_on_disk(raft));
> - fiber_yield();
> - } else {
> - fiber_sleep(0);
> - }
> - }
> - return 0;
> + } while (!is_idle);
> + assert(raft_is_fully_on_disk(raft));
> }
>
> static void
> @@ -601,7 +603,7 @@ raft_sm_pause_and_dump(struct raft *raft)
> if (raft->is_write_in_progress)
> return;
> ev_timer_stop(loop(), &raft->timer);
> - raft_worker_wakeup(raft);
> + raft_schedule_async(raft);
> raft->is_write_in_progress = true;
> }
>
> @@ -962,45 +964,11 @@ raft_new_term(struct raft *raft)
> raft_sm_schedule_new_term(raft, raft->volatile_term + 1);
> }
>
> -static void
> -raft_worker_wakeup(struct raft *raft)
> -{
> - if (raft->worker == NULL) {
> - raft->worker = fiber_new("raft_worker", raft_worker_f);
> - if (raft->worker == NULL) {
> - /*
> - * XXX: should be handled properly, no need to panic.
> - * The issue though is that most of the Raft state
> - * machine functions are not supposed to fail, and also
> - * they usually wakeup the fiber when their work is
> - * finished. So it is too late to fail. On the other
> - * hand it looks not so good to create the fiber when
> - * Raft is initialized. Because then it will occupy
> - * memory even if Raft is not used.
> - */
> - diag_log();
> - panic("Could't create Raft worker fiber");
> - return;
> - }
> - raft->worker->f_arg = raft;
> - fiber_set_joinable(raft->worker, true);
> - }
> - /*
> - * Don't wake the fiber if it writes something (not cancellable).
> - * Otherwise it would be a spurious wakeup breaking the WAL write not
> - * adapted to this. Also don't wakeup the current fiber - it leads to
> - * undefined behaviour.
> - */
> - if ((raft->worker->flags & FIBER_IS_CANCELLABLE) != 0 &&
> - fiber() != raft->worker)
> - fiber_wakeup(raft->worker);
> -}
> -
> static void
> raft_schedule_broadcast(struct raft *raft)
> {
> raft->is_broadcast_scheduled = true;
> - raft_worker_wakeup(raft);
> + raft_schedule_async(raft);
> }
>
> void
> @@ -1025,10 +993,4 @@ raft_destroy(struct raft *raft)
> {
> ev_timer_stop(loop(), &raft->timer);
> trigger_destroy(&raft->on_update);
> - if (raft->worker != NULL) {
> - raft_worker_wakeup(raft);
> - fiber_cancel(raft->worker);
> - fiber_join(raft->worker);
> - raft->worker = NULL;
> - }
> }
> diff --git a/src/box/raftlib.h b/src/box/raftlib.h
> index 4f4d24ca8..f545224a5 100644
> --- a/src/box/raftlib.h
> +++ b/src/box/raftlib.h
> @@ -68,7 +68,6 @@ extern "C" {
> * than the configured one. See more details in the code.
> */
>
> -struct fiber;
> struct raft;
>
> enum raft_state {
> @@ -120,6 +119,7 @@ struct raft_msg {
>
> typedef void (*raft_broadcast_f)(struct raft *raft, const struct raft_msg *req);
> typedef void (*raft_write_f)(struct raft *raft, const struct raft_msg *req);
> +typedef void (*raft_schedule_async_f)(struct raft *raft);
>
> /**
> * Raft connection to the environment, via which it talks to other nodes and
> @@ -130,6 +130,11 @@ struct raft_vtab {
> raft_broadcast_f broadcast;
> /** Save a message to disk. */
> raft_write_f write;
> + /**
> + * Schedule asynchronous work which may yield, and it can't be done
> + * right now.
> + */
> + raft_schedule_async_f schedule_async;
> };
>
> struct raft {
> @@ -203,8 +208,6 @@ struct raft {
> const struct vclock *vclock;
> /** State machine timed event trigger. */
> struct ev_timer timer;
> - /** Worker fiber to execute blocking tasks like IO. */
> - struct fiber *worker;
> /** Configured election timeout in seconds. */
> double election_timeout;
> /**
> @@ -255,6 +258,10 @@ int
> raft_process_msg(struct raft *raft, const struct raft_msg *req,
> uint32_t source);
>
> +/** Process all asynchronous events accumulated by Raft. */
> +void
> +raft_process_async(struct raft *raft);
> +
> /**
> * Process a heartbeat message from an instance with the given ID. It is used to
> * watch leader's health and start election when necessary.
--
Serge Petrenko
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Tarantool-patches] [PATCH v2 12/16] raft: move synchro queue clear to the worker fiber
2020-11-19 23:45 [Tarantool-patches] [PATCH v2 00/16] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
` (2 preceding siblings ...)
2020-11-19 23:45 ` [Tarantool-patches] [PATCH v2 11/16] raft: move worker fiber from Raft library to box Vladislav Shpilevoy
@ 2020-11-19 23:46 ` Vladislav Shpilevoy
2020-11-20 9:07 ` Serge Petrenko
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 13/16] raft: invoke update triggers within state machine Vladislav Shpilevoy
` (14 subsequent siblings)
18 siblings, 1 reply; 26+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-19 23:46 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko
The synchro queue was cleared from the Raft on_update trigger
installed by box. It was fine as long as the trigger is called
from the worker fiber, because it won't block the state machine,
while the synchro queue clearance yields.
But the trigger is going to be called from the Raft state machine
directly soon. Because it will need to call
box_update_ro_summary() right after Raft state is updated, without
a yield to switch to the worker fiber. This will be done in scope
of getting rid of box in the Raft library.
It means, the trigger can't call box_clear_synchro_queue(). But it
can schedule its execution for later, since the worker fiber now
belongs to box. The patch does it.
Part of #5303
---
src/box/raft.c | 22 ++++++++++++++++++----
1 file changed, 18 insertions(+), 4 deletions(-)
diff --git a/src/box/raft.c b/src/box/raft.c
index 0027230da..5ccfb3449 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -80,6 +80,19 @@ box_raft_request_to_msg(const struct raft_request *req, struct raft_msg *msg)
};
}
+static void
+box_raft_update_synchro_queue(struct raft *raft)
+{
+ assert(raft == box_raft());
+ /*
+ * When the node became a leader, it means it will ignore all records
+ * from all the other nodes, and won't get late CONFIRM messages anyway.
+ * Can clear the queue without waiting for confirmations.
+ */
+ if (raft->state == RAFT_STATE_LEADER)
+ box_clear_synchro_queue(false);
+}
+
static int
box_raft_worker_f(va_list args)
{
@@ -90,6 +103,7 @@ box_raft_worker_f(va_list args)
box_raft_has_work = false;
raft_process_async(raft);
+ box_raft_update_synchro_queue(raft);
if (!box_raft_has_work)
fiber_yield();
@@ -142,11 +156,11 @@ box_raft_on_update_f(struct trigger *trigger, void *event)
if (raft->state != RAFT_STATE_LEADER)
return 0;
/*
- * When the node became a leader, it means it will ignore all records
- * from all the other nodes, and won't get late CONFIRM messages anyway.
- * Can clear the queue without waiting for confirmations.
+ * If the node became a leader, time to clear the synchro queue. But it
+ * must be done in the worker fiber so as not to block the state
+ * machine, which called this trigger.
*/
- box_clear_synchro_queue(false);
+ box_raft_schedule_async(raft);
return 0;
}
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Tarantool-patches] [PATCH v2 12/16] raft: move synchro queue clear to the worker fiber
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 12/16] raft: move synchro queue clear to the worker fiber Vladislav Shpilevoy
@ 2020-11-20 9:07 ` Serge Petrenko
0 siblings, 0 replies; 26+ messages in thread
From: Serge Petrenko @ 2020-11-20 9:07 UTC (permalink / raw)
To: Vladislav Shpilevoy, tarantool-patches
20.11.2020 02:46, Vladislav Shpilevoy пишет:
> The synchro queue was cleared from the Raft on_update trigger
> installed by box. It was fine as long as the trigger is called
> from the worker fiber, because it won't block the state machine,
> while the synchro queue clearance yields.
>
> But the trigger is going to be called from the Raft state machine
> directly soon. Because it will need to call
> box_update_ro_summary() right after Raft state is updated, without
> a yield to switch to the worker fiber. This will be done in scope
> of getting rid of box in the Raft library.
>
> It means, the trigger can't call box_clear_synchro_queue(). But it
> can schedule its execution for later, since the worker fiber now
> belongs to box. The patch does it.
>
> Part of #5303
Thanks for the patch! LGTM.
> ---
> src/box/raft.c | 22 ++++++++++++++++++----
> 1 file changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/src/box/raft.c b/src/box/raft.c
> index 0027230da..5ccfb3449 100644
> --- a/src/box/raft.c
> +++ b/src/box/raft.c
> @@ -80,6 +80,19 @@ box_raft_request_to_msg(const struct raft_request *req, struct raft_msg *msg)
> };
> }
>
> +static void
> +box_raft_update_synchro_queue(struct raft *raft)
> +{
> + assert(raft == box_raft());
> + /*
> + * When the node became a leader, it means it will ignore all records
> + * from all the other nodes, and won't get late CONFIRM messages anyway.
> + * Can clear the queue without waiting for confirmations.
> + */
> + if (raft->state == RAFT_STATE_LEADER)
> + box_clear_synchro_queue(false);
> +}
> +
> static int
> box_raft_worker_f(va_list args)
> {
> @@ -90,6 +103,7 @@ box_raft_worker_f(va_list args)
> box_raft_has_work = false;
>
> raft_process_async(raft);
> + box_raft_update_synchro_queue(raft);
>
> if (!box_raft_has_work)
> fiber_yield();
> @@ -142,11 +156,11 @@ box_raft_on_update_f(struct trigger *trigger, void *event)
> if (raft->state != RAFT_STATE_LEADER)
> return 0;
> /*
> - * When the node became a leader, it means it will ignore all records
> - * from all the other nodes, and won't get late CONFIRM messages anyway.
> - * Can clear the queue without waiting for confirmations.
> + * If the node became a leader, time to clear the synchro queue. But it
> + * must be done in the worker fiber so as not to block the state
> + * machine, which called this trigger.
> */
> - box_clear_synchro_queue(false);
> + box_raft_schedule_async(raft);
> return 0;
> }
>
--
Serge Petrenko
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Tarantool-patches] [PATCH v2 13/16] raft: invoke update triggers within state machine
2020-11-19 23:45 [Tarantool-patches] [PATCH v2 00/16] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
` (3 preceding siblings ...)
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 12/16] raft: move synchro queue clear to the worker fiber Vladislav Shpilevoy
@ 2020-11-19 23:46 ` Vladislav Shpilevoy
2020-11-20 9:10 ` Serge Petrenko
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 14/16] raft: move RO summary update to box-Raft Vladislav Shpilevoy
` (13 subsequent siblings)
18 siblings, 1 reply; 26+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-19 23:46 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko
Raft used to call on_update trigger from the worker fiber. It was
done because it could yield. But it is not the case anymore. The
only yielding operation was box_clear_synchro_queue(), which is
not called from the trigger now.
That makes possible to call the trigger from within of the state
machine. And this removes the yield between the Raft state change
and the trigger invocation.
What, in turn, allows to move all box-related urgent updates to
the trigger. Such as box_update_ro_summary().
Part of #5303
---
src/box/raftlib.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/src/box/raftlib.c b/src/box/raftlib.c
index 4457a784f..f64a66942 100644
--- a/src/box/raftlib.c
+++ b/src/box/raftlib.c
@@ -574,7 +574,6 @@ raft_worker_handle_broadcast(struct raft *raft)
req.vclock = raft->vclock;
}
raft_broadcast(raft, &req);
- trigger_run(&raft->on_update, raft);
raft->is_broadcast_scheduled = false;
}
@@ -967,6 +966,17 @@ raft_new_term(struct raft *raft)
static void
raft_schedule_broadcast(struct raft *raft)
{
+ /*
+ * Broadcast works not only for network, but also for other subsystems
+ * on the same node. The info is delivered to them via update triggers.
+ * But the broadcast happens from inside of the state machine, so it
+ * can't yield.
+ */
+ int csw = fiber()->csw;
+ trigger_run(&raft->on_update, raft);
+ assert(csw == fiber()->csw);
+ (void)csw;
+
raft->is_broadcast_scheduled = true;
raft_schedule_async(raft);
}
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Tarantool-patches] [PATCH v2 13/16] raft: invoke update triggers within state machine
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 13/16] raft: invoke update triggers within state machine Vladislav Shpilevoy
@ 2020-11-20 9:10 ` Serge Petrenko
0 siblings, 0 replies; 26+ messages in thread
From: Serge Petrenko @ 2020-11-20 9:10 UTC (permalink / raw)
To: Vladislav Shpilevoy, tarantool-patches
20.11.2020 02:46, Vladislav Shpilevoy пишет:
> Raft used to call on_update trigger from the worker fiber. It was
> done because it could yield. But it is not the case anymore. The
> only yielding operation was box_clear_synchro_queue(), which is
> not called from the trigger now.
>
> That makes possible to call the trigger from within of the state
> machine. And this removes the yield between the Raft state change
> and the trigger invocation.
>
> What, in turn, allows to move all box-related urgent updates to
> the trigger. Such as box_update_ro_summary().
>
> Part of #5303
LGTM.
> ---
> src/box/raftlib.c | 12 +++++++++++-
> 1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/src/box/raftlib.c b/src/box/raftlib.c
> index 4457a784f..f64a66942 100644
> --- a/src/box/raftlib.c
> +++ b/src/box/raftlib.c
> @@ -574,7 +574,6 @@ raft_worker_handle_broadcast(struct raft *raft)
> req.vclock = raft->vclock;
> }
> raft_broadcast(raft, &req);
> - trigger_run(&raft->on_update, raft);
> raft->is_broadcast_scheduled = false;
> }
>
> @@ -967,6 +966,17 @@ raft_new_term(struct raft *raft)
> static void
> raft_schedule_broadcast(struct raft *raft)
> {
> + /*
> + * Broadcast works not only for network, but also for other subsystems
> + * on the same node. The info is delivered to them via update triggers.
> + * But the broadcast happens from inside of the state machine, so it
> + * can't yield.
> + */
> + int csw = fiber()->csw;
> + trigger_run(&raft->on_update, raft);
> + assert(csw == fiber()->csw);
> + (void)csw;
> +
> raft->is_broadcast_scheduled = true;
> raft_schedule_async(raft);
> }
--
Serge Petrenko
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Tarantool-patches] [PATCH v2 14/16] raft: move RO summary update to box-Raft
2020-11-19 23:45 [Tarantool-patches] [PATCH v2 00/16] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
` (4 preceding siblings ...)
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 13/16] raft: invoke update triggers within state machine Vladislav Shpilevoy
@ 2020-11-19 23:46 ` Vladislav Shpilevoy
2020-11-20 9:13 ` Serge Petrenko
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 15/16] raft: introduce RaftError Vladislav Shpilevoy
` (12 subsequent siblings)
18 siblings, 1 reply; 26+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-19 23:46 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko
RO summary update is supposed to make the instance read-only, when
it becomes a follower, and read-write when becomes a leader.
But it makes Raft depend on box, and prevents Raft move to a
separate library.
The patch moves the RO update to box-Raft.
This became possible after some preparatory work was done to make
Raft update triggers non-yielding, and invoked right after state
change (without a yield between the change and the triggers).
Part of #5303
---
src/box/raft.c | 8 ++++++++
src/box/raftlib.c | 8 --------
2 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/src/box/raft.c b/src/box/raft.c
index 5ccfb3449..c5a27cdec 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -153,6 +153,14 @@ box_raft_on_update_f(struct trigger *trigger, void *event)
(void)trigger;
struct raft *raft = (struct raft *)event;
assert(raft == box_raft());
+ /*
+ * XXX: in case the instance became a leader, RO must be updated only
+ * after clearing the synchro queue.
+ *
+ * When the instance became a follower, then on the contrary - make it
+ * read-only ASAP, this is good.
+ */
+ box_update_ro_summary();
if (raft->state != RAFT_STATE_LEADER)
return 0;
/*
diff --git a/src/box/raftlib.c b/src/box/raftlib.c
index f64a66942..6e6bc658f 100644
--- a/src/box/raftlib.c
+++ b/src/box/raftlib.c
@@ -33,7 +33,6 @@
#include "error.h"
#include "fiber.h"
#include "small/region.h"
-#include "box.h"
#include "tt_static.h"
/**
@@ -618,8 +617,6 @@ raft_sm_become_leader(struct raft *raft)
raft->state = RAFT_STATE_LEADER;
raft->leader = raft->self;
ev_timer_stop(loop(), &raft->timer);
- /* Make read-write (if other subsystems allow that. */
- box_update_ro_summary();
/* State is visible and it is changed - broadcast. */
raft_schedule_broadcast(raft);
}
@@ -670,7 +667,6 @@ raft_sm_schedule_new_term(struct raft *raft, uint64_t new_term)
raft->volatile_vote = 0;
raft->leader = 0;
raft->state = RAFT_STATE_FOLLOWER;
- box_update_ro_summary();
raft_sm_pause_and_dump(raft);
/*
* State is visible and it is changed - broadcast. Term is also visible,
@@ -701,7 +697,6 @@ raft_sm_schedule_new_election(struct raft *raft)
/* Everyone is a follower until its vote for self is persisted. */
raft_sm_schedule_new_term(raft, raft->term + 1);
raft_sm_schedule_new_vote(raft, raft->self);
- box_update_ro_summary();
}
static void
@@ -786,7 +781,6 @@ raft_sm_start(struct raft *raft)
*/
raft_sm_wait_leader_found(raft);
}
- box_update_ro_summary();
/*
* Nothing changed. But when raft was stopped, its state wasn't sent to
* replicas. At least this was happening at the moment of this being
@@ -808,7 +802,6 @@ raft_sm_stop(struct raft *raft)
raft->leader = 0;
raft->state = RAFT_STATE_FOLLOWER;
ev_timer_stop(loop(), &raft->timer);
- box_update_ro_summary();
/* State is visible and changed - broadcast. */
raft_schedule_broadcast(raft);
}
@@ -894,7 +887,6 @@ raft_cfg_is_candidate(struct raft *raft, bool is_candidate)
raft_schedule_broadcast(raft);
}
}
- box_update_ro_summary();
}
void
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Tarantool-patches] [PATCH v2 14/16] raft: move RO summary update to box-Raft
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 14/16] raft: move RO summary update to box-Raft Vladislav Shpilevoy
@ 2020-11-20 9:13 ` Serge Petrenko
0 siblings, 0 replies; 26+ messages in thread
From: Serge Petrenko @ 2020-11-20 9:13 UTC (permalink / raw)
To: Vladislav Shpilevoy, tarantool-patches
20.11.2020 02:46, Vladislav Shpilevoy пишет:
> RO summary update is supposed to make the instance read-only, when
> it becomes a follower, and read-write when becomes a leader.
>
> But it makes Raft depend on box, and prevents Raft move to a
> separate library.
>
> The patch moves the RO update to box-Raft.
>
> This became possible after some preparatory work was done to make
> Raft update triggers non-yielding, and invoked right after state
> change (without a yield between the change and the triggers).
>
> Part of #5303
LGTM
> ---
> src/box/raft.c | 8 ++++++++
> src/box/raftlib.c | 8 --------
> 2 files changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/src/box/raft.c b/src/box/raft.c
> index 5ccfb3449..c5a27cdec 100644
> --- a/src/box/raft.c
> +++ b/src/box/raft.c
> @@ -153,6 +153,14 @@ box_raft_on_update_f(struct trigger *trigger, void *event)
> (void)trigger;
> struct raft *raft = (struct raft *)event;
> assert(raft == box_raft());
> + /*
> + * XXX: in case the instance became a leader, RO must be updated only
> + * after clearing the synchro queue.
> + *
> + * When the instance became a follower, then on the contrary - make it
> + * read-only ASAP, this is good.
> + */
> + box_update_ro_summary();
> if (raft->state != RAFT_STATE_LEADER)
> return 0;
> /*
> diff --git a/src/box/raftlib.c b/src/box/raftlib.c
> index f64a66942..6e6bc658f 100644
> --- a/src/box/raftlib.c
> +++ b/src/box/raftlib.c
> @@ -33,7 +33,6 @@
> #include "error.h"
> #include "fiber.h"
> #include "small/region.h"
> -#include "box.h"
> #include "tt_static.h"
>
> /**
> @@ -618,8 +617,6 @@ raft_sm_become_leader(struct raft *raft)
> raft->state = RAFT_STATE_LEADER;
> raft->leader = raft->self;
> ev_timer_stop(loop(), &raft->timer);
> - /* Make read-write (if other subsystems allow that. */
> - box_update_ro_summary();
> /* State is visible and it is changed - broadcast. */
> raft_schedule_broadcast(raft);
> }
> @@ -670,7 +667,6 @@ raft_sm_schedule_new_term(struct raft *raft, uint64_t new_term)
> raft->volatile_vote = 0;
> raft->leader = 0;
> raft->state = RAFT_STATE_FOLLOWER;
> - box_update_ro_summary();
> raft_sm_pause_and_dump(raft);
> /*
> * State is visible and it is changed - broadcast. Term is also visible,
> @@ -701,7 +697,6 @@ raft_sm_schedule_new_election(struct raft *raft)
> /* Everyone is a follower until its vote for self is persisted. */
> raft_sm_schedule_new_term(raft, raft->term + 1);
> raft_sm_schedule_new_vote(raft, raft->self);
> - box_update_ro_summary();
> }
>
> static void
> @@ -786,7 +781,6 @@ raft_sm_start(struct raft *raft)
> */
> raft_sm_wait_leader_found(raft);
> }
> - box_update_ro_summary();
> /*
> * Nothing changed. But when raft was stopped, its state wasn't sent to
> * replicas. At least this was happening at the moment of this being
> @@ -808,7 +802,6 @@ raft_sm_stop(struct raft *raft)
> raft->leader = 0;
> raft->state = RAFT_STATE_FOLLOWER;
> ev_timer_stop(loop(), &raft->timer);
> - box_update_ro_summary();
> /* State is visible and changed - broadcast. */
> raft_schedule_broadcast(raft);
> }
> @@ -894,7 +887,6 @@ raft_cfg_is_candidate(struct raft *raft, bool is_candidate)
> raft_schedule_broadcast(raft);
> }
> }
> - box_update_ro_summary();
> }
>
> void
--
Serge Petrenko
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Tarantool-patches] [PATCH v2 15/16] raft: introduce RaftError
2020-11-19 23:45 [Tarantool-patches] [PATCH v2 00/16] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
` (5 preceding siblings ...)
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 14/16] raft: move RO summary update to box-Raft Vladislav Shpilevoy
@ 2020-11-19 23:46 ` Vladislav Shpilevoy
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 16/16] raft: move algorithm code to src/lib/raft Vladislav Shpilevoy
` (11 subsequent siblings)
18 siblings, 0 replies; 26+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-19 23:46 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko
Last piece of src/box used in Raft code was error.h. It was added
to be able to raise ClientErrors. To get rid of it the libraries
usually introduce their own error type available from
src/lib/core. Such as CollationError, SwimError, CryptoError.
This patch adds RaftError and removes the last box dependency from
Raft code.
Part of #5303
---
src/box/raftlib.c | 9 ++++-----
src/lib/core/diag.h | 2 ++
src/lib/core/exception.cc | 24 ++++++++++++++++++++++++
src/lib/core/exception.h | 7 +++++++
4 files changed, 37 insertions(+), 5 deletions(-)
diff --git a/src/box/raftlib.c b/src/box/raftlib.c
index 6e6bc658f..909d61a0a 100644
--- a/src/box/raftlib.c
+++ b/src/box/raftlib.c
@@ -30,7 +30,7 @@
*/
#include "raft.h"
-#include "error.h"
+#include "exception.h"
#include "fiber.h"
#include "small/region.h"
#include "tt_static.h"
@@ -315,14 +315,13 @@ raft_process_msg(struct raft *raft, const struct raft_msg *req, uint32_t source)
assert(source > 0);
assert(source != raft->self);
if (req->term == 0 || req->state == 0) {
- diag_set(ClientError, ER_PROTOCOL, "Raft term and state can't "
- "be zero");
+ diag_set(RaftError, "Raft term and state can't be zero");
return -1;
}
if (req->state == RAFT_STATE_CANDIDATE &&
(req->vote != source || req->vclock == NULL)) {
- diag_set(ClientError, ER_PROTOCOL, "Candidate should always "
- "vote for self and provide its vclock");
+ diag_set(RaftError, "Candidate should always vote for self and "
+ "provide its vclock");
return -1;
}
/* Outdated request. */
diff --git a/src/lib/core/diag.h b/src/lib/core/diag.h
index 6bf0b139a..b07eea838 100644
--- a/src/lib/core/diag.h
+++ b/src/lib/core/diag.h
@@ -335,6 +335,8 @@ struct error *
BuildSwimError(const char *file, unsigned line, const char *format, ...);
struct error *
BuildCryptoError(const char *file, unsigned line, const char *format, ...);
+struct error *
+BuildRaftError(const char *file, unsigned line, const char *format, ...);
struct index_def;
diff --git a/src/lib/core/exception.cc b/src/lib/core/exception.cc
index 180cb0e97..395baff6f 100644
--- a/src/lib/core/exception.cc
+++ b/src/lib/core/exception.cc
@@ -288,6 +288,18 @@ CryptoError::CryptoError(const char *file, unsigned line,
va_end(ap);
}
+const struct type_info type_RaftError =
+ make_type("RaftError", &type_Exception);
+
+RaftError::RaftError(const char *file, unsigned line, const char *format, ...)
+ : Exception(&type_RaftError, file, line)
+{
+ va_list ap;
+ va_start(ap, format);
+ error_vformat_msg(this, format, ap);
+ va_end(ap);
+}
+
#define BuildAlloc(type) \
void *p = malloc(sizeof(type)); \
if (p == NULL) \
@@ -409,6 +421,18 @@ BuildSocketError(const char *file, unsigned line, const char *socketname,
return e;
}
+struct error *
+BuildRaftError(const char *file, unsigned line, const char *format, ...)
+{
+ BuildAlloc(RaftError);
+ RaftError *e = new (p) RaftError(file, line, "");
+ va_list ap;
+ va_start(ap, format);
+ error_vformat_msg(e, format, ap);
+ va_end(ap);
+ return e;
+}
+
void
exception_init()
{
diff --git a/src/lib/core/exception.h b/src/lib/core/exception.h
index 1947b4f00..7277b2784 100644
--- a/src/lib/core/exception.h
+++ b/src/lib/core/exception.h
@@ -52,6 +52,7 @@ extern const struct type_info type_SystemError;
extern const struct type_info type_CollationError;
extern const struct type_info type_SwimError;
extern const struct type_info type_CryptoError;
+extern const struct type_info type_RaftError;
const char *
exception_get_string(struct error *e, const struct method_info *method);
@@ -168,6 +169,12 @@ public:
virtual void raise() { throw this; }
};
+class RaftError: public Exception {
+public:
+ RaftError(const char *file, unsigned line, const char *format, ...);
+ virtual void raise() { throw this; }
+};
+
/**
* Initialize the exception subsystem.
*/
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Tarantool-patches] [PATCH v2 16/16] raft: move algorithm code to src/lib/raft
2020-11-19 23:45 [Tarantool-patches] [PATCH v2 00/16] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
` (6 preceding siblings ...)
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 15/16] raft: introduce RaftError Vladislav Shpilevoy
@ 2020-11-19 23:46 ` Vladislav Shpilevoy
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 02/16] raft: move box_raft_* to src/box/raft.h and .c Vladislav Shpilevoy
` (10 subsequent siblings)
18 siblings, 0 replies; 26+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-19 23:46 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko
Raft algorithm code does not depend on box anymore, and is moved
to src/lib/raft.
This is done to be able to unit test Raft similarly to Swim - with
virtual event loop, network, time, and disk. Using any number of
instances. That will allow to cover all crazy and rare cases
possible in Raft, but without problems of functional tests
stability and clumsiness.
Part of #5303
---
src/box/CMakeLists.txt | 3 +--
src/box/raft.h | 2 +-
src/lib/CMakeLists.txt | 1 +
src/lib/raft/CMakeLists.txt | 7 +++++++
src/{box/raftlib.c => lib/raft/raft.c} | 0
src/{box/raftlib.h => lib/raft/raft.h} | 0
6 files changed, 10 insertions(+), 3 deletions(-)
create mode 100644 src/lib/raft/CMakeLists.txt
rename src/{box/raftlib.c => lib/raft/raft.c} (100%)
rename src/{box/raftlib.h => lib/raft/raft.h} (100%)
diff --git a/src/box/CMakeLists.txt b/src/box/CMakeLists.txt
index fcf779379..a7547c29f 100644
--- a/src/box/CMakeLists.txt
+++ b/src/box/CMakeLists.txt
@@ -169,7 +169,6 @@ add_library(box STATIC
port.c
txn.c
txn_limbo.c
- raftlib.c
raft.c
box.cc
gc.c
@@ -263,6 +262,6 @@ add_custom_command(OUTPUT ${SQL_BIN_DIR}/opcodes.c
${SQL_BIN_DIR}/opcodes.h)
target_link_libraries(box box_error tuple stat xrow xlog vclock crc32 scramble
- ${common_libraries})
+ raft ${common_libraries})
add_dependencies(box build_bundled_libs generate_sql_files)
diff --git a/src/box/raft.h b/src/box/raft.h
index c80faf811..1c59f17e6 100644
--- a/src/box/raft.h
+++ b/src/box/raft.h
@@ -29,7 +29,7 @@
* THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*/
-#include "raftlib.h"
+#include "raft/raft.h"
#if defined(__cplusplus)
extern "C" {
diff --git a/src/lib/CMakeLists.txt b/src/lib/CMakeLists.txt
index de1b902c6..cabbe3d89 100644
--- a/src/lib/CMakeLists.txt
+++ b/src/lib/CMakeLists.txt
@@ -14,6 +14,7 @@ add_subdirectory(crypto)
add_subdirectory(swim)
add_subdirectory(mpstream)
add_subdirectory(vclock)
+add_subdirectory(raft)
if(ENABLE_BUNDLED_MSGPUCK)
add_subdirectory(msgpuck EXCLUDE_FROM_ALL)
endif()
diff --git a/src/lib/raft/CMakeLists.txt b/src/lib/raft/CMakeLists.txt
new file mode 100644
index 000000000..aef2bacf7
--- /dev/null
+++ b/src/lib/raft/CMakeLists.txt
@@ -0,0 +1,7 @@
+set(lib_sources
+ raft.c
+)
+
+set_source_files_compile_flags(${lib_sources})
+add_library(raft STATIC ${lib_sources})
+target_link_libraries(raft core)
diff --git a/src/box/raftlib.c b/src/lib/raft/raft.c
similarity index 100%
rename from src/box/raftlib.c
rename to src/lib/raft/raft.c
diff --git a/src/box/raftlib.h b/src/lib/raft/raft.h
similarity index 100%
rename from src/box/raftlib.h
rename to src/lib/raft/raft.h
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Tarantool-patches] [PATCH v2 02/16] raft: move box_raft_* to src/box/raft.h and .c
2020-11-19 23:45 [Tarantool-patches] [PATCH v2 00/16] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
` (7 preceding siblings ...)
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 16/16] raft: move algorithm code to src/lib/raft Vladislav Shpilevoy
@ 2020-11-19 23:46 ` Vladislav Shpilevoy
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 03/16] raft: stop using replication_disconnect_timeout() Vladislav Shpilevoy
` (9 subsequent siblings)
18 siblings, 0 replies; 26+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-19 23:46 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko
The commit moves Raft functions and objects specific for box to
src/box/raft from src/box/box and src/box/raftlib.
The goal is to gradually eliminate all box dependencies from
src/box/raftlib and move it to src/lib/raft.
It makes the compilation work again after the previous commit
broke it.
Part of #5303
---
src/box/box.cc | 25 --------------
src/box/raft.c | 86 +++++++++++++++++++++++++++++++++++++++++++++++
src/box/raft.h | 59 ++++++++++++++++++++++++++++++++
src/box/raftlib.c | 29 ----------------
src/box/raftlib.h | 19 -----------
5 files changed, 145 insertions(+), 73 deletions(-)
create mode 100644 src/box/raft.c
create mode 100644 src/box/raft.h
diff --git a/src/box/box.cc b/src/box/box.cc
index 1f7dec362..8dd92a5f5 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -152,11 +152,6 @@ static struct fiber_pool tx_fiber_pool;
* are too many messages in flight (gh-1892).
*/
static struct cbus_endpoint tx_prio_endpoint;
-/**
- * A trigger executed each time the Raft state machine updates any
- * of its visible attributes.
- */
-static struct trigger box_raft_on_update;
void
box_update_ro_summary(void)
@@ -1060,23 +1055,6 @@ box_clear_synchro_queue(bool try_wait)
}
}
-static int
-box_raft_on_update_f(struct trigger *trigger, void *event)
-{
- (void)trigger;
- struct raft *raft = (struct raft *)event;
- assert(raft == box_raft());
- if (raft->state != RAFT_STATE_LEADER)
- return 0;
- /*
- * When the node became a leader, it means it will ignore all records
- * from all the other nodes, and won't get late CONFIRM messages anyway.
- * Can clear the queue without waiting for confirmations.
- */
- box_clear_synchro_queue(false);
- return 0;
-}
-
void
box_listen(void)
{
@@ -2658,9 +2636,6 @@ box_init(void)
txn_limbo_init();
sequence_init();
box_raft_init();
-
- trigger_create(&box_raft_on_update, box_raft_on_update_f, NULL, NULL);
- raft_on_update(box_raft(), &box_raft_on_update);
}
bool
diff --git a/src/box/raft.c b/src/box/raft.c
new file mode 100644
index 000000000..f289a6993
--- /dev/null
+++ b/src/box/raft.c
@@ -0,0 +1,86 @@
+/*
+ * Copyright 2010-2020, Tarantool AUTHORS, please see AUTHORS file.
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the
+ * following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY <COPYRIGHT HOLDER> ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
+ * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
+ * <COPYRIGHT HOLDER> OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
+ * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+#include "box.h"
+#include "raft.h"
+
+struct raft box_raft_global = {
+ /*
+ * Set an invalid state to validate in runtime the global raft node is
+ * not used before initialization.
+ */
+ .state = 0,
+};
+
+/**
+ * A trigger executed each time the Raft state machine updates any
+ * of its visible attributes.
+ */
+static struct trigger box_raft_on_update;
+
+static int
+box_raft_on_update_f(struct trigger *trigger, void *event)
+{
+ (void)trigger;
+ struct raft *raft = (struct raft *)event;
+ assert(raft == box_raft());
+ if (raft->state != RAFT_STATE_LEADER)
+ return 0;
+ /*
+ * When the node became a leader, it means it will ignore all records
+ * from all the other nodes, and won't get late CONFIRM messages anyway.
+ * Can clear the queue without waiting for confirmations.
+ */
+ box_clear_synchro_queue(false);
+ return 0;
+}
+
+void
+box_raft_init(void)
+{
+ raft_create(&box_raft_global);
+ trigger_create(&box_raft_on_update, box_raft_on_update_f, NULL, NULL);
+ raft_on_update(box_raft(), &box_raft_on_update);
+}
+
+void
+box_raft_free(void)
+{
+ /*
+ * Can't join the fiber, because the event loop is stopped already, and
+ * yields are not allowed.
+ */
+ box_raft_global.worker = NULL;
+ raft_destroy(&box_raft_global);
+ /*
+ * Invalidate so as box_raft() would fail if any usage attempt happens.
+ */
+ box_raft_global.state = 0;
+}
diff --git a/src/box/raft.h b/src/box/raft.h
new file mode 100644
index 000000000..fe0f073dc
--- /dev/null
+++ b/src/box/raft.h
@@ -0,0 +1,59 @@
+#pragma once
+/*
+ * Copyright 2010-2020, Tarantool AUTHORS, please see AUTHORS file.
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the
+ * following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY <COPYRIGHT HOLDER> ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
+ * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
+ * <COPYRIGHT HOLDER> OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
+ * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+#include "raftlib.h"
+
+#if defined(__cplusplus)
+extern "C" {
+#endif
+
+/** Raft state of this instance. */
+static inline struct raft *
+box_raft(void)
+{
+ extern struct raft box_raft_global;
+ /**
+ * Ensure the raft node can be used. I.e. that it is properly
+ * initialized. Entirely for debug purposes.
+ */
+ assert(box_raft_global.state != 0);
+ return &box_raft_global;
+}
+
+void
+box_raft_init(void);
+
+void
+box_raft_free(void);
+
+#if defined(__cplusplus)
+}
+#endif
diff --git a/src/box/raftlib.c b/src/box/raftlib.c
index ff664a4d1..3867c63e0 100644
--- a/src/box/raftlib.c
+++ b/src/box/raftlib.c
@@ -44,14 +44,6 @@
*/
#define RAFT_RANDOM_ELECTION_FACTOR 0.1
-struct raft box_raft_global = {
- /*
- * Set an invalid state to validate in runtime the global raft node is
- * not used before initialization.
- */
- .state = 0,
-};
-
/**
* When decoding we should never trust that there is
* a valid data incomes.
@@ -1106,24 +1098,3 @@ raft_destroy(struct raft *raft)
raft->worker = NULL;
}
}
-
-void
-box_raft_init(void)
-{
- raft_create(&box_raft_global);
-}
-
-void
-box_raft_free(void)
-{
- /*
- * Can't join the fiber, because the event loop is stopped already, and
- * yields are not allowed.
- */
- box_raft_global.worker = NULL;
- raft_destroy(&box_raft_global);
- /*
- * Invalidate so as box_raft() would fail if any usage attempt happens.
- */
- box_raft_global.state = 0;
-}
diff --git a/src/box/raftlib.h b/src/box/raftlib.h
index 3088fba23..805f69d64 100644
--- a/src/box/raftlib.h
+++ b/src/box/raftlib.h
@@ -271,25 +271,6 @@ raft_create(struct raft *raft);
void
raft_destroy(struct raft *raft);
-/** Raft state of this instance. */
-static inline struct raft *
-box_raft(void)
-{
- extern struct raft box_raft_global;
- /**
- * Ensure the raft node can be used. I.e. that it is properly
- * initialized. Entirely for debug purposes.
- */
- assert(box_raft_global.state != 0);
- return &box_raft_global;
-}
-
-void
-box_raft_init(void);
-
-void
-box_raft_free(void);
-
#if defined(__cplusplus)
}
#endif
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Tarantool-patches] [PATCH v2 03/16] raft: stop using replication_disconnect_timeout()
2020-11-19 23:45 [Tarantool-patches] [PATCH v2 00/16] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
` (8 preceding siblings ...)
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 02/16] raft: move box_raft_* to src/box/raft.h and .c Vladislav Shpilevoy
@ 2020-11-19 23:46 ` Vladislav Shpilevoy
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 04/16] raft: stop using replication_synchro_quorum Vladislav Shpilevoy
` (8 subsequent siblings)
18 siblings, 0 replies; 26+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-19 23:46 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko
Raft is being moved to a separate library in src/lib. It means,
it can't depend on anything from box/, including global
replication parameters such as replication_timeout, and functions
like replication_disconnect_timeout().
The patch makes raft stop using replication_disconnect_timeout().
Instead, it stores death timeout in struct raft. It is configured
by box simultaneously with replication_timeout.
Part of #5303
---
src/box/box.cc | 2 +-
src/box/raftlib.c | 13 ++++++-------
src/box/raftlib.h | 10 +++++++---
3 files changed, 14 insertions(+), 11 deletions(-)
diff --git a/src/box/box.cc b/src/box/box.cc
index 8dd92a5f5..25673ed42 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -890,7 +890,7 @@ void
box_set_replication_timeout(void)
{
replication_timeout = box_check_replication_timeout();
- raft_cfg_death_timeout(box_raft());
+ raft_cfg_death_timeout(box_raft(), replication_disconnect_timeout());
}
void
diff --git a/src/box/raftlib.c b/src/box/raftlib.c
index 3867c63e0..c156d6f46 100644
--- a/src/box/raftlib.c
+++ b/src/box/raftlib.c
@@ -804,8 +804,7 @@ raft_sm_wait_leader_dead(struct raft *raft)
assert(raft->is_candidate);
assert(raft->state == RAFT_STATE_FOLLOWER);
assert(raft->leader != 0);
- double death_timeout = replication_disconnect_timeout();
- ev_timer_set(&raft->timer, death_timeout, death_timeout);
+ ev_timer_set(&raft->timer, raft->death_timeout, raft->death_timeout);
ev_timer_start(loop(), &raft->timer);
}
@@ -817,8 +816,7 @@ raft_sm_wait_leader_found(struct raft *raft)
assert(raft->is_candidate);
assert(raft->state == RAFT_STATE_FOLLOWER);
assert(raft->leader == 0);
- double death_timeout = replication_disconnect_timeout();
- ev_timer_set(&raft->timer, death_timeout, death_timeout);
+ ev_timer_set(&raft->timer, raft->death_timeout, raft->death_timeout);
ev_timer_start(loop(), &raft->timer);
}
@@ -1012,14 +1010,14 @@ raft_cfg_election_quorum(struct raft *raft)
}
void
-raft_cfg_death_timeout(struct raft *raft)
+raft_cfg_death_timeout(struct raft *raft, double death_timeout)
{
+ raft->death_timeout = death_timeout;
if (raft->state == RAFT_STATE_FOLLOWER && raft->is_candidate &&
raft->leader != 0) {
assert(ev_is_active(&raft->timer));
- double death_timeout = replication_disconnect_timeout();
double timeout = ev_timer_remaining(loop(), &raft->timer) -
- raft->timer.at + death_timeout;
+ raft->timer.at + raft->death_timeout;
ev_timer_stop(loop(), &raft->timer);
ev_timer_set(&raft->timer, timeout, timeout);
ev_timer_start(loop(), &raft->timer);
@@ -1080,6 +1078,7 @@ raft_create(struct raft *raft)
.volatile_term = 1,
.term = 1,
.election_timeout = 5,
+ .death_timeout = 5,
};
ev_timer_init(&raft->timer, raft_sm_schedule_new_election_cb, 0, 0);
raft->timer.data = raft;
diff --git a/src/box/raftlib.h b/src/box/raftlib.h
index 805f69d64..b33a20326 100644
--- a/src/box/raftlib.h
+++ b/src/box/raftlib.h
@@ -156,6 +156,11 @@ struct raft {
struct fiber *worker;
/** Configured election timeout in seconds. */
double election_timeout;
+ /**
+ * Leader death timeout, after which it is considered dead and new
+ * elections can be started.
+ */
+ double death_timeout;
/**
* Trigger invoked each time any of the Raft node visible attributes are
* changed.
@@ -229,11 +234,10 @@ raft_cfg_election_quorum(struct raft *raft);
/**
* Configure Raft leader death timeout. I.e. number of seconds without
- * heartbeats from the leader to consider it dead. There is no a separate
- * option. Raft uses replication timeout for that.
+ * heartbeats from the leader to consider it dead.
*/
void
-raft_cfg_death_timeout(struct raft *raft);
+raft_cfg_death_timeout(struct raft *raft, double death_timeout);
/**
* Bump the term. When it is persisted, the node checks if there is a leader,
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Tarantool-patches] [PATCH v2 04/16] raft: stop using replication_synchro_quorum
2020-11-19 23:45 [Tarantool-patches] [PATCH v2 00/16] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
` (9 preceding siblings ...)
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 03/16] raft: stop using replication_disconnect_timeout() Vladislav Shpilevoy
@ 2020-11-19 23:46 ` Vladislav Shpilevoy
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 05/16] raft: stop using instance_id Vladislav Shpilevoy
` (7 subsequent siblings)
18 siblings, 0 replies; 26+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-19 23:46 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko
Raft is being moved to a separate library in src/lib. It means,
it can't depend on anything from box/, including global
replication parameters such as replication_synchro_quorum.
The patch makes raft stop using replication_synchro_quorum.
Instead, it has a new option 'election_quorum'. Note, that this is
just Raft API. Box API still uses replication_synchro_quorum. But
it is used to calculate the final quorum in src/box/raft, not
in src/box/raftlib. And to pass it to the base Raft
implementation.
Part of #5303
---
src/box/box.cc | 2 +-
src/box/raft.c | 52 +++++++++++++++++++++++++++++++
src/box/raft.h | 8 +++++
src/box/raftlib.c | 70 ++++++++----------------------------------
src/box/raftlib.h | 10 +++---
src/box/replication.cc | 3 ++
6 files changed, 83 insertions(+), 62 deletions(-)
diff --git a/src/box/box.cc b/src/box/box.cc
index 25673ed42..4652e5c49 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -921,7 +921,7 @@ box_set_replication_synchro_quorum(void)
return -1;
replication_synchro_quorum = value;
txn_limbo_on_parameters_change(&txn_limbo);
- raft_cfg_election_quorum(box_raft());
+ box_raft_update_election_quorum();
return 0;
}
diff --git a/src/box/raft.c b/src/box/raft.c
index f289a6993..dae5a559c 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -30,6 +30,7 @@
*/
#include "box.h"
#include "raft.h"
+#include "replication.h"
struct raft box_raft_global = {
/*
@@ -62,6 +63,57 @@ box_raft_on_update_f(struct trigger *trigger, void *event)
return 0;
}
+void
+box_raft_update_election_quorum(void)
+{
+ /*
+ * When the instance is started first time, it does not have an ID, so
+ * the registered count is 0. But the quorum can never be 0. At least
+ * the current instance should participate in the quorum.
+ */
+ int max = MAX(replicaset.registered_count, 1);
+ /**
+ * Election quorum is not strictly equal to synchronous replication
+ * quorum. Sometimes it can be lowered. That is about bootstrap.
+ *
+ * The problem with bootstrap is that when the replicaset boots, all the
+ * instances can't write to WAL and can't recover from their initial
+ * snapshot. They need one node which will boot first, and then they
+ * will replicate from it.
+ *
+ * This one node should boot from its zero snapshot, create replicaset
+ * UUID, register self with ID 1 in _cluster space, and then register
+ * all the other instances here. To do that the node must be writable.
+ * It should have read_only = false, connection quorum satisfied, and be
+ * a Raft leader if Raft is enabled.
+ *
+ * To be elected a Raft leader it needs to perform election. But it
+ * can't be done before at least synchronous quorum of the replicas is
+ * bootstrapped. And they can't be bootstrapped because wait for a
+ * leader to initialize _cluster. Cyclic dependency.
+ *
+ * This is resolved by truncation of the election quorum to the number
+ * of registered replicas, if their count is less than synchronous
+ * quorum. That helps to elect a first leader.
+ *
+ * It may seem that the first node could just declare itself a leader
+ * and then strictly follow the protocol from now on, but that won't
+ * work, because if the first node will restart after it is booted, but
+ * before quorum of replicas is booted, the cluster will stuck again.
+ *
+ * The current solution is totally safe because
+ *
+ * - after all the cluster will have node count >= quorum, if user used
+ * a correct config (God help him if he didn't);
+ *
+ * - synchronous replication quorum is untouched - it is not truncated.
+ * Only leader election quorum is affected. So synchronous data won't
+ * be lost.
+ */
+ int quorum = MIN(replication_synchro_quorum, max);
+ raft_cfg_election_quorum(box_raft(), quorum);
+}
+
void
box_raft_init(void)
{
diff --git a/src/box/raft.h b/src/box/raft.h
index fe0f073dc..d21c25e01 100644
--- a/src/box/raft.h
+++ b/src/box/raft.h
@@ -48,6 +48,14 @@ box_raft(void)
return &box_raft_global;
}
+/**
+ * Let the global raft know that the election quorum could change. It happens
+ * when configuration is updated, and when new nodes are added or old are
+ * deleted from the cluster.
+ */
+void
+box_raft_update_election_quorum(void);
+
void
box_raft_init(void);
diff --git a/src/box/raftlib.c b/src/box/raftlib.c
index c156d6f46..0657fa85a 100644
--- a/src/box/raftlib.c
+++ b/src/box/raftlib.c
@@ -130,50 +130,6 @@ raft_can_vote_for(const struct raft *raft, const struct vclock *v)
return cmp == 0 || cmp == 1;
}
-/**
- * Election quorum is not strictly equal to synchronous replication quorum.
- * Sometimes it can be lowered. That is about bootstrap.
- *
- * The problem with bootstrap is that when the replicaset boots, all the
- * instances can't write to WAL and can't recover from their initial snapshot.
- * They need one node which will boot first, and then they will replicate from
- * it.
- *
- * This one node should boot from its zero snapshot, create replicaset UUID,
- * register self with ID 1 in _cluster space, and then register all the other
- * instances here. To do that the node must be writable. It should have
- * read_only = false, connection quorum satisfied, and be a Raft leader if Raft
- * is enabled.
- *
- * To be elected a Raft leader it needs to perform election. But it can't be
- * done before at least synchronous quorum of the replicas is bootstrapped. And
- * they can't be bootstrapped because wait for a leader to initialize _cluster.
- * Cyclic dependency.
- *
- * This is resolved by truncation of the election quorum to the number of
- * registered replicas, if their count is less than synchronous quorum. That
- * helps to elect a first leader.
- *
- * It may seem that the first node could just declare itself a leader and then
- * strictly follow the protocol from now on, but that won't work, because if the
- * first node will restart after it is booted, but before quorum of replicas is
- * booted, the cluster will stuck again.
- *
- * The current solution is totally safe because
- *
- * - after all the cluster will have node count >= quorum, if user used a
- * correct config (God help him if he didn't);
- *
- * - synchronous replication quorum is untouched - it is not truncated. Only
- * leader election quorum is affected. So synchronous data won't be lost.
- */
-static inline int
-raft_election_quorum(const struct raft *raft)
-{
- (void)raft;
- return MIN(replication_synchro_quorum, replicaset.registered_count);
-}
-
/**
* Wakeup the Raft worker fiber in order to do some async work. If the fiber
* does not exist yet, it is created.
@@ -427,13 +383,12 @@ raft_process_msg(struct raft *raft, const struct raft_request *req,
* and now was answered by some other instance.
*/
assert(raft->volatile_vote == instance_id);
- int quorum = raft_election_quorum(raft);
bool was_set = bit_set(&raft->vote_mask, source);
raft->vote_count += !was_set;
- if (raft->vote_count < quorum) {
+ if (raft->vote_count < raft->election_quorum) {
say_info("RAFT: accepted vote for self, vote "
"count is %d/%d", raft->vote_count,
- quorum);
+ raft->election_quorum);
break;
}
raft_sm_become_leader(raft);
@@ -594,7 +549,7 @@ end_dump:
raft_sm_wait_leader_dead(raft);
} else if (raft->vote == instance_id) {
/* Just wrote own vote. */
- if (raft_election_quorum(raft) == 1)
+ if (raft->election_quorum == 1)
raft_sm_become_leader(raft);
else
raft_sm_become_candidate(raft);
@@ -692,7 +647,7 @@ raft_sm_become_leader(struct raft *raft)
{
assert(raft->state != RAFT_STATE_LEADER);
say_info("RAFT: enter leader state with quorum %d",
- raft_election_quorum(raft));
+ raft->election_quorum);
assert(raft->leader == 0);
assert(raft->is_candidate);
assert(!raft->is_write_in_progress);
@@ -730,7 +685,7 @@ raft_sm_become_candidate(struct raft *raft)
assert(raft->vote == instance_id);
assert(raft->is_candidate);
assert(!raft->is_write_in_progress);
- assert(raft_election_quorum(raft) > 1);
+ assert(raft->election_quorum > 1);
raft->state = RAFT_STATE_CANDIDATE;
raft->vote_count = 1;
raft->vote_mask = 0;
@@ -999,14 +954,14 @@ raft_cfg_election_timeout(struct raft *raft, double timeout)
}
void
-raft_cfg_election_quorum(struct raft *raft)
+raft_cfg_election_quorum(struct raft *raft, int election_quorum)
{
- if (raft->state != RAFT_STATE_CANDIDATE ||
- raft->state == RAFT_STATE_LEADER)
- return;
- if (raft->vote_count < raft_election_quorum(raft))
- return;
- raft_sm_become_leader(raft);
+ /* At least self is always a part of the quorum. */
+ assert(election_quorum > 0);
+ raft->election_quorum = election_quorum;
+ if (raft->state == RAFT_STATE_CANDIDATE &&
+ raft->vote_count >= raft->election_quorum)
+ raft_sm_become_leader(raft);
}
void
@@ -1077,6 +1032,7 @@ raft_create(struct raft *raft)
.state = RAFT_STATE_FOLLOWER,
.volatile_term = 1,
.term = 1,
+ .election_quorum = 1,
.election_timeout = 5,
.death_timeout = 5,
};
diff --git a/src/box/raftlib.h b/src/box/raftlib.h
index b33a20326..c9c13136e 100644
--- a/src/box/raftlib.h
+++ b/src/box/raftlib.h
@@ -150,6 +150,8 @@ struct raft {
vclock_map_t vote_mask;
/** Number of votes for this instance. Valid only in candidate state. */
int vote_count;
+ /** Number of votes necessary for successful election. */
+ int election_quorum;
/** State machine timed event trigger. */
struct ev_timer timer;
/** Worker fiber to execute blocking tasks like IO. */
@@ -225,12 +227,12 @@ void
raft_cfg_election_timeout(struct raft *raft, double timeout);
/**
- * Configure Raft leader election quorum. There is no a separate option.
- * Instead, synchronous replication quorum is used. Since Raft is tightly bound
- * with synchronous replication.
+ * Configure Raft leader election quorum. That may trigger immediate election,
+ * if the quorum is lowered, and this instance is a candidate having enough
+ * votes for the new quorum.
*/
void
-raft_cfg_election_quorum(struct raft *raft);
+raft_cfg_election_quorum(struct raft *raft, int election_quorum);
/**
* Configure Raft leader death timeout. I.e. number of seconds without
diff --git a/src/box/replication.cc b/src/box/replication.cc
index 65512cf0f..931c73a37 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -39,6 +39,7 @@
#include "box.h"
#include "gc.h"
#include "error.h"
+#include "raft.h"
#include "relay.h"
#include "sio.h"
@@ -250,6 +251,7 @@ replica_set_id(struct replica *replica, uint32_t replica_id)
say_info("assigned id %d to replica %s",
replica->id, tt_uuid_str(&replica->uuid));
replica->anon = false;
+ box_raft_update_election_quorum();
}
void
@@ -298,6 +300,7 @@ replica_clear_id(struct replica *replica)
assert(!replica->anon);
replica_delete(replica);
}
+ box_raft_update_election_quorum();
}
void
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Tarantool-patches] [PATCH v2 05/16] raft: stop using instance_id
2020-11-19 23:45 [Tarantool-patches] [PATCH v2 00/16] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
` (10 preceding siblings ...)
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 04/16] raft: stop using replication_synchro_quorum Vladislav Shpilevoy
@ 2020-11-19 23:46 ` Vladislav Shpilevoy
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 06/16] raft: make raft_request.vclock constant Vladislav Shpilevoy
` (6 subsequent siblings)
18 siblings, 0 replies; 26+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-19 23:46 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko
Raft is being moved to a separate library in src/lib. It means,
it can't depend on anything from box/.
The patch makes raft stop using instance_id.
Instead, it has a new option 'instance_id'. It is stored inside
struct raft as 'self', and should be configured using
raft_cfg_instance_id().
The configuration is done when bootstrap ends and the instance_id
is either recovered successfully, or the instance is anonymous.
While working on this, I also considered introducing a new
function raft_boot() instead of raft_cfg_instance_id(). Which I
would also use to configure vclock later. Raft_boot() would be
meant to be called only one time with non-dynamic parameters
instance_id and vclock.
But then I decided to keep adding new raft_cfg_*() functions.
Because:
- It is more consistent with the existing options;
- Does not require to think about too many different functions
like raft_create(), raft_boot(), raft_cfg_*() and in which order
to call them;
Also I was thinking to introduce a single raft_cfg() like I did
in swim with swim_cfg(), to reduce number of raft_cfg_*()
functions, but decided it would be even worse with so many
options.
Part of #5303
---
src/box/box.cc | 10 +++++-----
src/box/raftlib.c | 32 ++++++++++++++++++++------------
src/box/raftlib.h | 9 +++++++++
3 files changed, 34 insertions(+), 17 deletions(-)
diff --git a/src/box/box.cc b/src/box/box.cc
index 4652e5c49..e8e232126 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -2763,12 +2763,12 @@ box_cfg_xc(void)
* Fill in leader election parameters after bootstrap. Before it is not
* possible - there may be relevant data to recover from WAL and
* snapshot. Also until recovery is done, it is not possible to write
- * new records into WAL. It is also totally safe, because relaying is
- * not started until the box is configured. So it can't happen, that
- * this election-enabled node will try to relay to another
- * election-enabled node without election actually enabled leading to
- * disconnect.
+ * new records into WAL. Another reason - before recovery is done,
+ * instance_id is not known, so Raft simply can't work.
*/
+ if (!replication_anon)
+ raft_cfg_instance_id(box_raft(), instance_id);
+
if (box_set_election_timeout() != 0)
diag_raise();
/*
diff --git a/src/box/raftlib.c b/src/box/raftlib.c
index 0657fa85a..ca1940ba6 100644
--- a/src/box/raftlib.c
+++ b/src/box/raftlib.c
@@ -296,7 +296,7 @@ raft_process_msg(struct raft *raft, const struct raft_request *req,
say_info("RAFT: message %s from %u", raft_request_to_string(req),
source);
assert(source > 0);
- assert(source != instance_id);
+ assert(source != raft->self);
if (req->term == 0 || req->state == 0) {
diag_set(ClientError, ER_PROTOCOL, "Raft term and state can't "
"be zero");
@@ -337,7 +337,7 @@ raft_process_msg(struct raft *raft, const struct raft_request *req,
raft->leader);
break;
}
- if (req->vote == instance_id) {
+ if (req->vote == raft->self) {
/*
* This is entirely valid. This instance could
* request a vote, then become a follower or
@@ -373,7 +373,7 @@ raft_process_msg(struct raft *raft, const struct raft_request *req,
break;
case RAFT_STATE_CANDIDATE:
/* Check if this is a vote for a competing candidate. */
- if (req->vote != instance_id) {
+ if (req->vote != raft->self) {
say_info("RAFT: vote request is skipped - "
"competing candidate");
break;
@@ -382,7 +382,7 @@ raft_process_msg(struct raft *raft, const struct raft_request *req,
* Vote for self was requested earlier in this round,
* and now was answered by some other instance.
*/
- assert(raft->volatile_vote == instance_id);
+ assert(raft->volatile_vote == raft->self);
bool was_set = bit_set(&raft->vote_mask, source);
raft->vote_count += !was_set;
if (raft->vote_count < raft->election_quorum) {
@@ -547,7 +547,7 @@ end_dump:
} else if (raft->leader != 0) {
/* There is a known leader. Wait until it is dead. */
raft_sm_wait_leader_dead(raft);
- } else if (raft->vote == instance_id) {
+ } else if (raft->vote == raft->self) {
/* Just wrote own vote. */
if (raft->election_quorum == 1)
raft_sm_become_leader(raft);
@@ -561,7 +561,7 @@ end_dump:
raft_sm_wait_election_end(raft);
} else {
/* No leaders, no votes. */
- raft_sm_schedule_new_vote(raft, instance_id);
+ raft_sm_schedule_new_vote(raft, raft->self);
}
} else {
memset(&req, 0, sizeof(req));
@@ -596,7 +596,7 @@ raft_worker_handle_broadcast(struct raft *raft)
req.vote = raft->vote;
req.state = raft->state;
if (req.state == RAFT_STATE_CANDIDATE) {
- assert(raft->vote == instance_id);
+ assert(raft->vote == raft->self);
req.vclock = &replicaset.vclock;
}
replicaset_foreach(replica)
@@ -652,7 +652,7 @@ raft_sm_become_leader(struct raft *raft)
assert(raft->is_candidate);
assert(!raft->is_write_in_progress);
raft->state = RAFT_STATE_LEADER;
- raft->leader = instance_id;
+ raft->leader = raft->self;
ev_timer_stop(loop(), &raft->timer);
/* Make read-write (if other subsystems allow that. */
box_update_ro_summary();
@@ -682,14 +682,14 @@ raft_sm_become_candidate(struct raft *raft)
say_info("RAFT: enter candidate state with 1 self vote");
assert(raft->state == RAFT_STATE_FOLLOWER);
assert(raft->leader == 0);
- assert(raft->vote == instance_id);
+ assert(raft->vote == raft->self);
assert(raft->is_candidate);
assert(!raft->is_write_in_progress);
assert(raft->election_quorum > 1);
raft->state = RAFT_STATE_CANDIDATE;
raft->vote_count = 1;
raft->vote_mask = 0;
- bit_set(&raft->vote_mask, instance_id);
+ bit_set(&raft->vote_mask, raft->self);
raft_sm_wait_election_end(raft);
/* State is visible and it is changed - broadcast. */
raft_schedule_broadcast(raft);
@@ -736,7 +736,7 @@ raft_sm_schedule_new_election(struct raft *raft)
assert(raft->is_candidate);
/* Everyone is a follower until its vote for self is persisted. */
raft_sm_schedule_new_term(raft, raft->term + 1);
- raft_sm_schedule_new_vote(raft, instance_id);
+ raft_sm_schedule_new_vote(raft, raft->self);
box_update_ro_summary();
}
@@ -783,7 +783,7 @@ raft_sm_wait_election_end(struct raft *raft)
assert(raft->is_candidate);
assert(raft->state == RAFT_STATE_FOLLOWER ||
(raft->state == RAFT_STATE_CANDIDATE &&
- raft->volatile_vote == instance_id));
+ raft->volatile_vote == raft->self));
assert(raft->leader == 0);
double election_timeout = raft->election_timeout +
raft_new_random_election_shift(raft);
@@ -979,6 +979,14 @@ raft_cfg_death_timeout(struct raft *raft, double death_timeout)
}
}
+void
+raft_cfg_instance_id(struct raft *raft, uint32_t instance_id)
+{
+ assert(raft->self == 0);
+ assert(instance_id != 0);
+ raft->self = instance_id;
+}
+
void
raft_new_term(struct raft *raft)
{
diff --git a/src/box/raftlib.h b/src/box/raftlib.h
index c9c13136e..f75ed2567 100644
--- a/src/box/raftlib.h
+++ b/src/box/raftlib.h
@@ -95,6 +95,8 @@ const char *
raft_state_str(uint32_t state);
struct raft {
+ /** Instance ID of this node. */
+ uint32_t self;
/** Instance ID of leader of the current term. */
uint32_t leader;
/** State of the instance. */
@@ -241,6 +243,13 @@ raft_cfg_election_quorum(struct raft *raft, int election_quorum);
void
raft_cfg_death_timeout(struct raft *raft, double death_timeout);
+/**
+ * Configure ID of the given Raft instance. The ID can't be changed after it is
+ * assigned first time.
+ */
+void
+raft_cfg_instance_id(struct raft *raft, uint32_t instance_id);
+
/**
* Bump the term. When it is persisted, the node checks if there is a leader,
* and if there is not, a new election is started. That said, this function can
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Tarantool-patches] [PATCH v2 06/16] raft: make raft_request.vclock constant
2020-11-19 23:45 [Tarantool-patches] [PATCH v2 00/16] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
` (11 preceding siblings ...)
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 05/16] raft: stop using instance_id Vladislav Shpilevoy
@ 2020-11-19 23:46 ` Vladislav Shpilevoy
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 07/16] raft: stop using replicaset.vclock Vladislav Shpilevoy
` (5 subsequent siblings)
18 siblings, 0 replies; 26+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-19 23:46 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko
Raft is never supposed to change vclock. Not the stored one, nor
the received ones. The patch makes it checked during compilation.
The patch is mostly motivated by a next patch making Raft use an
externally configured vclock which can't be changed. Since Raft
uses raft_request to carry the vclock in a few places, the
request's vclock also must become const.
Part of #5303
---
src/box/box.cc | 2 +-
src/box/raftlib.c | 9 +++------
src/box/raftlib.h | 3 +--
src/box/xrow.c | 2 +-
src/box/xrow.h | 2 +-
5 files changed, 7 insertions(+), 11 deletions(-)
diff --git a/src/box/box.cc b/src/box/box.cc
index e8e232126..043a37658 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -2142,7 +2142,7 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
* should be 0.
*/
struct raft_request req;
- raft_serialize_for_network(box_raft(), &req, &vclock);
+ raft_serialize_for_network(box_raft(), &req);
xrow_encode_raft(&row, &fiber()->gc, &req);
coio_write_xrow(io, &row);
}
diff --git a/src/box/raftlib.c b/src/box/raftlib.c
index ca1940ba6..78164bf91 100644
--- a/src/box/raftlib.c
+++ b/src/box/raftlib.c
@@ -850,8 +850,7 @@ raft_sm_stop(struct raft *raft)
}
void
-raft_serialize_for_network(const struct raft *raft, struct raft_request *req,
- struct vclock *vclock)
+raft_serialize_for_network(const struct raft *raft, struct raft_request *req)
{
memset(req, 0, sizeof(*req));
/*
@@ -865,10 +864,8 @@ raft_serialize_for_network(const struct raft *raft, struct raft_request *req,
* Raft does not own vclock, so it always expects it passed externally.
* Vclock is sent out only by candidate instances.
*/
- if (req->state == RAFT_STATE_CANDIDATE) {
- req->vclock = vclock;
- vclock_copy(vclock, &replicaset.vclock);
- }
+ if (req->state == RAFT_STATE_CANDIDATE)
+ req->vclock = &replicaset.vclock;
}
void
diff --git a/src/box/raftlib.h b/src/box/raftlib.h
index f75ed2567..2da3cec86 100644
--- a/src/box/raftlib.h
+++ b/src/box/raftlib.h
@@ -263,8 +263,7 @@ raft_new_term(struct raft *raft);
* cluster. It is allowed to save anything here, not only persistent state.
*/
void
-raft_serialize_for_network(const struct raft *raft, struct raft_request *req,
- struct vclock *vclock);
+raft_serialize_for_network(const struct raft *raft, struct raft_request *req);
/**
* Save complete Raft state into a request to be persisted on disk. Only term
diff --git a/src/box/xrow.c b/src/box/xrow.c
index 165a00a16..bc06738ad 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -1057,7 +1057,7 @@ xrow_decode_raft(const struct xrow_header *row, struct raft_request *r,
r->vclock = vclock;
if (r->vclock == NULL)
mp_next(&pos);
- else if (mp_decode_vclock_ignore0(&pos, r->vclock) != 0)
+ else if (mp_decode_vclock_ignore0(&pos, vclock) != 0)
goto bad_msgpack;
break;
default:
diff --git a/src/box/xrow.h b/src/box/xrow.h
index 095911239..3d68c1268 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -268,7 +268,7 @@ struct raft_request {
uint64_t term;
uint32_t vote;
uint32_t state;
- struct vclock *vclock;
+ const struct vclock *vclock;
};
int
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Tarantool-patches] [PATCH v2 07/16] raft: stop using replicaset.vclock
2020-11-19 23:45 [Tarantool-patches] [PATCH v2 00/16] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
` (12 preceding siblings ...)
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 06/16] raft: make raft_request.vclock constant Vladislav Shpilevoy
@ 2020-11-19 23:46 ` Vladislav Shpilevoy
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 08/16] raft: introduce vtab for disk and network Vladislav Shpilevoy
` (4 subsequent siblings)
18 siblings, 0 replies; 26+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-19 23:46 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko
Raft is being moved to a separate library in src/lib. It means,
it can't depend on anything from box/.
The patch makes raft stop using replicaset.vclock.
Instead, it has a new option 'vclock'. It is stored inside struct
raft by pointer and should be configured using raft_cfg_vclock().
Box configures it to point at replicaset.vclock like before. But
now raftlib code does not depend on it explicitly.
Vclock is stored in Raft by pointer instead of by value so as not
to update it for each transaction. It would be too high price to
pay for Raft independence from box.
Part of #5303
---
src/box/box.cc | 1 +
src/box/raftlib.c | 15 +++++++++++----
src/box/raftlib.h | 16 ++++++++++++++++
3 files changed, 28 insertions(+), 4 deletions(-)
diff --git a/src/box/box.cc b/src/box/box.cc
index 043a37658..837fbd2e5 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -2768,6 +2768,7 @@ box_cfg_xc(void)
*/
if (!replication_anon)
raft_cfg_instance_id(box_raft(), instance_id);
+ raft_cfg_vclock(box_raft(), &replicaset.vclock);
if (box_set_election_timeout() != 0)
diag_raise();
diff --git a/src/box/raftlib.c b/src/box/raftlib.c
index 78164bf91..ab2e27fd8 100644
--- a/src/box/raftlib.c
+++ b/src/box/raftlib.c
@@ -125,8 +125,7 @@ raft_new_random_election_shift(const struct raft *raft)
static inline bool
raft_can_vote_for(const struct raft *raft, const struct vclock *v)
{
- (void)raft;
- int cmp = vclock_compare_ignore0(v, &replicaset.vclock);
+ int cmp = vclock_compare_ignore0(v, raft->vclock);
return cmp == 0 || cmp == 1;
}
@@ -597,7 +596,7 @@ raft_worker_handle_broadcast(struct raft *raft)
req.state = raft->state;
if (req.state == RAFT_STATE_CANDIDATE) {
assert(raft->vote == raft->self);
- req.vclock = &replicaset.vclock;
+ req.vclock = raft->vclock;
}
replicaset_foreach(replica)
relay_push_raft(replica->relay, &req);
@@ -865,7 +864,7 @@ raft_serialize_for_network(const struct raft *raft, struct raft_request *req)
* Vclock is sent out only by candidate instances.
*/
if (req->state == RAFT_STATE_CANDIDATE)
- req->vclock = &replicaset.vclock;
+ req->vclock = raft->vclock;
}
void
@@ -984,6 +983,14 @@ raft_cfg_instance_id(struct raft *raft, uint32_t instance_id)
raft->self = instance_id;
}
+void
+raft_cfg_vclock(struct raft *raft, const struct vclock *vclock)
+{
+ assert(raft->vclock == NULL);
+ assert(vclock != NULL);
+ raft->vclock = vclock;
+}
+
void
raft_new_term(struct raft *raft)
{
diff --git a/src/box/raftlib.h b/src/box/raftlib.h
index 2da3cec86..8d0d03da0 100644
--- a/src/box/raftlib.h
+++ b/src/box/raftlib.h
@@ -154,6 +154,15 @@ struct raft {
int vote_count;
/** Number of votes necessary for successful election. */
int election_quorum;
+ /**
+ * Vclock of the Raft node owner. Raft never changes it, only watches,
+ * and makes decisions based on it. The value is not stored by copy so
+ * as to avoid frequent updates. If every transaction would need to
+ * update several vclocks in different places, it would be too
+ * expensive. So they update only one vclock, which is shared between
+ * subsystems, such as Raft.
+ */
+ const struct vclock *vclock;
/** State machine timed event trigger. */
struct ev_timer timer;
/** Worker fiber to execute blocking tasks like IO. */
@@ -250,6 +259,13 @@ raft_cfg_death_timeout(struct raft *raft, double death_timeout);
void
raft_cfg_instance_id(struct raft *raft, uint32_t instance_id);
+/**
+ * Configure vclock of the given Raft instance. The vclock is not copied, so the
+ * caller must keep it valid.
+ */
+void
+raft_cfg_vclock(struct raft *raft, const struct vclock *vclock);
+
/**
* Bump the term. When it is persisted, the node checks if there is a leader,
* and if there is not, a new election is started. That said, this function can
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Tarantool-patches] [PATCH v2 08/16] raft: introduce vtab for disk and network
2020-11-19 23:45 [Tarantool-patches] [PATCH v2 00/16] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
` (13 preceding siblings ...)
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 07/16] raft: stop using replicaset.vclock Vladislav Shpilevoy
@ 2020-11-19 23:46 ` Vladislav Shpilevoy
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 09/16] raft: introduce raft_msg, drop xrow dependency Vladislav Shpilevoy
` (3 subsequent siblings)
18 siblings, 0 replies; 26+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-19 23:46 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko
Raft is being moved to a separate library in src/lib. It means,
it can't depend on anything from box/.
The patch makes raft stop using replicaset and journal objects.
They were used to broadcast messages to all the other nodes, and
to persist updates.
Now Raft does the same through vtab, which is configured by box.
Broadcast still sends messages via relays, and disk write still
uses the journal. But Raft does not depend on any specific journal
or network API.
Part of #5303
---
src/box/raft.c | 63 ++++++++++++++++++++++++++++++++-
src/box/raftlib.c | 89 +++++++++++++++--------------------------------
src/box/raftlib.h | 24 ++++++++++++-
3 files changed, 114 insertions(+), 62 deletions(-)
diff --git a/src/box/raft.c b/src/box/raft.c
index dae5a559c..5efff80a0 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -29,7 +29,10 @@
* SUCH DAMAGE.
*/
#include "box.h"
+#include "error.h"
+#include "journal.h"
#include "raft.h"
+#include "relay.h"
#include "replication.h"
struct raft box_raft_global = {
@@ -114,10 +117,68 @@ box_raft_update_election_quorum(void)
raft_cfg_election_quorum(box_raft(), quorum);
}
+static void
+box_raft_broadcast(struct raft *raft, const struct raft_request *req)
+{
+ (void)raft;
+ assert(raft == box_raft());
+ replicaset_foreach(replica)
+ relay_push_raft(replica->relay, req);
+}
+
+/** Wakeup Raft state writer fiber waiting for WAL write end. */
+static void
+box_raft_write_cb(struct journal_entry *entry)
+{
+ fiber_wakeup(entry->complete_data);
+}
+
+static void
+box_raft_write(struct raft *raft, const struct raft_request *req)
+{
+ (void)raft;
+ assert(raft == box_raft());
+ /* See Raft implementation why these fields are never written. */
+ assert(req->vclock == NULL);
+ assert(req->state == 0);
+
+ struct region *region = &fiber()->gc;
+ uint32_t svp = region_used(region);
+ struct xrow_header row;
+ char buf[sizeof(struct journal_entry) +
+ sizeof(struct xrow_header *)];
+ struct journal_entry *entry = (struct journal_entry *)buf;
+ entry->rows[0] = &row;
+
+ if (xrow_encode_raft(&row, region, req) != 0)
+ goto fail;
+ journal_entry_create(entry, 1, xrow_approx_len(&row), box_raft_write_cb,
+ fiber());
+
+ if (journal_write(entry) != 0 || entry->res < 0) {
+ diag_set(ClientError, ER_WAL_IO);
+ diag_log();
+ goto fail;
+ }
+
+ region_truncate(region, svp);
+ return;
+fail:
+ /*
+ * XXX: the stub is supposed to be removed once it is defined what to do
+ * when a raft request WAL write fails.
+ */
+ panic("Could not write a raft request to WAL\n");
+}
+
void
box_raft_init(void)
{
- raft_create(&box_raft_global);
+ static const struct raft_vtab box_raft_vtab = {
+ .broadcast = box_raft_broadcast,
+ .write = box_raft_write,
+ };
+ raft_create(&box_raft_global, &box_raft_vtab);
trigger_create(&box_raft_on_update, box_raft_on_update_f, NULL, NULL);
raft_on_update(box_raft(), &box_raft_on_update);
}
diff --git a/src/box/raftlib.c b/src/box/raftlib.c
index ab2e27fd8..2f5e90f21 100644
--- a/src/box/raftlib.c
+++ b/src/box/raftlib.c
@@ -31,11 +31,9 @@
#include "raft.h"
#include "error.h"
-#include "journal.h"
+#include "fiber.h"
#include "xrow.h"
#include "small/region.h"
-#include "replication.h"
-#include "relay.h"
#include "box.h"
#include "tt_static.h"
@@ -64,6 +62,20 @@ raft_state_str(uint32_t state)
return "invalid (x)";
};
+/** Shortcut for vtab 'broadcast' method. */
+static inline void
+raft_broadcast(struct raft *raft, const struct raft_request *req)
+{
+ raft->vtab->broadcast(raft, req);
+}
+
+/** Shortcut for vtab 'write' method. */
+static inline void
+raft_write(struct raft *raft, const struct raft_request *req)
+{
+ raft->vtab->write(raft, req);
+}
+
/**
* Check if Raft is completely synced with disk. Meaning all its critical values
* are in WAL. Only in that state the node can become a leader or a candidate.
@@ -469,58 +481,6 @@ raft_process_heartbeat(struct raft *raft, uint32_t source)
raft_sm_wait_leader_dead(raft);
}
-/** Wakeup Raft state writer fiber waiting for WAL write end. */
-static void
-raft_write_cb(struct journal_entry *entry)
-{
- fiber_wakeup(entry->complete_data);
-}
-
-/** Synchronously write a Raft request into WAL. */
-static void
-raft_write_request(const struct raft_request *req)
-{
- /*
- * Vclock is never persisted by Raft. It is used only to
- * be sent to network when vote for self.
- */
- assert(req->vclock == NULL);
- /*
- * State is not persisted. That would be strictly against Raft protocol.
- * The reason is that it does not make much sense - even if the node is
- * a leader now, after the node is restarted, there will be another
- * leader elected by that time likely.
- */
- assert(req->state == 0);
- struct region *region = &fiber()->gc;
- uint32_t svp = region_used(region);
- struct xrow_header row;
- char buf[sizeof(struct journal_entry) +
- sizeof(struct xrow_header *)];
- struct journal_entry *entry = (struct journal_entry *)buf;
- entry->rows[0] = &row;
-
- if (xrow_encode_raft(&row, region, req) != 0)
- goto fail;
- journal_entry_create(entry, 1, xrow_approx_len(&row), raft_write_cb,
- fiber());
-
- if (journal_write(entry) != 0 || entry->res < 0) {
- diag_set(ClientError, ER_WAL_IO);
- diag_log();
- goto fail;
- }
-
- region_truncate(region, svp);
- return;
-fail:
- /*
- * XXX: the stub is supposed to be removed once it is defined what to do
- * when a raft request WAL write fails.
- */
- panic("Could not write a raft request to WAL\n");
-}
-
/* Dump Raft state to WAL in a blocking way. */
static void
raft_worker_handle_io(struct raft *raft)
@@ -567,8 +527,17 @@ end_dump:
assert(raft->volatile_term >= raft->term);
req.term = raft->volatile_term;
req.vote = raft->volatile_vote;
-
- raft_write_request(&req);
+ /*
+ * Skip vclock. It is used only to be sent to network when vote
+ * for self. It is a job of the vclock owner to persist it
+ * anyhow.
+ *
+ * Skip state. That would be strictly against Raft protocol. The
+ * reason is that it does not make much sense - even if the node
+ * is a leader now, after the node is restarted, there will be
+ * another leader elected by that time likely.
+ */
+ raft_write(raft, &req);
say_info("RAFT: persisted state %s",
raft_request_to_string(&req));
@@ -598,8 +567,7 @@ raft_worker_handle_broadcast(struct raft *raft)
assert(raft->vote == raft->self);
req.vclock = raft->vclock;
}
- replicaset_foreach(replica)
- relay_push_raft(replica->relay, &req);
+ raft_broadcast(raft, &req);
trigger_run(&raft->on_update, raft);
raft->is_broadcast_scheduled = false;
}
@@ -1038,7 +1006,7 @@ raft_schedule_broadcast(struct raft *raft)
}
void
-raft_create(struct raft *raft)
+raft_create(struct raft *raft, const struct raft_vtab *vtab)
{
*raft = (struct raft) {
.state = RAFT_STATE_FOLLOWER,
@@ -1047,6 +1015,7 @@ raft_create(struct raft *raft)
.election_quorum = 1,
.election_timeout = 5,
.death_timeout = 5,
+ .vtab = vtab,
};
ev_timer_init(&raft->timer, raft_sm_schedule_new_election_cb, 0, 0);
raft->timer.data = raft;
diff --git a/src/box/raftlib.h b/src/box/raftlib.h
index 8d0d03da0..6181d9d49 100644
--- a/src/box/raftlib.h
+++ b/src/box/raftlib.h
@@ -69,6 +69,7 @@ extern "C" {
*/
struct fiber;
+struct raft;
struct raft_request;
enum raft_state {
@@ -94,6 +95,21 @@ enum raft_state {
const char *
raft_state_str(uint32_t state);
+typedef void (*raft_broadcast_f)(struct raft *raft,
+ const struct raft_request *req);
+typedef void (*raft_write_f)(struct raft *raft, const struct raft_request *req);
+
+/**
+ * Raft connection to the environment, via which it talks to other nodes and
+ * saves something to disk.
+ */
+struct raft_vtab {
+ /** Send a message to all nodes in the cluster. */
+ raft_broadcast_f broadcast;
+ /** Save a message to disk. */
+ raft_write_f write;
+};
+
struct raft {
/** Instance ID of this node. */
uint32_t self;
@@ -174,6 +190,8 @@ struct raft {
* elections can be started.
*/
double death_timeout;
+ /** Virtual table to perform application-specific actions. */
+ const struct raft_vtab *vtab;
/**
* Trigger invoked each time any of the Raft node visible attributes are
* changed.
@@ -295,8 +313,12 @@ raft_serialize_for_disk(const struct raft *raft, struct raft_request *req);
void
raft_on_update(struct raft *raft, struct trigger *trigger);
+/**
+ * Create a Raft node. The vtab is not copied. Its memory should stay valid even
+ * after the creation.
+ */
void
-raft_create(struct raft *raft);
+raft_create(struct raft *raft, const struct raft_vtab *vtab);
void
raft_destroy(struct raft *raft);
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Tarantool-patches] [PATCH v2 09/16] raft: introduce raft_msg, drop xrow dependency
2020-11-19 23:45 [Tarantool-patches] [PATCH v2 00/16] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
` (14 preceding siblings ...)
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 08/16] raft: introduce vtab for disk and network Vladislav Shpilevoy
@ 2020-11-19 23:46 ` Vladislav Shpilevoy
2020-11-20 9:14 ` [Tarantool-patches] [PATCH v2 00/16] Raft module, part 2 - relocation to src/lib/raft Serge Petrenko
` (2 subsequent siblings)
18 siblings, 0 replies; 26+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-19 23:46 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko
Raft used to depend on xrow, because it used raft_request as a
communication and persistence unit. Xrow is a part of src/box
library set, so it blocked Raft extraction into src/lib/raft.
This patch makes Raft not depend on xrow. For that Raft introduces
a new communication and persistence unit - struct raft_msg.
Interestingly, throughout its source code Raft already uses term
'message' to describe requests, so this patch also restores the
consistency. This is because raft_request name was used to be
consistent with other *_request structs in xrow.h. Now Raft does
not depend on this, and can use its own name.
Struct raft_msg repeats raft_request literally, but it actually
makes sense. Because when Raft is extracted to a new library, it
may start evolving independently. Its raft_msg may be populated
with new members, or their behaviour may change depending on how
the algorithm will evolve.
But inside box it will be possible to tweak and extend raft_msg
whenever it is necessary, via struct raft_request, and without
changing the basic library.
For instance, in future we may want to make nodes forward the
messages to each other during voting to speed the process up, and
for that we may want to add an explicit 'source' field to
raft_request, while it won't be necessary on the level of
raft_msg.
There is a new compatibility layer in src/box/raft.h which hides
raft_msg details from other box code, and does the msg <-> request
conversions.
Part of #5303
---
src/box/applier.cc | 2 +-
src/box/box.cc | 4 +--
src/box/memtx_engine.c | 4 +--
src/box/raft.c | 70 ++++++++++++++++++++++++++++++++++++++----
src/box/raft.h | 24 +++++++++++++++
src/box/raftlib.c | 28 ++++++++---------
src/box/raftlib.h | 38 ++++++++++++++++++-----
src/box/xrow.h | 4 +++
8 files changed, 139 insertions(+), 35 deletions(-)
diff --git a/src/box/applier.cc b/src/box/applier.cc
index fbde0eccd..fb2f5d130 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -893,7 +893,7 @@ applier_handle_raft(struct applier *applier, struct xrow_header *row)
struct vclock candidate_clock;
if (xrow_decode_raft(row, &req, &candidate_clock) != 0)
return -1;
- return raft_process_msg(box_raft(), &req, applier->instance_id);
+ return box_raft_process(&req, applier->instance_id);
}
/**
diff --git a/src/box/box.cc b/src/box/box.cc
index 837fbd2e5..1a4f8d377 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -394,7 +394,7 @@ apply_wal_row(struct xstream *stream, struct xrow_header *row)
/* Vclock is never persisted in WAL by Raft. */
if (xrow_decode_raft(row, &raft_req, NULL) != 0)
diag_raise();
- raft_process_recovery(box_raft(), &raft_req);
+ box_raft_recover(&raft_req);
return;
}
xrow_decode_dml_xc(row, &request, dml_request_key_map(row->type));
@@ -2142,7 +2142,7 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
* should be 0.
*/
struct raft_request req;
- raft_serialize_for_network(box_raft(), &req);
+ box_raft_checkpoint_remote(&req);
xrow_encode_raft(&row, &fiber()->gc, &req);
coio_write_xrow(io, &row);
}
diff --git a/src/box/memtx_engine.c b/src/box/memtx_engine.c
index 39d3ffa15..db2bb2333 100644
--- a/src/box/memtx_engine.c
+++ b/src/box/memtx_engine.c
@@ -210,7 +210,7 @@ memtx_engine_recover_raft(const struct xrow_header *row)
/* Vclock is never persisted in WAL by Raft. */
if (xrow_decode_raft(row, &req, NULL) != 0)
return -1;
- raft_process_recovery(box_raft(), &req);
+ box_raft_recover(&req);
return 0;
}
@@ -554,7 +554,7 @@ checkpoint_new(const char *snap_dirname, uint64_t snap_io_rate_limit)
opts.free_cache = true;
xdir_create(&ckpt->dir, snap_dirname, SNAP, &INSTANCE_UUID, &opts);
vclock_create(&ckpt->vclock);
- raft_serialize_for_disk(box_raft(), &ckpt->raft);
+ box_raft_checkpoint_local(&ckpt->raft);
ckpt->touch = false;
return ckpt;
}
diff --git a/src/box/raft.c b/src/box/raft.c
index 5efff80a0..2c9ed11b6 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -49,6 +49,28 @@ struct raft box_raft_global = {
*/
static struct trigger box_raft_on_update;
+static void
+box_raft_msg_to_request(const struct raft_msg *msg, struct raft_request *req)
+{
+ *req = (struct raft_request) {
+ .term = msg->term,
+ .vote = msg->vote,
+ .state = msg->state,
+ .vclock = msg->vclock,
+ };
+}
+
+static void
+box_raft_request_to_msg(const struct raft_request *req, struct raft_msg *msg)
+{
+ *msg = (struct raft_msg) {
+ .term = req->term,
+ .vote = req->vote,
+ .state = req->state,
+ .vclock = req->vclock,
+ };
+}
+
static int
box_raft_on_update_f(struct trigger *trigger, void *event)
{
@@ -117,13 +139,47 @@ box_raft_update_election_quorum(void)
raft_cfg_election_quorum(box_raft(), quorum);
}
+void
+box_raft_recover(const struct raft_request *req)
+{
+ struct raft_msg msg;
+ box_raft_request_to_msg(req, &msg);
+ raft_process_recovery(box_raft(), &msg);
+}
+
+void
+box_raft_checkpoint_local(struct raft_request *req)
+{
+ struct raft_msg msg;
+ raft_checkpoint_local(box_raft(), &msg);
+ box_raft_msg_to_request(&msg, req);
+}
+
+void
+box_raft_checkpoint_remote(struct raft_request *req)
+{
+ struct raft_msg msg;
+ raft_checkpoint_remote(box_raft(), &msg);
+ box_raft_msg_to_request(&msg, req);
+}
+
+int
+box_raft_process(struct raft_request *req, uint32_t source)
+{
+ struct raft_msg msg;
+ box_raft_request_to_msg(req, &msg);
+ return raft_process_msg(box_raft(), &msg, source);
+}
+
static void
-box_raft_broadcast(struct raft *raft, const struct raft_request *req)
+box_raft_broadcast(struct raft *raft, const struct raft_msg *msg)
{
(void)raft;
assert(raft == box_raft());
+ struct raft_request req;
+ box_raft_msg_to_request(msg, &req);
replicaset_foreach(replica)
- relay_push_raft(replica->relay, req);
+ relay_push_raft(replica->relay, &req);
}
/** Wakeup Raft state writer fiber waiting for WAL write end. */
@@ -134,14 +190,16 @@ box_raft_write_cb(struct journal_entry *entry)
}
static void
-box_raft_write(struct raft *raft, const struct raft_request *req)
+box_raft_write(struct raft *raft, const struct raft_msg *msg)
{
(void)raft;
assert(raft == box_raft());
/* See Raft implementation why these fields are never written. */
- assert(req->vclock == NULL);
- assert(req->state == 0);
+ assert(msg->vclock == NULL);
+ assert(msg->state == 0);
+ struct raft_request req;
+ box_raft_msg_to_request(msg, &req);
struct region *region = &fiber()->gc;
uint32_t svp = region_used(region);
struct xrow_header row;
@@ -150,7 +208,7 @@ box_raft_write(struct raft *raft, const struct raft_request *req)
struct journal_entry *entry = (struct journal_entry *)buf;
entry->rows[0] = &row;
- if (xrow_encode_raft(&row, region, req) != 0)
+ if (xrow_encode_raft(&row, region, &req) != 0)
goto fail;
journal_entry_create(entry, 1, xrow_approx_len(&row), box_raft_write_cb,
fiber());
diff --git a/src/box/raft.h b/src/box/raft.h
index d21c25e01..c80faf811 100644
--- a/src/box/raft.h
+++ b/src/box/raft.h
@@ -35,6 +35,8 @@
extern "C" {
#endif
+struct raft_request;
+
/** Raft state of this instance. */
static inline struct raft *
box_raft(void)
@@ -56,6 +58,28 @@ box_raft(void)
void
box_raft_update_election_quorum(void);
+/**
+ * Recover a single Raft request. Raft state machine is not turned on yet, this
+ * works only during instance recovery from the journal.
+ */
+void
+box_raft_recover(const struct raft_request *req);
+
+/** Save complete Raft state into a request to be persisted on disk locally. */
+void
+box_raft_checkpoint_local(struct raft_request *req);
+
+/**
+ * Save complete Raft state into a request to be sent to other instances of the
+ * cluster.
+ */
+void
+box_raft_checkpoint_remote(struct raft_request *req);
+
+/** Handle a single Raft request from a node with instance id @a source. */
+int
+box_raft_process(struct raft_request *req, uint32_t source);
+
void
box_raft_init(void);
diff --git a/src/box/raftlib.c b/src/box/raftlib.c
index 2f5e90f21..d28e51871 100644
--- a/src/box/raftlib.c
+++ b/src/box/raftlib.c
@@ -32,7 +32,6 @@
#include "error.h"
#include "fiber.h"
-#include "xrow.h"
#include "small/region.h"
#include "box.h"
#include "tt_static.h"
@@ -64,14 +63,14 @@ raft_state_str(uint32_t state)
/** Shortcut for vtab 'broadcast' method. */
static inline void
-raft_broadcast(struct raft *raft, const struct raft_request *req)
+raft_broadcast(struct raft *raft, const struct raft_msg *req)
{
raft->vtab->broadcast(raft, req);
}
/** Shortcut for vtab 'write' method. */
static inline void
-raft_write(struct raft *raft, const struct raft_request *req)
+raft_write(struct raft *raft, const struct raft_msg *req)
{
raft->vtab->write(raft, req);
}
@@ -235,7 +234,7 @@ static void
raft_sm_become_candidate(struct raft *raft);
static const char *
-raft_request_to_string(const struct raft_request *req)
+raft_msg_to_string(const struct raft_msg *req)
{
assert(req->term != 0);
char buf[1024];
@@ -273,9 +272,9 @@ raft_request_to_string(const struct raft_request *req)
}
void
-raft_process_recovery(struct raft *raft, const struct raft_request *req)
+raft_process_recovery(struct raft *raft, const struct raft_msg *req)
{
- say_verbose("RAFT: recover %s", raft_request_to_string(req));
+ say_verbose("RAFT: recover %s", raft_msg_to_string(req));
if (req->term != 0) {
raft->term = req->term;
raft->volatile_term = req->term;
@@ -301,11 +300,9 @@ raft_process_recovery(struct raft *raft, const struct raft_request *req)
}
int
-raft_process_msg(struct raft *raft, const struct raft_request *req,
- uint32_t source)
+raft_process_msg(struct raft *raft, const struct raft_msg *req, uint32_t source)
{
- say_info("RAFT: message %s from %u", raft_request_to_string(req),
- source);
+ say_info("RAFT: message %s from %u", raft_msg_to_string(req), source);
assert(source > 0);
assert(source != raft->self);
if (req->term == 0 || req->state == 0) {
@@ -488,7 +485,7 @@ raft_worker_handle_io(struct raft *raft)
assert(raft->is_write_in_progress);
/* During write Raft can't be anything but a follower. */
assert(raft->state == RAFT_STATE_FOLLOWER);
- struct raft_request req;
+ struct raft_msg req;
if (raft_is_fully_on_disk(raft)) {
end_dump:
@@ -538,8 +535,7 @@ end_dump:
* another leader elected by that time likely.
*/
raft_write(raft, &req);
- say_info("RAFT: persisted state %s",
- raft_request_to_string(&req));
+ say_info("RAFT: persisted state %s", raft_msg_to_string(&req));
assert(req.term >= raft->term);
raft->term = req.term;
@@ -558,7 +554,7 @@ static void
raft_worker_handle_broadcast(struct raft *raft)
{
assert(raft->is_broadcast_scheduled);
- struct raft_request req;
+ struct raft_msg req;
memset(&req, 0, sizeof(req));
req.term = raft->term;
req.vote = raft->vote;
@@ -817,7 +813,7 @@ raft_sm_stop(struct raft *raft)
}
void
-raft_serialize_for_network(const struct raft *raft, struct raft_request *req)
+raft_checkpoint_remote(const struct raft *raft, struct raft_msg *req)
{
memset(req, 0, sizeof(*req));
/*
@@ -836,7 +832,7 @@ raft_serialize_for_network(const struct raft *raft, struct raft_request *req)
}
void
-raft_serialize_for_disk(const struct raft *raft, struct raft_request *req)
+raft_checkpoint_local(const struct raft *raft, struct raft_msg *req)
{
memset(req, 0, sizeof(*req));
req->term = raft->term;
diff --git a/src/box/raftlib.h b/src/box/raftlib.h
index 6181d9d49..4f4d24ca8 100644
--- a/src/box/raftlib.h
+++ b/src/box/raftlib.h
@@ -70,7 +70,6 @@ extern "C" {
struct fiber;
struct raft;
-struct raft_request;
enum raft_state {
/**
@@ -95,9 +94,32 @@ enum raft_state {
const char *
raft_state_str(uint32_t state);
-typedef void (*raft_broadcast_f)(struct raft *raft,
- const struct raft_request *req);
-typedef void (*raft_write_f)(struct raft *raft, const struct raft_request *req);
+/**
+ * Basic Raft communication unit for talking to other nodes, and even to other
+ * subsystems such as disk storage.
+ */
+struct raft_msg {
+ /** Term of the instance. */
+ uint64_t term;
+ /**
+ * Instance ID of the instance this node voted for in the current term.
+ * 0 means the node didn't vote in this term.
+ */
+ uint32_t vote;
+ /**
+ * State of the instance. Can be 0 if the state does not matter for the
+ * message. For instance, when the message is sent to disk.
+ */
+ enum raft_state state;
+ /**
+ * Vclock of the instance. Can be NULL, if the node is not a candidate.
+ * Also is omitted when does not matter (when the message is for disk).
+ */
+ const struct vclock *vclock;
+};
+
+typedef void (*raft_broadcast_f)(struct raft *raft, const struct raft_msg *req);
+typedef void (*raft_write_f)(struct raft *raft, const struct raft_msg *req);
/**
* Raft connection to the environment, via which it talks to other nodes and
@@ -226,11 +248,11 @@ raft_is_enabled(const struct raft *raft)
/** Process a raft entry stored in WAL/snapshot. */
void
-raft_process_recovery(struct raft *raft, const struct raft_request *req);
+raft_process_recovery(struct raft *raft, const struct raft_msg *req);
/** Process a raft status message coming from the network. */
int
-raft_process_msg(struct raft *raft, const struct raft_request *req,
+raft_process_msg(struct raft *raft, const struct raft_msg *req,
uint32_t source);
/**
@@ -297,14 +319,14 @@ raft_new_term(struct raft *raft);
* cluster. It is allowed to save anything here, not only persistent state.
*/
void
-raft_serialize_for_network(const struct raft *raft, struct raft_request *req);
+raft_checkpoint_remote(const struct raft *raft, struct raft_msg *req);
/**
* Save complete Raft state into a request to be persisted on disk. Only term
* and vote are being persisted.
*/
void
-raft_serialize_for_disk(const struct raft *raft, struct raft_request *req);
+raft_checkpoint_local(const struct raft *raft, struct raft_msg *req);
/**
* Add a trigger invoked each time any of the Raft node visible attributes are
diff --git a/src/box/xrow.h b/src/box/xrow.h
index 3d68c1268..fde8f9474 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -264,6 +264,10 @@ xrow_encode_synchro(struct xrow_header *row,
int
xrow_decode_synchro(const struct xrow_header *row, struct synchro_request *req);
+/**
+ * Raft request. It repeats Raft message to the letter, but can be extended in
+ * future not depending on the Raft library.
+ */
struct raft_request {
uint64_t term;
uint32_t vote;
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Tarantool-patches] [PATCH v2 00/16] Raft module, part 2 - relocation to src/lib/raft
2020-11-19 23:45 [Tarantool-patches] [PATCH v2 00/16] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
` (15 preceding siblings ...)
2020-11-19 23:46 ` [Tarantool-patches] [PATCH v2 09/16] raft: introduce raft_msg, drop xrow dependency Vladislav Shpilevoy
@ 2020-11-20 9:14 ` Serge Petrenko
2020-11-20 19:42 ` Vladislav Shpilevoy
2020-11-23 5:30 ` Alexander V. Tikhonov
18 siblings, 0 replies; 26+ messages in thread
From: Serge Petrenko @ 2020-11-20 9:14 UTC (permalink / raw)
To: Vladislav Shpilevoy, tarantool-patches
20.11.2020 02:45, Vladislav Shpilevoy пишет:
> The patchset is a second part of Raft relocation to a new module for the sake of
> unit testing. This part does the relocation itself.
>
> It entirely consists of removal of box dependencies from raft code.
>
> The third part will virtualize Raft event loop at compile-time, and will
> introduce customizable implementations for network, disk, event loop, and time
> to perform unit tests.
>
> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5303-p2-src-lib-raft
> Issue: https://github.com/tarantool/tarantool/issues/5303
>
> Changes in v2:
> - Renames and comment fixes;
> - Fixed the bug when a follower could be writable for some time. But don't know
> how to test it in a sane way. Should be easy to cover in a unit test though.
> The fix is done in the commits from "raft: make worker non-cancellable during
> WAL write" to "move RO summary update to box-Raft".
>
> Vladislav Shpilevoy (16):
> raft: move sources to raftlib.h/.c
> raft: move box_raft_* to src/box/raft.h and .c
> raft: stop using replication_disconnect_timeout()
> raft: stop using replication_synchro_quorum
> raft: stop using instance_id
> raft: make raft_request.vclock constant
> raft: stop using replicaset.vclock
> raft: introduce vtab for disk and network
> raft: introduce raft_msg, drop xrow dependency
> raft: make worker non-cancellable during WAL write
> raft: move worker fiber from Raft library to box
> raft: move synchro queue clear to the worker fiber
> raft: invoke update triggers within state machine
> raft: move RO summary update to box-Raft
> raft: introduce RaftError
> raft: move algorithm code to src/lib/raft
>
> src/box/CMakeLists.txt | 2 +-
> src/box/applier.cc | 2 +-
> src/box/box.cc | 44 +-
> src/box/memtx_engine.c | 4 +-
> src/box/raft.c | 1200 ++++++-----------------------------
> src/box/raft.h | 252 +-------
> src/box/replication.cc | 3 +
> src/box/xrow.c | 2 +-
> src/box/xrow.h | 6 +-
> src/lib/CMakeLists.txt | 1 +
> src/lib/core/diag.h | 2 +
> src/lib/core/exception.cc | 24 +
> src/lib/core/exception.h | 7 +
> src/lib/raft/CMakeLists.txt | 7 +
> src/lib/raft/raft.c | 997 +++++++++++++++++++++++++++++
> src/lib/raft/raft.h | 357 +++++++++++
> 16 files changed, 1652 insertions(+), 1258 deletions(-)
> create mode 100644 src/lib/raft/CMakeLists.txt
> create mode 100644 src/lib/raft/raft.c
> create mode 100644 src/lib/raft/raft.h
>
Hi! Thanks for the patchset & fixes!
Patches 1-9, 15, 16 LGTM as reviewed in previous patchset version.
--
Serge Petrenko
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Tarantool-patches] [PATCH v2 00/16] Raft module, part 2 - relocation to src/lib/raft
2020-11-19 23:45 [Tarantool-patches] [PATCH v2 00/16] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
` (16 preceding siblings ...)
2020-11-20 9:14 ` [Tarantool-patches] [PATCH v2 00/16] Raft module, part 2 - relocation to src/lib/raft Serge Petrenko
@ 2020-11-20 19:42 ` Vladislav Shpilevoy
2020-11-23 5:30 ` Alexander V. Tikhonov
18 siblings, 0 replies; 26+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-20 19:42 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko, Alexander V. Tikhonov
Hi! Sasha, could you please take a look if it passes all the tests
with no new degradations?
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Tarantool-patches] [PATCH v2 00/16] Raft module, part 2 - relocation to src/lib/raft
2020-11-19 23:45 [Tarantool-patches] [PATCH v2 00/16] Raft module, part 2 - relocation to src/lib/raft Vladislav Shpilevoy
` (17 preceding siblings ...)
2020-11-20 19:42 ` Vladislav Shpilevoy
@ 2020-11-23 5:30 ` Alexander V. Tikhonov
2020-11-23 23:26 ` Vladislav Shpilevoy
18 siblings, 1 reply; 26+ messages in thread
From: Alexander V. Tikhonov @ 2020-11-23 5:30 UTC (permalink / raw)
To: Vladislav Shpilevoy; +Cc: tarantool-patches
Hi Vlad, thanks for the patch, as I see no new degradation found in
gitlab-ci testing commit criteria pipeline [1], patch LGTM.
[1] - https://gitlab.com/tarantool/tarantool/-/pipelines/219744245
On Fri, Nov 20, 2020 at 12:45:56AM +0100, Vladislav Shpilevoy wrote:
> The patchset is a second part of Raft relocation to a new module for the sake of
> unit testing. This part does the relocation itself.
>
> It entirely consists of removal of box dependencies from raft code.
>
> The third part will virtualize Raft event loop at compile-time, and will
> introduce customizable implementations for network, disk, event loop, and time
> to perform unit tests.
>
> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5303-p2-src-lib-raft
> Issue: https://github.com/tarantool/tarantool/issues/5303
>
> Changes in v2:
> - Renames and comment fixes;
> - Fixed the bug when a follower could be writable for some time. But don't know
> how to test it in a sane way. Should be easy to cover in a unit test though.
> The fix is done in the commits from "raft: make worker non-cancellable during
> WAL write" to "move RO summary update to box-Raft".
>
> Vladislav Shpilevoy (16):
> raft: move sources to raftlib.h/.c
> raft: move box_raft_* to src/box/raft.h and .c
> raft: stop using replication_disconnect_timeout()
> raft: stop using replication_synchro_quorum
> raft: stop using instance_id
> raft: make raft_request.vclock constant
> raft: stop using replicaset.vclock
> raft: introduce vtab for disk and network
> raft: introduce raft_msg, drop xrow dependency
> raft: make worker non-cancellable during WAL write
> raft: move worker fiber from Raft library to box
> raft: move synchro queue clear to the worker fiber
> raft: invoke update triggers within state machine
> raft: move RO summary update to box-Raft
> raft: introduce RaftError
> raft: move algorithm code to src/lib/raft
>
> src/box/CMakeLists.txt | 2 +-
> src/box/applier.cc | 2 +-
> src/box/box.cc | 44 +-
> src/box/memtx_engine.c | 4 +-
> src/box/raft.c | 1200 ++++++-----------------------------
> src/box/raft.h | 252 +-------
> src/box/replication.cc | 3 +
> src/box/xrow.c | 2 +-
> src/box/xrow.h | 6 +-
> src/lib/CMakeLists.txt | 1 +
> src/lib/core/diag.h | 2 +
> src/lib/core/exception.cc | 24 +
> src/lib/core/exception.h | 7 +
> src/lib/raft/CMakeLists.txt | 7 +
> src/lib/raft/raft.c | 997 +++++++++++++++++++++++++++++
> src/lib/raft/raft.h | 357 +++++++++++
> 16 files changed, 1652 insertions(+), 1258 deletions(-)
> create mode 100644 src/lib/raft/CMakeLists.txt
> create mode 100644 src/lib/raft/raft.c
> create mode 100644 src/lib/raft/raft.h
>
> --
> 2.24.3 (Apple Git-128)
>
^ permalink raw reply [flat|nested] 26+ messages in thread