[Tarantool-patches] [PATCH 10/12] raft: move box_update_ro_summary to update trigger

Vladislav Shpilevoy v.shpilevoy at tarantool.org
Thu Nov 19 02:21:12 MSK 2020


Hi! Thanks for the review!

> Raft uses synchronous WAL write, corect?
> 
> So there's a yield in raft_worker_handle_io(). Now there's a period of time when
> an instance is a follower, but it isn't read-only.
> 
> When you reconfigure a leader to become voter, everything's fine, since no
> writing to disk is involved.
> 
> However, if an existing leader receives a message with term greater, than its own,
> it'll first persist this term, and thus yield, and proceed to broadcast and switching
> to ro later.
> 
> So now it's possible that a follower is writeable for some period of time.

You are a savior. Thanks for the deep review and for noticing this.

Indeed. The issue exists. And it is much deeper, it seems.

I also glanced at your patch about RO-update vs limbo-clear order.
Unfortunately, it does not work. As well as my idea about running on_update
triggers from raft_schedule_broadcast().

The reason is that currently our on_update trigger yields. In both our
solutions. Because it calls box_clear_synchro_queue(). And this is exactly
what I was trying to avoid by using a fiber in the first place. The state
machine must not yield.

Also there is another issue, not related to the patch, but which I spotted
while worked on it today. The issue is - we can't cancel limbo clearance.
The node could be demoted to a follower during waiting for confirms, but it
still will wait for confirms and we can't stop it.

I started thinking that we could resolve both these issues if box/raft.c could
run update triggers without yields right away, but schedule async work, like
limbo clear, into a separate fiber, cancellable.

And I realized that we already have this fiber - raft.worker.

The idea is that we can move the worker fiber to box/raft.c. And in libraft
instead of a fiber we will have 2 methods:

	- raft_vtab.async_f - virtual method called by Raft, when it wants
	  to schedule some async heavy work (network, disk). We will call it
	  instead of raft_worker_wakeup().

	- raft_process_async - normal method, which the Raft owner should call
	  to handle all async events. In a separate fiber. This is the same
	  as raft_worker_f(), but not depending on fiber, and finite.

In box/raft.c we have a worker fiber. To libraft we give async_f, which
creates the fiber on demand, and wakes it up. No yields. Like now, but in
box/raft.c.

The fiber in its body will call raft_process_async() and fiber_yield() until
it is cancelled.

Now how does it fix the update triggers? - we fire on_update triggers in
raft_schedule_broadcast(), but box/raft.c in the trigger will only update
RO summary. It won't yield. For the limbo clearance it will wakeup the worker
fiber, which now belongs to box/raft.c, so it is totally fine. The worker
will call raft_process_async() and will clear the limbo when it is time.

Also the worker fiber can be cancelled/interrupted somehow if we want to
stop limbo clear when the node is not a leader anymore.

I started working on this already, and it seems to be good. Raft is simplified
even more, and we delete the ugly hack in box_raft_free() about changing struct
raft with box_raft_global.worker = NULL. We still nullify the fiber, but we
don't change struct raft.


More information about the Tarantool-patches mailing list