Hi! Thanks for your detailed answer!
I pushed v2 regarding all the comments.


Четверг, 13 февраля 2020, 9:47 +03:00 от Konstantin Osipov <kostja.osipov@gmail.com>:

* sergepetrenko <sergepetrenko@tarantool.org> [20/02/13 09:34]:
> Fix replicaset.applier.vclock initialization issues: it wasn't
> initialized at all previously.

In the next line you say that you remove the initialization. What
do you mean here?

I changed both the commit and the commit message.



> Moreover, there is no valid point in code
> to initialize it, since it may get stale right away if new entries are
> written to WAL.

Well, it reflects the state of the wal *as seen by* the set of
appliers. This is stated in the comment. So it doesn't have to
reflect local changes.

I see.


> So, check for both applier and replicaset vclocks.
> The greater one protects the instance from applying the rows it has
> already applied or has already scheduled to write.
> Also remove an unnecessary aplier vclock initialization from
> replication_init().

First of all, the race you describe applies to
local changes only. Yet you add the check for all replica ids.
This further obliterates this piece of code.

Ok, fixed.


Second, the core of the issue is a "hole" in vclock protection
enforced by latch_lock/latch_unlock. Basically the assumption that
latch_lock/latch_unlock has is that while a latch is locked, no
source can apply a transaction under this replica id. This, is
violated by the local WAL.

We used to skip all changes by local vclock id before in applier.

Later it was changed to be able to get-your-own logs on recovery,
e.g. if some replica has them , and the local node lost a piece of
wal.

It will take me a while to find this commit and ticket, but this
is the commit and ticket which introduced the regression.

The proper fix is to only apply local changes received from
remotes in orphan mode, and begin skipping them when entering
read-write mode.

Thanks for the clarification.



> Closes #4739
> ---
> src/box/applier.cc | 14 ++++++++++++--
> src/box/replication.cc | 1 -
> 2 files changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index ae3d281a5..acb26b7e2 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc
> @@ -731,8 +731,18 @@ applier_apply_tx(struct stailq *rows)
> struct latch *latch = (replica ? &replica->order_latch :
> &replicaset.applier.order_latch);
> latch_lock(latch);
> - if (vclock_get(&replicaset.applier.vclock,
> - first_row->replica_id) >= first_row->lsn) {
> + /*
> + * We cannot tell which vclock is greater. There is no
> + * proper place to initialize applier vclock, since it
> + * may get stale right away if we write something to WAL
> + * and it gets replicated and then arrives back from the
> + * replica. So check against both vclocks. Replicaset
> + * vclock will guard us from corner cases like the one
> + * above.
> + */
> + if (MAX(vclock_get(&replicaset.applier.vclock, first_row->replica_id),
> + vclock_get(&replicaset.vclock, first_row->replica_id)) >=
> + first_row->lsn) {
> latch_unlock(latch);
> return 0;
> }
> diff --git a/src/box/replication.cc b/src/box/replication.cc
> index e7bfa22ab..7b04573a4 100644
> --- a/src/box/replication.cc
> +++ b/src/box/replication.cc
> @@ -93,7 +93,6 @@ replication_init(void)
> latch_create(&replicaset.applier.order_latch);
>
> vclock_create(&replicaset.applier.vclock);
> - vclock_copy(&replicaset.applier.vclock, &replicaset.vclock);
> rlist_create(&replicaset.applier.on_rollback);
> rlist_create(&replicaset.applier.on_commit);
>
> --
> 2.20.1 (Apple Git-117)

--
Konstantin Osipov, Moscow, Russia


--
Sergey Petrenko