[Tarantool-patches] [PATCH 0/2] replication: fix applying of rows originating from local instance

Tarantool development patches archive
 help / color / mirror / Atom feed

* [Tarantool-patches] [PATCH 0/2] replication: fix applying of rows originating from local instance
@ 2020-02-12 23:50 sergepetrenko
  2020-02-12 23:51 ` [Tarantool-patches] [PATCH 1/2] replication: correctly check for rows to skip in applier sergepetrenko
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: sergepetrenko @ 2020-02-12 23:50 UTC (permalink / raw)
  To: alexander.turenko, v.shpilevoy; +Cc: tarantool-patches

This patchset fixes problems with applying self-issued rows arriving via replication.

https://github.com/tarantool/tarantool/issues/4739
https://github.com/tarantool/tarantool/tree/sp/gh-4739-vclock-assert

sergepetrenko (2):
  replication: correctly check for rows to skip in applier
  wal: panic when trying to write a record with a broken lsn

 src/box/applier.cc     | 14 ++++++++++++--
 src/box/replication.cc |  1 -
 src/box/wal.c          | 15 ++++++++++++---
 3 files changed, 24 insertions(+), 6 deletions(-)

-- 
2.20.1 (Apple Git-117)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Tarantool-patches] [PATCH 1/2] replication: correctly check for rows to skip in applier
  2020-02-12 23:50 [Tarantool-patches] [PATCH 0/2] replication: fix applying of rows originating from local instance sergepetrenko
@ 2020-02-12 23:51 ` sergepetrenko
  2020-02-13  6:47   ` Konstantin Osipov
  2020-02-12 23:51 ` [Tarantool-patches] [PATCH 2/2] wal: panic when trying to write a record with a broken lsn sergepetrenko
  2020-02-13  6:48 ` [Tarantool-patches] [PATCH 0/2] replication: fix applying of rows originating from local instance Konstantin Osipov
  2 siblings, 1 reply; 11+ messages in thread
From: sergepetrenko @ 2020-02-12 23:51 UTC (permalink / raw)
  To: alexander.turenko, v.shpilevoy; +Cc: tarantool-patches

Fix replicaset.applier.vclock initialization issues: it wasn't
initialized at all previously. Moreover, there is no valid point in code
to initialize it, since it may get stale right away if new entries are
written to WAL. So, check for both applier and replicaset vclocks.
The greater one protects the instance from applying the rows it has
already applied or has already scheduled to write.
Also remove an unnecessary aplier vclock initialization from
replication_init().

Closes #4739
---
 src/box/applier.cc     | 14 ++++++++++++--
 src/box/replication.cc |  1 -
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index ae3d281a5..acb26b7e2 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -731,8 +731,18 @@ applier_apply_tx(struct stailq *rows)
 	struct latch *latch = (replica ? &replica->order_latch :
 			       &replicaset.applier.order_latch);
 	latch_lock(latch);
-	if (vclock_get(&replicaset.applier.vclock,
-		       first_row->replica_id) >= first_row->lsn) {
+	/*
+	 * We cannot tell which vclock is greater. There is no
+	 * proper place to initialize applier vclock, since it
+	 * may get stale right away if we write something to WAL
+	 * and it gets replicated and then arrives back from the
+	 * replica. So check against both vclocks. Replicaset
+	 * vclock will guard us from corner cases like the one
+	 * above.
+	 */
+	if (MAX(vclock_get(&replicaset.applier.vclock, first_row->replica_id),
+		vclock_get(&replicaset.vclock, first_row->replica_id)) >=
+	    first_row->lsn) {
 		latch_unlock(latch);
 		return 0;
 	}
diff --git a/src/box/replication.cc b/src/box/replication.cc
index e7bfa22ab..7b04573a4 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -93,7 +93,6 @@ replication_init(void)
 	latch_create(&replicaset.applier.order_latch);
 
 	vclock_create(&replicaset.applier.vclock);
-	vclock_copy(&replicaset.applier.vclock, &replicaset.vclock);
 	rlist_create(&replicaset.applier.on_rollback);
 	rlist_create(&replicaset.applier.on_commit);
 
-- 
2.20.1 (Apple Git-117)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Tarantool-patches] [PATCH 2/2] wal: panic when trying to write a record with a broken lsn
  2020-02-12 23:50 [Tarantool-patches] [PATCH 0/2] replication: fix applying of rows originating from local instance sergepetrenko
  2020-02-12 23:51 ` [Tarantool-patches] [PATCH 1/2] replication: correctly check for rows to skip in applier sergepetrenko
@ 2020-02-12 23:51 ` sergepetrenko
  2020-02-13  6:48   ` Konstantin Osipov
  2020-02-13  6:48 ` [Tarantool-patches] [PATCH 0/2] replication: fix applying of rows originating from local instance Konstantin Osipov
  2 siblings, 1 reply; 11+ messages in thread
From: sergepetrenko @ 2020-02-12 23:51 UTC (permalink / raw)
  To: alexander.turenko, v.shpilevoy; +Cc: tarantool-patches

There is an assertion in vclock_follow `lsn > prev_lsn`, which doesn't
fire in release builds, of course. So we better panic on an attemt to
write a record with a duplicate or otherwise broken lsn.

Follow-up #4739
---
 src/box/wal.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/src/box/wal.c b/src/box/wal.c
index 0ae66ff32..3d4317f34 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -951,9 +951,18 @@ wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
 			(*row)->tsn = tsn;
 			(*row)->is_commit = row == end - 1;
 		} else {
-			vclock_follow(vclock_diff, (*row)->replica_id,
-				      (*row)->lsn - vclock_get(base,
-							       (*row)->replica_id));
+			int64_t diff = (*row)->lsn - vclock_get(base, (*row)->replica_id);
+			if (diff <= vclock_get(vclock_diff,
+					       (*row)->replica_id)) {
+				panic("Attempt to write a broken LSN to WAL:"
+				      " replica id: %d, committed lsn: %d,"
+				      " new lsn %d", (*row)->replica_id,
+				      vclock_get(base, (*row)->replica_id) +
+				      vclock_get(vclock_diff,
+						 (*row)->replica_id),
+				      (*row)->lsn);
+			}
+			vclock_follow(vclock_diff, (*row)->replica_id, diff);
 		}
 	}
 }
-- 
2.20.1 (Apple Git-117)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/2] replication: correctly check for rows to skip in applier
  2020-02-12 23:51 ` [Tarantool-patches] [PATCH 1/2] replication: correctly check for rows to skip in applier sergepetrenko
@ 2020-02-13  6:47   ` Konstantin Osipov
  2020-02-13  6:58     ` Konstantin Osipov
  2020-02-13 22:10     ` Sergey Petrenko
  0 siblings, 2 replies; 11+ messages in thread
From: Konstantin Osipov @ 2020-02-13  6:47 UTC (permalink / raw)
  To: sergepetrenko; +Cc: tarantool-patches, v.shpilevoy

* sergepetrenko <sergepetrenko@tarantool.org> [20/02/13 09:34]:
> Fix replicaset.applier.vclock initialization issues: it wasn't
> initialized at all previously.

In the next line you say that you remove the initialization. What
do you mean here?

> Moreover, there is no valid point in code
> to initialize it, since it may get stale right away if new entries are
> written to WAL.

Well, it reflects the state of the wal *as seen by* the set of
appliers. This is stated in the comment. So it doesn't have to
reflect local changes. 

> So, check for both applier and replicaset vclocks.
> The greater one protects the instance from applying the rows it has
> already applied or has already scheduled to write.
> Also remove an unnecessary aplier vclock initialization from
> replication_init().

First of all, the race you describe applies to
local changes only. Yet you add the check for all replica ids. 
This further obliterates this piece of code.

Second, the core of the issue is a "hole" in vclock protection
enforced by latch_lock/latch_unlock. Basically the assumption that
latch_lock/latch_unlock has is that while a latch is locked, no
source can apply a transaction under this replica id. This, is
violated by the local WAL.

We used to skip all changes by local vclock id before in applier.

Later it was changed to be able to get-your-own logs on recovery,
e.g. if some replica has them , and the local node lost a piece of
wal.

It will take me a while to find this commit and ticket, but this
is the commit and ticket which introduced the regression.

The proper fix is to only apply local changes received from
remotes in orphan mode, and begin skipping them when entering
read-write mode.

> Closes #4739
> ---
>  src/box/applier.cc     | 14 ++++++++++++--
>  src/box/replication.cc |  1 -
>  2 files changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index ae3d281a5..acb26b7e2 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc
> @@ -731,8 +731,18 @@ applier_apply_tx(struct stailq *rows)
>  	struct latch *latch = (replica ? &replica->order_latch :
>  			       &replicaset.applier.order_latch);
>  	latch_lock(latch);
> -	if (vclock_get(&replicaset.applier.vclock,
> -		       first_row->replica_id) >= first_row->lsn) {
> +	/*
> +	 * We cannot tell which vclock is greater. There is no
> +	 * proper place to initialize applier vclock, since it
> +	 * may get stale right away if we write something to WAL
> +	 * and it gets replicated and then arrives back from the
> +	 * replica. So check against both vclocks. Replicaset
> +	 * vclock will guard us from corner cases like the one
> +	 * above.
> +	 */
> +	if (MAX(vclock_get(&replicaset.applier.vclock, first_row->replica_id),
> +		vclock_get(&replicaset.vclock, first_row->replica_id)) >=
> +	    first_row->lsn) {
>  		latch_unlock(latch);
>  		return 0;
>  	}
> diff --git a/src/box/replication.cc b/src/box/replication.cc
> index e7bfa22ab..7b04573a4 100644
> --- a/src/box/replication.cc
> +++ b/src/box/replication.cc
> @@ -93,7 +93,6 @@ replication_init(void)
>  	latch_create(&replicaset.applier.order_latch);
>  
>  	vclock_create(&replicaset.applier.vclock);
> -	vclock_copy(&replicaset.applier.vclock, &replicaset.vclock);
>  	rlist_create(&replicaset.applier.on_rollback);
>  	rlist_create(&replicaset.applier.on_commit);
>  
> -- 
> 2.20.1 (Apple Git-117)

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Tarantool-patches] [PATCH 2/2] wal: panic when trying to write a record with a broken lsn
  2020-02-12 23:51 ` [Tarantool-patches] [PATCH 2/2] wal: panic when trying to write a record with a broken lsn sergepetrenko
@ 2020-02-13  6:48   ` Konstantin Osipov
  2020-02-13 22:05     ` Sergey Petrenko
  0 siblings, 1 reply; 11+ messages in thread
From: Konstantin Osipov @ 2020-02-13  6:48 UTC (permalink / raw)
  To: sergepetrenko; +Cc: tarantool-patches, v.shpilevoy

* sergepetrenko <sergepetrenko@tarantool.org> [20/02/13 09:34]:
> There is an assertion in vclock_follow `lsn > prev_lsn`, which doesn't
> fire in release builds, of course. So we better panic on an attemt to
> write a record with a duplicate or otherwise broken lsn.
> 

It should be a warning in production. You can't crash production
deploy because of a server bug, and lsn inconsistency is always a
server bug.

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Tarantool-patches] [PATCH 0/2] replication: fix applying of rows originating from local instance
  2020-02-12 23:50 [Tarantool-patches] [PATCH 0/2] replication: fix applying of rows originating from local instance sergepetrenko
  2020-02-12 23:51 ` [Tarantool-patches] [PATCH 1/2] replication: correctly check for rows to skip in applier sergepetrenko
  2020-02-12 23:51 ` [Tarantool-patches] [PATCH 2/2] wal: panic when trying to write a record with a broken lsn sergepetrenko
@ 2020-02-13  6:48 ` Konstantin Osipov
  2 siblings, 0 replies; 11+ messages in thread
From: Konstantin Osipov @ 2020-02-13  6:48 UTC (permalink / raw)
  To: sergepetrenko; +Cc: tarantool-patches, v.shpilevoy

* sergepetrenko <sergepetrenko@tarantool.org> [20/02/13 09:34]:
> This patchset fixes problems with applying self-issued rows arriving via replication.
> 
> https://github.com/tarantool/tarantool/issues/4739
> https://github.com/tarantool/tarantool/tree/sp/gh-4739-vclock-assert

Congratulations for nailing it down!


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/2] replication: correctly check for rows to skip in applier
  2020-02-13  6:47   ` Konstantin Osipov
@ 2020-02-13  6:58     ` Konstantin Osipov
  2020-02-13 22:03       ` Sergey Petrenko
  2020-02-13 22:10     ` Sergey Petrenko
  1 sibling, 1 reply; 11+ messages in thread
From: Konstantin Osipov @ 2020-02-13  6:58 UTC (permalink / raw)
  To: sergepetrenko, alexander.turenko, v.shpilevoy, tarantool-patches

* Konstantin Osipov <kostja.osipov@gmail.com> [20/02/13 09:47]:

From relay.cc:

        /*                                                                      
         * We're feeding a WAL, thus responding to FINAL JOIN or SUBSCRIBE      
         * request. If this is FINAL JOIN (i.e. relay->replica is NULL),        
         * we must relay all rows, even those originating from the replica      
         * itself (there may be such rows if this is rebootstrap). If this      
         * SUBSCRIBE, only send a row if it is not from the same replica        
         * (i.e. don't send replica's own rows back) or if this row is          
         * missing on the other side (i.e. in case of sudden power-loss,        
         * data was not written to WAL, so remote master can't recover          
         * it). In the latter case packet's LSN is less than or equal to        
         * local master's LSN at the moment it received 'SUBSCRIBE' request.    
         */                                                                     
        if (relay->replica == NULL ||                                           
            packet->replica_id != relay->replica->id ||                         
            packet->lsn <= vclock_get(&relay->local_vclock_at_subscribe,        
                                      packet->replica_id)) {                    
                struct errinj *inj = errinj(ERRINJ_RELAY_BREAK_LSN,             
                                            ERRINJ_INT);                        
                if (inj != NULL && packet->lsn == inj->iparam) {                
                        packet->lsn = inj->iparam - 1;                          
                        say_warn("injected broken lsn: %lld",                   
                                 (long long) packet->lsn);                      
                }                                                               
                relay_send(relay, packet);                                      
        }                                                                       
}                 


As you can see we never send our own rows back, as long as 
they are greater than relay->local_vclock_at_subscribe.

So what exactly does go wrong here? 

Is the bug triggered during initial replication configuration, or
during a reconfiguration?

I suspect the issue is that at reconfiguration we send
local_vclock_at_subscribe, but keep changing it.
The fix then would be to make sure the local component in
local_vclock_at_subscribe is set to infinity during
reconfiguration.

> * sergepetrenko <sergepetrenko@tarantool.org> [20/02/13 09:34]:
> > Fix replicaset.applier.vclock initialization issues: it wasn't
> > initialized at all previously.
> 
> In the next line you say that you remove the initialization. What
> do you mean here?
> 
> > Moreover, there is no valid point in code
> > to initialize it, since it may get stale right away if new entries are
> > written to WAL.
> 
> Well, it reflects the state of the wal *as seen by* the set of
> appliers. This is stated in the comment. So it doesn't have to
> reflect local changes. 
> 
> > So, check for both applier and replicaset vclocks.
> > The greater one protects the instance from applying the rows it has
> > already applied or has already scheduled to write.
> > Also remove an unnecessary aplier vclock initialization from
> > replication_init().
> 
> First of all, the race you describe applies to
> local changes only. Yet you add the check for all replica ids. 
> This further obliterates this piece of code.
> 
> Second, the core of the issue is a "hole" in vclock protection
> enforced by latch_lock/latch_unlock. Basically the assumption that
> latch_lock/latch_unlock has is that while a latch is locked, no
> source can apply a transaction under this replica id. This, is
> violated by the local WAL.
> 
> We used to skip all changes by local vclock id before in applier.
> 
> Later it was changed to be able to get-your-own logs on recovery,
> e.g. if some replica has them , and the local node lost a piece of
> wal.
> 
> It will take me a while to find this commit and ticket, but this
> is the commit and ticket which introduced the regression.
> 
> The proper fix is to only apply local changes received from
> remotes in orphan mode, and begin skipping them when entering
> read-write mode.
> 
> > Closes #4739
> > ---
> >  src/box/applier.cc     | 14 ++++++++++++--
> >  src/box/replication.cc |  1 -
> >  2 files changed, 12 insertions(+), 3 deletions(-)
> > 
> > diff --git a/src/box/applier.cc b/src/box/applier.cc
> > index ae3d281a5..acb26b7e2 100644
> > --- a/src/box/applier.cc
> > +++ b/src/box/applier.cc
> > @@ -731,8 +731,18 @@ applier_apply_tx(struct stailq *rows)
> >  	struct latch *latch = (replica ? &replica->order_latch :
> >  			       &replicaset.applier.order_latch);
> >  	latch_lock(latch);
> > -	if (vclock_get(&replicaset.applier.vclock,
> > -		       first_row->replica_id) >= first_row->lsn) {
> > +	/*
> > +	 * We cannot tell which vclock is greater. There is no
> > +	 * proper place to initialize applier vclock, since it
> > +	 * may get stale right away if we write something to WAL
> > +	 * and it gets replicated and then arrives back from the
> > +	 * replica. So check against both vclocks. Replicaset
> > +	 * vclock will guard us from corner cases like the one
> > +	 * above.
> > +	 */
> > +	if (MAX(vclock_get(&replicaset.applier.vclock, first_row->replica_id),
> > +		vclock_get(&replicaset.vclock, first_row->replica_id)) >=
> > +	    first_row->lsn) {
> >  		latch_unlock(latch);
> >  		return 0;
> >  	}
> > diff --git a/src/box/replication.cc b/src/box/replication.cc
> > index e7bfa22ab..7b04573a4 100644
> > --- a/src/box/replication.cc
> > +++ b/src/box/replication.cc
> > @@ -93,7 +93,6 @@ replication_init(void)
> >  	latch_create(&replicaset.applier.order_latch);
> >  
> >  	vclock_create(&replicaset.applier.vclock);
> > -	vclock_copy(&replicaset.applier.vclock, &replicaset.vclock);
> >  	rlist_create(&replicaset.applier.on_rollback);
> >  	rlist_create(&replicaset.applier.on_commit);
> >  
> > -- 
> > 2.20.1 (Apple Git-117)
> 
> -- 
> Konstantin Osipov, Moscow, Russia

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/2] replication: correctly check for rows to skip in applier
  2020-02-13  6:58     ` Konstantin Osipov
@ 2020-02-13 22:03       ` Sergey Petrenko
  0 siblings, 0 replies; 11+ messages in thread
From: Sergey Petrenko @ 2020-02-13 22:03 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches, v.shpilevoy

[-- Attachment #1: Type: text/plain, Size: 6574 bytes --]




>Четверг, 13 февраля 2020, 9:58 +03:00 от Konstantin Osipov <kostja.osipov@gmail.com>:
>
>* Konstantin Osipov < kostja.osipov@gmail.com > [20/02/13 09:47]:
>
>From relay.cc:
>
>        /* 
>         * We're feeding a WAL, thus responding to FINAL JOIN or SUBSCRIBE 
>         * request. If this is FINAL JOIN (i.e. relay->replica is NULL), 
>         * we must relay all rows, even those originating from the replica 
>         * itself (there may be such rows if this is rebootstrap). If this 
>         * SUBSCRIBE, only send a row if it is not from the same replica 
>         * (i.e. don't send replica's own rows back) or if this row is 
>         * missing on the other side (i.e. in case of sudden power-loss, 
>         * data was not written to WAL, so remote master can't recover 
>         * it). In the latter case packet's LSN is less than or equal to 
>         * local master's LSN at the moment it received 'SUBSCRIBE' request. 
>         */ 
>        if (relay->replica == NULL || 
>            packet->replica_id != relay->replica->id || 
>            packet->lsn <= vclock_get(&relay->local_vclock_at_subscribe, 
>                                      packet->replica_id)) { 
>                struct errinj *inj = errinj(ERRINJ_RELAY_BREAK_LSN, 
>                                            ERRINJ_INT); 
>                if (inj != NULL && packet->lsn == inj->iparam) { 
>                        packet->lsn = inj->iparam - 1; 
>                        say_warn("injected broken lsn: %lld", 
>                                 (long long) packet->lsn); 
>                } 
>                relay_send(relay, packet); 
>        } 
>} 
>
>
>As you can see we never send our own rows back, as long as 
>they are greater than relay->local_vclock_at_subscribe.
True. First of all, local_vclock_at_subscribe was updated later than
it should've been. I describe it in more detail in one of the commits.

>
>
>So what exactly does go wrong here? 
>
>Is the bug triggered during initial replication configuration, or
>during a reconfiguration?
Both during reconfiguration, resubscribing in case a remote instance
dies and restarts.
>
>
>I suspect the issue is that at reconfiguration we send
>local_vclock_at_subscribe, but keep changing it. 
Yep

>
>The fix then would be to make sure the local component in
>local_vclock_at_subscribe is set to infinity during
>reconfiguration
>
>
>> * sergepetrenko < sergepetrenko@tarantool.org > [20/02/13 09:34]:
>> > Fix replicaset.applier.vclock initialization issues: it wasn't
>> > initialized at all previously.
>> 
>> In the next line you say that you remove the initialization. What
>> do you mean here?
>> 
>> > Moreover, there is no valid point in code
>> > to initialize it, since it may get stale right away if new entries are
>> > written to WAL.
>> 
>> Well, it reflects the state of the wal *as seen by* the set of
>> appliers. This is stated in the comment. So it doesn't have to
>> reflect local changes. 
>> 
>> > So, check for both applier and replicaset vclocks.
>> > The greater one protects the instance from applying the rows it has
>> > already applied or has already scheduled to write.
>> > Also remove an unnecessary aplier vclock initialization from
>> > replication_init().
>> 
>> First of all, the race you describe applies to
>> local changes only. Yet you add the check for all replica ids. 
>> This further obliterates this piece of code.
>> 
>> Second, the core of the issue is a "hole" in vclock protection
>> enforced by latch_lock/latch_unlock. Basically the assumption that
>> latch_lock/latch_unlock has is that while a latch is locked, no
>> source can apply a transaction under this replica id. This, is
>> violated by the local WAL.
>> 
>> We used to skip all changes by local vclock id before in applier.
>> 
>> Later it was changed to be able to get-your-own logs on recovery,
>> e.g. if some replica has them , and the local node lost a piece of
>> wal.
>> 
>> It will take me a while to find this commit and ticket, but this
>> is the commit and ticket which introduced the regression.
>> 
>> The proper fix is to only apply local changes received from
>> remotes in orphan mode, and begin skipping them when entering
>> read-write mode.
>> 
>> > Closes #4739
>> > ---
>> >  src/box/applier.cc     | 14 ++++++++++++--
>> >  src/box/replication.cc |  1 -
>> >  2 files changed, 12 insertions(+), 3 deletions(-)
>> > 
>> > diff --git a/src/box/applier.cc b/src/box/applier.cc
>> > index ae3d281a5..acb26b7e2 100644
>> > --- a/src/box/applier.cc
>> > +++ b/src/box/applier.cc
>> > @@ -731,8 +731,18 @@ applier_apply_tx(struct stailq *rows)
>> >  	struct latch *latch = (replica ? &replica->order_latch :
>> >  			       &replicaset.applier.order_latch);
>> >  	latch_lock(latch);
>> > -	if (vclock_get(&replicaset.applier.vclock,
>> > -		       first_row->replica_id) >= first_row->lsn) {
>> > +	/*
>> > +	 * We cannot tell which vclock is greater. There is no
>> > +	 * proper place to initialize applier vclock, since it
>> > +	 * may get stale right away if we write something to WAL
>> > +	 * and it gets replicated and then arrives back from the
>> > +	 * replica. So check against both vclocks. Replicaset
>> > +	 * vclock will guard us from corner cases like the one
>> > +	 * above.
>> > +	 */
>> > +	if (MAX(vclock_get(&replicaset.applier.vclock, first_row->replica_id),
>> > +		vclock_get(&replicaset.vclock, first_row->replica_id)) >=
>> > +	    first_row->lsn) {
>> >  		latch_unlock(latch);
>> >  		return 0;
>> >  	}
>> > diff --git a/src/box/replication.cc b/src/box/replication.cc
>> > index e7bfa22ab..7b04573a4 100644
>> > --- a/src/box/replication.cc
>> > +++ b/src/box/replication.cc
>> > @@ -93,7 +93,6 @@ replication_init(void)
>> >  	latch_create(&replicaset.applier.order_latch);
>> > 
>> >  	vclock_create(&replicaset.applier.vclock);
>> > -	vclock_copy(&replicaset.applier.vclock, &replicaset.vclock);
>> >  	rlist_create(&replicaset.applier.on_rollback);
>> >  	rlist_create(&replicaset.applier.on_commit);
>> > 
>> > -- 
>> > 2.20.1 (Apple Git-117)
>> 
>> -- 
>> Konstantin Osipov, Moscow, Russia
>
>-- 
>Konstantin Osipov, Moscow, Russia


-- 
Sergey Petrenko

[-- Attachment #2: Type: text/html, Size: 10951 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Tarantool-patches] [PATCH 2/2] wal: panic when trying to write a record with a broken lsn
  2020-02-13  6:48   ` Konstantin Osipov
@ 2020-02-13 22:05     ` Sergey Petrenko
  2020-02-14  7:26       ` Konstantin Osipov
  0 siblings, 1 reply; 11+ messages in thread
From: Sergey Petrenko @ 2020-02-13 22:05 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches, v.shpilevoy

[-- Attachment #1: Type: text/plain, Size: 742 bytes --]




>Четверг, 13 февраля 2020, 9:48 +03:00 от Konstantin Osipov <kostja.osipov@gmail.com>:
>
>* sergepetrenko < sergepetrenko@tarantool.org > [20/02/13 09:34]:
>> There is an assertion in vclock_follow `lsn > prev_lsn`, which doesn't
>> fire in release builds, of course. So we better panic on an attemt to
>> write a record with a duplicate or otherwise broken lsn.
>> 
>
>It should be a warning in production. You can't crash production
>deploy because of a server bug, and lsn inconsistency is always a
>server bug.
Well, this inconsistency yields a WAL that cannot be recovered from.
But I don't have a strong opinion here. I changed it to a warning in v2.
>
>
>-- 
>Konstantin Osipov, Moscow, Russia


-- 
Sergey Petrenko

[-- Attachment #2: Type: text/html, Size: 1386 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/2] replication: correctly check for rows to skip in applier
  2020-02-13  6:47   ` Konstantin Osipov
  2020-02-13  6:58     ` Konstantin Osipov
@ 2020-02-13 22:10     ` Sergey Petrenko
  1 sibling, 0 replies; 11+ messages in thread
From: Sergey Petrenko @ 2020-02-13 22:10 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches, v.shpilevoy

[-- Attachment #1: Type: text/plain, Size: 3853 bytes --]

Hi! Thanks for your detailed answer!
I pushed v2 regarding all the comments.


>Четверг, 13 февраля 2020, 9:47 +03:00 от Konstantin Osipov <kostja.osipov@gmail.com>:
>
>* sergepetrenko < sergepetrenko@tarantool.org > [20/02/13 09:34]:
>> Fix replicaset.applier.vclock initialization issues: it wasn't
>> initialized at all previously.
>
>In the next line you say that you remove the initialization. What
>do you mean here?
I changed both the commit and the commit message.

>
>
>> Moreover, there is no valid point in code
>> to initialize it, since it may get stale right away if new entries are
>> written to WAL.
>
>Well, it reflects the state of the wal *as seen by* the set of
>appliers. This is stated in the comment. So it doesn't have to
>reflect local changes. 
I see.
>
>
>> So, check for both applier and replicaset vclocks.
>> The greater one protects the instance from applying the rows it has
>> already applied or has already scheduled to write.
>> Also remove an unnecessary aplier vclock initialization from
>> replication_init().
>
>First of all, the race you describe applies to
>local changes only. Yet you add the check for all replica ids. 
>This further obliterates this piece of code.
Ok, fixed.
>
>
>Second, the core of the issue is a "hole" in vclock protection
>enforced by latch_lock/latch_unlock. Basically the assumption that
>latch_lock/latch_unlock has is that while a latch is locked, no
>source can apply a transaction under this replica id. This, is
>violated by the local WAL.
>
>We used to skip all changes by local vclock id before in applier.
>
>Later it was changed to be able to get-your-own logs on recovery,
>e.g. if some replica has them , and the local node lost a piece of
>wal.
>
>It will take me a while to find this commit and ticket, but this
>is the commit and ticket which introduced the regression.
>
>The proper fix is to only apply local changes received from
>remotes in orphan mode, and begin skipping them when entering
>read-write mode.
Thanks for the clarification.

>
>
>> Closes #4739
>> ---
>>  src/box/applier.cc     | 14 ++++++++++++--
>>  src/box/replication.cc |  1 -
>>  2 files changed, 12 insertions(+), 3 deletions(-)
>> 
>> diff --git a/src/box/applier.cc b/src/box/applier.cc
>> index ae3d281a5..acb26b7e2 100644
>> --- a/src/box/applier.cc
>> +++ b/src/box/applier.cc
>> @@ -731,8 +731,18 @@ applier_apply_tx(struct stailq *rows)
>>  	struct latch *latch = (replica ? &replica->order_latch :
>>  			       &replicaset.applier.order_latch);
>>  	latch_lock(latch);
>> -	if (vclock_get(&replicaset.applier.vclock,
>> -		       first_row->replica_id) >= first_row->lsn) {
>> +	/*
>> +	 * We cannot tell which vclock is greater. There is no
>> +	 * proper place to initialize applier vclock, since it
>> +	 * may get stale right away if we write something to WAL
>> +	 * and it gets replicated and then arrives back from the
>> +	 * replica. So check against both vclocks. Replicaset
>> +	 * vclock will guard us from corner cases like the one
>> +	 * above.
>> +	 */
>> +	if (MAX(vclock_get(&replicaset.applier.vclock, first_row->replica_id),
>> +		vclock_get(&replicaset.vclock, first_row->replica_id)) >=
>> +	    first_row->lsn) {
>>  		latch_unlock(latch);
>>  		return 0;
>>  	}
>> diff --git a/src/box/replication.cc b/src/box/replication.cc
>> index e7bfa22ab..7b04573a4 100644
>> --- a/src/box/replication.cc
>> +++ b/src/box/replication.cc
>> @@ -93,7 +93,6 @@ replication_init(void)
>>  	latch_create(&replicaset.applier.order_latch);
>> 
>>  	vclock_create(&replicaset.applier.vclock);
>> -	vclock_copy(&replicaset.applier.vclock, &replicaset.vclock);
>>  	rlist_create(&replicaset.applier.on_rollback);
>>  	rlist_create(&replicaset.applier.on_commit);
>> 
>> -- 
>> 2.20.1 (Apple Git-117)
>
>-- 
>Konstantin Osipov, Moscow, Russia


-- 
Sergey Petrenko

[-- Attachment #2: Type: text/html, Size: 5614 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Tarantool-patches] [PATCH 2/2] wal: panic when trying to write a record with a broken lsn
  2020-02-13 22:05     ` Sergey Petrenko
@ 2020-02-14  7:26       ` Konstantin Osipov
  0 siblings, 0 replies; 11+ messages in thread
From: Konstantin Osipov @ 2020-02-14  7:26 UTC (permalink / raw)
  To: Sergey Petrenko; +Cc: tarantool-patches, v.shpilevoy

* Sergey Petrenko <sergepetrenko@tarantool.org> [20/02/14 09:46]:
> 
> 
> 
> >Четверг, 13 февраля 2020, 9:48 +03:00 от Konstantin Osipov <kostja.osipov@gmail.com>:
> >
> >* sergepetrenko < sergepetrenko@tarantool.org > [20/02/13 09:34]:
> >> There is an assertion in vclock_follow `lsn > prev_lsn`, which doesn't
> >> fire in release builds, of course. So we better panic on an attemt to
> >> write a record with a duplicate or otherwise broken lsn.
> >> 
> >
> >It should be a warning in production. You can't crash production
> >deploy because of a server bug, and lsn inconsistency is always a
> >server bug.
> Well, this inconsistency yields a WAL that cannot be recovered from.
> But I don't have a strong opinion here. I changed it to a warning in v2.

It can, with force_recovery. Besides, you can make a snapshot.


-- 
Konstantin Osipov, Moscow, Russia
https://scylladb.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-02-14  7:26 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-12 23:50 [Tarantool-patches] [PATCH 0/2] replication: fix applying of rows originating from local instance sergepetrenko
2020-02-12 23:51 ` [Tarantool-patches] [PATCH 1/2] replication: correctly check for rows to skip in applier sergepetrenko
2020-02-13  6:47   ` Konstantin Osipov
2020-02-13  6:58     ` Konstantin Osipov
2020-02-13 22:03       ` Sergey Petrenko
2020-02-13 22:10     ` Sergey Petrenko
2020-02-12 23:51 ` [Tarantool-patches] [PATCH 2/2] wal: panic when trying to write a record with a broken lsn sergepetrenko
2020-02-13  6:48   ` Konstantin Osipov
2020-02-13 22:05     ` Sergey Petrenko
2020-02-14  7:26       ` Konstantin Osipov
2020-02-13  6:48 ` [Tarantool-patches] [PATCH 0/2] replication: fix applying of rows originating from local instance Konstantin Osipov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox