[Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance

Tarantool development patches archive
 help / color / mirror / Atom feed

* [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance
@ 2020-02-18 17:37 Serge Petrenko
  2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 1/4] box: expose box_is_orphan method Serge Petrenko
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Serge Petrenko @ 2020-02-18 17:37 UTC (permalink / raw)
  To: v.shpilevoy, alexander.turenko, kostja.osipov; +Cc: tarantool-patches

https://github.com/tarantool/tarantool/issues/4739
https://github.com/tarantool/tarantool/tree/sp/gh-4739-vclock-assert-v3

Changes in v3:
  - review fixes as per review from Vlad
  - instead of skipping rows on replica side,
    do it on master side, by patching recovery
    to silently follow rows coming from a certain
    instance.

Changes in v2:
 - review fixes as per review from Kostja

Serge Petrenko (4):
  box: expose box_is_orphan method
  recovery: allow to ignore rows coming from a certain instance
  replication: do not relay rows coming from a remote instance back to
    it
  wal: warn when trying to write a record with a broken lsn

 src/box/applier.cc         |  2 +-
 src/box/box.cc             | 15 +++++++++++----
 src/box/box.h              |  3 +++
 src/box/iproto_constants.h |  1 +
 src/box/recovery.cc        | 12 +++++++++++-
 src/box/recovery.h         |  7 ++++++-
 src/box/relay.cc           | 14 +++++++++++---
 src/box/relay.h            |  3 ++-
 src/box/wal.c              | 17 ++++++++++++++---
 src/box/xrow.c             | 18 +++++++++++++++---
 src/box/xrow.h             | 26 ++++++++++++++++----------
 11 files changed, 91 insertions(+), 27 deletions(-)

-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Tarantool-patches] [PATCH v3 1/4] box: expose box_is_orphan method
  2020-02-18 17:37 [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance Serge Petrenko
@ 2020-02-18 17:37 ` Serge Petrenko
  2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance Serge Petrenko
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Serge Petrenko @ 2020-02-18 17:37 UTC (permalink / raw)
  To: v.shpilevoy, alexander.turenko, kostja.osipov; +Cc: tarantool-patches

is_orphan status check is needed by applier in order to tell relay
whether to send the instance's own rows back or not.

Prerequisite #4739
---
 src/box/box.cc | 6 ++++++
 src/box/box.h  | 3 +++
 2 files changed, 9 insertions(+)

diff --git a/src/box/box.cc b/src/box/box.cc
index 1b2b27d61..0290578b2 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -247,6 +247,12 @@ box_is_ro(void)
 	return is_ro || is_orphan;
 }
 
+bool
+box_is_orphan(void)
+{
+	return is_orphan;
+}
+
 int
 box_wait_ro(bool ro, double timeout)
 {
diff --git a/src/box/box.h b/src/box/box.h
index a212e6510..f37a945eb 100644
--- a/src/box/box.h
+++ b/src/box/box.h
@@ -105,6 +105,9 @@ box_set_ro();
 bool
 box_is_ro(void);
 
+bool
+box_is_orphan(void);
+
 /**
  * Wait until the instance switches to a desired mode.
  * \param ro wait read-only if set or read-write if unset
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance
  2020-02-18 17:37 [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance Serge Petrenko
  2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 1/4] box: expose box_is_orphan method Serge Petrenko
@ 2020-02-18 17:37 ` Serge Petrenko
  2020-02-18 19:03   ` Konstantin Osipov
  2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 3/4] replication: do not relay rows coming from a remote instance back to it Serge Petrenko
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Serge Petrenko @ 2020-02-18 17:37 UTC (permalink / raw)
  To: v.shpilevoy, alexander.turenko, kostja.osipov; +Cc: tarantool-patches

Prerequisite #4739
---
 src/box/box.cc      |  2 +-
 src/box/recovery.cc | 12 +++++++++++-
 src/box/recovery.h  |  7 ++++++-
 src/box/relay.cc    |  4 ++--
 4 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 0290578b2..a4d823df0 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -2171,7 +2171,7 @@ local_recovery(const struct tt_uuid *instance_uuid,
 	struct recovery *recovery;
 	recovery = recovery_new(cfg_gets("wal_dir"),
 				cfg_geti("force_recovery"),
-				checkpoint_vclock);
+				checkpoint_vclock, 0);
 
 	/*
 	 * Make sure we report the actual recovery position
diff --git a/src/box/recovery.cc b/src/box/recovery.cc
index 64aa467b1..e6d6b5744 100644
--- a/src/box/recovery.cc
+++ b/src/box/recovery.cc
@@ -81,7 +81,7 @@
  */
 struct recovery *
 recovery_new(const char *wal_dirname, bool force_recovery,
-	     const struct vclock *vclock)
+	     const struct vclock *vclock, unsigned int id_ignore_map)
 {
 	struct recovery *r = (struct recovery *)
 			calloc(1, sizeof(*r));
@@ -101,6 +101,8 @@ recovery_new(const char *wal_dirname, bool force_recovery,
 
 	vclock_copy(&r->vclock, vclock);
 
+	r->id_ignore_map = id_ignore_map;
+
 	/**
 	 * Avoid scanning WAL dir before we recovered
 	 * the snapshot and know instance UUID - this will
@@ -275,6 +277,14 @@ recover_xlog(struct recovery *r, struct xstream *stream,
 		 * failed row anyway.
 		 */
 		vclock_follow_xrow(&r->vclock, &row);
+		/*
+		 * Do not try to apply rows coming from an
+		 * ignored instance, but still follow their lsns
+		 * to make sure recovery vclock stays the same as
+		 * the one emerging from local recovery.
+		 */
+		if (1 << row.replica_id & r->id_ignore_map)
+			continue;
 		if (xstream_write(stream, &row) == 0) {
 			++row_count;
 			if (row_count % 100000 == 0)
diff --git a/src/box/recovery.h b/src/box/recovery.h
index 6e68abc0b..2e10b7064 100644
--- a/src/box/recovery.h
+++ b/src/box/recovery.h
@@ -50,6 +50,11 @@ struct recovery {
 	/** The WAL cursor we're currently reading/writing from/to. */
 	struct xlog_cursor cursor;
 	struct xdir wal_dir;
+	/**
+	 * A map of replica ids whose changes will be silently
+	 * ignored on recovery.
+	 */
+	unsigned int id_ignore_map;
 	/**
 	 * This fiber is used in local hot standby mode.
 	 * It looks for changes in the wal_dir and applies
@@ -62,7 +67,7 @@ struct recovery {
 
 struct recovery *
 recovery_new(const char *wal_dirname, bool force_recovery,
-	     const struct vclock *vclock);
+	     const struct vclock *vclock, unsigned int id_ignore_map);
 
 void
 recovery_delete(struct recovery *r);
diff --git a/src/box/relay.cc b/src/box/relay.cc
index b89632273..741a09201 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -355,7 +355,7 @@ relay_final_join(int fd, uint64_t sync, struct vclock *start_vclock,
 	});
 
 	relay->r = recovery_new(cfg_gets("wal_dir"), false,
-			       start_vclock);
+			       start_vclock, 0);
 	vclock_copy(&relay->stop_vclock, stop_vclock);
 
 	int rc = cord_costart(&relay->cord, "final_join",
@@ -701,7 +701,7 @@ relay_subscribe(struct replica *replica, int fd, uint64_t sync,
 
 	vclock_copy(&relay->local_vclock_at_subscribe, &replicaset.vclock);
 	relay->r = recovery_new(cfg_gets("wal_dir"), false,
-			        replica_clock);
+			        replica_clock, 0);
 	vclock_copy(&relay->tx.vclock, replica_clock);
 	relay->version_id = replica_version_id;
 
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance
  2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance Serge Petrenko
@ 2020-02-18 19:03   ` Konstantin Osipov
  2020-02-19  8:43     ` Serge Petrenko
  0 siblings, 1 reply; 21+ messages in thread
From: Konstantin Osipov @ 2020-02-18 19:03 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: tarantool-patches, v.shpilevoy

* Serge Petrenko <sergepetrenko@tarantool.org> [20/02/18 20:38]:
> Prerequisite #4739

This is a strange way to mute rows from self. Why not set vclock
component to infinity as I suggested multiple times? Why not
respond to me with objection if my suggestion  can not be done?

> 

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance
  2020-02-18 19:03   ` Konstantin Osipov
@ 2020-02-19  8:43     ` Serge Petrenko
  2020-02-19  8:52       ` Konstantin Osipov
  0 siblings, 1 reply; 21+ messages in thread
From: Serge Petrenko @ 2020-02-19  8:43 UTC (permalink / raw)
  To: Konstantin Osipov, Vladislav Shpilevoy, Alexander Turenko
  Cc: tarantool-patches

> 18 февр. 2020 г., в 22:03, Konstantin Osipov <kostja.osipov@gmail.com> написал(а):
> 
> * Serge Petrenko <sergepetrenko@tarantool.org> [20/02/18 20:38]:
>> Prerequisite #4739
> 
> This is a strange way to mute rows from self. Why not set vclock
> component to infinity as I suggested multiple times? Why not
> respond to me with objection if my suggestion  can not be done?

I responded with a patch, so now we can discuss both your and my suggestions.

If I understood you correctly, you suggested to set replica self lsn to infinity
(on master side), so that recovery on masters side would skip replicas rows.

I tried your approach and initialized recovery with vclock[replica_id] = INT64_MAX.
This does allow you to skip replica’s rows, but this breaks vclock signature, which will
overflow immediately. vclock signatures are used to order gc consumers, gc
consumer corresponding to a replica gets its vclock from relay recovery.
Ok, you could suggest to reset vclock[replica_id] to some meaningful value, but where
do you get this value from? You cannot do gc message vclock[replica_id] =
replica ack vclock[replica_id], because replica may have some rows you still
don’t have. Then replica ack vclock signature may get too big and you will delete
other logs containing your changes.
You also cannot set gc vclock[replica_id] to 0, because it will hold logs, not needed by
replica for too long.

This is why I decided to implement the skipping mechanism from this patch.
It allows to track the exact vclock of the last recovered row, and it allows to
skip replica rows, just like we wanted.

--
Serge Petrenko
sergepetrenko@tarantool.org

> 
>> 
> 
> -- 
> Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance
  2020-02-19  8:43     ` Serge Petrenko
@ 2020-02-19  8:52       ` Konstantin Osipov
  2020-02-19  8:57         ` Serge Petrenko
  0 siblings, 1 reply; 21+ messages in thread
From: Konstantin Osipov @ 2020-02-19  8:52 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: tarantool-patches, Vladislav Shpilevoy

* Serge Petrenko <sergepetrenko@tarantool.org> [20/02/19 11:45]:
> > This is a strange way to mute rows from self. Why not set vclock
> > component to infinity as I suggested multiple times? Why not
> > respond to me with objection if my suggestion  can not be done?
> 
> I responded with a patch, so now we can discuss both your and my suggestions.

No, we can't. You responded with a patch for your suggestion, but
not for mine. So we compare apples and oranges here. Just like
with Ilya's patch, which only half way captured my suggestion in
his "alternatives".

In the end I'm not even sure you got it right.

> If I understood you correctly, you suggested to set replica self lsn to infinity
> (on master side), so that recovery on masters side would skip replicas rows.
> 
> I tried your approach and initialized recovery with vclock[replica_id] = INT64_MAX.
> This does allow you to skip replica’s rows, but this breaks vclock signature, which will
> overflow immediately. vclock signatures are used to order gc consumers, gc
> consumer corresponding to a replica gets its vclock from relay recovery.
> Ok, you could suggest to reset vclock[replica_id] to some meaningful value, but where
> do you get this value from? You cannot do gc message vclock[replica_id] =
> replica ack vclock[replica_id], because replica may have some rows you still
> don’t have. Then replica ack vclock signature may get too big and you will delete
> other logs containing your changes.
> You also cannot set gc vclock[replica_id] to 0, because it will hold logs, not needed by
> replica for too long.

Please send a patch and we'll be able to discuss where it went
wrong for you.


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance
  2020-02-19  8:52       ` Konstantin Osipov
@ 2020-02-19  8:57         ` Serge Petrenko
  2020-02-19  9:02           ` Konstantin Osipov
  0 siblings, 1 reply; 21+ messages in thread
From: Serge Petrenko @ 2020-02-19  8:57 UTC (permalink / raw)
  To: Konstantin Osipov, Vladislav Shpilevoy, Alexander Turenko
  Cc: tarantool-patches




> 19 февр. 2020 г., в 11:52, Konstantin Osipov <kostja.osipov@gmail.com> написал(а):
> 
> * Serge Petrenko <sergepetrenko@tarantool.org> [20/02/19 11:45]:
>>> This is a strange way to mute rows from self. Why not set vclock
>>> component to infinity as I suggested multiple times? Why not
>>> respond to me with objection if my suggestion  can not be done?
>> 
>> I responded with a patch, so now we can discuss both your and my suggestions.
> 
> No, we can't. You responded with a patch for your suggestion, but
> not for mine. So we compare apples and oranges here. Just like
> with Ilya's patch, which only half way captured my suggestion in
> his "alternatives".
> 
> In the end I'm not even sure you got it right.



> 
>> If I understood you correctly, you suggested to set replica self lsn to infinity
>> (on master side), so that recovery on masters side would skip replicas rows.
>> 

Does it look like I got it right from my answer?

>> I tried your approach and initialized recovery with vclock[replica_id] = INT64_MAX.
>> This does allow you to skip replica’s rows, but this breaks vclock signature, which will
>> overflow immediately. vclock signatures are used to order gc consumers, gc
>> consumer corresponding to a replica gets its vclock from relay recovery.
>> Ok, you could suggest to reset vclock[replica_id] to some meaningful value, but where
>> do you get this value from? You cannot do gc message vclock[replica_id] =
>> replica ack vclock[replica_id], because replica may have some rows you still
>> don’t have. Then replica ack vclock signature may get too big and you will delete
>> other logs containing your changes.
>> You also cannot set gc vclock[replica_id] to 0, because it will hold logs, not needed by
>> replica for too long.
> 
> Please send a patch and we'll be able to discuss where it went
> wrong for you.

Ok

> 
> 
> -- 
> Konstantin Osipov, Moscow, Russia


--
Serge Petrenko
sergepetrenko@tarantool.org

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance
  2020-02-19  8:57         ` Serge Petrenko
@ 2020-02-19  9:02           ` Konstantin Osipov
  2020-02-19  9:35             ` Serge Petrenko
  0 siblings, 1 reply; 21+ messages in thread
From: Konstantin Osipov @ 2020-02-19  9:02 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: tarantool-patches, Vladislav Shpilevoy

* Serge Petrenko <sergepetrenko@tarantool.org> [20/02/19 12:01]:
> 
> 
> 
> > 19 февр. 2020 г., в 11:52, Konstantin Osipov <kostja.osipov@gmail.com> написал(а):
> > 
> > * Serge Petrenko <sergepetrenko@tarantool.org> [20/02/19 11:45]:
> >>> This is a strange way to mute rows from self. Why not set vclock
> >>> component to infinity as I suggested multiple times? Why not
> >>> respond to me with objection if my suggestion  can not be done?
> >> 
> >> I responded with a patch, so now we can discuss both your and my suggestions.
> > 
> > No, we can't. You responded with a patch for your suggestion, but
> > not for mine. So we compare apples and oranges here. Just like
> > with Ilya's patch, which only half way captured my suggestion in
> > his "alternatives".
> > 
> > In the end I'm not even sure you got it right.
> 
> 
> 
> > 
> >> If I understood you correctly, you suggested to set replica self lsn to infinity
> >> (on master side), so that recovery on masters side would skip replicas rows.
> >> 
> 
> Does it look like I got it right from my answer?

I don't understand. For example, what do you mean by setting it on
the master? You were supposed to set it on the replica, when
sending a SUBSCRIBE request to the master.
Honestly, your English isn't perfect, we don't share the same
vocabulary/terms, so it's hard to see what went wrong without a
patch.

 

-- 
Konstantin Osipov, Moscow, Russia
https://scylladb.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance
  2020-02-19  9:02           ` Konstantin Osipov
@ 2020-02-19  9:35             ` Serge Petrenko
  2020-02-19 10:11               ` Konstantin Osipov
  0 siblings, 1 reply; 21+ messages in thread
From: Serge Petrenko @ 2020-02-19  9:35 UTC (permalink / raw)
  To: Konstantin Osipov, Vladislav Shpilevoy, Alexander Turenko
  Cc: tarantool-patches

[-- Attachment #1: Type: text/plain, Size: 2236 bytes --]



> 19 февр. 2020 г., в 12:02, Konstantin Osipov <kostja.osipov@gmail.com> написал(а):
> 
> * Serge Petrenko <sergepetrenko@tarantool.org <mailto:sergepetrenko@tarantool.org>> [20/02/19 12:01]:
>> 
>> 
>> 
>>> 19 февр. 2020 г., в 11:52, Konstantin Osipov <kostja.osipov@gmail.com> написал(а):
>>> 
>>> * Serge Petrenko <sergepetrenko@tarantool.org> [20/02/19 11:45]:
>>>>> This is a strange way to mute rows from self. Why not set vclock
>>>>> component to infinity as I suggested multiple times? Why not
>>>>> respond to me with objection if my suggestion  can not be done?
>>>> 
>>>> I responded with a patch, so now we can discuss both your and my suggestions.
>>> 
>>> No, we can't. You responded with a patch for your suggestion, but
>>> not for mine. So we compare apples and oranges here. Just like
>>> with Ilya's patch, which only half way captured my suggestion in
>>> his "alternatives".
>>> 
>>> In the end I'm not even sure you got it right.
>> 
>> 
>> 
>>> 
>>>> If I understood you correctly, you suggested to set replica self lsn to infinity
>>>> (on master side), so that recovery on masters side would skip replicas rows.
>>>> 
>> 
>> Does it look like I got it right from my answer?
> 
> I don't understand. For example, what do you mean by setting it on
> the master? You were supposed to set it on the replica, when
> sending a SUBSCRIBE request to the master.

I mean taking replica clock from a subscribe request and setting replica_clock[replica->id] = INT64_MAX.
This is done on master in relay_subscribe() called from box_process_subscribe(). Then relay recovery is initialized
with this vclock making recovery skip rows coming from the replica itself.

I’m starting to think that you want me to mangle only with replicaset.applier.vclock
Sorry if I missed it in the very beginning of our discussion.

> Honestly, your English isn't perfect, we don't share the same
> vocabulary/terms, so it's hard to see what went wrong without a
> patch.

Ok, I’ll send a patch soon.
> 
> 
> 
> -- 
> Konstantin Osipov, Moscow, Russia
> https://scylladb.com <https://scylladb.com/>

--
Serge Petrenko
sergepetrenko@tarantool.org



[-- Attachment #2: Type: text/html, Size: 13067 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance
  2020-02-19  9:35             ` Serge Petrenko
@ 2020-02-19 10:11               ` Konstantin Osipov
  2020-02-19 10:31                 ` Serge Petrenko
  0 siblings, 1 reply; 21+ messages in thread
From: Konstantin Osipov @ 2020-02-19 10:11 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: tarantool-patches, Vladislav Shpilevoy

* Serge Petrenko <sergepetrenko@tarantool.org> [20/02/19 12:36]:
> I mean taking replica clock from a subscribe request and setting replica_clock[replica->id] = INT64_MAX.
> This is done on master in relay_subscribe() called from box_process_subscribe(). Then relay recovery is initialized
> with this vclock making recovery skip rows coming from the replica itself.

Why is this logic done on master? Why can't you send the right
vclock from replica?


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance
  2020-02-19 10:11               ` Konstantin Osipov
@ 2020-02-19 10:31                 ` Serge Petrenko
  2020-02-19 11:27                   ` Konstantin Osipov
  0 siblings, 1 reply; 21+ messages in thread
From: Serge Petrenko @ 2020-02-19 10:31 UTC (permalink / raw)
  To: Konstantin Osipov, Vladislav Shpilevoy, Alexander Turenko
  Cc: tarantool-patches



> 19 февр. 2020 г., в 13:11, Konstantin Osipov <kostja.osipov@gmail.com> написал(а):
> 
> * Serge Petrenko <sergepetrenko@tarantool.org> [20/02/19 12:36]:
>> I mean taking replica clock from a subscribe request and setting replica_clock[replica->id] = INT64_MAX.
>> This is done on master in relay_subscribe() called from box_process_subscribe(). Then relay recovery is initialized
>> with this vclock making recovery skip rows coming from the replica itself.
> 
> Why is this logic done on master? Why can't you send the right
> vclock from replica?
> 


Cos this strange vclock will be used to initialize gc consumer,
and this gc consumer will be ordered by that vclock signature,
which will overflow.
Anyway, let me send the patch first. This is the same problem
in both cases,  when the logic you propose works either on master or
on replica side.
--
Serge Petrenko
sergepetrenko@tarantool.org


> 
> -- 
> Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance
  2020-02-19 10:31                 ` Serge Petrenko
@ 2020-02-19 11:27                   ` Konstantin Osipov
  0 siblings, 0 replies; 21+ messages in thread
From: Konstantin Osipov @ 2020-02-19 11:27 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: tarantool-patches, Vladislav Shpilevoy

* Serge Petrenko <sergepetrenko@tarantool.org> [20/02/19 13:32]:
> 
> 
> > 19 февр. 2020 г., в 13:11, Konstantin Osipov <kostja.osipov@gmail.com> написал(а):
> > 
> > * Serge Petrenko <sergepetrenko@tarantool.org> [20/02/19 12:36]:
> >> I mean taking replica clock from a subscribe request and setting replica_clock[replica->id] = INT64_MAX.
> >> This is done on master in relay_subscribe() called from box_process_subscribe(). Then relay recovery is initialized
> >> with this vclock making recovery skip rows coming from the replica itself.
> > 
> > Why is this logic done on master? Why can't you send the right
> > vclock from replica?
> > 
> 
> 
> Cos this strange vclock will be used to initialize gc consumer,
> and this gc consumer will be ordered by that vclock signature,
> which will overflow.

I think there was a patch from Georgy that stops ordering GC
consumers by their signature, and uses vclock_compare instead?

Can it be pushed first? 
In any case, your "special flag" is extremely cumbersome. Better
introduce a subscription mask to IPROTO_SUBSCRIBE, we need it for
https://github.com/tarantool/tarantool/issues/3294 anyway.

> Anyway, let me send the patch first. This is the same problem
> in both cases,  when the logic you propose works either on master or
> on replica side.

-- 
Konstantin Osipov, Moscow, Russia
https://scylladb.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Tarantool-patches] [PATCH v3 3/4] replication: do not relay rows coming from a remote instance back to it
  2020-02-18 17:37 [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance Serge Petrenko
  2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 1/4] box: expose box_is_orphan method Serge Petrenko
  2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance Serge Petrenko
@ 2020-02-18 17:37 ` Serge Petrenko
  2020-02-18 19:07   ` Konstantin Osipov
  2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 4/4] wal: warn when trying to write a record with a broken lsn Serge Petrenko
  2020-02-22 20:21 ` [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance Georgy Kirichenko
  4 siblings, 1 reply; 21+ messages in thread
From: Serge Petrenko @ 2020-02-18 17:37 UTC (permalink / raw)
  To: v.shpilevoy, alexander.turenko, kostja.osipov; +Cc: tarantool-patches

We have a mechanism for restoring rows originating from an instance that
suffered a sudden power loss: remote masters resend the isntance's rows
received before a certain point in time, defined by remote master vclock
at the moment of subscribe.
However, this is useful only on initial replication configuraiton, when
an instance has just recovered, so that it can receive what it has
relayed but haven't synced to disk.
In other cases, when an instance is operating normally and master-master
replication is configured, the mechanism described above may lead to
instance re-applying instance's own rows, coming from a master it has just
subscribed to.
To fix the problem do not relay rows coming from a remote instance, if
the instance has already recovered.

Closes #4739
---
 src/box/applier.cc         |  2 +-
 src/box/box.cc             |  7 ++++---
 src/box/iproto_constants.h |  1 +
 src/box/relay.cc           | 12 ++++++++++--
 src/box/relay.h            |  3 ++-
 src/box/xrow.c             | 18 +++++++++++++++---
 src/box/xrow.h             | 26 ++++++++++++++++----------
 7 files changed, 49 insertions(+), 20 deletions(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index ae3d281a5..542144d14 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -867,7 +867,7 @@ applier_subscribe(struct applier *applier)
 	vclock_create(&vclock);
 	vclock_copy(&vclock, &replicaset.vclock);
 	xrow_encode_subscribe_xc(&row, &REPLICASET_UUID, &INSTANCE_UUID,
-				 &vclock, replication_anon);
+				 &vclock, replication_anon, box_is_orphan());
 	coio_write_xrow(coio, &row);
 
 	/* Read SUBSCRIBE response */
diff --git a/src/box/box.cc b/src/box/box.cc
index a4d823df0..d485845c7 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1787,8 +1787,9 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
 	uint32_t replica_version_id;
 	vclock_create(&replica_clock);
 	bool anon;
-	xrow_decode_subscribe_xc(header, NULL, &replica_uuid,
-				 &replica_clock, &replica_version_id, &anon);
+	bool is_orphan;
+	xrow_decode_subscribe_xc(header, NULL, &replica_uuid, &replica_clock,
+				 &replica_version_id, &anon, &is_orphan);
 
 	/* Forbid connection to itself */
 	if (tt_uuid_is_equal(&replica_uuid, &INSTANCE_UUID))
@@ -1871,7 +1872,7 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
 	 * indefinitely).
 	 */
 	relay_subscribe(replica, io->fd, header->sync, &replica_clock,
-			replica_version_id);
+			replica_version_id, is_orphan);
 }
 
 void
diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
index b66c05c06..9616cbcf0 100644
--- a/src/box/iproto_constants.h
+++ b/src/box/iproto_constants.h
@@ -125,6 +125,7 @@ enum iproto_key {
 	IPROTO_STMT_ID = 0x43,
 	/* Leave a gap between SQL keys and additional request keys */
 	IPROTO_REPLICA_ANON = 0x50,
+	IPROTO_REPLICA_IS_ORPHAN = 0x51,
 	IPROTO_KEY_MAX
 };
 
diff --git a/src/box/relay.cc b/src/box/relay.cc
index 741a09201..384773b05 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -676,7 +676,8 @@ relay_subscribe_f(va_list ap)
 /** Replication acceptor fiber handler. */
 void
 relay_subscribe(struct replica *replica, int fd, uint64_t sync,
-		struct vclock *replica_clock, uint32_t replica_version_id)
+		struct vclock *replica_clock, uint32_t replica_version_id,
+		bool replica_is_orphan)
 {
 	assert(replica->anon || replica->id != REPLICA_ID_NIL);
 	struct relay *relay = replica->relay;
@@ -700,8 +701,15 @@ relay_subscribe(struct replica *replica, int fd, uint64_t sync,
 	});
 
 	vclock_copy(&relay->local_vclock_at_subscribe, &replicaset.vclock);
+	/*
+	 * If the remote instance is already synced, don't relay
+	 * the instance's rows back to it. In order to do so,
+	 * make recovery silently skip rows originating from the
+	 * instance.
+	 */
 	relay->r = recovery_new(cfg_gets("wal_dir"), false,
-			        replica_clock, 0);
+			        replica_clock, replica_is_orphan ?
+					       0 : 1 << replica->id);
 	vclock_copy(&relay->tx.vclock, replica_clock);
 	relay->version_id = replica_version_id;
 
diff --git a/src/box/relay.h b/src/box/relay.h
index e1782d78f..9ba878faf 100644
--- a/src/box/relay.h
+++ b/src/box/relay.h
@@ -124,6 +124,7 @@ relay_final_join(int fd, uint64_t sync, struct vclock *start_vclock,
  */
 void
 relay_subscribe(struct replica *replica, int fd, uint64_t sync,
-		struct vclock *replica_vclock, uint32_t replica_version_id);
+		struct vclock *replica_vclock, uint32_t replica_version_id,
+		bool replica_is_orphan);
 
 #endif /* TARANTOOL_REPLICATION_RELAY_H_INCLUDED */
diff --git a/src/box/xrow.c b/src/box/xrow.c
index 968c3a202..6b8155859 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -1194,7 +1194,7 @@ int
 xrow_encode_subscribe(struct xrow_header *row,
 		      const struct tt_uuid *replicaset_uuid,
 		      const struct tt_uuid *instance_uuid,
-		      const struct vclock *vclock, bool anon)
+		      const struct vclock *vclock, bool anon, bool is_orphan)
 {
 	memset(row, 0, sizeof(*row));
 	size_t size = XROW_BODY_LEN_MAX + mp_sizeof_vclock(vclock);
@@ -1204,7 +1204,7 @@ xrow_encode_subscribe(struct xrow_header *row,
 		return -1;
 	}
 	char *data = buf;
-	data = mp_encode_map(data, 5);
+	data = mp_encode_map(data, 6);
 	data = mp_encode_uint(data, IPROTO_CLUSTER_UUID);
 	data = xrow_encode_uuid(data, replicaset_uuid);
 	data = mp_encode_uint(data, IPROTO_INSTANCE_UUID);
@@ -1215,6 +1215,8 @@ xrow_encode_subscribe(struct xrow_header *row,
 	data = mp_encode_uint(data, tarantool_version_id());
 	data = mp_encode_uint(data, IPROTO_REPLICA_ANON);
 	data = mp_encode_bool(data, anon);
+	data = mp_encode_uint(data, IPROTO_REPLICA_IS_ORPHAN);
+	data = mp_encode_bool(data, is_orphan);
 	assert(data <= buf + size);
 	row->body[0].iov_base = buf;
 	row->body[0].iov_len = (data - buf);
@@ -1226,7 +1228,7 @@ xrow_encode_subscribe(struct xrow_header *row,
 int
 xrow_decode_subscribe(struct xrow_header *row, struct tt_uuid *replicaset_uuid,
 		      struct tt_uuid *instance_uuid, struct vclock *vclock,
-		      uint32_t *version_id, bool *anon)
+		      uint32_t *version_id, bool *anon, bool *is_orphan)
 {
 	if (row->bodycnt == 0) {
 		diag_set(ClientError, ER_INVALID_MSGPACK, "request body");
@@ -1301,6 +1303,16 @@ xrow_decode_subscribe(struct xrow_header *row, struct tt_uuid *replicaset_uuid,
 			}
 			*anon = mp_decode_bool(&d);
 			break;
+		case IPROTO_REPLICA_IS_ORPHAN:
+			if (is_orphan == NULL)
+				goto skip;
+			if (mp_typeof(*d) != MP_BOOL) {
+				xrow_on_decode_err(data, end, ER_INVALID_MSGPACK,
+						   "invalid REPLICA_IS_ORPHAN flag");
+				return -1;
+			}
+			*is_orphan = mp_decode_bool(&d);
+			break;
 		default: skip:
 			mp_next(&d); /* value */
 		}
diff --git a/src/box/xrow.h b/src/box/xrow.h
index 0973c497d..3f188870a 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -322,6 +322,7 @@ xrow_encode_register(struct xrow_header *row,
  * @param instance_uuid Instance uuid.
  * @param vclock Replication clock.
  * @param anon Whether it is an anonymous subscribe request or not.
+ * @param is_orphan Whether the instance is in orphan mode.
  *
  * @retval  0 Success.
  * @retval -1 Memory error.
@@ -330,7 +331,7 @@ int
 xrow_encode_subscribe(struct xrow_header *row,
 		      const struct tt_uuid *replicaset_uuid,
 		      const struct tt_uuid *instance_uuid,
-		      const struct vclock *vclock, bool anon);
+		      const struct vclock *vclock, bool anon, bool is_orphan);
 
 /**
  * Decode SUBSCRIBE command.
@@ -347,7 +348,7 @@ xrow_encode_subscribe(struct xrow_header *row,
 int
 xrow_decode_subscribe(struct xrow_header *row, struct tt_uuid *replicaset_uuid,
 		      struct tt_uuid *instance_uuid, struct vclock *vclock,
-		      uint32_t *version_id, bool *anon);
+		      uint32_t *version_id, bool *anon, bool *is_orphan);
 
 /**
  * Encode JOIN command.
@@ -371,7 +372,8 @@ xrow_encode_join(struct xrow_header *row, const struct tt_uuid *instance_uuid);
 static inline int
 xrow_decode_join(struct xrow_header *row, struct tt_uuid *instance_uuid)
 {
-	return xrow_decode_subscribe(row, NULL, instance_uuid, NULL, NULL, NULL);
+	return xrow_decode_subscribe(row, NULL, instance_uuid, NULL, NULL, NULL,
+				     NULL);
 }
 
 /**
@@ -386,7 +388,8 @@ static inline int
 xrow_decode_register(struct xrow_header *row, struct tt_uuid *instance_uuid,
 		     struct vclock *vclock)
 {
-	return xrow_decode_subscribe(row, NULL, instance_uuid, vclock, NULL, NULL);
+	return xrow_decode_subscribe(row, NULL, instance_uuid, vclock, NULL,
+				     NULL, NULL);
 }
 
 /**
@@ -411,7 +414,7 @@ xrow_encode_vclock(struct xrow_header *row, const struct vclock *vclock);
 static inline int
 xrow_decode_vclock(struct xrow_header *row, struct vclock *vclock)
 {
-	return xrow_decode_subscribe(row, NULL, NULL, vclock, NULL, NULL);
+	return xrow_decode_subscribe(row, NULL, NULL, vclock, NULL, NULL, NULL);
 }
 
 /**
@@ -442,7 +445,8 @@ xrow_decode_subscribe_response(struct xrow_header *row,
 			       struct tt_uuid *replicaset_uuid,
 			       struct vclock *vclock)
 {
-	return xrow_decode_subscribe(row, replicaset_uuid, NULL, vclock, NULL, NULL);
+	return xrow_decode_subscribe(row, replicaset_uuid, NULL, vclock, NULL,
+				     NULL, NULL);
 }
 
 /**
@@ -817,10 +821,10 @@ static inline void
 xrow_encode_subscribe_xc(struct xrow_header *row,
 			 const struct tt_uuid *replicaset_uuid,
 			 const struct tt_uuid *instance_uuid,
-			 const struct vclock *vclock, bool anon)
+			 const struct vclock *vclock, bool anon, bool is_orphan)
 {
 	if (xrow_encode_subscribe(row, replicaset_uuid, instance_uuid,
-				  vclock, anon) != 0)
+				  vclock, anon, is_orphan) != 0)
 		diag_raise();
 }
 
@@ -829,10 +833,12 @@ static inline void
 xrow_decode_subscribe_xc(struct xrow_header *row,
 			 struct tt_uuid *replicaset_uuid,
 		         struct tt_uuid *instance_uuid, struct vclock *vclock,
-			 uint32_t *replica_version_id, bool *anon)
+			 uint32_t *replica_version_id, bool *anon,
+			 bool *is_orphan)
 {
 	if (xrow_decode_subscribe(row, replicaset_uuid, instance_uuid,
-				  vclock, replica_version_id, anon) != 0)
+				  vclock, replica_version_id, anon,
+				  is_orphan) != 0)
 		diag_raise();
 }
 
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 3/4] replication: do not relay rows coming from a remote instance back to it
  2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 3/4] replication: do not relay rows coming from a remote instance back to it Serge Petrenko
@ 2020-02-18 19:07   ` Konstantin Osipov
  0 siblings, 0 replies; 21+ messages in thread
From: Konstantin Osipov @ 2020-02-18 19:07 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: tarantool-patches, v.shpilevoy

* Serge Petrenko <sergepetrenko@tarantool.org> [20/02/18 20:38]:
>  	vclock_create(&replica_clock);
>  	bool anon;
> -	xrow_decode_subscribe_xc(header, NULL, &replica_uuid,
> -				 &replica_clock, &replica_version_id, &anon);
> +	bool is_orphan;
> +	xrow_decode_subscribe_xc(header, NULL, &replica_uuid, &replica_clock,
> +				 &replica_version_id, &anon, &is_orphan);


Why did you make it so complicated, could you please explain? What
went wrong with my suggestion?

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Tarantool-patches] [PATCH v3 4/4] wal: warn when trying to write a record with a broken lsn
  2020-02-18 17:37 [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance Serge Petrenko
                   ` (2 preceding siblings ...)
  2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 3/4] replication: do not relay rows coming from a remote instance back to it Serge Petrenko
@ 2020-02-18 17:37 ` Serge Petrenko
  2020-02-22 20:21 ` [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance Georgy Kirichenko
  4 siblings, 0 replies; 21+ messages in thread
From: Serge Petrenko @ 2020-02-18 17:37 UTC (permalink / raw)
  To: v.shpilevoy, alexander.turenko, kostja.osipov; +Cc: tarantool-patches

There is an assertion in vclock_follow `lsn > prev_lsn`, which doesn't
fire in release builds, of course. Let's at least warn the user on an
attemt to write a record with a duplicate or otherwise broken lsn, and
not follow such an lsn.

Follow-up #4739
---
 src/box/wal.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/src/box/wal.c b/src/box/wal.c
index 0ae66ff32..a87aedf1d 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -951,9 +951,20 @@ wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
 			(*row)->tsn = tsn;
 			(*row)->is_commit = row == end - 1;
 		} else {
-			vclock_follow(vclock_diff, (*row)->replica_id,
-				      (*row)->lsn - vclock_get(base,
-							       (*row)->replica_id));
+			int64_t diff = (*row)->lsn - vclock_get(base, (*row)->replica_id);
+			if (diff <= vclock_get(vclock_diff,
+					       (*row)->replica_id)) {
+				say_crit("Attempt to write a broken LSN to WAL:"
+					 " replica id: %d, confirmed lsn: %d,"
+					 " new lsn %d", (*row)->replica_id,
+					 vclock_get(base, (*row)->replica_id) +
+					 vclock_get(vclock_diff,
+						    (*row)->replica_id),
+						    (*row)->lsn);
+				assert(0);
+			} else {
+				vclock_follow(vclock_diff, (*row)->replica_id, diff);
+			}
 		}
 	}
 }
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance
  2020-02-18 17:37 [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance Serge Petrenko
                   ` (3 preceding siblings ...)
  2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 4/4] wal: warn when trying to write a record with a broken lsn Serge Petrenko
@ 2020-02-22 20:21 ` Georgy Kirichenko
  2020-02-22 20:49   ` Konstantin Osipov
  4 siblings, 1 reply; 21+ messages in thread
From: Georgy Kirichenko @ 2020-02-22 20:21 UTC (permalink / raw)
  To: tarantool-patches, Serge Petrenko; +Cc: v.shpilevoy

[-- Attachment #1: Type: text/plain, Size: 1741 bytes --]

On Tuesday, February 18, 2020 8:37:03 PM MSK Serge Petrenko wrote:
Hi! Thanks for the patch.
I am pretty sure replica should be able to apply all replication stream 
transaction in a proper way without any reliance on a master correctness. Or 
signal an error if this is impossible. I am suggesting such logic because a 
replica has the complete information about it own at the moment. This includes 
local vclock, cfg, all existing appliers state and incoming streams. 

WBR

> https://github.com/tarantool/tarantool/issues/4739
> https://github.com/tarantool/tarantool/tree/sp/gh-4739-vclock-assert-v3
> 
> Changes in v3:
>   - review fixes as per review from Vlad
>   - instead of skipping rows on replica side,
>     do it on master side, by patching recovery
>     to silently follow rows coming from a certain
>     instance.
> 
> Changes in v2:
>  - review fixes as per review from Kostja
> 
> Serge Petrenko (4):
>   box: expose box_is_orphan method
>   recovery: allow to ignore rows coming from a certain instance
>   replication: do not relay rows coming from a remote instance back to
>     it
>   wal: warn when trying to write a record with a broken lsn
> 
>  src/box/applier.cc         |  2 +-
>  src/box/box.cc             | 15 +++++++++++----
>  src/box/box.h              |  3 +++
>  src/box/iproto_constants.h |  1 +
>  src/box/recovery.cc        | 12 +++++++++++-
>  src/box/recovery.h         |  7 ++++++-
>  src/box/relay.cc           | 14 +++++++++++---
>  src/box/relay.h            |  3 ++-
>  src/box/wal.c              | 17 ++++++++++++++---
>  src/box/xrow.c             | 18 +++++++++++++++---
>  src/box/xrow.h             | 26 ++++++++++++++++----------
>  11 files changed, 91 insertions(+), 27 deletions(-)


[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance
  2020-02-22 20:21 ` [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance Georgy Kirichenko
@ 2020-02-22 20:49   ` Konstantin Osipov
  2020-02-23  8:16     ` Georgy Kirichenko
  0 siblings, 1 reply; 21+ messages in thread
From: Konstantin Osipov @ 2020-02-22 20:49 UTC (permalink / raw)
  To: Georgy Kirichenko; +Cc: tarantool-patches, v.shpilevoy

* Georgy Kirichenko <kirichenkoga@gmail.com> [20/02/22 23:22]:
> On Tuesday, February 18, 2020 8:37:03 PM MSK Serge Petrenko wrote:
> Hi! Thanks for the patch.
> I am pretty sure replica should be able to apply all replication stream 
> transaction in a proper way without any reliance on a master correctness. Or 
> signal an error if this is impossible. I am suggesting such logic because a 
> replica has the complete information about it own at the moment. This includes 
> local vclock, cfg, all existing appliers state and incoming streams. 

It's an impossible and useless pursuit.

Please read up on byzantine faults.



-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance
  2020-02-22 20:49   ` Konstantin Osipov
@ 2020-02-23  8:16     ` Georgy Kirichenko
  2020-02-24 10:18       ` Konstantin Osipov
  0 siblings, 1 reply; 21+ messages in thread
From: Georgy Kirichenko @ 2020-02-23  8:16 UTC (permalink / raw)
  To: Konstantin Osipov, Georgy Kirichenko, tarantool-patches,
	Serge Petrenko, v.shpilevoy, alexander.turenko

Please do not think you are the only person who knows about byzantine faults. 
Also there is little relevance between byzantine faults and my suggestion to 
enforce replica-side checking.

In any case filtering on the master side is the most worst  thing we could do. 
In this case master has only one peer and have no chance to make a proper 
decision if replica is broken. And we have no chance to know about it (except 
assert which are excluded from release builds, or panic messages). For 
instance if master skipped some rows then there are no any tracks of the 
situation we could detect.
In the opposite case a replica could connect to as many masters as they need 
to filter out all invalid data or hacked masters. At least we could enforce 
replication stream meta checking.
Two major point I would like to mention are:
1. Replica could consistently follow all vclock members and apply all 
transactions without gaps (I already got rid of them, I hope you remember)
2. Replica could protect itself against concurrent local writes (one was made 
locally, the second one is returned from master)

On Saturday, February 22, 2020 11:49:30 PM MSK Konstantin Osipov wrote:
> * Georgy Kirichenko <kirichenkoga@gmail.com> [20/02/22 23:22]:
> > On Tuesday, February 18, 2020 8:37:03 PM MSK Serge Petrenko wrote:
> > Hi! Thanks for the patch.
> > I am pretty sure replica should be able to apply all replication stream
> > transaction in a proper way without any reliance on a master correctness.
> > Or signal an error if this is impossible. I am suggesting such logic
> > because a replica has the complete information about it own at the
> > moment. This includes local vclock, cfg, all existing appliers state and
> > incoming streams.
> It's an impossible and useless pursuit.
> 
> Please read up on byzantine faults.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance
  2020-02-23  8:16     ` Georgy Kirichenko
@ 2020-02-24 10:18       ` Konstantin Osipov
  2020-02-24 12:31         ` Георгий Кириченко
  0 siblings, 1 reply; 21+ messages in thread
From: Konstantin Osipov @ 2020-02-24 10:18 UTC (permalink / raw)
  To: Georgy Kirichenko; +Cc: tarantool-patches, v.shpilevoy

* Georgy Kirichenko <kirichenkoga@gmail.com> [20/02/23 12:21]:

> Please do not think you are the only person who knows about byzantine faults. 
> Also there is little relevance between byzantine faults and my suggestion to 
> enforce replica-side checking.

You've been suggesting that filtering on the master is safer. I
pointed out it's not, there is no way to guarantee
(even in theory) correctness/safety if replica if master is
malfunctioning.

I merely pointed out that your safety argument has no merit. 

There are no other practical advantages of filtering on replica
either: there is a disadvantage, more traffic and more filtering work to do 
inside tx thread (as opposed to relay/wal thread if done on
master).

It is also against the current responsibilities of IPROTO_SUBSCRIBE: the
concept of a subscription is that replica specifies what it is 
interested in. Specifically, it specifies vclock components it's.
You suggest to make the replica responsible for
submitting its vclock, but the master decide what to do with it -
this splits the decision making logic between the two, making the
whole thing harder to understand. 

IPROTO_SUBSCRIBE responsibility layout today is typical for a
request-response protocol: the master, being the server, executes
the command as specified by the client (the replica), and the
replica runs the logic to decide what command to issue.

You suggest to change it because of some theoretical concerns you
have. 

> In any case filtering on the master side is the most worst  thing we could do. 
> In this case master has only one peer and have no chance to make a proper 
> decision if replica is broken. And we have no chance to know about it (except 
> assert which are excluded from release builds, or panic messages). For 
> instance if master skipped some rows then there are no any tracks of the 
> situation we could detect.

The situation is symmetrical. Both peers do not have the whole
picture. You can make either of the peers responsible for the
decision, then the other peer will need to supply the missing
bits. There is no way you can make it safer by changing who makes
the decision, but you can certainly make it more messed up by
splitting this logic or going against an established layout.

If you have a specific example why things will improve if done
otherwise - in the number of packets, or traffic, or some other
measurable way, you should point it out. 

> In the opposite case a replica could connect to as many masters as they need 
> to filter out all invalid data or hacked masters. At least we could enforce 
> replication stream meta checking.

I do not think the scope of this issue has ever been protecting
against hacked masters. It has never been a goal of the protocol
either. 

> Two major point I would like to mention are:
> 1. Replica could consistently follow all vclock members and apply all 
> transactions without gaps (I already got rid of them, I hope you remember)
> 2. Replica could protect itself against concurrent local writes (one was made 
> locally, the second one is returned from master)

This was added for specific reasons. There is no known reason the
master should send unnecessary data to replica or replica fast
path should get slower.

-- 
Konstantin Osipov, Moscow, Russia
https://scylladb.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance
  2020-02-24 10:18       ` Konstantin Osipov
@ 2020-02-24 12:31         ` Георгий Кириченко
  2020-02-26 10:09           ` Sergey Petrenko
  0 siblings, 1 reply; 21+ messages in thread
From: Георгий Кириченко @ 2020-02-24 12:31 UTC (permalink / raw)
  To: Konstantin Osipov, Georgy Kirichenko, tarantool-patches,
	Serge Petrenko, Vladislav Shpilevoy, alexander.turenko

[-- Attachment #1: Type: text/plain, Size: 5816 bytes --]

Please read messages before answering. I did never say that:
> You've been suggesting that filtering on the master is safer.
I said it safer do to it on the replica side and replica should not rely on
master correctness.
> I pointed out it's not, there is no way to guarantee
(even in theory) correctness/safety if replica if master is
malfunctioning.
Excuse my but this is demagogy, we talk about what is more safer but not
absolutely safety.
>The situation is symmetrical. Both peers do not have the whole
>picture. You can make either of the peers responsible for the
>decision, then the other peer will need to supply the missing
>bits.
No, you are wrong. A master has only one information source about the
stream it should send to a replica whereas
 a replica could connect to many masters to fetch proper data (from one or
many masters). And we already implemented similar logic -
a voting protocol and yoh should known about it.Additionally my
approach allows to collect all corresponding logic as filtering
 of concurrent streams, vclock following, subcriptions and replication
groups which are not implemented yet, registration and whatever else in one
module at replica side.
>I do not think the scope of this issue has ever been protecting
>against hacked masters. It has never been a goal of the protocol
>either.
A hacked master could be a master with an implementation error and we
should be able to detech such error as soon as possible. But if a replica
will not
check an incomming stream there is no way to prevent fatal data losses.
>This was added for specific reasons. There is no known reason the
>master should send unnecessary data to replica or replica fast
>path should get slower.
I am afraid you did not understand me. I did not ever said that I am
against any optimization which could make replication faster.
I completely against any attempts to rely on an optimiztion logic. If a
master allows to skip unrequired rows then replica should not rely on this
code corectness.
 In other words, if some input stream could broke replica the replica
should protect itself agains such data. This is not the replicas master
responsibility.

пн, 24 февр. 2020 г. в 13:18, Konstantin Osipov <kostja.osipov@gmail.com>:

> * Georgy Kirichenko <kirichenkoga@gmail.com> [20/02/23 12:21]:
>
> > Please do not think you are the only person who knows about byzantine
> faults.
> > Also there is little relevance between byzantine faults and my
> suggestion to
> > enforce replica-side checking.
>
> You've been suggesting that filtering on the master is safer. I
> pointed out it's not, there is no way to guarantee
> (even in theory) correctness/safety if replica if master is
> malfunctioning.
>
> I merely pointed out that your safety argument has no merit.
>
> There are no other practical advantages of filtering on replica
> either: there is a disadvantage, more traffic and more filtering work to
> do
> inside tx thread (as opposed to relay/wal thread if done on
> master).
>
> It is also against the current responsibilities of IPROTO_SUBSCRIBE: the
> concept of a subscription is that replica specifies what it is
> interested in. Specifically, it specifies vclock components it's.
> You suggest to make the replica responsible for
> submitting its vclock, but the master decide what to do with it -
> this splits the decision making logic between the two, making the
> whole thing harder to understand.
>
> IPROTO_SUBSCRIBE responsibility layout today is typical for a
> request-response protocol: the master, being the server, executes
> the command as specified by the client (the replica), and the
> replica runs the logic to decide what command to issue.
>
> You suggest to change it because of some theoretical concerns you
> have.
>
> > In any case filtering on the master side is the most worst  thing we
> could do.
> > In this case master has only one peer and have no chance to make a
> proper
> > decision if replica is broken. And we have no chance to know about it
> (except
> > assert which are excluded from release builds, or panic messages). For
> > instance if master skipped some rows then there are no any tracks of the
> > situation we could detect.
>
> The situation is symmetrical. Both peers do not have the whole
> picture. You can make either of the peers responsible for the
> decision, then the other peer will need to supply the missing
> bits. There is no way you can make it safer by changing who makes
> the decision, but you can certainly make it more messed up by
> splitting this logic or going against an established layout.
>
> If you have a specific example why things will improve if done
> otherwise - in the number of packets, or traffic, or some other
> measurable way, you should point it out.
>
> > In the opposite case a replica could connect to as many masters as they
> need
> > to filter out all invalid data or hacked masters. At least we could
> enforce
> > replication stream meta checking.
>
> I do not think the scope of this issue has ever been protecting
> against hacked masters. It has never been a goal of the protocol
> either.
>
> > Two major point I would like to mention are:
> > 1. Replica could consistently follow all vclock members and apply all
> > transactions without gaps (I already got rid of them, I hope you
> remember)
> > 2. Replica could protect itself against concurrent local writes (one was
> made
> > locally, the second one is returned from master)
>
> This was added for specific reasons. There is no known reason the
> master should send unnecessary data to replica or replica fast
> path should get slower.
>
> --
> Konstantin Osipov, Moscow, Russia
> https://scylladb.com
>

[-- Attachment #2: Type: text/html, Size: 6676 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance
  2020-02-24 12:31         ` Георгий Кириченко
@ 2020-02-26 10:09           ` Sergey Petrenko
  0 siblings, 0 replies; 21+ messages in thread
From: Sergey Petrenko @ 2020-02-26 10:09 UTC (permalink / raw)
  To: Георгий
	Кириченко
  Cc: tarantool-patches, Vladislav Shpilevoy

[-- Attachment #1: Type: text/plain, Size: 6207 bytes --]


  
>Понедельник, 24 февраля 2020, 15:31 +03:00 от Георгий Кириченко <kirichenkoga@gmail.com>:
> 
>Please read messages before answering. I did never say that: 
>> You've been suggesting that filtering on the master is safer.
>I said it safer do to it on the replica side and replica should not rely on master correctness.
> 
>> I pointed out it's not, there is no way to guarantee (even in theory) correctness/safety if replica if master is
>malfunctioning.
>Excuse my but this is demagogy, we talk about what is more safer but not absolutely safety.
>>The situation is symmetrical. Both peers do not have the whole >picture. You can make either of the peers responsible for the
>>decision, then the other peer will need to supply the missing
>>bits.
>No, you are wrong. A master has only one information source about the stream it should send to a replica whereas
> a replica could connect to many masters to fetch proper data (from one or many masters). And we already implemented similar logic - 
>a voting protocol and yoh should known about it.Additionally my approach allows to collect all corresponding logic as filtering
> of concurrent streams, vclock following, subcriptions and replication groups which are not implemented yet, registration and whatever else in one module at replica side.
>>I do not think the scope of this issue has ever been protecting >against hacked masters. It has never been a goal of the protocol
>>either.
>A hacked master could be a master with an implementation error and we should be able to detech such error as soon as possible. But if a replica will not
>check an incomming stream there is no way to prevent fatal data losses.
>>This was added for specific reasons. There is no known reason the >master should send unnecessary data to replica or replica fast
>>path should get slower.
>I am afraid you did not understand me. I did not ever said that I am against any optimization which could make replication faster.
>I completely against any attempts to rely on an optimiztion logic. If a master allows to skip unrequired rows then replica should not rely on this code corectness.
> In other words, if some input stream could broke replica the replica should protect itself agains such data. This is not the replicas master responsibility.
 
Hi! I’ve just sent v4, which is closest to what Kostja wants to see,
as far as I understood, at least.
Please, check it out and tell me what you think.
Sorry, but I didn’t fully understand your proposal. Looks like you stand
for v2 of the patch, where all the filtering is done on replica side.
Is it true?
 
--
Serge Petrenko
 
 
>пн, 24 февр. 2020 г. в 13:18, Konstantin Osipov < kostja.osipov@gmail.com >:
>>* Georgy Kirichenko < kirichenkoga@gmail.com > [20/02/23 12:21]:
>>
>>> Please do not think you are the only person who knows about byzantine faults.
>>> Also there is little relevance between byzantine faults and my suggestion to
>>> enforce replica-side checking.
>>
>>You've been suggesting that filtering on the master is safer. I
>>pointed out it's not, there is no way to guarantee
>>(even in theory) correctness/safety if replica if master is
>>malfunctioning.
>>
>>I merely pointed out that your safety argument has no merit.
>>
>>There are no other practical advantages of filtering on replica
>>either: there is a disadvantage, more traffic and more filtering work to do
>>inside tx thread (as opposed to relay/wal thread if done on
>>master).
>>
>>It is also against the current responsibilities of IPROTO_SUBSCRIBE: the
>>concept of a subscription is that replica specifies what it is
>>interested in. Specifically, it specifies vclock components it's.
>>You suggest to make the replica responsible for
>>submitting its vclock, but the master decide what to do with it -
>>this splits the decision making logic between the two, making the
>>whole thing harder to understand.
>>
>>IPROTO_SUBSCRIBE responsibility layout today is typical for a
>>request-response protocol: the master, being the server, executes
>>the command as specified by the client (the replica), and the
>>replica runs the logic to decide what command to issue.
>>
>>You suggest to change it because of some theoretical concerns you
>>have.
>>
>>> In any case filtering on the master side is the most worst  thing we could do.
>>> In this case master has only one peer and have no chance to make a proper
>>> decision if replica is broken. And we have no chance to know about it (except
>>> assert which are excluded from release builds, or panic messages). For
>>> instance if master skipped some rows then there are no any tracks of the
>>> situation we could detect.
>>
>>The situation is symmetrical. Both peers do not have the whole
>>picture. You can make either of the peers responsible for the
>>decision, then the other peer will need to supply the missing
>>bits. There is no way you can make it safer by changing who makes
>>the decision, but you can certainly make it more messed up by
>>splitting this logic or going against an established layout.
>>
>>If you have a specific example why things will improve if done
>>otherwise - in the number of packets, or traffic, or some other
>>measurable way, you should point it out.
>>
>>> In the opposite case a replica could connect to as many masters as they need
>>> to filter out all invalid data or hacked masters. At least we could enforce
>>> replication stream meta checking.
>>
>>I do not think the scope of this issue has ever been protecting
>>against hacked masters. It has never been a goal of the protocol
>>either.
>>
>>> Two major point I would like to mention are:
>>> 1. Replica could consistently follow all vclock members and apply all
>>> transactions without gaps (I already got rid of them, I hope you remember)
>>> 2. Replica could protect itself against concurrent local writes (one was made
>>> locally, the second one is returned from master)
>>
>>This was added for specific reasons. There is no known reason the
>>master should send unnecessary data to replica or replica fast
>>path should get slower.
>>
>>--
>>Konstantin Osipov, Moscow, Russia
>>https://scylladb.com 
 
 

[-- Attachment #2: Type: text/html, Size: 8154 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2020-02-26 10:09 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-18 17:37 [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance Serge Petrenko
2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 1/4] box: expose box_is_orphan method Serge Petrenko
2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance Serge Petrenko
2020-02-18 19:03   ` Konstantin Osipov
2020-02-19  8:43     ` Serge Petrenko
2020-02-19  8:52       ` Konstantin Osipov
2020-02-19  8:57         ` Serge Petrenko
2020-02-19  9:02           ` Konstantin Osipov
2020-02-19  9:35             ` Serge Petrenko
2020-02-19 10:11               ` Konstantin Osipov
2020-02-19 10:31                 ` Serge Petrenko
2020-02-19 11:27                   ` Konstantin Osipov
2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 3/4] replication: do not relay rows coming from a remote instance back to it Serge Petrenko
2020-02-18 19:07   ` Konstantin Osipov
2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 4/4] wal: warn when trying to write a record with a broken lsn Serge Petrenko
2020-02-22 20:21 ` [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance Georgy Kirichenko
2020-02-22 20:49   ` Konstantin Osipov
2020-02-23  8:16     ` Georgy Kirichenko
2020-02-24 10:18       ` Konstantin Osipov
2020-02-24 12:31         ` Георгий Кириченко
2020-02-26 10:09           ` Sergey Petrenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox