* [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance @ 2020-02-18 17:37 Serge Petrenko 2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 1/4] box: expose box_is_orphan method Serge Petrenko ` (4 more replies) 0 siblings, 5 replies; 21+ messages in thread From: Serge Petrenko @ 2020-02-18 17:37 UTC (permalink / raw) To: v.shpilevoy, alexander.turenko, kostja.osipov; +Cc: tarantool-patches https://github.com/tarantool/tarantool/issues/4739 https://github.com/tarantool/tarantool/tree/sp/gh-4739-vclock-assert-v3 Changes in v3: - review fixes as per review from Vlad - instead of skipping rows on replica side, do it on master side, by patching recovery to silently follow rows coming from a certain instance. Changes in v2: - review fixes as per review from Kostja Serge Petrenko (4): box: expose box_is_orphan method recovery: allow to ignore rows coming from a certain instance replication: do not relay rows coming from a remote instance back to it wal: warn when trying to write a record with a broken lsn src/box/applier.cc | 2 +- src/box/box.cc | 15 +++++++++++---- src/box/box.h | 3 +++ src/box/iproto_constants.h | 1 + src/box/recovery.cc | 12 +++++++++++- src/box/recovery.h | 7 ++++++- src/box/relay.cc | 14 +++++++++++--- src/box/relay.h | 3 ++- src/box/wal.c | 17 ++++++++++++++--- src/box/xrow.c | 18 +++++++++++++++--- src/box/xrow.h | 26 ++++++++++++++++---------- 11 files changed, 91 insertions(+), 27 deletions(-) -- 2.21.1 (Apple Git-122.3) ^ permalink raw reply [flat|nested] 21+ messages in thread
* [Tarantool-patches] [PATCH v3 1/4] box: expose box_is_orphan method 2020-02-18 17:37 [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance Serge Petrenko @ 2020-02-18 17:37 ` Serge Petrenko 2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance Serge Petrenko ` (3 subsequent siblings) 4 siblings, 0 replies; 21+ messages in thread From: Serge Petrenko @ 2020-02-18 17:37 UTC (permalink / raw) To: v.shpilevoy, alexander.turenko, kostja.osipov; +Cc: tarantool-patches is_orphan status check is needed by applier in order to tell relay whether to send the instance's own rows back or not. Prerequisite #4739 --- src/box/box.cc | 6 ++++++ src/box/box.h | 3 +++ 2 files changed, 9 insertions(+) diff --git a/src/box/box.cc b/src/box/box.cc index 1b2b27d61..0290578b2 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -247,6 +247,12 @@ box_is_ro(void) return is_ro || is_orphan; } +bool +box_is_orphan(void) +{ + return is_orphan; +} + int box_wait_ro(bool ro, double timeout) { diff --git a/src/box/box.h b/src/box/box.h index a212e6510..f37a945eb 100644 --- a/src/box/box.h +++ b/src/box/box.h @@ -105,6 +105,9 @@ box_set_ro(); bool box_is_ro(void); +bool +box_is_orphan(void); + /** * Wait until the instance switches to a desired mode. * \param ro wait read-only if set or read-write if unset -- 2.21.1 (Apple Git-122.3) ^ permalink raw reply [flat|nested] 21+ messages in thread
* [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance 2020-02-18 17:37 [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance Serge Petrenko 2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 1/4] box: expose box_is_orphan method Serge Petrenko @ 2020-02-18 17:37 ` Serge Petrenko 2020-02-18 19:03 ` Konstantin Osipov 2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 3/4] replication: do not relay rows coming from a remote instance back to it Serge Petrenko ` (2 subsequent siblings) 4 siblings, 1 reply; 21+ messages in thread From: Serge Petrenko @ 2020-02-18 17:37 UTC (permalink / raw) To: v.shpilevoy, alexander.turenko, kostja.osipov; +Cc: tarantool-patches Prerequisite #4739 --- src/box/box.cc | 2 +- src/box/recovery.cc | 12 +++++++++++- src/box/recovery.h | 7 ++++++- src/box/relay.cc | 4 ++-- 4 files changed, 20 insertions(+), 5 deletions(-) diff --git a/src/box/box.cc b/src/box/box.cc index 0290578b2..a4d823df0 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -2171,7 +2171,7 @@ local_recovery(const struct tt_uuid *instance_uuid, struct recovery *recovery; recovery = recovery_new(cfg_gets("wal_dir"), cfg_geti("force_recovery"), - checkpoint_vclock); + checkpoint_vclock, 0); /* * Make sure we report the actual recovery position diff --git a/src/box/recovery.cc b/src/box/recovery.cc index 64aa467b1..e6d6b5744 100644 --- a/src/box/recovery.cc +++ b/src/box/recovery.cc @@ -81,7 +81,7 @@ */ struct recovery * recovery_new(const char *wal_dirname, bool force_recovery, - const struct vclock *vclock) + const struct vclock *vclock, unsigned int id_ignore_map) { struct recovery *r = (struct recovery *) calloc(1, sizeof(*r)); @@ -101,6 +101,8 @@ recovery_new(const char *wal_dirname, bool force_recovery, vclock_copy(&r->vclock, vclock); + r->id_ignore_map = id_ignore_map; + /** * Avoid scanning WAL dir before we recovered * the snapshot and know instance UUID - this will @@ -275,6 +277,14 @@ recover_xlog(struct recovery *r, struct xstream *stream, * failed row anyway. */ vclock_follow_xrow(&r->vclock, &row); + /* + * Do not try to apply rows coming from an + * ignored instance, but still follow their lsns + * to make sure recovery vclock stays the same as + * the one emerging from local recovery. + */ + if (1 << row.replica_id & r->id_ignore_map) + continue; if (xstream_write(stream, &row) == 0) { ++row_count; if (row_count % 100000 == 0) diff --git a/src/box/recovery.h b/src/box/recovery.h index 6e68abc0b..2e10b7064 100644 --- a/src/box/recovery.h +++ b/src/box/recovery.h @@ -50,6 +50,11 @@ struct recovery { /** The WAL cursor we're currently reading/writing from/to. */ struct xlog_cursor cursor; struct xdir wal_dir; + /** + * A map of replica ids whose changes will be silently + * ignored on recovery. + */ + unsigned int id_ignore_map; /** * This fiber is used in local hot standby mode. * It looks for changes in the wal_dir and applies @@ -62,7 +67,7 @@ struct recovery { struct recovery * recovery_new(const char *wal_dirname, bool force_recovery, - const struct vclock *vclock); + const struct vclock *vclock, unsigned int id_ignore_map); void recovery_delete(struct recovery *r); diff --git a/src/box/relay.cc b/src/box/relay.cc index b89632273..741a09201 100644 --- a/src/box/relay.cc +++ b/src/box/relay.cc @@ -355,7 +355,7 @@ relay_final_join(int fd, uint64_t sync, struct vclock *start_vclock, }); relay->r = recovery_new(cfg_gets("wal_dir"), false, - start_vclock); + start_vclock, 0); vclock_copy(&relay->stop_vclock, stop_vclock); int rc = cord_costart(&relay->cord, "final_join", @@ -701,7 +701,7 @@ relay_subscribe(struct replica *replica, int fd, uint64_t sync, vclock_copy(&relay->local_vclock_at_subscribe, &replicaset.vclock); relay->r = recovery_new(cfg_gets("wal_dir"), false, - replica_clock); + replica_clock, 0); vclock_copy(&relay->tx.vclock, replica_clock); relay->version_id = replica_version_id; -- 2.21.1 (Apple Git-122.3) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance 2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance Serge Petrenko @ 2020-02-18 19:03 ` Konstantin Osipov 2020-02-19 8:43 ` Serge Petrenko 0 siblings, 1 reply; 21+ messages in thread From: Konstantin Osipov @ 2020-02-18 19:03 UTC (permalink / raw) To: Serge Petrenko; +Cc: tarantool-patches, v.shpilevoy * Serge Petrenko <sergepetrenko@tarantool.org> [20/02/18 20:38]: > Prerequisite #4739 This is a strange way to mute rows from self. Why not set vclock component to infinity as I suggested multiple times? Why not respond to me with objection if my suggestion can not be done? > -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance 2020-02-18 19:03 ` Konstantin Osipov @ 2020-02-19 8:43 ` Serge Petrenko 2020-02-19 8:52 ` Konstantin Osipov 0 siblings, 1 reply; 21+ messages in thread From: Serge Petrenko @ 2020-02-19 8:43 UTC (permalink / raw) To: Konstantin Osipov, Vladislav Shpilevoy, Alexander Turenko Cc: tarantool-patches > 18 февр. 2020 г., в 22:03, Konstantin Osipov <kostja.osipov@gmail.com> написал(а): > > * Serge Petrenko <sergepetrenko@tarantool.org> [20/02/18 20:38]: >> Prerequisite #4739 > > This is a strange way to mute rows from self. Why not set vclock > component to infinity as I suggested multiple times? Why not > respond to me with objection if my suggestion can not be done? I responded with a patch, so now we can discuss both your and my suggestions. If I understood you correctly, you suggested to set replica self lsn to infinity (on master side), so that recovery on masters side would skip replicas rows. I tried your approach and initialized recovery with vclock[replica_id] = INT64_MAX. This does allow you to skip replica’s rows, but this breaks vclock signature, which will overflow immediately. vclock signatures are used to order gc consumers, gc consumer corresponding to a replica gets its vclock from relay recovery. Ok, you could suggest to reset vclock[replica_id] to some meaningful value, but where do you get this value from? You cannot do gc message vclock[replica_id] = replica ack vclock[replica_id], because replica may have some rows you still don’t have. Then replica ack vclock signature may get too big and you will delete other logs containing your changes. You also cannot set gc vclock[replica_id] to 0, because it will hold logs, not needed by replica for too long. This is why I decided to implement the skipping mechanism from this patch. It allows to track the exact vclock of the last recovered row, and it allows to skip replica rows, just like we wanted. -- Serge Petrenko sergepetrenko@tarantool.org > >> > > -- > Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance 2020-02-19 8:43 ` Serge Petrenko @ 2020-02-19 8:52 ` Konstantin Osipov 2020-02-19 8:57 ` Serge Petrenko 0 siblings, 1 reply; 21+ messages in thread From: Konstantin Osipov @ 2020-02-19 8:52 UTC (permalink / raw) To: Serge Petrenko; +Cc: tarantool-patches, Vladislav Shpilevoy * Serge Petrenko <sergepetrenko@tarantool.org> [20/02/19 11:45]: > > This is a strange way to mute rows from self. Why not set vclock > > component to infinity as I suggested multiple times? Why not > > respond to me with objection if my suggestion can not be done? > > I responded with a patch, so now we can discuss both your and my suggestions. No, we can't. You responded with a patch for your suggestion, but not for mine. So we compare apples and oranges here. Just like with Ilya's patch, which only half way captured my suggestion in his "alternatives". In the end I'm not even sure you got it right. > If I understood you correctly, you suggested to set replica self lsn to infinity > (on master side), so that recovery on masters side would skip replicas rows. > > I tried your approach and initialized recovery with vclock[replica_id] = INT64_MAX. > This does allow you to skip replica’s rows, but this breaks vclock signature, which will > overflow immediately. vclock signatures are used to order gc consumers, gc > consumer corresponding to a replica gets its vclock from relay recovery. > Ok, you could suggest to reset vclock[replica_id] to some meaningful value, but where > do you get this value from? You cannot do gc message vclock[replica_id] = > replica ack vclock[replica_id], because replica may have some rows you still > don’t have. Then replica ack vclock signature may get too big and you will delete > other logs containing your changes. > You also cannot set gc vclock[replica_id] to 0, because it will hold logs, not needed by > replica for too long. Please send a patch and we'll be able to discuss where it went wrong for you. -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance 2020-02-19 8:52 ` Konstantin Osipov @ 2020-02-19 8:57 ` Serge Petrenko 2020-02-19 9:02 ` Konstantin Osipov 0 siblings, 1 reply; 21+ messages in thread From: Serge Petrenko @ 2020-02-19 8:57 UTC (permalink / raw) To: Konstantin Osipov, Vladislav Shpilevoy, Alexander Turenko Cc: tarantool-patches > 19 февр. 2020 г., в 11:52, Konstantin Osipov <kostja.osipov@gmail.com> написал(а): > > * Serge Petrenko <sergepetrenko@tarantool.org> [20/02/19 11:45]: >>> This is a strange way to mute rows from self. Why not set vclock >>> component to infinity as I suggested multiple times? Why not >>> respond to me with objection if my suggestion can not be done? >> >> I responded with a patch, so now we can discuss both your and my suggestions. > > No, we can't. You responded with a patch for your suggestion, but > not for mine. So we compare apples and oranges here. Just like > with Ilya's patch, which only half way captured my suggestion in > his "alternatives". > > In the end I'm not even sure you got it right. > >> If I understood you correctly, you suggested to set replica self lsn to infinity >> (on master side), so that recovery on masters side would skip replicas rows. >> Does it look like I got it right from my answer? >> I tried your approach and initialized recovery with vclock[replica_id] = INT64_MAX. >> This does allow you to skip replica’s rows, but this breaks vclock signature, which will >> overflow immediately. vclock signatures are used to order gc consumers, gc >> consumer corresponding to a replica gets its vclock from relay recovery. >> Ok, you could suggest to reset vclock[replica_id] to some meaningful value, but where >> do you get this value from? You cannot do gc message vclock[replica_id] = >> replica ack vclock[replica_id], because replica may have some rows you still >> don’t have. Then replica ack vclock signature may get too big and you will delete >> other logs containing your changes. >> You also cannot set gc vclock[replica_id] to 0, because it will hold logs, not needed by >> replica for too long. > > Please send a patch and we'll be able to discuss where it went > wrong for you. Ok > > > -- > Konstantin Osipov, Moscow, Russia -- Serge Petrenko sergepetrenko@tarantool.org ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance 2020-02-19 8:57 ` Serge Petrenko @ 2020-02-19 9:02 ` Konstantin Osipov 2020-02-19 9:35 ` Serge Petrenko 0 siblings, 1 reply; 21+ messages in thread From: Konstantin Osipov @ 2020-02-19 9:02 UTC (permalink / raw) To: Serge Petrenko; +Cc: tarantool-patches, Vladislav Shpilevoy * Serge Petrenko <sergepetrenko@tarantool.org> [20/02/19 12:01]: > > > > > 19 февр. 2020 г., в 11:52, Konstantin Osipov <kostja.osipov@gmail.com> написал(а): > > > > * Serge Petrenko <sergepetrenko@tarantool.org> [20/02/19 11:45]: > >>> This is a strange way to mute rows from self. Why not set vclock > >>> component to infinity as I suggested multiple times? Why not > >>> respond to me with objection if my suggestion can not be done? > >> > >> I responded with a patch, so now we can discuss both your and my suggestions. > > > > No, we can't. You responded with a patch for your suggestion, but > > not for mine. So we compare apples and oranges here. Just like > > with Ilya's patch, which only half way captured my suggestion in > > his "alternatives". > > > > In the end I'm not even sure you got it right. > > > > > > >> If I understood you correctly, you suggested to set replica self lsn to infinity > >> (on master side), so that recovery on masters side would skip replicas rows. > >> > > Does it look like I got it right from my answer? I don't understand. For example, what do you mean by setting it on the master? You were supposed to set it on the replica, when sending a SUBSCRIBE request to the master. Honestly, your English isn't perfect, we don't share the same vocabulary/terms, so it's hard to see what went wrong without a patch. -- Konstantin Osipov, Moscow, Russia https://scylladb.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance 2020-02-19 9:02 ` Konstantin Osipov @ 2020-02-19 9:35 ` Serge Petrenko 2020-02-19 10:11 ` Konstantin Osipov 0 siblings, 1 reply; 21+ messages in thread From: Serge Petrenko @ 2020-02-19 9:35 UTC (permalink / raw) To: Konstantin Osipov, Vladislav Shpilevoy, Alexander Turenko Cc: tarantool-patches [-- Attachment #1: Type: text/plain, Size: 2236 bytes --] > 19 февр. 2020 г., в 12:02, Konstantin Osipov <kostja.osipov@gmail.com> написал(а): > > * Serge Petrenko <sergepetrenko@tarantool.org <mailto:sergepetrenko@tarantool.org>> [20/02/19 12:01]: >> >> >> >>> 19 февр. 2020 г., в 11:52, Konstantin Osipov <kostja.osipov@gmail.com> написал(а): >>> >>> * Serge Petrenko <sergepetrenko@tarantool.org> [20/02/19 11:45]: >>>>> This is a strange way to mute rows from self. Why not set vclock >>>>> component to infinity as I suggested multiple times? Why not >>>>> respond to me with objection if my suggestion can not be done? >>>> >>>> I responded with a patch, so now we can discuss both your and my suggestions. >>> >>> No, we can't. You responded with a patch for your suggestion, but >>> not for mine. So we compare apples and oranges here. Just like >>> with Ilya's patch, which only half way captured my suggestion in >>> his "alternatives". >>> >>> In the end I'm not even sure you got it right. >> >> >> >>> >>>> If I understood you correctly, you suggested to set replica self lsn to infinity >>>> (on master side), so that recovery on masters side would skip replicas rows. >>>> >> >> Does it look like I got it right from my answer? > > I don't understand. For example, what do you mean by setting it on > the master? You were supposed to set it on the replica, when > sending a SUBSCRIBE request to the master. I mean taking replica clock from a subscribe request and setting replica_clock[replica->id] = INT64_MAX. This is done on master in relay_subscribe() called from box_process_subscribe(). Then relay recovery is initialized with this vclock making recovery skip rows coming from the replica itself. I’m starting to think that you want me to mangle only with replicaset.applier.vclock Sorry if I missed it in the very beginning of our discussion. > Honestly, your English isn't perfect, we don't share the same > vocabulary/terms, so it's hard to see what went wrong without a > patch. Ok, I’ll send a patch soon. > > > > -- > Konstantin Osipov, Moscow, Russia > https://scylladb.com <https://scylladb.com/> -- Serge Petrenko sergepetrenko@tarantool.org [-- Attachment #2: Type: text/html, Size: 13067 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance 2020-02-19 9:35 ` Serge Petrenko @ 2020-02-19 10:11 ` Konstantin Osipov 2020-02-19 10:31 ` Serge Petrenko 0 siblings, 1 reply; 21+ messages in thread From: Konstantin Osipov @ 2020-02-19 10:11 UTC (permalink / raw) To: Serge Petrenko; +Cc: tarantool-patches, Vladislav Shpilevoy * Serge Petrenko <sergepetrenko@tarantool.org> [20/02/19 12:36]: > I mean taking replica clock from a subscribe request and setting replica_clock[replica->id] = INT64_MAX. > This is done on master in relay_subscribe() called from box_process_subscribe(). Then relay recovery is initialized > with this vclock making recovery skip rows coming from the replica itself. Why is this logic done on master? Why can't you send the right vclock from replica? -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance 2020-02-19 10:11 ` Konstantin Osipov @ 2020-02-19 10:31 ` Serge Petrenko 2020-02-19 11:27 ` Konstantin Osipov 0 siblings, 1 reply; 21+ messages in thread From: Serge Petrenko @ 2020-02-19 10:31 UTC (permalink / raw) To: Konstantin Osipov, Vladislav Shpilevoy, Alexander Turenko Cc: tarantool-patches > 19 февр. 2020 г., в 13:11, Konstantin Osipov <kostja.osipov@gmail.com> написал(а): > > * Serge Petrenko <sergepetrenko@tarantool.org> [20/02/19 12:36]: >> I mean taking replica clock from a subscribe request and setting replica_clock[replica->id] = INT64_MAX. >> This is done on master in relay_subscribe() called from box_process_subscribe(). Then relay recovery is initialized >> with this vclock making recovery skip rows coming from the replica itself. > > Why is this logic done on master? Why can't you send the right > vclock from replica? > Cos this strange vclock will be used to initialize gc consumer, and this gc consumer will be ordered by that vclock signature, which will overflow. Anyway, let me send the patch first. This is the same problem in both cases, when the logic you propose works either on master or on replica side. -- Serge Petrenko sergepetrenko@tarantool.org > > -- > Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance 2020-02-19 10:31 ` Serge Petrenko @ 2020-02-19 11:27 ` Konstantin Osipov 0 siblings, 0 replies; 21+ messages in thread From: Konstantin Osipov @ 2020-02-19 11:27 UTC (permalink / raw) To: Serge Petrenko; +Cc: tarantool-patches, Vladislav Shpilevoy * Serge Petrenko <sergepetrenko@tarantool.org> [20/02/19 13:32]: > > > > 19 февр. 2020 г., в 13:11, Konstantin Osipov <kostja.osipov@gmail.com> написал(а): > > > > * Serge Petrenko <sergepetrenko@tarantool.org> [20/02/19 12:36]: > >> I mean taking replica clock from a subscribe request and setting replica_clock[replica->id] = INT64_MAX. > >> This is done on master in relay_subscribe() called from box_process_subscribe(). Then relay recovery is initialized > >> with this vclock making recovery skip rows coming from the replica itself. > > > > Why is this logic done on master? Why can't you send the right > > vclock from replica? > > > > > Cos this strange vclock will be used to initialize gc consumer, > and this gc consumer will be ordered by that vclock signature, > which will overflow. I think there was a patch from Georgy that stops ordering GC consumers by their signature, and uses vclock_compare instead? Can it be pushed first? In any case, your "special flag" is extremely cumbersome. Better introduce a subscription mask to IPROTO_SUBSCRIBE, we need it for https://github.com/tarantool/tarantool/issues/3294 anyway. > Anyway, let me send the patch first. This is the same problem > in both cases, when the logic you propose works either on master or > on replica side. -- Konstantin Osipov, Moscow, Russia https://scylladb.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* [Tarantool-patches] [PATCH v3 3/4] replication: do not relay rows coming from a remote instance back to it 2020-02-18 17:37 [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance Serge Petrenko 2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 1/4] box: expose box_is_orphan method Serge Petrenko 2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance Serge Petrenko @ 2020-02-18 17:37 ` Serge Petrenko 2020-02-18 19:07 ` Konstantin Osipov 2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 4/4] wal: warn when trying to write a record with a broken lsn Serge Petrenko 2020-02-22 20:21 ` [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance Georgy Kirichenko 4 siblings, 1 reply; 21+ messages in thread From: Serge Petrenko @ 2020-02-18 17:37 UTC (permalink / raw) To: v.shpilevoy, alexander.turenko, kostja.osipov; +Cc: tarantool-patches We have a mechanism for restoring rows originating from an instance that suffered a sudden power loss: remote masters resend the isntance's rows received before a certain point in time, defined by remote master vclock at the moment of subscribe. However, this is useful only on initial replication configuraiton, when an instance has just recovered, so that it can receive what it has relayed but haven't synced to disk. In other cases, when an instance is operating normally and master-master replication is configured, the mechanism described above may lead to instance re-applying instance's own rows, coming from a master it has just subscribed to. To fix the problem do not relay rows coming from a remote instance, if the instance has already recovered. Closes #4739 --- src/box/applier.cc | 2 +- src/box/box.cc | 7 ++++--- src/box/iproto_constants.h | 1 + src/box/relay.cc | 12 ++++++++++-- src/box/relay.h | 3 ++- src/box/xrow.c | 18 +++++++++++++++--- src/box/xrow.h | 26 ++++++++++++++++---------- 7 files changed, 49 insertions(+), 20 deletions(-) diff --git a/src/box/applier.cc b/src/box/applier.cc index ae3d281a5..542144d14 100644 --- a/src/box/applier.cc +++ b/src/box/applier.cc @@ -867,7 +867,7 @@ applier_subscribe(struct applier *applier) vclock_create(&vclock); vclock_copy(&vclock, &replicaset.vclock); xrow_encode_subscribe_xc(&row, &REPLICASET_UUID, &INSTANCE_UUID, - &vclock, replication_anon); + &vclock, replication_anon, box_is_orphan()); coio_write_xrow(coio, &row); /* Read SUBSCRIBE response */ diff --git a/src/box/box.cc b/src/box/box.cc index a4d823df0..d485845c7 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -1787,8 +1787,9 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header) uint32_t replica_version_id; vclock_create(&replica_clock); bool anon; - xrow_decode_subscribe_xc(header, NULL, &replica_uuid, - &replica_clock, &replica_version_id, &anon); + bool is_orphan; + xrow_decode_subscribe_xc(header, NULL, &replica_uuid, &replica_clock, + &replica_version_id, &anon, &is_orphan); /* Forbid connection to itself */ if (tt_uuid_is_equal(&replica_uuid, &INSTANCE_UUID)) @@ -1871,7 +1872,7 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header) * indefinitely). */ relay_subscribe(replica, io->fd, header->sync, &replica_clock, - replica_version_id); + replica_version_id, is_orphan); } void diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h index b66c05c06..9616cbcf0 100644 --- a/src/box/iproto_constants.h +++ b/src/box/iproto_constants.h @@ -125,6 +125,7 @@ enum iproto_key { IPROTO_STMT_ID = 0x43, /* Leave a gap between SQL keys and additional request keys */ IPROTO_REPLICA_ANON = 0x50, + IPROTO_REPLICA_IS_ORPHAN = 0x51, IPROTO_KEY_MAX }; diff --git a/src/box/relay.cc b/src/box/relay.cc index 741a09201..384773b05 100644 --- a/src/box/relay.cc +++ b/src/box/relay.cc @@ -676,7 +676,8 @@ relay_subscribe_f(va_list ap) /** Replication acceptor fiber handler. */ void relay_subscribe(struct replica *replica, int fd, uint64_t sync, - struct vclock *replica_clock, uint32_t replica_version_id) + struct vclock *replica_clock, uint32_t replica_version_id, + bool replica_is_orphan) { assert(replica->anon || replica->id != REPLICA_ID_NIL); struct relay *relay = replica->relay; @@ -700,8 +701,15 @@ relay_subscribe(struct replica *replica, int fd, uint64_t sync, }); vclock_copy(&relay->local_vclock_at_subscribe, &replicaset.vclock); + /* + * If the remote instance is already synced, don't relay + * the instance's rows back to it. In order to do so, + * make recovery silently skip rows originating from the + * instance. + */ relay->r = recovery_new(cfg_gets("wal_dir"), false, - replica_clock, 0); + replica_clock, replica_is_orphan ? + 0 : 1 << replica->id); vclock_copy(&relay->tx.vclock, replica_clock); relay->version_id = replica_version_id; diff --git a/src/box/relay.h b/src/box/relay.h index e1782d78f..9ba878faf 100644 --- a/src/box/relay.h +++ b/src/box/relay.h @@ -124,6 +124,7 @@ relay_final_join(int fd, uint64_t sync, struct vclock *start_vclock, */ void relay_subscribe(struct replica *replica, int fd, uint64_t sync, - struct vclock *replica_vclock, uint32_t replica_version_id); + struct vclock *replica_vclock, uint32_t replica_version_id, + bool replica_is_orphan); #endif /* TARANTOOL_REPLICATION_RELAY_H_INCLUDED */ diff --git a/src/box/xrow.c b/src/box/xrow.c index 968c3a202..6b8155859 100644 --- a/src/box/xrow.c +++ b/src/box/xrow.c @@ -1194,7 +1194,7 @@ int xrow_encode_subscribe(struct xrow_header *row, const struct tt_uuid *replicaset_uuid, const struct tt_uuid *instance_uuid, - const struct vclock *vclock, bool anon) + const struct vclock *vclock, bool anon, bool is_orphan) { memset(row, 0, sizeof(*row)); size_t size = XROW_BODY_LEN_MAX + mp_sizeof_vclock(vclock); @@ -1204,7 +1204,7 @@ xrow_encode_subscribe(struct xrow_header *row, return -1; } char *data = buf; - data = mp_encode_map(data, 5); + data = mp_encode_map(data, 6); data = mp_encode_uint(data, IPROTO_CLUSTER_UUID); data = xrow_encode_uuid(data, replicaset_uuid); data = mp_encode_uint(data, IPROTO_INSTANCE_UUID); @@ -1215,6 +1215,8 @@ xrow_encode_subscribe(struct xrow_header *row, data = mp_encode_uint(data, tarantool_version_id()); data = mp_encode_uint(data, IPROTO_REPLICA_ANON); data = mp_encode_bool(data, anon); + data = mp_encode_uint(data, IPROTO_REPLICA_IS_ORPHAN); + data = mp_encode_bool(data, is_orphan); assert(data <= buf + size); row->body[0].iov_base = buf; row->body[0].iov_len = (data - buf); @@ -1226,7 +1228,7 @@ xrow_encode_subscribe(struct xrow_header *row, int xrow_decode_subscribe(struct xrow_header *row, struct tt_uuid *replicaset_uuid, struct tt_uuid *instance_uuid, struct vclock *vclock, - uint32_t *version_id, bool *anon) + uint32_t *version_id, bool *anon, bool *is_orphan) { if (row->bodycnt == 0) { diag_set(ClientError, ER_INVALID_MSGPACK, "request body"); @@ -1301,6 +1303,16 @@ xrow_decode_subscribe(struct xrow_header *row, struct tt_uuid *replicaset_uuid, } *anon = mp_decode_bool(&d); break; + case IPROTO_REPLICA_IS_ORPHAN: + if (is_orphan == NULL) + goto skip; + if (mp_typeof(*d) != MP_BOOL) { + xrow_on_decode_err(data, end, ER_INVALID_MSGPACK, + "invalid REPLICA_IS_ORPHAN flag"); + return -1; + } + *is_orphan = mp_decode_bool(&d); + break; default: skip: mp_next(&d); /* value */ } diff --git a/src/box/xrow.h b/src/box/xrow.h index 0973c497d..3f188870a 100644 --- a/src/box/xrow.h +++ b/src/box/xrow.h @@ -322,6 +322,7 @@ xrow_encode_register(struct xrow_header *row, * @param instance_uuid Instance uuid. * @param vclock Replication clock. * @param anon Whether it is an anonymous subscribe request or not. + * @param is_orphan Whether the instance is in orphan mode. * * @retval 0 Success. * @retval -1 Memory error. @@ -330,7 +331,7 @@ int xrow_encode_subscribe(struct xrow_header *row, const struct tt_uuid *replicaset_uuid, const struct tt_uuid *instance_uuid, - const struct vclock *vclock, bool anon); + const struct vclock *vclock, bool anon, bool is_orphan); /** * Decode SUBSCRIBE command. @@ -347,7 +348,7 @@ xrow_encode_subscribe(struct xrow_header *row, int xrow_decode_subscribe(struct xrow_header *row, struct tt_uuid *replicaset_uuid, struct tt_uuid *instance_uuid, struct vclock *vclock, - uint32_t *version_id, bool *anon); + uint32_t *version_id, bool *anon, bool *is_orphan); /** * Encode JOIN command. @@ -371,7 +372,8 @@ xrow_encode_join(struct xrow_header *row, const struct tt_uuid *instance_uuid); static inline int xrow_decode_join(struct xrow_header *row, struct tt_uuid *instance_uuid) { - return xrow_decode_subscribe(row, NULL, instance_uuid, NULL, NULL, NULL); + return xrow_decode_subscribe(row, NULL, instance_uuid, NULL, NULL, NULL, + NULL); } /** @@ -386,7 +388,8 @@ static inline int xrow_decode_register(struct xrow_header *row, struct tt_uuid *instance_uuid, struct vclock *vclock) { - return xrow_decode_subscribe(row, NULL, instance_uuid, vclock, NULL, NULL); + return xrow_decode_subscribe(row, NULL, instance_uuid, vclock, NULL, + NULL, NULL); } /** @@ -411,7 +414,7 @@ xrow_encode_vclock(struct xrow_header *row, const struct vclock *vclock); static inline int xrow_decode_vclock(struct xrow_header *row, struct vclock *vclock) { - return xrow_decode_subscribe(row, NULL, NULL, vclock, NULL, NULL); + return xrow_decode_subscribe(row, NULL, NULL, vclock, NULL, NULL, NULL); } /** @@ -442,7 +445,8 @@ xrow_decode_subscribe_response(struct xrow_header *row, struct tt_uuid *replicaset_uuid, struct vclock *vclock) { - return xrow_decode_subscribe(row, replicaset_uuid, NULL, vclock, NULL, NULL); + return xrow_decode_subscribe(row, replicaset_uuid, NULL, vclock, NULL, + NULL, NULL); } /** @@ -817,10 +821,10 @@ static inline void xrow_encode_subscribe_xc(struct xrow_header *row, const struct tt_uuid *replicaset_uuid, const struct tt_uuid *instance_uuid, - const struct vclock *vclock, bool anon) + const struct vclock *vclock, bool anon, bool is_orphan) { if (xrow_encode_subscribe(row, replicaset_uuid, instance_uuid, - vclock, anon) != 0) + vclock, anon, is_orphan) != 0) diag_raise(); } @@ -829,10 +833,12 @@ static inline void xrow_decode_subscribe_xc(struct xrow_header *row, struct tt_uuid *replicaset_uuid, struct tt_uuid *instance_uuid, struct vclock *vclock, - uint32_t *replica_version_id, bool *anon) + uint32_t *replica_version_id, bool *anon, + bool *is_orphan) { if (xrow_decode_subscribe(row, replicaset_uuid, instance_uuid, - vclock, replica_version_id, anon) != 0) + vclock, replica_version_id, anon, + is_orphan) != 0) diag_raise(); } -- 2.21.1 (Apple Git-122.3) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v3 3/4] replication: do not relay rows coming from a remote instance back to it 2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 3/4] replication: do not relay rows coming from a remote instance back to it Serge Petrenko @ 2020-02-18 19:07 ` Konstantin Osipov 0 siblings, 0 replies; 21+ messages in thread From: Konstantin Osipov @ 2020-02-18 19:07 UTC (permalink / raw) To: Serge Petrenko; +Cc: tarantool-patches, v.shpilevoy * Serge Petrenko <sergepetrenko@tarantool.org> [20/02/18 20:38]: > vclock_create(&replica_clock); > bool anon; > - xrow_decode_subscribe_xc(header, NULL, &replica_uuid, > - &replica_clock, &replica_version_id, &anon); > + bool is_orphan; > + xrow_decode_subscribe_xc(header, NULL, &replica_uuid, &replica_clock, > + &replica_version_id, &anon, &is_orphan); Why did you make it so complicated, could you please explain? What went wrong with my suggestion? -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 21+ messages in thread
* [Tarantool-patches] [PATCH v3 4/4] wal: warn when trying to write a record with a broken lsn 2020-02-18 17:37 [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance Serge Petrenko ` (2 preceding siblings ...) 2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 3/4] replication: do not relay rows coming from a remote instance back to it Serge Petrenko @ 2020-02-18 17:37 ` Serge Petrenko 2020-02-22 20:21 ` [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance Georgy Kirichenko 4 siblings, 0 replies; 21+ messages in thread From: Serge Petrenko @ 2020-02-18 17:37 UTC (permalink / raw) To: v.shpilevoy, alexander.turenko, kostja.osipov; +Cc: tarantool-patches There is an assertion in vclock_follow `lsn > prev_lsn`, which doesn't fire in release builds, of course. Let's at least warn the user on an attemt to write a record with a duplicate or otherwise broken lsn, and not follow such an lsn. Follow-up #4739 --- src/box/wal.c | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/src/box/wal.c b/src/box/wal.c index 0ae66ff32..a87aedf1d 100644 --- a/src/box/wal.c +++ b/src/box/wal.c @@ -951,9 +951,20 @@ wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base, (*row)->tsn = tsn; (*row)->is_commit = row == end - 1; } else { - vclock_follow(vclock_diff, (*row)->replica_id, - (*row)->lsn - vclock_get(base, - (*row)->replica_id)); + int64_t diff = (*row)->lsn - vclock_get(base, (*row)->replica_id); + if (diff <= vclock_get(vclock_diff, + (*row)->replica_id)) { + say_crit("Attempt to write a broken LSN to WAL:" + " replica id: %d, confirmed lsn: %d," + " new lsn %d", (*row)->replica_id, + vclock_get(base, (*row)->replica_id) + + vclock_get(vclock_diff, + (*row)->replica_id), + (*row)->lsn); + assert(0); + } else { + vclock_follow(vclock_diff, (*row)->replica_id, diff); + } } } } -- 2.21.1 (Apple Git-122.3) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance 2020-02-18 17:37 [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance Serge Petrenko ` (3 preceding siblings ...) 2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 4/4] wal: warn when trying to write a record with a broken lsn Serge Petrenko @ 2020-02-22 20:21 ` Georgy Kirichenko 2020-02-22 20:49 ` Konstantin Osipov 4 siblings, 1 reply; 21+ messages in thread From: Georgy Kirichenko @ 2020-02-22 20:21 UTC (permalink / raw) To: tarantool-patches, Serge Petrenko; +Cc: v.shpilevoy [-- Attachment #1: Type: text/plain, Size: 1741 bytes --] On Tuesday, February 18, 2020 8:37:03 PM MSK Serge Petrenko wrote: Hi! Thanks for the patch. I am pretty sure replica should be able to apply all replication stream transaction in a proper way without any reliance on a master correctness. Or signal an error if this is impossible. I am suggesting such logic because a replica has the complete information about it own at the moment. This includes local vclock, cfg, all existing appliers state and incoming streams. WBR > https://github.com/tarantool/tarantool/issues/4739 > https://github.com/tarantool/tarantool/tree/sp/gh-4739-vclock-assert-v3 > > Changes in v3: > - review fixes as per review from Vlad > - instead of skipping rows on replica side, > do it on master side, by patching recovery > to silently follow rows coming from a certain > instance. > > Changes in v2: > - review fixes as per review from Kostja > > Serge Petrenko (4): > box: expose box_is_orphan method > recovery: allow to ignore rows coming from a certain instance > replication: do not relay rows coming from a remote instance back to > it > wal: warn when trying to write a record with a broken lsn > > src/box/applier.cc | 2 +- > src/box/box.cc | 15 +++++++++++---- > src/box/box.h | 3 +++ > src/box/iproto_constants.h | 1 + > src/box/recovery.cc | 12 +++++++++++- > src/box/recovery.h | 7 ++++++- > src/box/relay.cc | 14 +++++++++++--- > src/box/relay.h | 3 ++- > src/box/wal.c | 17 ++++++++++++++--- > src/box/xrow.c | 18 +++++++++++++++--- > src/box/xrow.h | 26 ++++++++++++++++---------- > 11 files changed, 91 insertions(+), 27 deletions(-) [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance 2020-02-22 20:21 ` [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance Georgy Kirichenko @ 2020-02-22 20:49 ` Konstantin Osipov 2020-02-23 8:16 ` Georgy Kirichenko 0 siblings, 1 reply; 21+ messages in thread From: Konstantin Osipov @ 2020-02-22 20:49 UTC (permalink / raw) To: Georgy Kirichenko; +Cc: tarantool-patches, v.shpilevoy * Georgy Kirichenko <kirichenkoga@gmail.com> [20/02/22 23:22]: > On Tuesday, February 18, 2020 8:37:03 PM MSK Serge Petrenko wrote: > Hi! Thanks for the patch. > I am pretty sure replica should be able to apply all replication stream > transaction in a proper way without any reliance on a master correctness. Or > signal an error if this is impossible. I am suggesting such logic because a > replica has the complete information about it own at the moment. This includes > local vclock, cfg, all existing appliers state and incoming streams. It's an impossible and useless pursuit. Please read up on byzantine faults. -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance 2020-02-22 20:49 ` Konstantin Osipov @ 2020-02-23 8:16 ` Georgy Kirichenko 2020-02-24 10:18 ` Konstantin Osipov 0 siblings, 1 reply; 21+ messages in thread From: Georgy Kirichenko @ 2020-02-23 8:16 UTC (permalink / raw) To: Konstantin Osipov, Georgy Kirichenko, tarantool-patches, Serge Petrenko, v.shpilevoy, alexander.turenko Please do not think you are the only person who knows about byzantine faults. Also there is little relevance between byzantine faults and my suggestion to enforce replica-side checking. In any case filtering on the master side is the most worst thing we could do. In this case master has only one peer and have no chance to make a proper decision if replica is broken. And we have no chance to know about it (except assert which are excluded from release builds, or panic messages). For instance if master skipped some rows then there are no any tracks of the situation we could detect. In the opposite case a replica could connect to as many masters as they need to filter out all invalid data or hacked masters. At least we could enforce replication stream meta checking. Two major point I would like to mention are: 1. Replica could consistently follow all vclock members and apply all transactions without gaps (I already got rid of them, I hope you remember) 2. Replica could protect itself against concurrent local writes (one was made locally, the second one is returned from master) On Saturday, February 22, 2020 11:49:30 PM MSK Konstantin Osipov wrote: > * Georgy Kirichenko <kirichenkoga@gmail.com> [20/02/22 23:22]: > > On Tuesday, February 18, 2020 8:37:03 PM MSK Serge Petrenko wrote: > > Hi! Thanks for the patch. > > I am pretty sure replica should be able to apply all replication stream > > transaction in a proper way without any reliance on a master correctness. > > Or signal an error if this is impossible. I am suggesting such logic > > because a replica has the complete information about it own at the > > moment. This includes local vclock, cfg, all existing appliers state and > > incoming streams. > It's an impossible and useless pursuit. > > Please read up on byzantine faults. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance 2020-02-23 8:16 ` Georgy Kirichenko @ 2020-02-24 10:18 ` Konstantin Osipov 2020-02-24 12:31 ` Георгий Кириченко 0 siblings, 1 reply; 21+ messages in thread From: Konstantin Osipov @ 2020-02-24 10:18 UTC (permalink / raw) To: Georgy Kirichenko; +Cc: tarantool-patches, v.shpilevoy * Georgy Kirichenko <kirichenkoga@gmail.com> [20/02/23 12:21]: > Please do not think you are the only person who knows about byzantine faults. > Also there is little relevance between byzantine faults and my suggestion to > enforce replica-side checking. You've been suggesting that filtering on the master is safer. I pointed out it's not, there is no way to guarantee (even in theory) correctness/safety if replica if master is malfunctioning. I merely pointed out that your safety argument has no merit. There are no other practical advantages of filtering on replica either: there is a disadvantage, more traffic and more filtering work to do inside tx thread (as opposed to relay/wal thread if done on master). It is also against the current responsibilities of IPROTO_SUBSCRIBE: the concept of a subscription is that replica specifies what it is interested in. Specifically, it specifies vclock components it's. You suggest to make the replica responsible for submitting its vclock, but the master decide what to do with it - this splits the decision making logic between the two, making the whole thing harder to understand. IPROTO_SUBSCRIBE responsibility layout today is typical for a request-response protocol: the master, being the server, executes the command as specified by the client (the replica), and the replica runs the logic to decide what command to issue. You suggest to change it because of some theoretical concerns you have. > In any case filtering on the master side is the most worst thing we could do. > In this case master has only one peer and have no chance to make a proper > decision if replica is broken. And we have no chance to know about it (except > assert which are excluded from release builds, or panic messages). For > instance if master skipped some rows then there are no any tracks of the > situation we could detect. The situation is symmetrical. Both peers do not have the whole picture. You can make either of the peers responsible for the decision, then the other peer will need to supply the missing bits. There is no way you can make it safer by changing who makes the decision, but you can certainly make it more messed up by splitting this logic or going against an established layout. If you have a specific example why things will improve if done otherwise - in the number of packets, or traffic, or some other measurable way, you should point it out. > In the opposite case a replica could connect to as many masters as they need > to filter out all invalid data or hacked masters. At least we could enforce > replication stream meta checking. I do not think the scope of this issue has ever been protecting against hacked masters. It has never been a goal of the protocol either. > Two major point I would like to mention are: > 1. Replica could consistently follow all vclock members and apply all > transactions without gaps (I already got rid of them, I hope you remember) > 2. Replica could protect itself against concurrent local writes (one was made > locally, the second one is returned from master) This was added for specific reasons. There is no known reason the master should send unnecessary data to replica or replica fast path should get slower. -- Konstantin Osipov, Moscow, Russia https://scylladb.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance 2020-02-24 10:18 ` Konstantin Osipov @ 2020-02-24 12:31 ` Георгий Кириченко 2020-02-26 10:09 ` Sergey Petrenko 0 siblings, 1 reply; 21+ messages in thread From: Георгий Кириченко @ 2020-02-24 12:31 UTC (permalink / raw) To: Konstantin Osipov, Georgy Kirichenko, tarantool-patches, Serge Petrenko, Vladislav Shpilevoy, alexander.turenko [-- Attachment #1: Type: text/plain, Size: 5816 bytes --] Please read messages before answering. I did never say that: > You've been suggesting that filtering on the master is safer. I said it safer do to it on the replica side and replica should not rely on master correctness. > I pointed out it's not, there is no way to guarantee (even in theory) correctness/safety if replica if master is malfunctioning. Excuse my but this is demagogy, we talk about what is more safer but not absolutely safety. >The situation is symmetrical. Both peers do not have the whole >picture. You can make either of the peers responsible for the >decision, then the other peer will need to supply the missing >bits. No, you are wrong. A master has only one information source about the stream it should send to a replica whereas a replica could connect to many masters to fetch proper data (from one or many masters). And we already implemented similar logic - a voting protocol and yoh should known about it.Additionally my approach allows to collect all corresponding logic as filtering of concurrent streams, vclock following, subcriptions and replication groups which are not implemented yet, registration and whatever else in one module at replica side. >I do not think the scope of this issue has ever been protecting >against hacked masters. It has never been a goal of the protocol >either. A hacked master could be a master with an implementation error and we should be able to detech such error as soon as possible. But if a replica will not check an incomming stream there is no way to prevent fatal data losses. >This was added for specific reasons. There is no known reason the >master should send unnecessary data to replica or replica fast >path should get slower. I am afraid you did not understand me. I did not ever said that I am against any optimization which could make replication faster. I completely against any attempts to rely on an optimiztion logic. If a master allows to skip unrequired rows then replica should not rely on this code corectness. In other words, if some input stream could broke replica the replica should protect itself agains such data. This is not the replicas master responsibility. пн, 24 февр. 2020 г. в 13:18, Konstantin Osipov <kostja.osipov@gmail.com>: > * Georgy Kirichenko <kirichenkoga@gmail.com> [20/02/23 12:21]: > > > Please do not think you are the only person who knows about byzantine > faults. > > Also there is little relevance between byzantine faults and my > suggestion to > > enforce replica-side checking. > > You've been suggesting that filtering on the master is safer. I > pointed out it's not, there is no way to guarantee > (even in theory) correctness/safety if replica if master is > malfunctioning. > > I merely pointed out that your safety argument has no merit. > > There are no other practical advantages of filtering on replica > either: there is a disadvantage, more traffic and more filtering work to > do > inside tx thread (as opposed to relay/wal thread if done on > master). > > It is also against the current responsibilities of IPROTO_SUBSCRIBE: the > concept of a subscription is that replica specifies what it is > interested in. Specifically, it specifies vclock components it's. > You suggest to make the replica responsible for > submitting its vclock, but the master decide what to do with it - > this splits the decision making logic between the two, making the > whole thing harder to understand. > > IPROTO_SUBSCRIBE responsibility layout today is typical for a > request-response protocol: the master, being the server, executes > the command as specified by the client (the replica), and the > replica runs the logic to decide what command to issue. > > You suggest to change it because of some theoretical concerns you > have. > > > In any case filtering on the master side is the most worst thing we > could do. > > In this case master has only one peer and have no chance to make a > proper > > decision if replica is broken. And we have no chance to know about it > (except > > assert which are excluded from release builds, or panic messages). For > > instance if master skipped some rows then there are no any tracks of the > > situation we could detect. > > The situation is symmetrical. Both peers do not have the whole > picture. You can make either of the peers responsible for the > decision, then the other peer will need to supply the missing > bits. There is no way you can make it safer by changing who makes > the decision, but you can certainly make it more messed up by > splitting this logic or going against an established layout. > > If you have a specific example why things will improve if done > otherwise - in the number of packets, or traffic, or some other > measurable way, you should point it out. > > > In the opposite case a replica could connect to as many masters as they > need > > to filter out all invalid data or hacked masters. At least we could > enforce > > replication stream meta checking. > > I do not think the scope of this issue has ever been protecting > against hacked masters. It has never been a goal of the protocol > either. > > > Two major point I would like to mention are: > > 1. Replica could consistently follow all vclock members and apply all > > transactions without gaps (I already got rid of them, I hope you > remember) > > 2. Replica could protect itself against concurrent local writes (one was > made > > locally, the second one is returned from master) > > This was added for specific reasons. There is no known reason the > master should send unnecessary data to replica or replica fast > path should get slower. > > -- > Konstantin Osipov, Moscow, Russia > https://scylladb.com > [-- Attachment #2: Type: text/html, Size: 6676 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance 2020-02-24 12:31 ` Георгий Кириченко @ 2020-02-26 10:09 ` Sergey Petrenko 0 siblings, 0 replies; 21+ messages in thread From: Sergey Petrenko @ 2020-02-26 10:09 UTC (permalink / raw) To: Георгий Кириченко Cc: tarantool-patches, Vladislav Shpilevoy [-- Attachment #1: Type: text/plain, Size: 6207 bytes --] >Понедельник, 24 февраля 2020, 15:31 +03:00 от Георгий Кириченко <kirichenkoga@gmail.com>: > >Please read messages before answering. I did never say that: >> You've been suggesting that filtering on the master is safer. >I said it safer do to it on the replica side and replica should not rely on master correctness. > >> I pointed out it's not, there is no way to guarantee (even in theory) correctness/safety if replica if master is >malfunctioning. >Excuse my but this is demagogy, we talk about what is more safer but not absolutely safety. >>The situation is symmetrical. Both peers do not have the whole >picture. You can make either of the peers responsible for the >>decision, then the other peer will need to supply the missing >>bits. >No, you are wrong. A master has only one information source about the stream it should send to a replica whereas > a replica could connect to many masters to fetch proper data (from one or many masters). And we already implemented similar logic - >a voting protocol and yoh should known about it.Additionally my approach allows to collect all corresponding logic as filtering > of concurrent streams, vclock following, subcriptions and replication groups which are not implemented yet, registration and whatever else in one module at replica side. >>I do not think the scope of this issue has ever been protecting >against hacked masters. It has never been a goal of the protocol >>either. >A hacked master could be a master with an implementation error and we should be able to detech such error as soon as possible. But if a replica will not >check an incomming stream there is no way to prevent fatal data losses. >>This was added for specific reasons. There is no known reason the >master should send unnecessary data to replica or replica fast >>path should get slower. >I am afraid you did not understand me. I did not ever said that I am against any optimization which could make replication faster. >I completely against any attempts to rely on an optimiztion logic. If a master allows to skip unrequired rows then replica should not rely on this code corectness. > In other words, if some input stream could broke replica the replica should protect itself agains such data. This is not the replicas master responsibility. Hi! I’ve just sent v4, which is closest to what Kostja wants to see, as far as I understood, at least. Please, check it out and tell me what you think. Sorry, but I didn’t fully understand your proposal. Looks like you stand for v2 of the patch, where all the filtering is done on replica side. Is it true? -- Serge Petrenko >пн, 24 февр. 2020 г. в 13:18, Konstantin Osipov < kostja.osipov@gmail.com >: >>* Georgy Kirichenko < kirichenkoga@gmail.com > [20/02/23 12:21]: >> >>> Please do not think you are the only person who knows about byzantine faults. >>> Also there is little relevance between byzantine faults and my suggestion to >>> enforce replica-side checking. >> >>You've been suggesting that filtering on the master is safer. I >>pointed out it's not, there is no way to guarantee >>(even in theory) correctness/safety if replica if master is >>malfunctioning. >> >>I merely pointed out that your safety argument has no merit. >> >>There are no other practical advantages of filtering on replica >>either: there is a disadvantage, more traffic and more filtering work to do >>inside tx thread (as opposed to relay/wal thread if done on >>master). >> >>It is also against the current responsibilities of IPROTO_SUBSCRIBE: the >>concept of a subscription is that replica specifies what it is >>interested in. Specifically, it specifies vclock components it's. >>You suggest to make the replica responsible for >>submitting its vclock, but the master decide what to do with it - >>this splits the decision making logic between the two, making the >>whole thing harder to understand. >> >>IPROTO_SUBSCRIBE responsibility layout today is typical for a >>request-response protocol: the master, being the server, executes >>the command as specified by the client (the replica), and the >>replica runs the logic to decide what command to issue. >> >>You suggest to change it because of some theoretical concerns you >>have. >> >>> In any case filtering on the master side is the most worst thing we could do. >>> In this case master has only one peer and have no chance to make a proper >>> decision if replica is broken. And we have no chance to know about it (except >>> assert which are excluded from release builds, or panic messages). For >>> instance if master skipped some rows then there are no any tracks of the >>> situation we could detect. >> >>The situation is symmetrical. Both peers do not have the whole >>picture. You can make either of the peers responsible for the >>decision, then the other peer will need to supply the missing >>bits. There is no way you can make it safer by changing who makes >>the decision, but you can certainly make it more messed up by >>splitting this logic or going against an established layout. >> >>If you have a specific example why things will improve if done >>otherwise - in the number of packets, or traffic, or some other >>measurable way, you should point it out. >> >>> In the opposite case a replica could connect to as many masters as they need >>> to filter out all invalid data or hacked masters. At least we could enforce >>> replication stream meta checking. >> >>I do not think the scope of this issue has ever been protecting >>against hacked masters. It has never been a goal of the protocol >>either. >> >>> Two major point I would like to mention are: >>> 1. Replica could consistently follow all vclock members and apply all >>> transactions without gaps (I already got rid of them, I hope you remember) >>> 2. Replica could protect itself against concurrent local writes (one was made >>> locally, the second one is returned from master) >> >>This was added for specific reasons. There is no known reason the >>master should send unnecessary data to replica or replica fast >>path should get slower. >> >>-- >>Konstantin Osipov, Moscow, Russia >>https://scylladb.com [-- Attachment #2: Type: text/html, Size: 8154 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2020-02-26 10:09 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-02-18 17:37 [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance Serge Petrenko 2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 1/4] box: expose box_is_orphan method Serge Petrenko 2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 2/4] recovery: allow to ignore rows coming from a certain instance Serge Petrenko 2020-02-18 19:03 ` Konstantin Osipov 2020-02-19 8:43 ` Serge Petrenko 2020-02-19 8:52 ` Konstantin Osipov 2020-02-19 8:57 ` Serge Petrenko 2020-02-19 9:02 ` Konstantin Osipov 2020-02-19 9:35 ` Serge Petrenko 2020-02-19 10:11 ` Konstantin Osipov 2020-02-19 10:31 ` Serge Petrenko 2020-02-19 11:27 ` Konstantin Osipov 2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 3/4] replication: do not relay rows coming from a remote instance back to it Serge Petrenko 2020-02-18 19:07 ` Konstantin Osipov 2020-02-18 17:37 ` [Tarantool-patches] [PATCH v3 4/4] wal: warn when trying to write a record with a broken lsn Serge Petrenko 2020-02-22 20:21 ` [Tarantool-patches] [PATCH v3 0/4] replication: fix applying of rows originating from local instance Georgy Kirichenko 2020-02-22 20:49 ` Konstantin Osipov 2020-02-23 8:16 ` Georgy Kirichenko 2020-02-24 10:18 ` Konstantin Osipov 2020-02-24 12:31 ` Георгий Кириченко 2020-02-26 10:09 ` Sergey Petrenko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox