Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH 1/1] replication: check rs uuid on subscribe process
@ 2021-05-28 20:35 Vladislav Shpilevoy via Tarantool-patches
  2021-05-28 22:06 ` Cyrill Gorcunov via Tarantool-patches
  2021-06-01  8:29 ` Serge Petrenko via Tarantool-patches
  0 siblings, 2 replies; 11+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-05-28 20:35 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

Remote node doing the subscribe might be from a different
replicaset.

Before this patch the subscribe would be retried infinitely
because the node couldn't be found in _cluster, and the master
assumed it must have joined to another node, and its ID should
arrive shortly (ER_TOO_EARLY_SUBSCRIBE).

The ID would never arrive, because the node belongs to another
replicaset.

The patch makes so the master checks if the peer lives in the same
replicaset. Since it is doing a subscribe, it must have joined
already and should have a valid replicaser UUID, regardless of
whether it is anonymous or not.

Correct behaviour is to hard cut this peer off immediately,
without retries.

Closes #6094
Part of #5613
---
Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-6094-replication-bad-error
Issue: https://github.com/tarantool/tarantool/issues/6094

The UUID ignorance on subscribe decode was introduced here:
https://github.com/tarantool/tarantool/commit/7f8cbde3555084ad6c41f137aec4faba4648c705#diff-fc276b44b551b4eac3431c9433d4bc881790ddd7df76226d7579f80da7798f6e

And I don't understand why. Maybe I miss something? The tests have
passed. Sergey, do you remember why was it needed really?

Replicaset UUID mismatch definitely means the node can't connect.
It is not related to whether it is anonymous or not. Because it
has nothing to do with _cluster.

 .../unreleased/gh-6094-rs-uuid-mismatch.md    |  6 ++
 src/box/box.cc                                | 17 ++++-
 .../gh-6094-rs-uuid-mismatch.result           | 67 +++++++++++++++++++
 .../gh-6094-rs-uuid-mismatch.test.lua         | 25 +++++++
 test/replication/suite.cfg                    |  1 +
 5 files changed, 114 insertions(+), 2 deletions(-)
 create mode 100644 changelogs/unreleased/gh-6094-rs-uuid-mismatch.md
 create mode 100644 test/replication/gh-6094-rs-uuid-mismatch.result
 create mode 100644 test/replication/gh-6094-rs-uuid-mismatch.test.lua

diff --git a/changelogs/unreleased/gh-6094-rs-uuid-mismatch.md b/changelogs/unreleased/gh-6094-rs-uuid-mismatch.md
new file mode 100644
index 000000000..f4e47da3d
--- /dev/null
+++ b/changelogs/unreleased/gh-6094-rs-uuid-mismatch.md
@@ -0,0 +1,6 @@
+## bugfix/replication
+
+* Fixed an error when a replica, at attempt to subscribe to a foreign cluster
+  (with different replicaset UUID), didn't notice it is not possible, and
+  instead was stuck in an infinite retry loop printing an error about "too
+  early subscribe" (gh-6094).
diff --git a/src/box/box.cc b/src/box/box.cc
index c10e0d8bf..5672073d6 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -2686,17 +2686,30 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
 		tnt_raise(ClientError, ER_LOADING);
 
 	struct tt_uuid replica_uuid = uuid_nil;
+	struct tt_uuid peer_replicaset_uuid = uuid_nil;
 	struct vclock replica_clock;
 	uint32_t replica_version_id;
 	vclock_create(&replica_clock);
 	bool anon;
 	uint32_t id_filter;
-	xrow_decode_subscribe_xc(header, NULL, &replica_uuid, &replica_clock,
-				 &replica_version_id, &anon, &id_filter);
+	xrow_decode_subscribe_xc(header, &peer_replicaset_uuid, &replica_uuid,
+				 &replica_clock, &replica_version_id, &anon,
+				 &id_filter);
 
 	/* Forbid connection to itself */
 	if (tt_uuid_is_equal(&replica_uuid, &INSTANCE_UUID))
 		tnt_raise(ClientError, ER_CONNECTION_TO_SELF);
+	/*
+	 * The peer should have bootstrapped from somebody since it tries to
+	 * subscribe already. If it belongs to a different replicaset, it won't
+	 * be ever found here, and would try to reconnect thinking its replica
+	 * ID wasn't replicated here yet. Prevent it right away.
+	 */
+	if (!tt_uuid_is_equal(&peer_replicaset_uuid, &REPLICASET_UUID)) {
+		tnt_raise(ClientError, ER_REPLICASET_UUID_MISMATCH,
+			  tt_uuid_str(&REPLICASET_UUID),
+			  tt_uuid_str(&peer_replicaset_uuid));
+	}
 
 	/*
 	 * Do not allow non-anonymous followers for anonymous
diff --git a/test/replication/gh-6094-rs-uuid-mismatch.result b/test/replication/gh-6094-rs-uuid-mismatch.result
new file mode 100644
index 000000000..1ba189a37
--- /dev/null
+++ b/test/replication/gh-6094-rs-uuid-mismatch.result
@@ -0,0 +1,67 @@
+-- test-run result file version 2
+test_run = require('test_run').new()
+ | ---
+ | ...
+
+--
+-- gh-6094: master instance didn't check if the subscribed instance has the same
+-- replicaset UUID as its own. As a result, if the peer is from a different
+-- replicaset, the master couldn't find its record in _cluster, and assumed it
+-- simply needs to wait a bit more. This led to an infinite re-subscribe.
+--
+box.schema.user.grant('guest', 'super')
+ | ---
+ | ...
+
+test_run:cmd('create server master2 with script="replication/master1.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server master2')
+ | ---
+ | - true
+ | ...
+test_run:switch('master2')
+ | ---
+ | - true
+ | ...
+replication = test_run:eval('default', 'return box.cfg.listen')[1]
+ | ---
+ | ...
+box.cfg{replication = {replication}}
+ | ---
+ | ...
+assert(test_run:grep_log('master2', 'ER_REPLICASET_UUID_MISMATCH'))
+ | ---
+ | - ER_REPLICASET_UUID_MISMATCH
+ | ...
+info = box.info
+ | ---
+ | ...
+repl_info = info.replication[1]
+ | ---
+ | ...
+assert(not repl_info.downstream and not repl_info.upstream)
+ | ---
+ | - true
+ | ...
+assert(info.status == 'orphan')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server master2')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server master2')
+ | ---
+ | - true
+ | ...
+box.schema.user.revoke('guest', 'super')
+ | ---
+ | ...
diff --git a/test/replication/gh-6094-rs-uuid-mismatch.test.lua b/test/replication/gh-6094-rs-uuid-mismatch.test.lua
new file mode 100644
index 000000000..d3928f52b
--- /dev/null
+++ b/test/replication/gh-6094-rs-uuid-mismatch.test.lua
@@ -0,0 +1,25 @@
+test_run = require('test_run').new()
+
+--
+-- gh-6094: master instance didn't check if the subscribed instance has the same
+-- replicaset UUID as its own. As a result, if the peer is from a different
+-- replicaset, the master couldn't find its record in _cluster, and assumed it
+-- simply needs to wait a bit more. This led to an infinite re-subscribe.
+--
+box.schema.user.grant('guest', 'super')
+
+test_run:cmd('create server master2 with script="replication/master1.lua"')
+test_run:cmd('start server master2')
+test_run:switch('master2')
+replication = test_run:eval('default', 'return box.cfg.listen')[1]
+box.cfg{replication = {replication}}
+assert(test_run:grep_log('master2', 'ER_REPLICASET_UUID_MISMATCH'))
+info = box.info
+repl_info = info.replication[1]
+assert(not repl_info.downstream and not repl_info.upstream)
+assert(info.status == 'orphan')
+
+test_run:switch('default')
+test_run:cmd('stop server master2')
+test_run:cmd('delete server master2')
+box.schema.user.revoke('guest', 'super')
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index dc39e2f74..5acb28fd4 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -45,6 +45,7 @@
     "gh-5435-qsync-clear-synchro-queue-commit-all.test.lua": {},
     "gh-5536-wal-limit.test.lua": {},
     "gh-5566-final-join-synchro.test.lua": {},
+    "gh-6094-rs-uuid-mismatch.test.lua": {},
     "*": {
         "memtx": {"engine": "memtx"},
         "vinyl": {"engine": "vinyl"}
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] replication: check rs uuid on subscribe process
  2021-05-28 20:35 [Tarantool-patches] [PATCH 1/1] replication: check rs uuid on subscribe process Vladislav Shpilevoy via Tarantool-patches
@ 2021-05-28 22:06 ` Cyrill Gorcunov via Tarantool-patches
  2021-05-28 22:20   ` Vladislav Shpilevoy via Tarantool-patches
  2021-06-01  8:29 ` Serge Petrenko via Tarantool-patches
  1 sibling, 1 reply; 11+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-05-28 22:06 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

On Fri, May 28, 2021 at 10:35:42PM +0200, Vladislav Shpilevoy wrote:
> +	if (!tt_uuid_is_equal(&peer_replicaset_uuid, &REPLICASET_UUID)) {
> +		tnt_raise(ClientError, ER_REPLICASET_UUID_MISMATCH,
> +			  tt_uuid_str(&REPLICASET_UUID),
> +			  tt_uuid_str(&peer_replicaset_uuid));
> +	}

Vlad, I didn't dive into the patch context yet but this
use of *two* sequential calls of tt_uuid_str() is somehow
suspicious. The tt_uuid_str uses single common buffer
for any call which means some of tt_uuid_str() result
will be overwritten, no?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] replication: check rs uuid on subscribe process
  2021-05-28 22:06 ` Cyrill Gorcunov via Tarantool-patches
@ 2021-05-28 22:20   ` Vladislav Shpilevoy via Tarantool-patches
  2021-05-28 22:30     ` Cyrill Gorcunov via Tarantool-patches
  2021-06-01  7:52     ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 2 replies; 11+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-05-28 22:20 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tarantool-patches

On 29.05.2021 00:06, Cyrill Gorcunov wrote:
> On Fri, May 28, 2021 at 10:35:42PM +0200, Vladislav Shpilevoy wrote:
>> +	if (!tt_uuid_is_equal(&peer_replicaset_uuid, &REPLICASET_UUID)) {
>> +		tnt_raise(ClientError, ER_REPLICASET_UUID_MISMATCH,
>> +			  tt_uuid_str(&REPLICASET_UUID),
>> +			  tt_uuid_str(&peer_replicaset_uuid));
>> +	}
> 
> Vlad, I didn't dive into the patch context yet but this
> use of *two* sequential calls of tt_uuid_str() is somehow
> suspicious. The tt_uuid_str uses single common buffer
> for any call which means some of tt_uuid_str() result
> will be overwritten, no?

Nope, the static buffer is cyclic. Each tt_uuid_str()
call uses only a part of it. In total it is safe to
make ~12 tt_uuid_str() calls in a row. Because the
static buffer size is 12KB, and one tt_uuid_str()
wastes 1KB (which is not necessary, it could be much
more compact).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] replication: check rs uuid on subscribe process
  2021-05-28 22:20   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-05-28 22:30     ` Cyrill Gorcunov via Tarantool-patches
  2021-06-01  7:52     ` Cyrill Gorcunov via Tarantool-patches
  1 sibling, 0 replies; 11+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-05-28 22:30 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

On Sat, May 29, 2021 at 12:20:46AM +0200, Vladislav Shpilevoy wrote:
> > 
> > Vlad, I didn't dive into the patch context yet but this
> > use of *two* sequential calls of tt_uuid_str() is somehow
> > suspicious. The tt_uuid_str uses single common buffer
> > for any call which means some of tt_uuid_str() result
> > will be overwritten, no?
> 
> Nope, the static buffer is cyclic.

Aha! Thanks for explanation!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] replication: check rs uuid on subscribe process
  2021-05-28 22:20   ` Vladislav Shpilevoy via Tarantool-patches
  2021-05-28 22:30     ` Cyrill Gorcunov via Tarantool-patches
@ 2021-06-01  7:52     ` Cyrill Gorcunov via Tarantool-patches
  1 sibling, 0 replies; 11+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-06-01  7:52 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

On Sat, May 29, 2021 at 12:20:46AM +0200, Vladislav Shpilevoy wrote:
> > Vlad, I didn't dive into the patch context yet but this
> > use of *two* sequential calls of tt_uuid_str() is somehow
> > suspicious. The tt_uuid_str uses single common buffer
> > for any call which means some of tt_uuid_str() result
> > will be overwritten, no?
> 
> Nope, the static buffer is cyclic. Each tt_uuid_str()
> call uses only a part of it. In total it is safe to
> make ~12 tt_uuid_str() calls in a row. Because the
> static buffer size is 12KB, and one tt_uuid_str()
> wastes 1KB (which is not necessary, it could be much
> more compact).

Looks ok to me, thanks! I've been thinking if we really
need to allow joining from different replicaset but don't
find any scenario. Thus ACK

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] replication: check rs uuid on subscribe process
  2021-05-28 20:35 [Tarantool-patches] [PATCH 1/1] replication: check rs uuid on subscribe process Vladislav Shpilevoy via Tarantool-patches
  2021-05-28 22:06 ` Cyrill Gorcunov via Tarantool-patches
@ 2021-06-01  8:29 ` Serge Petrenko via Tarantool-patches
  2021-06-01 21:34   ` Vladislav Shpilevoy via Tarantool-patches
  1 sibling, 1 reply; 11+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-06-01  8:29 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov



28.05.2021 23:35, Vladislav Shpilevoy пишет:
> Remote node doing the subscribe might be from a different
> replicaset.
>
> Before this patch the subscribe would be retried infinitely
> because the node couldn't be found in _cluster, and the master
> assumed it must have joined to another node, and its ID should
> arrive shortly (ER_TOO_EARLY_SUBSCRIBE).
>
> The ID would never arrive, because the node belongs to another
> replicaset.
>
> The patch makes so the master checks if the peer lives in the same
> replicaset. Since it is doing a subscribe, it must have joined
> already and should have a valid replicaser UUID, regardless of
> whether it is anonymous or not.
>
> Correct behaviour is to hard cut this peer off immediately,
> without retries.
>
> Closes #6094
> Part of #5613
> ---
> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-6094-replication-bad-error
> Issue: https://github.com/tarantool/tarantool/issues/6094
>
> The UUID ignorance on subscribe decode was introduced here:
> https://github.com/tarantool/tarantool/commit/7f8cbde3555084ad6c41f137aec4faba4648c705#diff-fc276b44b551b4eac3431c9433d4bc881790ddd7df76226d7579f80da7798f6e
>
> And I don't understand why. Maybe I miss something? The tests have
> passed. Sergey, do you remember why was it needed really?
>
> Replicaset UUID mismatch definitely means the node can't connect.
> It is not related to whether it is anonymous or not. Because it
> has nothing to do with _cluster.

Hi! Thanks for the patch!

That change was meant to help anonymous replicas fetch data from
multiple clusters. There was such an idea back then.
It never got implemented though, and I doubt it will.

So the idea was that replica should check the replicaset UUID itself.
It doesn't work now obviously. And looks like it hadn't worked before
you introduced the ER_TOO_EARLY_SUBSCRIBE error.
The test for checking uuid on replica side didn't catch this problem
because the replica was already registered on master in it.

Long story short, I'm ok with this change, but now you should remove
unnecessary replicaset uuid checks from replica side (in applier_subscribe).

And looks like replication/gh-3704-misc-replica-checks-cluster-id test is
obsolete now.


>
>   .../unreleased/gh-6094-rs-uuid-mismatch.md    |  6 ++
>   src/box/box.cc                                | 17 ++++-
>   .../gh-6094-rs-uuid-mismatch.result           | 67 +++++++++++++++++++
>   .../gh-6094-rs-uuid-mismatch.test.lua         | 25 +++++++
>   test/replication/suite.cfg                    |  1 +
>   5 files changed, 114 insertions(+), 2 deletions(-)
>   create mode 100644 changelogs/unreleased/gh-6094-rs-uuid-mismatch.md
>   create mode 100644 test/replication/gh-6094-rs-uuid-mismatch.result
>   create mode 100644 test/replication/gh-6094-rs-uuid-mismatch.test.lua
>
> diff --git a/changelogs/unreleased/gh-6094-rs-uuid-mismatch.md b/changelogs/unreleased/gh-6094-rs-uuid-mismatch.md
> new file mode 100644
> index 000000000..f4e47da3d
> --- /dev/null
> +++ b/changelogs/unreleased/gh-6094-rs-uuid-mismatch.md
> @@ -0,0 +1,6 @@
> +## bugfix/replication
> +
> +* Fixed an error when a replica, at attempt to subscribe to a foreign cluster
> +  (with different replicaset UUID), didn't notice it is not possible, and
> +  instead was stuck in an infinite retry loop printing an error about "too
> +  early subscribe" (gh-6094).
> diff --git a/src/box/box.cc b/src/box/box.cc
> index c10e0d8bf..5672073d6 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -2686,17 +2686,30 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
>   		tnt_raise(ClientError, ER_LOADING);
>   
>   	struct tt_uuid replica_uuid = uuid_nil;
> +	struct tt_uuid peer_replicaset_uuid = uuid_nil;
>   	struct vclock replica_clock;
>   	uint32_t replica_version_id;
>   	vclock_create(&replica_clock);
>   	bool anon;
>   	uint32_t id_filter;
> -	xrow_decode_subscribe_xc(header, NULL, &replica_uuid, &replica_clock,
> -				 &replica_version_id, &anon, &id_filter);
> +	xrow_decode_subscribe_xc(header, &peer_replicaset_uuid, &replica_uuid,
> +				 &replica_clock, &replica_version_id, &anon,
> +				 &id_filter);
>   
>   	/* Forbid connection to itself */
>   	if (tt_uuid_is_equal(&replica_uuid, &INSTANCE_UUID))
>   		tnt_raise(ClientError, ER_CONNECTION_TO_SELF);
> +	/*
> +	 * The peer should have bootstrapped from somebody since it tries to
> +	 * subscribe already. If it belongs to a different replicaset, it won't
> +	 * be ever found here, and would try to reconnect thinking its replica
> +	 * ID wasn't replicated here yet. Prevent it right away.
> +	 */
> +	if (!tt_uuid_is_equal(&peer_replicaset_uuid, &REPLICASET_UUID)) {
> +		tnt_raise(ClientError, ER_REPLICASET_UUID_MISMATCH,
> +			  tt_uuid_str(&REPLICASET_UUID),
> +			  tt_uuid_str(&peer_replicaset_uuid));
> +	}
>   
>   	/*
>   	 * Do not allow non-anonymous followers for anonymous
> diff --git a/test/replication/gh-6094-rs-uuid-mismatch.result b/test/replication/gh-6094-rs-uuid-mismatch.result
> new file mode 100644
> index 000000000..1ba189a37
> --- /dev/null
> +++ b/test/replication/gh-6094-rs-uuid-mismatch.result
> @@ -0,0 +1,67 @@
> +-- test-run result file version 2
> +test_run = require('test_run').new()
> + | ---
> + | ...
> +
> +--
> +-- gh-6094: master instance didn't check if the subscribed instance has the same
> +-- replicaset UUID as its own. As a result, if the peer is from a different
> +-- replicaset, the master couldn't find its record in _cluster, and assumed it
> +-- simply needs to wait a bit more. This led to an infinite re-subscribe.
> +--
> +box.schema.user.grant('guest', 'super')
> + | ---
> + | ...
> +
> +test_run:cmd('create server master2 with script="replication/master1.lua"')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server master2')
> + | ---
> + | - true
> + | ...
> +test_run:switch('master2')
> + | ---
> + | - true
> + | ...
> +replication = test_run:eval('default', 'return box.cfg.listen')[1]
> + | ---
> + | ...
> +box.cfg{replication = {replication}}
> + | ---
> + | ...
> +assert(test_run:grep_log('master2', 'ER_REPLICASET_UUID_MISMATCH'))
> + | ---
> + | - ER_REPLICASET_UUID_MISMATCH
> + | ...
> +info = box.info
> + | ---
> + | ...
> +repl_info = info.replication[1]
> + | ---
> + | ...
> +assert(not repl_info.downstream and not repl_info.upstream)
> + | ---
> + | - true
> + | ...
> +assert(info.status == 'orphan')
> + | ---
> + | - true
> + | ...
> +
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server master2')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('delete server master2')
> + | ---
> + | - true
> + | ...
> +box.schema.user.revoke('guest', 'super')
> + | ---
> + | ...
> diff --git a/test/replication/gh-6094-rs-uuid-mismatch.test.lua b/test/replication/gh-6094-rs-uuid-mismatch.test.lua
> new file mode 100644
> index 000000000..d3928f52b
> --- /dev/null
> +++ b/test/replication/gh-6094-rs-uuid-mismatch.test.lua
> @@ -0,0 +1,25 @@
> +test_run = require('test_run').new()
> +
> +--
> +-- gh-6094: master instance didn't check if the subscribed instance has the same
> +-- replicaset UUID as its own. As a result, if the peer is from a different
> +-- replicaset, the master couldn't find its record in _cluster, and assumed it
> +-- simply needs to wait a bit more. This led to an infinite re-subscribe.
> +--
> +box.schema.user.grant('guest', 'super')
> +
> +test_run:cmd('create server master2 with script="replication/master1.lua"')
> +test_run:cmd('start server master2')
> +test_run:switch('master2')
> +replication = test_run:eval('default', 'return box.cfg.listen')[1]
> +box.cfg{replication = {replication}}
> +assert(test_run:grep_log('master2', 'ER_REPLICASET_UUID_MISMATCH'))
> +info = box.info
> +repl_info = info.replication[1]
> +assert(not repl_info.downstream and not repl_info.upstream)
> +assert(info.status == 'orphan')
> +
> +test_run:switch('default')
> +test_run:cmd('stop server master2')
> +test_run:cmd('delete server master2')
> +box.schema.user.revoke('guest', 'super')
> diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
> index dc39e2f74..5acb28fd4 100644
> --- a/test/replication/suite.cfg
> +++ b/test/replication/suite.cfg
> @@ -45,6 +45,7 @@
>       "gh-5435-qsync-clear-synchro-queue-commit-all.test.lua": {},
>       "gh-5536-wal-limit.test.lua": {},
>       "gh-5566-final-join-synchro.test.lua": {},
> +    "gh-6094-rs-uuid-mismatch.test.lua": {},
>       "*": {
>           "memtx": {"engine": "memtx"},
>           "vinyl": {"engine": "vinyl"}

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] replication: check rs uuid on subscribe process
  2021-06-01  8:29 ` Serge Petrenko via Tarantool-patches
@ 2021-06-01 21:34   ` Vladislav Shpilevoy via Tarantool-patches
  2021-06-02  6:36     ` Serge Petrenko via Tarantool-patches
  0 siblings, 1 reply; 11+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-01 21:34 UTC (permalink / raw)
  To: Serge Petrenko, tarantool-patches, gorcunov

Hi! Thanks for the review!

>> The UUID ignorance on subscribe decode was introduced here:
>> https://github.com/tarantool/tarantool/commit/7f8cbde3555084ad6c41f137aec4faba4648c705#diff-fc276b44b551b4eac3431c9433d4bc881790ddd7df76226d7579f80da7798f6e
>>
>> And I don't understand why. Maybe I miss something? The tests have
>> passed. Sergey, do you remember why was it needed really?
>>
>> Replicaset UUID mismatch definitely means the node can't connect.
>> It is not related to whether it is anonymous or not. Because it
>> has nothing to do with _cluster.
> 
> Hi! Thanks for the patch!
> 
> That change was meant to help anonymous replicas fetch data from
> multiple clusters. There was such an idea back then.
> It never got implemented though, and I doubt it will.

Wow, I am not sure how it is supposed to work at all. Starting from
the problem that in different RS you have conflicting replica_id, and
you won't be able to keep track of the changes properly. You simply can't
pack the same replica_id from 2 different replicasets into one replica_id
in your own journal.

It does not matter if you are anon or not - the changes in the given
replicasets are generated by non-anon nodes and their replica IDs will
clash.

> So the idea was that replica should check the replicaset UUID itself.
> It doesn't work now obviously. And looks like it hadn't worked before
> you introduced the ER_TOO_EARLY_SUBSCRIBE error.
> The test for checking uuid on replica side didn't catch this problem
> because the replica was already registered on master in it.
> 
> Long story short, I'm ok with this change, but now you should remove
> unnecessary replicaset uuid checks from replica side (in applier_subscribe).

If it is not critical for you, I would leave as many checks as possible
on both sides. It has nothing to do with perf but protects from unexpected
situations like this one. Let me know if you want it deleted anyway.

> And looks like replication/gh-3704-misc-replica-checks-cluster-id test is
> obsolete now.

Yes, looks like so. I deleted it on the branch.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] replication: check rs uuid on subscribe process
  2021-06-01 21:34   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-02  6:36     ` Serge Petrenko via Tarantool-patches
  2021-06-02 20:16       ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 1 reply; 11+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-06-02  6:36 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov



02.06.2021 00:34, Vladislav Shpilevoy пишет:
> Hi! Thanks for the review!

Thanks for the fixes!
>
>>> The UUID ignorance on subscribe decode was introduced here:
>>> https://github.com/tarantool/tarantool/commit/7f8cbde3555084ad6c41f137aec4faba4648c705#diff-fc276b44b551b4eac3431c9433d4bc881790ddd7df76226d7579f80da7798f6e
>>>
>>> And I don't understand why. Maybe I miss something? The tests have
>>> passed. Sergey, do you remember why was it needed really?
>>>
>>> Replicaset UUID mismatch definitely means the node can't connect.
>>> It is not related to whether it is anonymous or not. Because it
>>> has nothing to do with _cluster.
>> Hi! Thanks for the patch!
>>
>> That change was meant to help anonymous replicas fetch data from
>> multiple clusters. There was such an idea back then.
>> It never got implemented though, and I doubt it will.
> Wow, I am not sure how it is supposed to work at all. Starting from
> the problem that in different RS you have conflicting replica_id, and
> you won't be able to keep track of the changes properly. You simply can't
> pack the same replica_id from 2 different replicasets into one replica_id
> in your own journal.
>
> It does not matter if you are anon or not - the changes in the given
> replicasets are generated by non-anon nodes and their replica IDs will
> clash.

I guess you could assign the same replica_id to all the changes coming from
one cluster. Just assign lsns in the order of arrival, or something.
Or you could maintain a vclock per  each cluster, assign the same replica_id
to each change coming from this cluster, and assign vclock_sum as the 
change's
lsn.


>
>> So the idea was that replica should check the replicaset UUID itself.
>> It doesn't work now obviously. And looks like it hadn't worked before
>> you introduced the ER_TOO_EARLY_SUBSCRIBE error.
>> The test for checking uuid on replica side didn't catch this problem
>> because the replica was already registered on master in it.
>>
>> Long story short, I'm ok with this change, but now you should remove
>> unnecessary replicaset uuid checks from replica side (in applier_subscribe).
> If it is not critical for you, I would leave as many checks as possible
> on both sides. It has nothing to do with perf but protects from unexpected
> situations like this one. Let me know if you want it deleted anyway.

Ok, I'm not against it.

>> And looks like replication/gh-3704-misc-replica-checks-cluster-id test is
>> obsolete now.
> Yes, looks like so. I deleted it on the branch.


Sorry, one more nit. I failed to notice this earlier, but there's an 
obsolete
comment in box_process_subscribe().

It goes like
"Tarantool > 2.1.1 master doesn't check that replica has
the same cluster uuid".

LGTM, once you delete the comment.

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] replication: check rs uuid on subscribe process
  2021-06-02  6:36     ` Serge Petrenko via Tarantool-patches
@ 2021-06-02 20:16       ` Vladislav Shpilevoy via Tarantool-patches
  2021-06-03  7:14         ` Serge Petrenko via Tarantool-patches
  2021-06-03  7:18         ` Serge Petrenko via Tarantool-patches
  0 siblings, 2 replies; 11+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-02 20:16 UTC (permalink / raw)
  To: Serge Petrenko, tarantool-patches, gorcunov

Hi! Good job on the review!

>>>> The UUID ignorance on subscribe decode was introduced here:
>>>> https://github.com/tarantool/tarantool/commit/7f8cbde3555084ad6c41f137aec4faba4648c705#diff-fc276b44b551b4eac3431c9433d4bc881790ddd7df76226d7579f80da7798f6e
>>>>
>>>> And I don't understand why. Maybe I miss something? The tests have
>>>> passed. Sergey, do you remember why was it needed really?
>>>>
>>>> Replicaset UUID mismatch definitely means the node can't connect.
>>>> It is not related to whether it is anonymous or not. Because it
>>>> has nothing to do with _cluster.
>>> Hi! Thanks for the patch!
>>>
>>> That change was meant to help anonymous replicas fetch data from
>>> multiple clusters. There was such an idea back then.
>>> It never got implemented though, and I doubt it will.
>> Wow, I am not sure how it is supposed to work at all. Starting from
>> the problem that in different RS you have conflicting replica_id, and
>> you won't be able to keep track of the changes properly. You simply can't
>> pack the same replica_id from 2 different replicasets into one replica_id
>> in your own journal.
>>
>> It does not matter if you are anon or not - the changes in the given
>> replicasets are generated by non-anon nodes and their replica IDs will
>> clash.
> 
> I guess you could assign the same replica_id to all the changes coming from
> one cluster. Just assign lsns in the order of arrival, or something.

How will you resubscribe then to get the changes from one RS starting from
a certain vclock? The same for the idea below.

When you resubscribe, you need to tell the other RS starting from which
vclock it should send data. If you changed the received replica IDs, then
you don't have the needed information persisted anywhere.

Absence of a resubscribe means a rejoin on each restart and disconnect,
which makes it hardly usable.

> Sorry, one more nit. I failed to notice this earlier, but there's an obsolete
> comment in box_process_subscribe().
> 
> It goes like
> "Tarantool > 2.1.1 master doesn't check that replica has
> the same cluster uuid".
> 
> LGTM, once you delete the comment.

Thanks for noticing. I didn't delete it, but applied the diff below,
pushed to master, 2.8, 2.7.

====================
@@ -2786,11 +2786,13 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
 	 * the replica how many rows we have in stock for it,
 	 * and identify ourselves with our own replica id.
 	 *
-	 * Tarantool > 2.1.1 master doesn't check that replica
-	 * has the same cluster id. Instead it sends its cluster
-	 * id to replica, and replica checks that its cluster id
-	 * matches master's one. Older versions will just ignore
-	 * the additional field.
+	 * Master not only checks the replica has the same replicaset UUID, but
+	 * also sends the UUID to the replica so both Tarantools could perform
+	 * any checks they want depending on their version and supported
+	 * features.
+	 *
+	 * Older versions not supporting replicaset UUID in the response will
+	 * just ignore the additional field (these are < 2.1.1).
 	 */
 	struct xrow_header row;
 	xrow_encode_subscribe_response_xc(&row, &REPLICASET_UUID, &vclock);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] replication: check rs uuid on subscribe process
  2021-06-02 20:16       ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-03  7:14         ` Serge Petrenko via Tarantool-patches
  2021-06-03  7:18         ` Serge Petrenko via Tarantool-patches
  1 sibling, 0 replies; 11+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-06-03  7:14 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov



02.06.2021 23:16, Vladislav Shpilevoy пишет:
> Hi! Good job on the review!

Thanks!
>
>>>>> The UUID ignorance on subscribe decode was introduced here:
>>>>> https://github.com/tarantool/tarantool/commit/7f8cbde3555084ad6c41f137aec4faba4648c705#diff-fc276b44b551b4eac3431c9433d4bc881790ddd7df76226d7579f80da7798f6e
>>>>>
>>>>> And I don't understand why. Maybe I miss something? The tests have
>>>>> passed. Sergey, do you remember why was it needed really?
>>>>>
>>>>> Replicaset UUID mismatch definitely means the node can't connect.
>>>>> It is not related to whether it is anonymous or not. Because it
>>>>> has nothing to do with _cluster.
>>>> Hi! Thanks for the patch!
>>>>
>>>> That change was meant to help anonymous replicas fetch data from
>>>> multiple clusters. There was such an idea back then.
>>>> It never got implemented though, and I doubt it will.
>>> Wow, I am not sure how it is supposed to work at all. Starting from
>>> the problem that in different RS you have conflicting replica_id, and
>>> you won't be able to keep track of the changes properly. You simply can't
>>> pack the same replica_id from 2 different replicasets into one replica_id
>>> in your own journal.
>>>
>>> It does not matter if you are anon or not - the changes in the given
>>> replicasets are generated by non-anon nodes and their replica IDs will
>>> clash.
>> I guess you could assign the same replica_id to all the changes coming from
>> one cluster. Just assign lsns in the order of arrival, or something.
> How will you resubscribe then to get the changes from one RS starting from
> a certain vclock? The same for the idea below.
>
> When you resubscribe, you need to tell the other RS starting from which
> vclock it should send data. If you changed the received replica IDs, then
> you don't have the needed information persisted anywhere.
>
> Absence of a resubscribe means a rejoin on each restart and disconnect,
> which makes it hardly usable.
>
>> Sorry, one more nit. I failed to notice this earlier, but there's an obsolete
>> comment in box_process_subscribe().
>>
>> It goes like
>> "Tarantool > 2.1.1 master doesn't check that replica has
>> the same cluster uuid".
>>
>> LGTM, once you delete the comment.
> Thanks for noticing. I didn't delete it, but applied the diff below,
> pushed to master, 2.8, 2.7.
>
> ====================
> @@ -2786,11 +2786,13 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
>   	 * the replica how many rows we have in stock for it,
>   	 * and identify ourselves with our own replica id.
>   	 *
> -	 * Tarantool > 2.1.1 master doesn't check that replica
> -	 * has the same cluster id. Instead it sends its cluster
> -	 * id to replica, and replica checks that its cluster id
> -	 * matches master's one. Older versions will just ignore
> -	 * the additional field.
> +	 * Master not only checks the replica has the same replicaset UUID, but
> +	 * also sends the UUID to the replica so both Tarantools could perform
> +	 * any checks they want depending on their version and supported
> +	 * features.
> +	 *
> +	 * Older versions not supporting replicaset UUID in the response will
> +	 * just ignore the additional field (these are < 2.1.1).
>   	 */

Nice comment!

>   	struct xrow_header row;
>   	xrow_encode_subscribe_response_xc(&row, &REPLICASET_UUID, &vclock);

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] replication: check rs uuid on subscribe process
  2021-06-02 20:16       ` Vladislav Shpilevoy via Tarantool-patches
  2021-06-03  7:14         ` Serge Petrenko via Tarantool-patches
@ 2021-06-03  7:18         ` Serge Petrenko via Tarantool-patches
  1 sibling, 0 replies; 11+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-06-03  7:18 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov



02.06.2021 23:16, Vladislav Shpilevoy пишет:
> Hi! Good job on the review!
>
>>>>> The UUID ignorance on subscribe decode was introduced here:
>>>>> https://github.com/tarantool/tarantool/commit/7f8cbde3555084ad6c41f137aec4faba4648c705#diff-fc276b44b551b4eac3431c9433d4bc881790ddd7df76226d7579f80da7798f6e
>>>>>
>>>>> And I don't understand why. Maybe I miss something? The tests have
>>>>> passed. Sergey, do you remember why was it needed really?
>>>>>
>>>>> Replicaset UUID mismatch definitely means the node can't connect.
>>>>> It is not related to whether it is anonymous or not. Because it
>>>>> has nothing to do with _cluster.
>>>> Hi! Thanks for the patch!
>>>>
>>>> That change was meant to help anonymous replicas fetch data from
>>>> multiple clusters. There was such an idea back then.
>>>> It never got implemented though, and I doubt it will.
>>> Wow, I am not sure how it is supposed to work at all. Starting from
>>> the problem that in different RS you have conflicting replica_id, and
>>> you won't be able to keep track of the changes properly. You simply can't
>>> pack the same replica_id from 2 different replicasets into one replica_id
>>> in your own journal.
>>>
>>> It does not matter if you are anon or not - the changes in the given
>>> replicasets are generated by non-anon nodes and their replica IDs will
>>> clash.
>> I guess you could assign the same replica_id to all the changes coming from
>> one cluster. Just assign lsns in the order of arrival, or something.
> How will you resubscribe then to get the changes from one RS starting from
> a certain vclock? The same for the idea below.
>
> When you resubscribe, you need to tell the other RS starting from which
> vclock it should send data. If you changed the received replica IDs, then
> you don't have the needed information persisted anywhere.
>
> Absence of a resubscribe means a rejoin on each restart and disconnect,
> which makes it hardly usable.

Maybe save received vclocks in some local space? It would have
[cluster_uuid ; vclock ] pairs. Then you could recover all the data first,
and know which vclock to send in resubscribe to every cluster 
representative.


>
>> Sorry, one more nit. I failed to notice this earlier, but there's an obsolete
>> comment in box_process_subscribe().
>>
>> It goes like
>> "Tarantool > 2.1.1 master doesn't check that replica has
>> the same cluster uuid".
>>
>> LGTM, once you delete the comment.
> Thanks for noticing. I didn't delete it, but applied the diff below,
> pushed to master, 2.8, 2.7.
>
> ====================
> @@ -2786,11 +2786,13 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
>   	 * the replica how many rows we have in stock for it,
>   	 * and identify ourselves with our own replica id.
>   	 *
> -	 * Tarantool > 2.1.1 master doesn't check that replica
> -	 * has the same cluster id. Instead it sends its cluster
> -	 * id to replica, and replica checks that its cluster id
> -	 * matches master's one. Older versions will just ignore
> -	 * the additional field.
> +	 * Master not only checks the replica has the same replicaset UUID, but
> +	 * also sends the UUID to the replica so both Tarantools could perform
> +	 * any checks they want depending on their version and supported
> +	 * features.
> +	 *
> +	 * Older versions not supporting replicaset UUID in the response will
> +	 * just ignore the additional field (these are < 2.1.1).
>   	 */
>   	struct xrow_header row;
>   	xrow_encode_subscribe_response_xc(&row, &REPLICASET_UUID, &vclock);

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-06-03  7:18 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-28 20:35 [Tarantool-patches] [PATCH 1/1] replication: check rs uuid on subscribe process Vladislav Shpilevoy via Tarantool-patches
2021-05-28 22:06 ` Cyrill Gorcunov via Tarantool-patches
2021-05-28 22:20   ` Vladislav Shpilevoy via Tarantool-patches
2021-05-28 22:30     ` Cyrill Gorcunov via Tarantool-patches
2021-06-01  7:52     ` Cyrill Gorcunov via Tarantool-patches
2021-06-01  8:29 ` Serge Petrenko via Tarantool-patches
2021-06-01 21:34   ` Vladislav Shpilevoy via Tarantool-patches
2021-06-02  6:36     ` Serge Petrenko via Tarantool-patches
2021-06-02 20:16       ` Vladislav Shpilevoy via Tarantool-patches
2021-06-03  7:14         ` Serge Petrenko via Tarantool-patches
2021-06-03  7:18         ` Serge Petrenko via Tarantool-patches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox