Tarantool development patches archive
 help / color / mirror / Atom feed
From: Serge Petrenko via Tarantool-patches <tarantool-patches@dev.tarantool.org>
To: v.shpilevoy@tarantool.org, gorcunov@gmail.com
Cc: tarantool-patches@dev.tarantool.org
Subject: [Tarantool-patches] [PATCH v4 06/16] replication: send latest effective promote in initial join
Date: Wed, 14 Jul 2021 21:25:34 +0300	[thread overview]
Message-ID: <de13e1f50454b0cf2cbcd2cdfeb5082e36c55ee6.1626287002.git.sergepetrenko@tarantool.org> (raw)
In-Reply-To: <cover.1626287002.git.sergepetrenko@tarantool.org>

A joining instance may never receive the latest PROMOTE request, which
is the only source of information about the limbo owner. Send out the
latest limbo state (e.g. the latest applied PROMOTE request) together
with the initial join snapshot.

Prerequisite #6034
---
 src/box/applier.cc                       |  5 ++
 src/box/relay.cc                         |  9 ++-
 test/replication/replica_rejoin.result   | 77 ++++++++++++++----------
 test/replication/replica_rejoin.test.lua | 50 +++++++--------
 4 files changed, 85 insertions(+), 56 deletions(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index 0f81b7cc4..4088fcc21 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -454,6 +454,11 @@ applier_wait_snapshot(struct applier *applier)
 			coio_read_xrow(coio, ibuf, &row);
 			if (iproto_type_is_error(row.type)) {
 				xrow_decode_error_xc(&row);
+			} else if (iproto_type_is_promote_request(row.type)) {
+				struct synchro_request req;
+				if (xrow_decode_synchro(&row, &req) != 0)
+					diag_raise();
+				txn_limbo_process(&txn_limbo, &req);
 			} else if (row.type != IPROTO_JOIN_SNAPSHOT) {
 				tnt_raise(ClientError, ER_UNKNOWN_REQUEST_TYPE,
 					  (uint32_t)row.type);
diff --git a/src/box/relay.cc b/src/box/relay.cc
index 4ebe0fb06..4b102a777 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -427,6 +427,9 @@ relay_initial_join(int fd, uint64_t sync, struct vclock *vclock,
 	if (txn_limbo_wait_confirm(&txn_limbo) != 0)
 		diag_raise();
 
+	struct synchro_request req;
+	txn_limbo_checkpoint(&txn_limbo, &req);
+
 	/* Respond to the JOIN request with the current vclock. */
 	struct xrow_header row;
 	xrow_encode_vclock_xc(&row, vclock);
@@ -442,7 +445,11 @@ relay_initial_join(int fd, uint64_t sync, struct vclock *vclock,
 		row.type = IPROTO_JOIN_META;
 		coio_write_xrow(&relay->io, &row);
 
-		/* Empty at the moment. */
+		char body[XROW_SYNCHRO_BODY_LEN_MAX];
+		xrow_encode_synchro(&row, body, &req);
+		row.replica_id = req.replica_id;
+		row.sync = sync;
+		coio_write_xrow(&relay->io, &row);
 
 		/* Mark the end of the metadata stream. */
 		row.type = IPROTO_JOIN_SNAPSHOT;
diff --git a/test/replication/replica_rejoin.result b/test/replication/replica_rejoin.result
index 843333a19..e489c150a 100644
--- a/test/replication/replica_rejoin.result
+++ b/test/replication/replica_rejoin.result
@@ -7,10 +7,19 @@ test_run = env.new()
 log = require('log')
 ---
 ...
-engine = test_run:get_cfg('engine')
+test_run:cmd("create server master with script='replication/master1.lua'")
 ---
+- true
 ...
-test_run:cleanup_cluster()
+test_run:cmd("start server master")
+---
+- true
+...
+test_run:switch("master")
+---
+- true
+...
+engine = test_run:get_cfg('engine')
 ---
 ...
 --
@@ -43,7 +52,7 @@ _ = box.space.test:insert{3}
 ---
 ...
 -- Join a replica, then stop it.
-test_run:cmd("create server replica with rpl_master=default, script='replication/replica_rejoin.lua'")
+test_run:cmd("create server replica with rpl_master=master, script='replication/replica_rejoin.lua'")
 ---
 - true
 ...
@@ -65,7 +74,7 @@ box.space.test:select()
   - [2]
   - [3]
 ...
-test_run:cmd("switch default")
+test_run:cmd("switch master")
 ---
 - true
 ...
@@ -75,7 +84,7 @@ test_run:cmd("stop server replica")
 ...
 -- Restart the server to purge the replica from
 -- the garbage collection state.
-test_run:cmd("restart server default")
+test_run:cmd("restart server master")
 box.cfg{wal_cleanup_delay = 0}
 ---
 ...
@@ -146,7 +155,7 @@ box.space.test:select()
   - [20]
   - [30]
 ...
-test_run:cmd("switch default")
+test_run:cmd("switch master")
 ---
 - true
 ...
@@ -154,7 +163,7 @@ test_run:cmd("switch default")
 for i = 10, 30, 10 do box.space.test:update(i, {{'!', 1, i}}) end
 ---
 ...
-vclock = test_run:get_vclock('default')
+vclock = test_run:get_vclock('master')
 ---
 ...
 vclock[0] = nil
@@ -191,7 +200,7 @@ box.space.test:replace{1, 2, 3} -- bumps LSN on the replica
 ---
 - [1, 2, 3]
 ...
-test_run:cmd("switch default")
+test_run:cmd("switch master")
 ---
 - true
 ...
@@ -199,7 +208,7 @@ test_run:cmd("stop server replica")
 ---
 - true
 ...
-test_run:cmd("restart server default")
+test_run:cmd("restart server master")
 box.cfg{wal_cleanup_delay = 0}
 ---
 ...
@@ -253,7 +262,7 @@ box.space.test:select()
 -- from the replica.
 --
 -- Bootstrap a new replica.
-test_run:cmd("switch default")
+test_run:cmd("switch master")
 ---
 - true
 ...
@@ -295,7 +304,7 @@ box.cfg{replication = ''}
 ---
 ...
 -- Bump vclock on the master.
-test_run:cmd("switch default")
+test_run:cmd("switch master")
 ---
 - true
 ...
@@ -317,15 +326,15 @@ vclock = test_run:get_vclock('replica')
 vclock[0] = nil
 ---
 ...
-_ = test_run:wait_vclock('default', vclock)
+_ = test_run:wait_vclock('master', vclock)
 ---
 ...
 -- Restart the master and force garbage collection.
-test_run:cmd("switch default")
+test_run:cmd("switch master")
 ---
 - true
 ...
-test_run:cmd("restart server default")
+test_run:cmd("restart server master")
 box.cfg{wal_cleanup_delay = 0}
 ---
 ...
@@ -373,7 +382,7 @@ vclock = test_run:get_vclock('replica')
 vclock[0] = nil
 ---
 ...
-_ = test_run:wait_vclock('default', vclock)
+_ = test_run:wait_vclock('master', vclock)
 ---
 ...
 -- Restart the replica. It should successfully rebootstrap.
@@ -396,38 +405,42 @@ test_run:cmd("switch default")
 ---
 - true
 ...
-box.cfg{replication = ''}
+test_run:cmd("stop server replica")
 ---
+- true
 ...
-test_run:cmd("stop server replica")
+test_run:cmd("delete server replica")
 ---
 - true
 ...
-test_run:cmd("cleanup server replica")
+test_run:cmd("stop server master")
 ---
 - true
 ...
-test_run:cmd("delete server replica")
+test_run:cmd("delete server master")
 ---
 - true
 ...
-test_run:cleanup_cluster()
+--
+-- gh-4107: rebootstrap fails if the replica was deleted from
+-- the cluster on the master.
+--
+test_run:cmd("create server master with script='replication/master1.lua'")
 ---
+- true
 ...
-box.space.test:drop()
+test_run:cmd("start server master")
 ---
+- true
 ...
-box.schema.user.revoke('guest', 'replication')
+test_run:switch("master")
 ---
+- true
 ...
---
--- gh-4107: rebootstrap fails if the replica was deleted from
--- the cluster on the master.
---
 box.schema.user.grant('guest', 'replication')
 ---
 ...
-test_run:cmd("create server replica with rpl_master=default, script='replication/replica_uuid.lua'")
+test_run:cmd("create server replica with rpl_master=master, script='replication/replica_uuid.lua'")
 ---
 - true
 ...
@@ -462,11 +475,11 @@ box.space._cluster:get(2) ~= nil
 ---
 - true
 ...
-test_run:cmd("stop server replica")
+test_run:switch("default")
 ---
 - true
 ...
-test_run:cmd("cleanup server replica")
+test_run:cmd("stop server replica")
 ---
 - true
 ...
@@ -474,9 +487,11 @@ test_run:cmd("delete server replica")
 ---
 - true
 ...
-box.schema.user.revoke('guest', 'replication')
+test_run:cmd("stop server master")
 ---
+- true
 ...
-test_run:cleanup_cluster()
+test_run:cmd("delete server master")
 ---
+- true
 ...
diff --git a/test/replication/replica_rejoin.test.lua b/test/replication/replica_rejoin.test.lua
index c3ba9bf3f..2563177cf 100644
--- a/test/replication/replica_rejoin.test.lua
+++ b/test/replication/replica_rejoin.test.lua
@@ -1,9 +1,11 @@
 env = require('test_run')
 test_run = env.new()
 log = require('log')
-engine = test_run:get_cfg('engine')
 
-test_run:cleanup_cluster()
+test_run:cmd("create server master with script='replication/master1.lua'")
+test_run:cmd("start server master")
+test_run:switch("master")
+engine = test_run:get_cfg('engine')
 
 --
 -- gh-5806: this replica_rejoin test relies on the wal cleanup fiber
@@ -23,17 +25,17 @@ _ = box.space.test:insert{2}
 _ = box.space.test:insert{3}
 
 -- Join a replica, then stop it.
-test_run:cmd("create server replica with rpl_master=default, script='replication/replica_rejoin.lua'")
+test_run:cmd("create server replica with rpl_master=master, script='replication/replica_rejoin.lua'")
 test_run:cmd("start server replica")
 test_run:cmd("switch replica")
 box.info.replication[1].upstream.status == 'follow' or log.error(box.info)
 box.space.test:select()
-test_run:cmd("switch default")
+test_run:cmd("switch master")
 test_run:cmd("stop server replica")
 
 -- Restart the server to purge the replica from
 -- the garbage collection state.
-test_run:cmd("restart server default")
+test_run:cmd("restart server master")
 box.cfg{wal_cleanup_delay = 0}
 
 -- Make some checkpoints to remove old xlogs.
@@ -58,11 +60,11 @@ box.info.replication[2].downstream.vclock ~= nil or log.error(box.info)
 test_run:cmd("switch replica")
 box.info.replication[1].upstream.status == 'follow' or log.error(box.info)
 box.space.test:select()
-test_run:cmd("switch default")
+test_run:cmd("switch master")
 
 -- Make sure the replica follows new changes.
 for i = 10, 30, 10 do box.space.test:update(i, {{'!', 1, i}}) end
-vclock = test_run:get_vclock('default')
+vclock = test_run:get_vclock('master')
 vclock[0] = nil
 _ = test_run:wait_vclock('replica', vclock)
 test_run:cmd("switch replica")
@@ -76,9 +78,9 @@ box.space.test:select()
 -- Check that rebootstrap is NOT initiated unless the replica
 -- is strictly behind the master.
 box.space.test:replace{1, 2, 3} -- bumps LSN on the replica
-test_run:cmd("switch default")
+test_run:cmd("switch master")
 test_run:cmd("stop server replica")
-test_run:cmd("restart server default")
+test_run:cmd("restart server master")
 box.cfg{wal_cleanup_delay = 0}
 checkpoint_count = box.cfg.checkpoint_count
 box.cfg{checkpoint_count = 1}
@@ -99,7 +101,7 @@ box.space.test:select()
 --
 
 -- Bootstrap a new replica.
-test_run:cmd("switch default")
+test_run:cmd("switch master")
 test_run:cmd("stop server replica")
 test_run:cmd("cleanup server replica")
 test_run:cleanup_cluster()
@@ -113,17 +115,17 @@ box.cfg{replication = replica_listen}
 test_run:cmd("switch replica")
 box.cfg{replication = ''}
 -- Bump vclock on the master.
-test_run:cmd("switch default")
+test_run:cmd("switch master")
 box.space.test:replace{1}
 -- Bump vclock on the replica.
 test_run:cmd("switch replica")
 for i = 1, 10 do box.space.test:replace{2} end
 vclock = test_run:get_vclock('replica')
 vclock[0] = nil
-_ = test_run:wait_vclock('default', vclock)
+_ = test_run:wait_vclock('master', vclock)
 -- Restart the master and force garbage collection.
-test_run:cmd("switch default")
-test_run:cmd("restart server default")
+test_run:cmd("switch master")
+test_run:cmd("restart server master")
 box.cfg{wal_cleanup_delay = 0}
 replica_listen = test_run:cmd("eval replica 'return box.cfg.listen'")
 replica_listen ~= nil
@@ -139,7 +141,7 @@ test_run:cmd("switch replica")
 for i = 1, 10 do box.space.test:replace{2} end
 vclock = test_run:get_vclock('replica')
 vclock[0] = nil
-_ = test_run:wait_vclock('default', vclock)
+_ = test_run:wait_vclock('master', vclock)
 -- Restart the replica. It should successfully rebootstrap.
 test_run:cmd("restart server replica with args='true'")
 box.space.test:select()
@@ -148,20 +150,20 @@ box.space.test:replace{2}
 
 -- Cleanup.
 test_run:cmd("switch default")
-box.cfg{replication = ''}
 test_run:cmd("stop server replica")
-test_run:cmd("cleanup server replica")
 test_run:cmd("delete server replica")
-test_run:cleanup_cluster()
-box.space.test:drop()
-box.schema.user.revoke('guest', 'replication')
+test_run:cmd("stop server master")
+test_run:cmd("delete server master")
 
 --
 -- gh-4107: rebootstrap fails if the replica was deleted from
 -- the cluster on the master.
 --
+test_run:cmd("create server master with script='replication/master1.lua'")
+test_run:cmd("start server master")
+test_run:switch("master")
 box.schema.user.grant('guest', 'replication')
-test_run:cmd("create server replica with rpl_master=default, script='replication/replica_uuid.lua'")
+test_run:cmd("create server replica with rpl_master=master, script='replication/replica_uuid.lua'")
 start_cmd = string.format("start server replica with args='%s'", require('uuid').new())
 box.space._cluster:get(2) == nil
 test_run:cmd(start_cmd)
@@ -170,8 +172,8 @@ test_run:cmd("cleanup server replica")
 box.space._cluster:delete(2) ~= nil
 test_run:cmd(start_cmd)
 box.space._cluster:get(2) ~= nil
+test_run:switch("default")
 test_run:cmd("stop server replica")
-test_run:cmd("cleanup server replica")
 test_run:cmd("delete server replica")
-box.schema.user.revoke('guest', 'replication')
-test_run:cleanup_cluster()
+test_run:cmd("stop server master")
+test_run:cmd("delete server master")
-- 
2.30.1 (Apple Git-130)


  parent reply	other threads:[~2021-07-14 18:28 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-14 18:25 [Tarantool-patches] [PATCH v4 00/16] forbid implicit limbo ownership transition Serge Petrenko via Tarantool-patches
2021-07-14 18:25 ` [Tarantool-patches] [PATCH v4 01/16] replication: always send raft state to subscribers Serge Petrenko via Tarantool-patches
2021-07-14 18:25 ` [Tarantool-patches] [PATCH v4 02/16] txn_limbo: fix promote term filtering Serge Petrenko via Tarantool-patches
2021-07-14 18:25 ` [Tarantool-patches] [PATCH v4 03/16] txn_limbo: persist the latest effective promote in snapshot Serge Petrenko via Tarantool-patches
2021-07-14 18:25 ` [Tarantool-patches] [PATCH v4 04/16] replication: encode version in JOIN request Serge Petrenko via Tarantool-patches
2021-07-14 18:25 ` [Tarantool-patches] [PATCH v4 05/16] replication: add META stage to JOIN Serge Petrenko via Tarantool-patches
2021-07-14 18:25 ` Serge Petrenko via Tarantool-patches [this message]
2021-07-21 23:24   ` [Tarantool-patches] [PATCH v4 06/16] replication: send latest effective promote in initial join Vladislav Shpilevoy via Tarantool-patches
2021-07-23  7:44     ` Sergey Petrenko via Tarantool-patches
2021-07-26 23:43       ` Vladislav Shpilevoy via Tarantool-patches
2021-07-14 18:25 ` [Tarantool-patches] [PATCH v4 07/16] replication: send current Raft term in join response Serge Petrenko via Tarantool-patches
2021-07-21 23:24   ` Vladislav Shpilevoy via Tarantool-patches
2021-07-23  7:44     ` Sergey Petrenko via Tarantool-patches
2021-07-26 23:43       ` Vladislav Shpilevoy via Tarantool-patches
2021-07-29 20:46         ` Sergey Petrenko via Tarantool-patches
2021-07-14 18:25 ` [Tarantool-patches] [PATCH v4 08/16] raft: refactor raft_new_term() Serge Petrenko via Tarantool-patches
2021-07-14 18:25 ` [Tarantool-patches] [PATCH v4 09/16] box: split promote() into reasonable parts Serge Petrenko via Tarantool-patches
2021-07-21 23:26   ` Vladislav Shpilevoy via Tarantool-patches
2021-07-23  7:45     ` Sergey Petrenko via Tarantool-patches
2021-07-26 23:44       ` Vladislav Shpilevoy via Tarantool-patches
2021-07-29 20:46         ` Sergey Petrenko via Tarantool-patches
2021-07-14 18:25 ` [Tarantool-patches] [PATCH v4 10/16] box: make promote always bump the term Serge Petrenko via Tarantool-patches
2021-07-26 23:45   ` Vladislav Shpilevoy via Tarantool-patches
2021-07-29 20:46     ` Sergey Petrenko via Tarantool-patches
2021-07-14 18:25 ` [Tarantool-patches] [PATCH v4 11/16] box: make promote on the current leader a no-op Serge Petrenko via Tarantool-patches
2021-07-21 23:26   ` Vladislav Shpilevoy via Tarantool-patches
2021-07-23  7:45     ` Sergey Petrenko via Tarantool-patches
2021-07-14 18:25 ` [Tarantool-patches] [PATCH v4 12/16] box: fix an assertion failure after a spurious wakeup in promote Serge Petrenko via Tarantool-patches
2021-07-21 23:29   ` Vladislav Shpilevoy via Tarantool-patches
2021-07-23  7:45     ` Sergey Petrenko via Tarantool-patches
2021-07-14 18:25 ` [Tarantool-patches] [PATCH v4 13/16] box: allow calling promote on a candidate Serge Petrenko via Tarantool-patches
2021-07-15 14:06   ` Serge Petrenko via Tarantool-patches
2021-07-14 18:25 ` [Tarantool-patches] [PATCH v4 14/16] box: extract promote() settings to a separate method Serge Petrenko via Tarantool-patches
2021-07-14 18:25 ` [Tarantool-patches] [PATCH v4 15/16] replication: forbid implicit limbo owner transition Serge Petrenko via Tarantool-patches
2021-07-14 18:25 ` [Tarantool-patches] [PATCH v4 16/16] box: introduce `box.ctl.demote` Serge Petrenko via Tarantool-patches
2021-07-15 17:13   ` Serge Petrenko via Tarantool-patches
2021-07-15 20:11   ` [Tarantool-patches] [PATCH v4 17/16] replication: fix flaky election_qsync.test Serge Petrenko via Tarantool-patches
2021-07-26 23:43 ` [Tarantool-patches] [PATCH v4 00/16] forbid implicit limbo ownership transition Vladislav Shpilevoy via Tarantool-patches
2021-07-29 20:47   ` Sergey Petrenko via Tarantool-patches

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=de13e1f50454b0cf2cbcd2cdfeb5082e36c55ee6.1626287002.git.sergepetrenko@tarantool.org \
    --to=tarantool-patches@dev.tarantool.org \
    --cc=gorcunov@gmail.com \
    --cc=sergepetrenko@tarantool.org \
    --cc=v.shpilevoy@tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH v4 06/16] replication: send latest effective promote in initial join' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox