[Tarantool-patches] [PATCH v19 0/3] qsync: implement packet filtering (part 1)

Tarantool development patches archive
 help / color / mirror / Atom feed

* [Tarantool-patches] [PATCH v19 0/3] qsync: implement packet filtering (part 1)
@ 2021-09-30  9:44 Cyrill Gorcunov via Tarantool-patches
  2021-09-30  9:44 ` [Tarantool-patches] [PATCH v19 1/3] latch: add latch_is_locked helper Cyrill Gorcunov via Tarantool-patches
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-30  9:44 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy

Please take a look once time permit.

v19 (by Vlad):
 - do not modify box_issue_promote and demote (while they are
   still simply utter code duplication but whatever)
 - make txn_limbo_process being void
 - make txn_limbo_process_begin/commit/rollback being void
 - the real processing of request under the lock named as
   txn_limbo_process_core
 - testcase completely reworked (kudos to SergeP)
 - note that if we import test to the master branch without
   ordering pass it will fire assertion
 - dropped off debug info from box.info interface

Guys, also I noted that the test while being passed pretty fine locally
sometimes doesn't pass on CI due to timeout in waiting record {2} propagation,
so any advices on the test is quite welcome!

branch gorcunov/gh-6036-rollback-confirm-19
issue https://github.com/tarantool/tarantool/issues/6036
previous series https://lists.tarantool.org/tarantool-patches/20210922130535.79479-1-gorcunov@gmail.com/

Cyrill Gorcunov (3):
  latch: add latch_is_locked helper
  qsync: order access to the limbo terms
  test: add gh-6036-qsync-order test

 src/box/applier.cc                            |  12 +-
 src/box/box.cc                                |  15 +-
 src/box/txn_limbo.c                           |  17 +-
 src/box/txn_limbo.h                           |  53 ++++-
 src/lib/core/latch.h                          |  11 +
 test/replication/gh-6036-qsync-master.lua     |   1 +
 test/replication/gh-6036-qsync-node.lua       |  35 +++
 test/replication/gh-6036-qsync-order.result   | 207 ++++++++++++++++++
 test/replication/gh-6036-qsync-order.test.lua |  95 ++++++++
 test/replication/gh-6036-qsync-replica1.lua   |   1 +
 test/replication/gh-6036-qsync-replica2.lua   |   1 +
 test/replication/suite.cfg                    |   1 +
 test/replication/suite.ini                    |   2 +-
 13 files changed, 433 insertions(+), 18 deletions(-)
 create mode 120000 test/replication/gh-6036-qsync-master.lua
 create mode 100644 test/replication/gh-6036-qsync-node.lua
 create mode 100644 test/replication/gh-6036-qsync-order.result
 create mode 100644 test/replication/gh-6036-qsync-order.test.lua
 create mode 120000 test/replication/gh-6036-qsync-replica1.lua
 create mode 120000 test/replication/gh-6036-qsync-replica2.lua


base-commit: 33830978870bbe8cdc3c23fa06ba0f141a960124
-- 
2.31.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [PATCH v19 1/3] latch: add latch_is_locked helper
  2021-09-30  9:44 [Tarantool-patches] [PATCH v19 0/3] qsync: implement packet filtering (part 1) Cyrill Gorcunov via Tarantool-patches
@ 2021-09-30  9:44 ` Cyrill Gorcunov via Tarantool-patches
  2021-09-30  9:44 ` [Tarantool-patches] [PATCH v19 2/3] qsync: order access to the limbo terms Cyrill Gorcunov via Tarantool-patches
  2021-09-30  9:44 ` [Tarantool-patches] [PATCH v19 3/3] test: add gh-6036-qsync-order test Cyrill Gorcunov via Tarantool-patches
  2 siblings, 0 replies; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-30  9:44 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy

To test if latch is locked.

Part-of #6036

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/lib/core/latch.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/lib/core/latch.h b/src/lib/core/latch.h
index 49c59cf63..0aaa8b634 100644
--- a/src/lib/core/latch.h
+++ b/src/lib/core/latch.h
@@ -95,6 +95,17 @@ latch_owner(struct latch *l)
 	return l->owner;
 }
 
+/**
+ * Return true if the latch is locked.
+ *
+ * @param l - latch to be tested.
+ */
+static inline bool
+latch_is_locked(const struct latch *l)
+{
+	return l->owner != NULL;
+}
+
 /**
  * Lock a latch. If the latch is already locked by another fiber,
  * waits for timeout.
-- 
2.31.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [PATCH v19 2/3] qsync: order access to the limbo terms
  2021-09-30  9:44 [Tarantool-patches] [PATCH v19 0/3] qsync: implement packet filtering (part 1) Cyrill Gorcunov via Tarantool-patches
  2021-09-30  9:44 ` [Tarantool-patches] [PATCH v19 1/3] latch: add latch_is_locked helper Cyrill Gorcunov via Tarantool-patches
@ 2021-09-30  9:44 ` Cyrill Gorcunov via Tarantool-patches
  2021-10-01 12:14   ` Serge Petrenko via Tarantool-patches
  2021-09-30  9:44 ` [Tarantool-patches] [PATCH v19 3/3] test: add gh-6036-qsync-order test Cyrill Gorcunov via Tarantool-patches
  2 siblings, 1 reply; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-30  9:44 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy

Limbo terms tracking is shared between appliers and when
one of appliers is waiting for write to complete inside
journal_write() routine, an other may need to access read
term value to figure out if promote request is valid to
apply. Due to cooperative multitasking access to the terms
is not consistent so we need to be sure that other fibers
read up to date terms (ie written to the WAL).

For this sake we use a latching mechanism, when one fiber
takes a lock for updating other readers are waiting until
the operation is complete.

For example here is a call graph of two appliers

applier 1
---------
applier_apply_tx
  (promote term = 3
   current max term = 2)
  applier_synchro_filter_tx
  apply_synchro_row
    journal_write
      (sleeping)

at this moment another applier comes in with obsolete
data and term 2

                              applier 2
                              ---------
                              applier_apply_tx
                                (term 2)
                                applier_synchro_filter_tx
                                  txn_limbo_is_replica_outdated -> false
                                journal_write (sleep)

applier 1
---------
journal wakes up
  apply_synchro_row_cb
    set max term to 3

So the applier 2 didn't notice that term 3 is already seen
and wrote obsolete data. With locking the applier 2 will
wait until applier 1 has finished its write.

We introduce the following helpers:

1) txn_limbo_process_begin: which takes a lock
2) txn_limbo_process_commit and txn_limbo_process_rollback
   which simply release the lock but have different names
   for better semantics
3) txn_limbo_process is a general function which uses x_begin
   and x_commit helper internally
4) txn_limbo_process_core to do a real job over processing the
   request, it implies that txn_limbo_process_begin been called

Part-of #6036

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/applier.cc  | 12 +++++++---
 src/box/box.cc      | 15 ++++++++-----
 src/box/txn_limbo.c | 17 +++++++++++++--
 src/box/txn_limbo.h | 53 ++++++++++++++++++++++++++++++++++++++++-----
 4 files changed, 80 insertions(+), 17 deletions(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index b981bd436..46c36e259 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -857,7 +857,7 @@ apply_synchro_row_cb(struct journal_entry *entry)
 		applier_rollback_by_wal_io(entry->res);
 	} else {
 		replica_txn_wal_write_cb(synchro_entry->rcb);
-		txn_limbo_process(&txn_limbo, synchro_entry->req);
+		txn_limbo_process_core(&txn_limbo, synchro_entry->req);
 		trigger_run(&replicaset.applier.on_wal_write, NULL);
 	}
 	fiber_wakeup(synchro_entry->owner);
@@ -873,6 +873,8 @@ apply_synchro_row(uint32_t replica_id, struct xrow_header *row)
 	if (xrow_decode_synchro(row, &req) != 0)
 		goto err;
 
+	txn_limbo_process_begin(&txn_limbo);
+
 	struct replica_cb_data rcb_data;
 	struct synchro_entry entry;
 	/*
@@ -910,12 +912,16 @@ apply_synchro_row(uint32_t replica_id, struct xrow_header *row)
 	 * transactions side, including the async ones.
 	 */
 	if (journal_write(&entry.base) != 0)
-		goto err;
+		goto err_rollback;
 	if (entry.base.res < 0) {
 		diag_set_journal_res(entry.base.res);
-		goto err;
+		goto err_rollback;
 	}
+	txn_limbo_process_commit(&txn_limbo);
 	return 0;
+
+err_rollback:
+	txn_limbo_process_rollback(&txn_limbo);
 err:
 	diag_log();
 	return -1;
diff --git a/src/box/box.cc b/src/box/box.cc
index e082e1a3d..6a9be745a 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1677,8 +1677,6 @@ box_issue_promote(uint32_t prev_leader_id, int64_t promote_lsn)
 	struct raft *raft = box_raft();
 	assert(raft->volatile_term == raft->term);
 	assert(promote_lsn >= 0);
-	txn_limbo_write_promote(&txn_limbo, promote_lsn,
-				raft->term);
 	struct synchro_request req = {
 		.type = IPROTO_RAFT_PROMOTE,
 		.replica_id = prev_leader_id,
@@ -1686,8 +1684,11 @@ box_issue_promote(uint32_t prev_leader_id, int64_t promote_lsn)
 		.lsn = promote_lsn,
 		.term = raft->term,
 	};
-	txn_limbo_process(&txn_limbo, &req);
+	txn_limbo_process_begin(&txn_limbo);
+	txn_limbo_write_promote(&txn_limbo, req.lsn, req.term);
+	txn_limbo_process_core(&txn_limbo, &req);
 	assert(txn_limbo_is_empty(&txn_limbo));
+	txn_limbo_process_commit(&txn_limbo);
 }
 
 /** A guard to block multiple simultaneous promote()/demote() invocations. */
@@ -1699,8 +1700,6 @@ box_issue_demote(uint32_t prev_leader_id, int64_t promote_lsn)
 {
 	assert(box_raft()->volatile_term == box_raft()->term);
 	assert(promote_lsn >= 0);
-	txn_limbo_write_demote(&txn_limbo, promote_lsn,
-				box_raft()->term);
 	struct synchro_request req = {
 		.type = IPROTO_RAFT_DEMOTE,
 		.replica_id = prev_leader_id,
@@ -1708,8 +1707,12 @@ box_issue_demote(uint32_t prev_leader_id, int64_t promote_lsn)
 		.lsn = promote_lsn,
 		.term = box_raft()->term,
 	};
-	txn_limbo_process(&txn_limbo, &req);
+	txn_limbo_process_begin(&txn_limbo);
+	txn_limbo_write_demote(&txn_limbo, promote_lsn,
+				box_raft()->term);
+	txn_limbo_process_core(&txn_limbo, &req);
 	assert(txn_limbo_is_empty(&txn_limbo));
+	txn_limbo_process_commit(&txn_limbo);
 }
 
 int
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index 70447caaf..855c98c98 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -47,6 +47,7 @@ txn_limbo_create(struct txn_limbo *limbo)
 	vclock_create(&limbo->vclock);
 	vclock_create(&limbo->promote_term_map);
 	limbo->promote_greatest_term = 0;
+	latch_create(&limbo->promote_latch);
 	limbo->confirmed_lsn = 0;
 	limbo->rollback_count = 0;
 	limbo->is_in_rollback = false;
@@ -724,11 +725,14 @@ txn_limbo_wait_empty(struct txn_limbo *limbo, double timeout)
 }
 
 void
-txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
+txn_limbo_process_core(struct txn_limbo *limbo,
+		       const struct synchro_request *req)
 {
+	assert(latch_is_locked(&limbo->promote_latch));
+
 	uint64_t term = req->term;
 	uint32_t origin = req->origin_id;
-	if (txn_limbo_replica_term(limbo, origin) < term) {
+	if (vclock_get(&limbo->promote_term_map, origin) < (int64_t)term) {
 		vclock_follow(&limbo->promote_term_map, origin, term);
 		if (term > limbo->promote_greatest_term)
 			limbo->promote_greatest_term = term;
@@ -786,6 +790,15 @@ txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
 	return;
 }
 
+void
+txn_limbo_process(struct txn_limbo *limbo,
+		  const struct synchro_request *req)
+{
+	txn_limbo_process_begin(limbo);
+	txn_limbo_process_core(limbo, req);
+	txn_limbo_process_commit(limbo);
+}
+
 void
 txn_limbo_on_parameters_change(struct txn_limbo *limbo)
 {
diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h
index 53e52f676..fdb214711 100644
--- a/src/box/txn_limbo.h
+++ b/src/box/txn_limbo.h
@@ -31,6 +31,7 @@
  */
 #include "small/rlist.h"
 #include "vclock/vclock.h"
+#include "latch.h"
 
 #include <stdint.h>
 
@@ -147,6 +148,10 @@ struct txn_limbo {
 	 * limbo and raft are in sync and the terms are the same.
 	 */
 	uint64_t promote_greatest_term;
+	/**
+	 * To order access to the promote data.
+	 */
+	struct latch promote_latch;
 	/**
 	 * Maximal LSN gathered quorum and either already confirmed in WAL, or
 	 * whose confirmation is in progress right now. Any attempt to confirm
@@ -216,9 +221,12 @@ txn_limbo_last_entry(struct txn_limbo *limbo)
  * @a replica_id.
  */
 static inline uint64_t
-txn_limbo_replica_term(const struct txn_limbo *limbo, uint32_t replica_id)
+txn_limbo_replica_term(struct txn_limbo *limbo, uint32_t replica_id)
 {
-	return vclock_get(&limbo->promote_term_map, replica_id);
+	latch_lock(&limbo->promote_latch);
+	uint64_t v = vclock_get(&limbo->promote_term_map, replica_id);
+	latch_unlock(&limbo->promote_latch);
+	return v;
 }
 
 /**
@@ -226,11 +234,14 @@ txn_limbo_replica_term(const struct txn_limbo *limbo, uint32_t replica_id)
  * data from it. The check is only valid when elections are enabled.
  */
 static inline bool
-txn_limbo_is_replica_outdated(const struct txn_limbo *limbo,
+txn_limbo_is_replica_outdated(struct txn_limbo *limbo,
 			      uint32_t replica_id)
 {
-	return txn_limbo_replica_term(limbo, replica_id) <
-	       limbo->promote_greatest_term;
+	latch_lock(&limbo->promote_latch);
+	uint64_t v = vclock_get(&limbo->promote_term_map, replica_id);
+	bool res = v < limbo->promote_greatest_term;
+	latch_unlock(&limbo->promote_latch);
+	return res;
 }
 
 /**
@@ -300,7 +311,37 @@ txn_limbo_ack(struct txn_limbo *limbo, uint32_t replica_id, int64_t lsn);
 int
 txn_limbo_wait_complete(struct txn_limbo *limbo, struct txn_limbo_entry *entry);
 
-/** Execute a synchronous replication request. */
+/**
+ * Initiate execution of a synchronous replication request.
+ */
+static inline void
+txn_limbo_process_begin(struct txn_limbo *limbo)
+{
+	latch_lock(&limbo->promote_latch);
+}
+
+/** Commit a synchronous replication request. */
+static inline void
+txn_limbo_process_commit(struct txn_limbo *limbo)
+{
+	assert(latch_is_locked(&limbo->promote_latch));
+	latch_unlock(&limbo->promote_latch);
+}
+
+/** Rollback a synchronous replication request. */
+static inline void
+txn_limbo_process_rollback(struct txn_limbo *limbo)
+{
+	assert(latch_is_locked(&limbo->promote_latch));
+	latch_unlock(&limbo->promote_latch);
+}
+
+/** Core of processing synchronous replication request. */
+void
+txn_limbo_process_core(struct txn_limbo *limbo,
+		       const struct synchro_request *req);
+
+/** Process a synchronous replication request. */
 void
 txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req);
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [PATCH v19 3/3] test: add gh-6036-qsync-order test
  2021-09-30  9:44 [Tarantool-patches] [PATCH v19 0/3] qsync: implement packet filtering (part 1) Cyrill Gorcunov via Tarantool-patches
  2021-09-30  9:44 ` [Tarantool-patches] [PATCH v19 1/3] latch: add latch_is_locked helper Cyrill Gorcunov via Tarantool-patches
  2021-09-30  9:44 ` [Tarantool-patches] [PATCH v19 2/3] qsync: order access to the limbo terms Cyrill Gorcunov via Tarantool-patches
@ 2021-09-30  9:44 ` Cyrill Gorcunov via Tarantool-patches
  2021-10-01 12:30   ` Serge Petrenko via Tarantool-patches
  2 siblings, 1 reply; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-30  9:44 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy

To test that promotion requests are handled only when appropriate
write to WAL completes, because we update memory data before the
write finishes.

Note that without the patch this test fires assertion

> tarantool: src/box/txn_limbo.c:481: txn_limbo_read_rollback: Assertion `e->txn->signature >= 0' failed.

Part-of #6036

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 test/replication/gh-6036-qsync-master.lua     |   1 +
 test/replication/gh-6036-qsync-node.lua       |  35 +++
 test/replication/gh-6036-qsync-order.result   | 207 ++++++++++++++++++
 test/replication/gh-6036-qsync-order.test.lua |  95 ++++++++
 test/replication/gh-6036-qsync-replica1.lua   |   1 +
 test/replication/gh-6036-qsync-replica2.lua   |   1 +
 test/replication/suite.cfg                    |   1 +
 test/replication/suite.ini                    |   2 +-
 8 files changed, 342 insertions(+), 1 deletion(-)
 create mode 120000 test/replication/gh-6036-qsync-master.lua
 create mode 100644 test/replication/gh-6036-qsync-node.lua
 create mode 100644 test/replication/gh-6036-qsync-order.result
 create mode 100644 test/replication/gh-6036-qsync-order.test.lua
 create mode 120000 test/replication/gh-6036-qsync-replica1.lua
 create mode 120000 test/replication/gh-6036-qsync-replica2.lua

diff --git a/test/replication/gh-6036-qsync-master.lua b/test/replication/gh-6036-qsync-master.lua
new file mode 120000
index 000000000..87bdb46ef
--- /dev/null
+++ b/test/replication/gh-6036-qsync-master.lua
@@ -0,0 +1 @@
+gh-6036-qsync-node.lua
\ No newline at end of file
diff --git a/test/replication/gh-6036-qsync-node.lua b/test/replication/gh-6036-qsync-node.lua
new file mode 100644
index 000000000..ba6213255
--- /dev/null
+++ b/test/replication/gh-6036-qsync-node.lua
@@ -0,0 +1,35 @@
+local INSTANCE_ID = string.match(arg[0], "gh%-6036%-qsync%-(.+)%.lua")
+
+local function unix_socket(name)
+    return "unix/:./" .. name .. '.sock';
+end
+
+require('console').listen(os.getenv('ADMIN'))
+
+local box_cfg_common = {
+        listen                      = unix_socket(INSTANCE_ID),
+        replication                 = {
+            unix_socket("master"),
+            unix_socket("replica1"),
+            unix_socket("replica2"),
+        },
+        replication_connect_quorum  = 1,
+        replication_synchro_quorum  = 1,
+        replication_synchro_timeout = 10000,
+}
+
+if INSTANCE_ID == "master" then
+    box_cfg_common['election_mode'] = "manual"
+    box.cfg(box_cfg_common)
+elseif INSTANCE_ID == "replica1" then
+    box_cfg_common['election_mode'] = "manual"
+    box.cfg(box_cfg_common)
+else
+    assert(INSTANCE_ID == "replica2")
+    box_cfg_common['election_mode'] = "manual"
+    box.cfg(box_cfg_common)
+end
+
+box.once("bootstrap", function()
+    box.schema.user.grant('guest', 'super')
+end)
diff --git a/test/replication/gh-6036-qsync-order.result b/test/replication/gh-6036-qsync-order.result
new file mode 100644
index 000000000..378ce917a
--- /dev/null
+++ b/test/replication/gh-6036-qsync-order.result
@@ -0,0 +1,207 @@
+-- test-run result file version 2
+--
+-- gh-6036: verify that terms are locked when we're inside journal
+-- write routine, because parallel appliers may ignore the fact that
+-- the term is updated already but not yet written leading to data
+-- inconsistency.
+--
+test_run = require('test_run').new()
+ | ---
+ | ...
+
+test_run:cmd('create server master with script="replication/gh-6036-qsync-master.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('create server replica1 with script="replication/gh-6036-qsync-replica1.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('create server replica2 with script="replication/gh-6036-qsync-replica2.lua"')
+ | ---
+ | - true
+ | ...
+
+test_run:cmd('start server master with wait=False')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica1 with wait=False')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica2 with wait=False')
+ | ---
+ | - true
+ | ...
+
+test_run:wait_fullmesh({"master", "replica1", "replica2"})
+ | ---
+ | ...
+
+--
+-- Create a synchro space on the master node and make
+-- sure the write processed just fine.
+test_run:switch("master")
+ | ---
+ | - true
+ | ...
+box.ctl.promote()
+ | ---
+ | ...
+s = box.schema.create_space('test', {is_sync = true})
+ | ---
+ | ...
+_ = s:create_index('pk')
+ | ---
+ | ...
+s:insert{1}
+ | ---
+ | - [1]
+ | ...
+test_run:switch("replica1")
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.space.test:get{1} ~= nil end)
+ | ---
+ | - true
+ | ...
+test_run:switch("replica2")
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.space.test:get{1} ~= nil end)
+ | ---
+ | - true
+ | ...
+
+--
+-- Drop connection between master and replica1.
+test_run:switch("master")
+ | ---
+ | - true
+ | ...
+box.cfg({                                   \
+    replication = {                         \
+        "unix/:./master.sock",              \
+        "unix/:./replica2.sock",            \
+    },                                      \
+})
+ | ---
+ | ...
+--
+-- Drop connection between replica1 and master.
+test_run:switch("replica1")
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.space.test:get{1} ~= nil end)
+ | ---
+ | - true
+ | ...
+box.cfg({                                   \
+    replication = {                         \
+        "unix/:./replica1.sock",            \
+        "unix/:./replica2.sock",            \
+    },                                      \
+})
+ | ---
+ | ...
+
+--
+-- Here we have the following scheme
+--
+--              replica2 (will be delayed)
+--              /     \
+--          master    replica1
+
+--
+-- Initiate disk delay and remember the max term seen so far.
+test_run:switch("replica2")
+ | ---
+ | - true
+ | ...
+box.error.injection.set('ERRINJ_WAL_DELAY', true)
+ | ---
+ | - ok
+ | ...
+
+--
+-- Make replica1 been a leader and start writting data,
+-- the PROMOTE request get queued on replica2 and not
+-- yet processed, same time INSERT won't complete either
+-- waiting for PROMOTE completion first.
+test_run:switch("replica1")
+ | ---
+ | - true
+ | ...
+box.ctl.promote()
+ | ---
+ | ...
+_ = require('fiber').create(function() box.space.test:insert{2} end)
+ | ---
+ | ...
+
+--
+-- The master node has no clue that there is a new leader
+-- and continue writting data with obsolete term. Since replica2
+-- is delayed now the INSERT won't proceed yet but get queued.
+test_run:switch("master")
+ | ---
+ | - true
+ | ...
+_ = require('fiber').create(function() box.space.test:insert{3} end)
+ | ---
+ | ...
+
+--
+-- Finally enable replica2 back. Make sure the data from new replica1
+-- leader get writting while old leader's data ignored.
+test_run:switch("replica2")
+ | ---
+ | - true
+ | ...
+box.error.injection.set('ERRINJ_WAL_DELAY', false)
+ | ---
+ | - ok
+ | ...
+test_run:wait_cond(function() return box.space.test:get{2} ~= nil end)
+ | ---
+ | - true
+ | ...
+box.space.test:select{}
+ | ---
+ | - - [1]
+ |   - [2]
+ | ...
+
+test_run:switch("default")
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica1')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica2')
+ | ---
+ | - true
+ | ...
+
+test_run:cmd('delete server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica1')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica2')
+ | ---
+ | - true
+ | ...
diff --git a/test/replication/gh-6036-qsync-order.test.lua b/test/replication/gh-6036-qsync-order.test.lua
new file mode 100644
index 000000000..ac1dc5d91
--- /dev/null
+++ b/test/replication/gh-6036-qsync-order.test.lua
@@ -0,0 +1,95 @@
+--
+-- gh-6036: verify that terms are locked when we're inside journal
+-- write routine, because parallel appliers may ignore the fact that
+-- the term is updated already but not yet written leading to data
+-- inconsistency.
+--
+test_run = require('test_run').new()
+
+test_run:cmd('create server master with script="replication/gh-6036-qsync-master.lua"')
+test_run:cmd('create server replica1 with script="replication/gh-6036-qsync-replica1.lua"')
+test_run:cmd('create server replica2 with script="replication/gh-6036-qsync-replica2.lua"')
+
+test_run:cmd('start server master with wait=False')
+test_run:cmd('start server replica1 with wait=False')
+test_run:cmd('start server replica2 with wait=False')
+
+test_run:wait_fullmesh({"master", "replica1", "replica2"})
+
+--
+-- Create a synchro space on the master node and make
+-- sure the write processed just fine.
+test_run:switch("master")
+box.ctl.promote()
+s = box.schema.create_space('test', {is_sync = true})
+_ = s:create_index('pk')
+s:insert{1}
+test_run:switch("replica1")
+test_run:wait_cond(function() return box.space.test:get{1} ~= nil end)
+test_run:switch("replica2")
+test_run:wait_cond(function() return box.space.test:get{1} ~= nil end)
+
+--
+-- Drop connection between master and replica1.
+test_run:switch("master")
+box.cfg({                                   \
+    replication = {                         \
+        "unix/:./master.sock",              \
+        "unix/:./replica2.sock",            \
+    },                                      \
+})
+--
+-- Drop connection between replica1 and master.
+test_run:switch("replica1")
+test_run:wait_cond(function() return box.space.test:get{1} ~= nil end)
+box.cfg({                                   \
+    replication = {                         \
+        "unix/:./replica1.sock",            \
+        "unix/:./replica2.sock",            \
+    },                                      \
+})
+
+--
+-- Here we have the following scheme
+--
+--              replica2 (will be delayed)
+--              /     \
+--          master    replica1
+
+--
+-- Initiate disk delay and remember the max term seen so far.
+test_run:switch("replica2")
+box.error.injection.set('ERRINJ_WAL_DELAY', true)
+
+--
+-- Make replica1 been a leader and start writting data,
+-- the PROMOTE request get queued on replica2 and not
+-- yet processed, same time INSERT won't complete either
+-- waiting for PROMOTE completion first.
+test_run:switch("replica1")
+box.ctl.promote()
+_ = require('fiber').create(function() box.space.test:insert{2} end)
+
+--
+-- The master node has no clue that there is a new leader
+-- and continue writting data with obsolete term. Since replica2
+-- is delayed now the INSERT won't proceed yet but get queued.
+test_run:switch("master")
+_ = require('fiber').create(function() box.space.test:insert{3} end)
+
+--
+-- Finally enable replica2 back. Make sure the data from new replica1
+-- leader get writting while old leader's data ignored.
+test_run:switch("replica2")
+box.error.injection.set('ERRINJ_WAL_DELAY', false)
+test_run:wait_cond(function() return box.space.test:get{2} ~= nil end)
+box.space.test:select{}
+
+test_run:switch("default")
+test_run:cmd('stop server master')
+test_run:cmd('stop server replica1')
+test_run:cmd('stop server replica2')
+
+test_run:cmd('delete server master')
+test_run:cmd('delete server replica1')
+test_run:cmd('delete server replica2')
diff --git a/test/replication/gh-6036-qsync-replica1.lua b/test/replication/gh-6036-qsync-replica1.lua
new file mode 120000
index 000000000..87bdb46ef
--- /dev/null
+++ b/test/replication/gh-6036-qsync-replica1.lua
@@ -0,0 +1 @@
+gh-6036-qsync-node.lua
\ No newline at end of file
diff --git a/test/replication/gh-6036-qsync-replica2.lua b/test/replication/gh-6036-qsync-replica2.lua
new file mode 120000
index 000000000..87bdb46ef
--- /dev/null
+++ b/test/replication/gh-6036-qsync-replica2.lua
@@ -0,0 +1 @@
+gh-6036-qsync-node.lua
\ No newline at end of file
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 3eee0803c..ed09b2087 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -59,6 +59,7 @@
     "gh-6094-rs-uuid-mismatch.test.lua": {},
     "gh-6127-election-join-new.test.lua": {},
     "gh-6035-applier-filter.test.lua": {},
+    "gh-6036-qsync-order.test.lua": {},
     "election-candidate-promote.test.lua": {},
     "*": {
         "memtx": {"engine": "memtx"},
diff --git a/test/replication/suite.ini b/test/replication/suite.ini
index 77eb95f49..080e4fbf4 100644
--- a/test/replication/suite.ini
+++ b/test/replication/suite.ini
@@ -3,7 +3,7 @@ core = tarantool
 script =  master.lua
 description = tarantool/box, replication
 disabled = consistent.test.lua
-release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5430-qsync-promote-crash.test.lua gh-5430-cluster-mvcc.test.lua  gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua gh-6027-applier-error-show.test.lua gh-6032-promote-wal-write.test.lua gh-6057-qsync-confirm-async-no-wal.test.lua gh-5447-downstream-lag.test.lua gh-4040-invalid-msgpack.test.lua
+release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5430-qsync-promote-crash.test.lua gh-5430-cluster-mvcc.test.lua  gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua gh-6027-applier-error-show.test.lua gh-6032-promote-wal-write.test.lua gh-6057-qsync-confirm-async-no-wal.test.lua gh-5447-downstream-lag.test.lua gh-4040-invalid-msgpack.test.lua gh-6036-qsync-order.test.lua
 config = suite.cfg
 lua_libs = lua/fast_replica.lua lua/rlimit.lua
 use_unix_sockets = True
-- 
2.31.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v19 2/3] qsync: order access to the limbo terms
  2021-09-30  9:44 ` [Tarantool-patches] [PATCH v19 2/3] qsync: order access to the limbo terms Cyrill Gorcunov via Tarantool-patches
@ 2021-10-01 12:14   ` Serge Petrenko via Tarantool-patches
  2021-10-01 12:31     ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 16+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-10-01 12:14 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml; +Cc: Vladislav Shpilevoy



30.09.2021 12:44, Cyrill Gorcunov пишет:
> Limbo terms tracking is shared between appliers and when
> one of appliers is waiting for write to complete inside
> journal_write() routine, an other may need to access read
> term value to figure out if promote request is valid to
> apply. Due to cooperative multitasking access to the terms
> is not consistent so we need to be sure that other fibers
> read up to date terms (ie written to the WAL).
>
> For this sake we use a latching mechanism, when one fiber
> takes a lock for updating other readers are waiting until
> the operation is complete.
>
> For example here is a call graph of two appliers
>
> applier 1
> ---------
> applier_apply_tx
>    (promote term = 3
>     current max term = 2)
>    applier_synchro_filter_tx
>    apply_synchro_row
>      journal_write
>        (sleeping)
>
> at this moment another applier comes in with obsolete
> data and term 2
>
>                                applier 2
>                                ---------
>                                applier_apply_tx
>                                  (term 2)
>                                  applier_synchro_filter_tx
>                                    txn_limbo_is_replica_outdated -> false
>                                  journal_write (sleep)
>
> applier 1
> ---------
> journal wakes up
>    apply_synchro_row_cb
>      set max term to 3
>
> So the applier 2 didn't notice that term 3 is already seen
> and wrote obsolete data. With locking the applier 2 will
> wait until applier 1 has finished its write.
>
> We introduce the following helpers:
>
> 1) txn_limbo_process_begin: which takes a lock
> 2) txn_limbo_process_commit and txn_limbo_process_rollback
>     which simply release the lock but have different names
>     for better semantics
> 3) txn_limbo_process is a general function which uses x_begin
>     and x_commit helper internally
> 4) txn_limbo_process_core to do a real job over processing the
>     request, it implies that txn_limbo_process_begin been called
>
> Part-of #6036
>
> Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
> ---
>   src/box/applier.cc  | 12 +++++++---
>   src/box/box.cc      | 15 ++++++++-----
>   src/box/txn_limbo.c | 17 +++++++++++++--
>   src/box/txn_limbo.h | 53 ++++++++++++++++++++++++++++++++++++++++-----
>   4 files changed, 80 insertions(+), 17 deletions(-)
>
> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index b981bd436..46c36e259 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc
> @@ -857,7 +857,7 @@ apply_synchro_row_cb(struct journal_entry *entry)
>   		applier_rollback_by_wal_io(entry->res);
>   	} else {
>   		replica_txn_wal_write_cb(synchro_entry->rcb);
> -		txn_limbo_process(&txn_limbo, synchro_entry->req);
> +		txn_limbo_process_core(&txn_limbo, synchro_entry->req);
>   		trigger_run(&replicaset.applier.on_wal_write, NULL);
>   	}
>   	fiber_wakeup(synchro_entry->owner);
> @@ -873,6 +873,8 @@ apply_synchro_row(uint32_t replica_id, struct xrow_header *row)
>   	if (xrow_decode_synchro(row, &req) != 0)
>   		goto err;
>   
> +	txn_limbo_process_begin(&txn_limbo);
> +
>   	struct replica_cb_data rcb_data;
>   	struct synchro_entry entry;
>   	/*
> @@ -910,12 +912,16 @@ apply_synchro_row(uint32_t replica_id, struct xrow_header *row)
>   	 * transactions side, including the async ones.
>   	 */
>   	if (journal_write(&entry.base) != 0)
> -		goto err;
> +		goto err_rollback;
>   	if (entry.base.res < 0) {
>   		diag_set_journal_res(entry.base.res);
> -		goto err;
> +		goto err_rollback;
>   	}
> +	txn_limbo_process_commit(&txn_limbo);
>   	return 0;
> +
> +err_rollback:
> +	txn_limbo_process_rollback(&txn_limbo);
>   err:
>   	diag_log();
>   	return -1;
> diff --git a/src/box/box.cc b/src/box/box.cc
> index e082e1a3d..6a9be745a 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -1677,8 +1677,6 @@ box_issue_promote(uint32_t prev_leader_id, int64_t promote_lsn)
>   	struct raft *raft = box_raft();
>   	assert(raft->volatile_term == raft->term);
>   	assert(promote_lsn >= 0);
> -	txn_limbo_write_promote(&txn_limbo, promote_lsn,
> -				raft->term);
>   	struct synchro_request req = {
>   		.type = IPROTO_RAFT_PROMOTE,
>   		.replica_id = prev_leader_id,
> @@ -1686,8 +1684,11 @@ box_issue_promote(uint32_t prev_leader_id, int64_t promote_lsn)
>   		.lsn = promote_lsn,
>   		.term = raft->term,
>   	};
> -	txn_limbo_process(&txn_limbo, &req);
> +	txn_limbo_process_begin(&txn_limbo);
> +	txn_limbo_write_promote(&txn_limbo, req.lsn, req.term);
> +	txn_limbo_process_core(&txn_limbo, &req);
>   	assert(txn_limbo_is_empty(&txn_limbo));
> +	txn_limbo_process_commit(&txn_limbo);
>   }
>   
>   /** A guard to block multiple simultaneous promote()/demote() invocations. */
> @@ -1699,8 +1700,6 @@ box_issue_demote(uint32_t prev_leader_id, int64_t promote_lsn)
>   {
>   	assert(box_raft()->volatile_term == box_raft()->term);
>   	assert(promote_lsn >= 0);
> -	txn_limbo_write_demote(&txn_limbo, promote_lsn,
> -				box_raft()->term);
>   	struct synchro_request req = {
>   		.type = IPROTO_RAFT_DEMOTE,
>   		.replica_id = prev_leader_id,
> @@ -1708,8 +1707,12 @@ box_issue_demote(uint32_t prev_leader_id, int64_t promote_lsn)
>   		.lsn = promote_lsn,
>   		.term = box_raft()->term,
>   	};
> -	txn_limbo_process(&txn_limbo, &req);
> +	txn_limbo_process_begin(&txn_limbo);
> +	txn_limbo_write_demote(&txn_limbo, promote_lsn,
> +				box_raft()->term);
> +	txn_limbo_process_core(&txn_limbo, &req);
>   	assert(txn_limbo_is_empty(&txn_limbo));
> +	txn_limbo_process_commit(&txn_limbo);
>   }
>   
>   int
> diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
> index 70447caaf..855c98c98 100644
> --- a/src/box/txn_limbo.c
> +++ b/src/box/txn_limbo.c
> @@ -47,6 +47,7 @@ txn_limbo_create(struct txn_limbo *limbo)
>   	vclock_create(&limbo->vclock);
>   	vclock_create(&limbo->promote_term_map);
>   	limbo->promote_greatest_term = 0;
> +	latch_create(&limbo->promote_latch);
>   	limbo->confirmed_lsn = 0;
>   	limbo->rollback_count = 0;
>   	limbo->is_in_rollback = false;
> @@ -724,11 +725,14 @@ txn_limbo_wait_empty(struct txn_limbo *limbo, double timeout)
>   }
>   
>   void
> -txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
> +txn_limbo_process_core(struct txn_limbo *limbo,
> +		       const struct synchro_request *req)
>   {
> +	assert(latch_is_locked(&limbo->promote_latch));
> +
>   	uint64_t term = req->term;
>   	uint32_t origin = req->origin_id;
> -	if (txn_limbo_replica_term(limbo, origin) < term) {
> +	if (vclock_get(&limbo->promote_term_map, origin) < (int64_t)term) {
>   		vclock_follow(&limbo->promote_term_map, origin, term);
>   		if (term > limbo->promote_greatest_term)
>   			limbo->promote_greatest_term = term;
> @@ -786,6 +790,15 @@ txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
>   	return;
>   }
>   
> +void
> +txn_limbo_process(struct txn_limbo *limbo,
> +		  const struct synchro_request *req)
> +{
> +	txn_limbo_process_begin(limbo);
> +	txn_limbo_process_core(limbo, req);
> +	txn_limbo_process_commit(limbo);
> +}
> +
>   void
>   txn_limbo_on_parameters_change(struct txn_limbo *limbo)
>   {
> diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h
> index 53e52f676..fdb214711 100644
> --- a/src/box/txn_limbo.h
> +++ b/src/box/txn_limbo.h
> @@ -31,6 +31,7 @@
>    */
>   #include "small/rlist.h"
>   #include "vclock/vclock.h"
> +#include "latch.h"
>   
>   #include <stdint.h>
>   
> @@ -147,6 +148,10 @@ struct txn_limbo {
>   	 * limbo and raft are in sync and the terms are the same.
>   	 */
>   	uint64_t promote_greatest_term;
> +	/**
> +	 * To order access to the promote data.
> +	 */
> +	struct latch promote_latch;
>   	/**
>   	 * Maximal LSN gathered quorum and either already confirmed in WAL, or
>   	 * whose confirmation is in progress right now. Any attempt to confirm
> @@ -216,9 +221,12 @@ txn_limbo_last_entry(struct txn_limbo *limbo)
>    * @a replica_id.
>    */
>   static inline uint64_t
> -txn_limbo_replica_term(const struct txn_limbo *limbo, uint32_t replica_id)
> +txn_limbo_replica_term(struct txn_limbo *limbo, uint32_t replica_id)
>   {
> -	return vclock_get(&limbo->promote_term_map, replica_id);
> +	latch_lock(&limbo->promote_latch);
> +	uint64_t v = vclock_get(&limbo->promote_term_map, replica_id);
> +	latch_unlock(&limbo->promote_latch);
> +	return v;
>   }
>   
>   /**
> @@ -226,11 +234,14 @@ txn_limbo_replica_term(const struct txn_limbo *limbo, uint32_t replica_id)
>    * data from it. The check is only valid when elections are enabled.
>    */
>   static inline bool
> -txn_limbo_is_replica_outdated(const struct txn_limbo *limbo,
> +txn_limbo_is_replica_outdated(struct txn_limbo *limbo,
>   			      uint32_t replica_id)
>   {
> -	return txn_limbo_replica_term(limbo, replica_id) <
> -	       limbo->promote_greatest_term;
> +	latch_lock(&limbo->promote_latch);
> +	uint64_t v = vclock_get(&limbo->promote_term_map, replica_id);
> +	bool res = v < limbo->promote_greatest_term;
> +	latch_unlock(&limbo->promote_latch);
> +	return res;
>   }
>   
>   /**
> @@ -300,7 +311,37 @@ txn_limbo_ack(struct txn_limbo *limbo, uint32_t replica_id, int64_t lsn);
>   int
>   txn_limbo_wait_complete(struct txn_limbo *limbo, struct txn_limbo_entry *entry);
>   
> -/** Execute a synchronous replication request. */
> +/**
> + * Initiate execution of a synchronous replication request.
> + */
> +static inline void
> +txn_limbo_process_begin(struct txn_limbo *limbo)
> +{
> +	latch_lock(&limbo->promote_latch);
> +}
> +
> +/** Commit a synchronous replication request. */
> +static inline void
> +txn_limbo_process_commit(struct txn_limbo *limbo)
> +{
> +	assert(latch_is_locked(&limbo->promote_latch));
> +	latch_unlock(&limbo->promote_latch);
> +}
> +
> +/** Rollback a synchronous replication request. */
> +static inline void
> +txn_limbo_process_rollback(struct txn_limbo *limbo)
> +{
> +	assert(latch_is_locked(&limbo->promote_latch));
> +	latch_unlock(&limbo->promote_latch);
> +}
> +
> +/** Core of processing synchronous replication request. */
> +void
> +txn_limbo_process_core(struct txn_limbo *limbo,
> +		       const struct synchro_request *req);
> +
> +/** Process a synchronous replication request. */
>   void
>   txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req);

Thanks for the patch!

Mostly ok with one question:

What about txn_limbo_write_confirm/txn_limbo_read_confirm pairs issued
inside txn_limbo_ack() and txn_limbo_on_parameters_change() ?

Shouldn't they take the latch as well? I mean, txn_limbo_ack() and
txn_limbo_on_parameters_change() as a whole.
>   

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v19 3/3] test: add gh-6036-qsync-order test
  2021-09-30  9:44 ` [Tarantool-patches] [PATCH v19 3/3] test: add gh-6036-qsync-order test Cyrill Gorcunov via Tarantool-patches
@ 2021-10-01 12:30   ` Serge Petrenko via Tarantool-patches
  2021-10-04 21:16     ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 16+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-10-01 12:30 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml; +Cc: Vladislav Shpilevoy



30.09.2021 12:44, Cyrill Gorcunov пишет:
> To test that promotion requests are handled only when appropriate
> write to WAL completes, because we update memory data before the
> write finishes.
>
> Note that without the patch this test fires assertion
>

Thanks for the test!
Please, find a couple of comments below.
I think the test won't be flaky anymore once you fix my comments.

>> tarantool: src/box/txn_limbo.c:481: txn_limbo_read_rollback: Assertion `e->txn->signature >= 0' failed.
> Part-of #6036
>
> Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
> ---
>   test/replication/gh-6036-qsync-master.lua     |   1 +
>   test/replication/gh-6036-qsync-node.lua       |  35 +++
>   test/replication/gh-6036-qsync-order.result   | 207 ++++++++++++++++++
>   test/replication/gh-6036-qsync-order.test.lua |  95 ++++++++
>   test/replication/gh-6036-qsync-replica1.lua   |   1 +
>   test/replication/gh-6036-qsync-replica2.lua   |   1 +
>   test/replication/suite.cfg                    |   1 +
>   test/replication/suite.ini                    |   2 +-
>   8 files changed, 342 insertions(+), 1 deletion(-)
>   create mode 120000 test/replication/gh-6036-qsync-master.lua
>   create mode 100644 test/replication/gh-6036-qsync-node.lua
>   create mode 100644 test/replication/gh-6036-qsync-order.result
>   create mode 100644 test/replication/gh-6036-qsync-order.test.lua
>   create mode 120000 test/replication/gh-6036-qsync-replica1.lua
>   create mode 120000 test/replication/gh-6036-qsync-replica2.lua
>
> diff --git a/test/replication/gh-6036-qsync-master.lua b/test/replication/gh-6036-qsync-master.lua
> new file mode 120000
> index 000000000..87bdb46ef
> --- /dev/null
> +++ b/test/replication/gh-6036-qsync-master.lua
> @@ -0,0 +1 @@
> +gh-6036-qsync-node.lua
> \ No newline at end of file
> diff --git a/test/replication/gh-6036-qsync-node.lua b/test/replication/gh-6036-qsync-node.lua
> new file mode 100644
> index 000000000..ba6213255
> --- /dev/null
> +++ b/test/replication/gh-6036-qsync-node.lua
> @@ -0,0 +1,35 @@
> +local INSTANCE_ID = string.match(arg[0], "gh%-6036%-qsync%-(.+)%.lua")
> +
> +local function unix_socket(name)
> +    return "unix/:./" .. name .. '.sock';
> +end
> +
> +require('console').listen(os.getenv('ADMIN'))
> +
> +local box_cfg_common = {
> +        listen                      = unix_socket(INSTANCE_ID),
> +        replication                 = {
> +            unix_socket("master"),
> +            unix_socket("replica1"),
> +            unix_socket("replica2"),
> +        },
> +        replication_connect_quorum  = 1,
> +        replication_synchro_quorum  = 1,
> +        replication_synchro_timeout = 10000,
> +}
> +
> +if INSTANCE_ID == "master" then
> +    box_cfg_common['election_mode'] = "manual"
> +    box.cfg(box_cfg_common)
> +elseif INSTANCE_ID == "replica1" then
> +    box_cfg_common['election_mode'] = "manual"
> +    box.cfg(box_cfg_common)
> +else
> +    assert(INSTANCE_ID == "replica2")
> +    box_cfg_common['election_mode'] = "manual"
> +    box.cfg(box_cfg_common)
> +end
> +
> +box.once("bootstrap", function()
> +    box.schema.user.grant('guest', 'super')
> +end)

Looks like "election_replica.lua" suits our needs perfectly now.
No need to introduce a new instance file.

> diff --git a/test/replication/gh-6036-qsync-order.result b/test/replication/gh-6036-qsync-order.result
> new file mode 100644
> index 000000000..378ce917a
> --- /dev/null
> +++ b/test/replication/gh-6036-qsync-order.result
> @@ -0,0 +1,207 @@
> +-- test-run result file version 2
> +--
> +-- gh-6036: verify that terms are locked when we're inside journal
> +-- write routine, because parallel appliers may ignore the fact that
> +-- the term is updated already but not yet written leading to data
> +-- inconsistency.
> +--
> +test_run = require('test_run').new()
> + | ---
> + | ...
> +
> +test_run:cmd('create server master with script="replication/gh-6036-qsync-master.lua"')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('create server replica1 with script="replication/gh-6036-qsync-replica1.lua"')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('create server replica2 with script="replication/gh-6036-qsync-replica2.lua"')
> + | ---
> + | - true
> + | ...
> +
> +test_run:cmd('start server master with wait=False')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server replica1 with wait=False')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server replica2 with wait=False')
> + | ---
> + | - true
> + | ...
> +
> +test_run:wait_fullmesh({"master", "replica1", "replica2"})
> + | ---
> + | ...
> +
> +--
> +-- Create a synchro space on the master node and make
> +-- sure the write processed just fine.
> +test_run:switch("master")
> + | ---
> + | - true
> + | ...
> +box.ctl.promote()
> + | ---
> + | ...
> +s = box.schema.create_space('test', {is_sync = true})
> + | ---
> + | ...
> +_ = s:create_index('pk')
> + | ---
> + | ...
> +s:insert{1}
> + | ---
> + | - [1]
> + | ...
> +test_run:switch("replica1")
> + | ---
> + | - true
> + | ...
> +test_run:wait_cond(function() return box.space.test:get{1} ~= nil end)
> + | ---
> + | - true
> + | ...
> +test_run:switch("replica2")
> + | ---
> + | - true
> + | ...
> +test_run:wait_cond(function() return box.space.test:get{1} ~= nil end)
> + | ---
> + | - true
> + | ...
> +


You may replace both calls with test_run:wait_lsn('master', 'replica')
Even without switching.

> +--
> +-- Drop connection between master and replica1.
> +test_run:switch("master")
> + | ---
> + | - true
> + | ...
> +box.cfg({                                   \
> +    replication = {                         \
> +        "unix/:./master.sock",              \
> +        "unix/:./replica2.sock",            \
> +    },                                      \
> +})
> + | ---
> + | ...
> +--
> +-- Drop connection between replica1 and master.
> +test_run:switch("replica1")
> + | ---
> + | - true
> + | ...
> +test_run:wait_cond(function() return box.space.test:get{1} ~= nil end)
> + | ---
> + | - true
> + | ...
> +box.cfg({                                   \
> +    replication = {                         \
> +        "unix/:./replica1.sock",            \
> +        "unix/:./replica2.sock",            \
> +    },                                      \
> +})
> + | ---
> + | ...
> +
> +--
> +-- Here we have the following scheme
> +--
> +--              replica2 (will be delayed)
> +--              /     \
> +--          master    replica1
> +
> +--
> +-- Initiate disk delay and remember the max term seen so far.
> +test_run:switch("replica2")
> + | ---
> + | - true
> + | ...
> +box.error.injection.set('ERRINJ_WAL_DELAY', true)
> + | ---
> + | - ok
> + | ...
> +
> +--
> +-- Make replica1 been a leader and start writting data,
> +-- the PROMOTE request get queued on replica2 and not
> +-- yet processed, same time INSERT won't complete either
> +-- waiting for PROMOTE completion first.
> +test_run:switch("replica1")
> + | ---
> + | - true
> + | ...
> +box.ctl.promote()
> + | ---
> + | ...
> +_ = require('fiber').create(function() box.space.test:insert{2} end)
> + | ---
> + | ...
> +
> +--

Prior to doing something on master, you should make sure
replica2 has received the promote.
"wait_lsn" won't work here, because WAL is disabled. You may try
test_run:wait_cond(function() return box.space.test:get{2} ~= nil end)

> +-- The master node has no clue that there is a new leader
> +-- and continue writting data with obsolete term. Since replica2
> +-- is delayed now the INSERT won't proceed yet but get queued.
> +test_run:switch("master")
> + | ---
> + | - true
> + | ...
> +_ = require('fiber').create(function() box.space.test:insert{3} end)
> + | ---
> + | ...
> +
Just like above, try to wait until replica2 receives insert{3}.
> +--
> +-- Finally enable replica2 back. Make sure the data from new replica1
> +-- leader get writting while old leader's data ignored.
> +test_run:switch("replica2")
> + | ---
> + | - true
> + | ...
> +box.error.injection.set('ERRINJ_WAL_DELAY', false)
> + | ---
> + | - ok
> + | ...
> +test_run:wait_cond(function() return box.space.test:get{2} ~= nil end)
> + | ---
> + | - true
> + | ...
> +box.space.test:select{}
> + | ---
> + | - - [1]
> + |   - [2]
> + | ...
> +
> +test_run:switch("default")
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server master')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server replica1')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server replica2')
> + | ---
> + | - true
> + | ...
> +
> +test_run:cmd('delete server master')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('delete server replica1')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('delete server replica2')
> + | ---
> + | - true
> + | ...
> diff --git a/test/replication/gh-6036-qsync-order.test.lua b/test/replication/gh-6036-qsync-order.test.lua
> new file mode 100644
> index 000000000..ac1dc5d91
> --- /dev/null
> +++ b/test/replication/gh-6036-qsync-order.test.lua
> @@ -0,0 +1,95 @@
> +--
> +-- gh-6036: verify that terms are locked when we're inside journal
> +-- write routine, because parallel appliers may ignore the fact that
> +-- the term is updated already but not yet written leading to data
> +-- inconsistency.
> +--
> +test_run = require('test_run').new()
> +
> +test_run:cmd('create server master with script="replication/gh-6036-qsync-master.lua"')
> +test_run:cmd('create server replica1 with script="replication/gh-6036-qsync-replica1.lua"')
> +test_run:cmd('create server replica2 with script="replication/gh-6036-qsync-replica2.lua"')
> +
> +test_run:cmd('start server master with wait=False')
> +test_run:cmd('start server replica1 with wait=False')
> +test_run:cmd('start server replica2 with wait=False')
> +
> +test_run:wait_fullmesh({"master", "replica1", "replica2"})
> +
> +--
> +-- Create a synchro space on the master node and make
> +-- sure the write processed just fine.
> +test_run:switch("master")
> +box.ctl.promote()
> +s = box.schema.create_space('test', {is_sync = true})
> +_ = s:create_index('pk')
> +s:insert{1}
> +test_run:switch("replica1")
> +test_run:wait_cond(function() return box.space.test:get{1} ~= nil end)
> +test_run:switch("replica2")
> +test_run:wait_cond(function() return box.space.test:get{1} ~= nil end)
> +
> +--
> +-- Drop connection between master and replica1.
> +test_run:switch("master")
> +box.cfg({                                   \
> +    replication = {                         \
> +        "unix/:./master.sock",              \
> +        "unix/:./replica2.sock",            \
> +    },                                      \
> +})
> +--
> +-- Drop connection between replica1 and master.
> +test_run:switch("replica1")
> +test_run:wait_cond(function() return box.space.test:get{1} ~= nil end)
> +box.cfg({                                   \
> +    replication = {                         \
> +        "unix/:./replica1.sock",            \
> +        "unix/:./replica2.sock",            \
> +    },                                      \
> +})
> +
> +--
> +-- Here we have the following scheme
> +--
> +--              replica2 (will be delayed)
> +--              /     \
> +--          master    replica1
> +
> +--
> +-- Initiate disk delay and remember the max term seen so far.
> +test_run:switch("replica2")
> +box.error.injection.set('ERRINJ_WAL_DELAY', true)
> +
> +--
> +-- Make replica1 been a leader and start writting data,
> +-- the PROMOTE request get queued on replica2 and not
> +-- yet processed, same time INSERT won't complete either
> +-- waiting for PROMOTE completion first.
> +test_run:switch("replica1")
> +box.ctl.promote()
> +_ = require('fiber').create(function() box.space.test:insert{2} end)
> +
> +--
> +-- The master node has no clue that there is a new leader
> +-- and continue writting data with obsolete term. Since replica2
> +-- is delayed now the INSERT won't proceed yet but get queued.
> +test_run:switch("master")
> +_ = require('fiber').create(function() box.space.test:insert{3} end)
> +
> +--
> +-- Finally enable replica2 back. Make sure the data from new replica1
> +-- leader get writting while old leader's data ignored.
> +test_run:switch("replica2")
> +box.error.injection.set('ERRINJ_WAL_DELAY', false)
> +test_run:wait_cond(function() return box.space.test:get{2} ~= nil end)
> +box.space.test:select{}
> +
> +test_run:switch("default")
> +test_run:cmd('stop server master')
> +test_run:cmd('stop server replica1')
> +test_run:cmd('stop server replica2')
> +
> +test_run:cmd('delete server master')
> +test_run:cmd('delete server replica1')
> +test_run:cmd('delete server replica2')
> diff --git a/test/replication/gh-6036-qsync-replica1.lua b/test/replication/gh-6036-qsync-replica1.lua
> new file mode 120000
> index 000000000..87bdb46ef
> --- /dev/null
> +++ b/test/replication/gh-6036-qsync-replica1.lua
> @@ -0,0 +1 @@
> +gh-6036-qsync-node.lua
> \ No newline at end of file
> diff --git a/test/replication/gh-6036-qsync-replica2.lua b/test/replication/gh-6036-qsync-replica2.lua
> new file mode 120000
> index 000000000..87bdb46ef
> --- /dev/null
> +++ b/test/replication/gh-6036-qsync-replica2.lua
> @@ -0,0 +1 @@
> +gh-6036-qsync-node.lua
> \ No newline at end of file
> diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
> index 3eee0803c..ed09b2087 100644
> --- a/test/replication/suite.cfg
> +++ b/test/replication/suite.cfg
> @@ -59,6 +59,7 @@
>       "gh-6094-rs-uuid-mismatch.test.lua": {},
>       "gh-6127-election-join-new.test.lua": {},
>       "gh-6035-applier-filter.test.lua": {},
> +    "gh-6036-qsync-order.test.lua": {},
>       "election-candidate-promote.test.lua": {},
>       "*": {
>           "memtx": {"engine": "memtx"},
> diff --git a/test/replication/suite.ini b/test/replication/suite.ini
> index 77eb95f49..080e4fbf4 100644
> --- a/test/replication/suite.ini
> +++ b/test/replication/suite.ini
> @@ -3,7 +3,7 @@ core = tarantool
>   script =  master.lua
>   description = tarantool/box, replication
>   disabled = consistent.test.lua
> -release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5430-qsync-promote-crash.test.lua gh-5430-cluster-mvcc.test.lua  gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua gh-6027-applier-error-show.test.lua gh-6032-promote-wal-write.test.lua gh-6057-qsync-confirm-async-no-wal.test.lua gh-5447-downstream-lag.test.lua gh-4040-invalid-msgpack.test.lua
> +release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5430-qsync-promote-crash.test.lua gh-5430-cluster-mvcc.test.lua  gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua gh-6027-applier-error-show.test.lua gh-6032-promote-wal-write.test.lua gh-6057-qsync-confirm-async-no-wal.test.lua gh-5447-downstream-lag.test.lua gh-4040-invalid-msgpack.test.lua gh-6036-qsync-order.test.lua
>   config = suite.cfg
>   lua_libs = lua/fast_replica.lua lua/rlimit.lua
>   use_unix_sockets = True

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v19 2/3] qsync: order access to the limbo terms
  2021-10-01 12:14   ` Serge Petrenko via Tarantool-patches
@ 2021-10-01 12:31     ` Cyrill Gorcunov via Tarantool-patches
  2021-10-01 12:37       ` Serge Petrenko via Tarantool-patches
  0 siblings, 1 reply; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-10-01 12:31 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: tml, Vladislav Shpilevoy

On Fri, Oct 01, 2021 at 03:14:52PM +0300, Serge Petrenko wrote:
...
> > +/** Process a synchronous replication request. */
> >   void
> >   txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req);
> 
> Thanks for the patch!
> 
> Mostly ok with one question:
> 
> What about txn_limbo_write_confirm/txn_limbo_read_confirm pairs issued
> inside txn_limbo_ack() and txn_limbo_on_parameters_change() ?
> 
> Shouldn't they take the latch as well? I mean, txn_limbo_ack() and
> txn_limbo_on_parameters_change() as a whole.

Wait, Serge, currently we guard promote_map/max_term, so it won't be
read while there is its update on the fly. Thus If only I'm not missing
something obvious txn_limbo_on_parameters_change() can't interfere with
promote data or race with it anyhow. If you mean some other race then
it seems I don't see it yet, but I suspect I might be simply wrong :)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v19 2/3] qsync: order access to the limbo terms
  2021-10-01 12:31     ` Cyrill Gorcunov via Tarantool-patches
@ 2021-10-01 12:37       ` Serge Petrenko via Tarantool-patches
  2021-10-04 21:53         ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 16+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-10-01 12:37 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml, Vladislav Shpilevoy



01.10.2021 15:31, Cyrill Gorcunov пишет:
> On Fri, Oct 01, 2021 at 03:14:52PM +0300, Serge Petrenko wrote:
> ...
>>> +/** Process a synchronous replication request. */
>>>    void
>>>    txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req);
>> Thanks for the patch!
>>
>> Mostly ok with one question:
>>
>> What about txn_limbo_write_confirm/txn_limbo_read_confirm pairs issued
>> inside txn_limbo_ack() and txn_limbo_on_parameters_change() ?
>>
>> Shouldn't they take the latch as well? I mean, txn_limbo_ack() and
>> txn_limbo_on_parameters_change() as a whole.
> Wait, Serge, currently we guard promote_map/max_term, so it won't be
> read while there is its update on the fly. Thus If only I'm not missing
> something obvious txn_limbo_on_parameters_change() can't interfere with
> promote data or race with it anyhow. If you mean some other race then
> it seems I don't see it yet, but I suspect I might be simply wrong :)
Shouldn't we guard limbo->owner as well?
Otherwise you may start writing confirm for an old leader once promote 
for a new
one is already in progress.

I don't remember us discussing this before, so, maybe I'm just confused.

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v19 3/3] test: add gh-6036-qsync-order test
  2021-10-01 12:30   ` Serge Petrenko via Tarantool-patches
@ 2021-10-04 21:16     ` Cyrill Gorcunov via Tarantool-patches
  2021-10-05 13:55       ` Serge Petrenko via Tarantool-patches
  0 siblings, 1 reply; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-10-04 21:16 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: tml, Vladislav Shpilevoy

On Fri, Oct 01, 2021 at 03:30:41PM +0300, Serge Petrenko wrote:
> 
> Thanks for the test!
> Please, find a couple of comments below.
> I think the test won't be flaky anymore once you fix my comments.

Thanks for comments, Serge!

> > +
> > +box.once("bootstrap", function()
> > +    box.schema.user.grant('guest', 'super')
> > +end)
> 
> Looks like "election_replica.lua" suits our needs perfectly now.
> No need to introduce a new instance file.

yup, I updated the test, thanks!

> > +test_run:wait_cond(function() return box.space.test:get{1} ~= nil end)
> > + | ---
> > + | - true
> > + | ...
> > +
> 
> 
> You may replace both calls with test_run:wait_lsn('master', 'replica')
> Even without switching.

Actually I need switch, otherwise i got stuck, so I use

test_run:switch("master")
box.ctl.promote()
s = box.schema.create_space('test', {is_sync = true})
_ = s:create_index('pk')
s:insert{1}

test_run:switch("replica1")
test_run:wait_lsn('replica1', 'master')

test_run:switch("replica2")
test_run:wait_lsn('replica2', 'master')


which works just fine

> > +
> > +--
> > +-- Make replica1 been a leader and start writting data,
> > +-- the PROMOTE request get queued on replica2 and not
> > +-- yet processed, same time INSERT won't complete either
> > +-- waiting for PROMOTE completion first.
> > +test_run:switch("replica1")
> > + | ---
> > + | - true
> > + | ...
> > +box.ctl.promote()
> > + | ---
> > + | ...
> > +_ = require('fiber').create(function() box.space.test:insert{2} end)
> > + | ---
> > + | ...
> > +
> > +--
> 
> Prior to doing something on master, you should make sure
> replica2 has received the promote.
> "wait_lsn" won't work here, because WAL is disabled. You may try
> test_run:wait_cond(function() return box.space.test:get{2} ~= nil end)

Wait, this moment is dubious. Look, once we issue promote it get
stuck inside journal write procedure so the next "insert" won't
proceed until the promote finished. I understand that you point
to the potential race here because even promote() command may get
finished but slowed down on nework level and simply reach replica2
out of other calls. I think without aditional debug output (such
as promote term exposed via box.info I did in previous series) we
can't be sure about timings and it seems that i've to return back
the box.info patch. I mean currently the command

test_run:wait_cond(function() return box.space.test:get{2} ~= nil end)

stucks forever because promote yet not finished and next 'insert' simply
has not been applied.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v19 2/3] qsync: order access to the limbo terms
  2021-10-01 12:37       ` Serge Petrenko via Tarantool-patches
@ 2021-10-04 21:53         ` Cyrill Gorcunov via Tarantool-patches
  2021-10-05 13:25           ` Serge Petrenko via Tarantool-patches
  0 siblings, 1 reply; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-10-04 21:53 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: tml, Vladislav Shpilevoy

On Fri, Oct 01, 2021 at 03:37:18PM +0300, Serge Petrenko wrote:
> Shouldn't we guard limbo->owner as well?
> Otherwise you may start writing confirm for an old leader once promote for a
> new one is already in progress.
> 
> I don't remember us discussing this before, so, maybe I'm just confused.

Seems I still don't follow. Look, currently limbo::owner_id is assigned
from the one place:

txn_limbo_process_core
  txn_limbo_process_begin(limbo)
    txn_limbo_read_demote | txn_limbo_read_promote
      ...
      limbo->owner_id = replica_id;

The txn_limbo_process_core is called from console (via box.ctl.promote|demote
commands) or via applier. When it comes from console then we still take the
lock before write into the journal

static void
box_issue_promote(uint32_t prev_leader_id, int64_t promote_lsn)
{
	struct raft *raft = box_raft();
	assert(raft->volatile_term == raft->term);
	assert(promote_lsn >= 0);
	struct synchro_request req = {
		.type = IPROTO_RAFT_PROMOTE,
		.replica_id = prev_leader_id,
		.origin_id = instance_id,
		.lsn = promote_lsn,
		.term = raft->term,
	};
-->	txn_limbo_process_begin(&txn_limbo);
-->	txn_limbo_write_promote(&txn_limbo, req.lsn, req.term);
	txn_limbo_process_core(&txn_limbo, &req);
	assert(txn_limbo_is_empty(&txn_limbo));
	txn_limbo_process_commit(&txn_limbo);
}

thus console commands and applier are serialized. Or I miss something
obvious?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v19 2/3] qsync: order access to the limbo terms
  2021-10-04 21:53         ` Cyrill Gorcunov via Tarantool-patches
@ 2021-10-05 13:25           ` Serge Petrenko via Tarantool-patches
  2021-10-05 21:52             ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 16+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-10-05 13:25 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml, Vladislav Shpilevoy



05.10.2021 00:53, Cyrill Gorcunov пишет:
> On Fri, Oct 01, 2021 at 03:37:18PM +0300, Serge Petrenko wrote:
>> Shouldn't we guard limbo->owner as well?
>> Otherwise you may start writing confirm for an old leader once promote for a
>> new one is already in progress.
>>
>> I don't remember us discussing this before, so, maybe I'm just confused.
> Seems I still don't follow. Look, currently limbo::owner_id is assigned
> from the one place:
>
> txn_limbo_process_core
>    txn_limbo_process_begin(limbo)
>      txn_limbo_read_demote | txn_limbo_read_promote
>        ...
>        limbo->owner_id = replica_id;
>
> The txn_limbo_process_core is called from console (via box.ctl.promote|demote
> commands) or via applier. When it comes from console then we still take the
> lock before write into the journal
>
> static void
> box_issue_promote(uint32_t prev_leader_id, int64_t promote_lsn)
> {
> 	struct raft *raft = box_raft();
> 	assert(raft->volatile_term == raft->term);
> 	assert(promote_lsn >= 0);
> 	struct synchro_request req = {
> 		.type = IPROTO_RAFT_PROMOTE,
> 		.replica_id = prev_leader_id,
> 		.origin_id = instance_id,
> 		.lsn = promote_lsn,
> 		.term = raft->term,
> 	};
> -->	txn_limbo_process_begin(&txn_limbo);
> -->	txn_limbo_write_promote(&txn_limbo, req.lsn, req.term);
> 	txn_limbo_process_core(&txn_limbo, &req);
> 	assert(txn_limbo_is_empty(&txn_limbo));
> 	txn_limbo_process_commit(&txn_limbo);
> }
>
> thus console commands and applier are serialized. Or I miss something
> obvious?

Here's what can happen. You're the old leader, and you've received a
promote from someone else already, it's being written now.
Now some replica sends you an ack. You enter txn_limbo_ack
(it doesn't check the latch). txn_limbo_ack() may result in writing a 
CONFIRM.

The PROMOTE is already in progress, so you'll end up writing CONFIRM for
a wrong leader id (your own) after PROMOTE is written.

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v19 3/3] test: add gh-6036-qsync-order test
  2021-10-04 21:16     ` Cyrill Gorcunov via Tarantool-patches
@ 2021-10-05 13:55       ` Serge Petrenko via Tarantool-patches
  2021-10-05 22:26         ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 16+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-10-05 13:55 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml, Vladislav Shpilevoy



05.10.2021 00:16, Cyrill Gorcunov пишет:
> On Fri, Oct 01, 2021 at 03:30:41PM +0300, Serge Petrenko wrote:
>> Thanks for the test!
>> Please, find a couple of comments below.
>> I think the test won't be flaky anymore once you fix my comments.
> Thanks for comments, Serge!
>
>>> +
>>> +box.once("bootstrap", function()
>>> +    box.schema.user.grant('guest', 'super')
>>> +end)
>> Looks like "election_replica.lua" suits our needs perfectly now.
>> No need to introduce a new instance file.
> yup, I updated the test, thanks!
>
>>> +test_run:wait_cond(function() return box.space.test:get{1} ~= nil end)
>>> + | ---
>>> + | - true
>>> + | ...
>>> +
>>
>> You may replace both calls with test_run:wait_lsn('master', 'replica')
>> Even without switching.
> Actually I need switch, otherwise i got stuck, so I use
>
> test_run:switch("master")
> box.ctl.promote()
> s = box.schema.create_space('test', {is_sync = true})
> _ = s:create_index('pk')
> s:insert{1}
>
> test_run:switch("replica1")
> test_run:wait_lsn('replica1', 'master')
>
> test_run:switch("replica2")
> test_run:wait_lsn('replica2', 'master')
>
>
> which works just fine

Yep, that's what I meant.

>
>>> +
>>> +--
>>> +-- Make replica1 been a leader and start writting data,
>>> +-- the PROMOTE request get queued on replica2 and not
>>> +-- yet processed, same time INSERT won't complete either
>>> +-- waiting for PROMOTE completion first.
>>> +test_run:switch("replica1")
>>> + | ---
>>> + | - true
>>> + | ...
>>> +box.ctl.promote()
>>> + | ---
>>> + | ...
>>> +_ = require('fiber').create(function() box.space.test:insert{2} end)
>>> + | ---
>>> + | ...
>>> +
>>> +--
>> Prior to doing something on master, you should make sure
>> replica2 has received the promote.
>> "wait_lsn" won't work here, because WAL is disabled. You may try
>> test_run:wait_cond(function() return box.space.test:get{2} ~= nil end)
> Wait, this moment is dubious. Look, once we issue promote it get
> stuck inside journal write procedure so the next "insert" won't
> proceed until the promote finished. I understand that you point
> to the potential race here because even promote() command may get
> finished but slowed down on nework level and simply reach replica2
> out of other calls. I think without aditional debug output (such
> as promote term exposed via box.info I did in previous series) we
> can't be sure about timings and it seems that i've to return back
> the box.info patch. I mean currently the command
>
> test_run:wait_cond(function() return box.space.test:get{2} ~= nil end)
>
> stucks forever because promote yet not finished and next 'insert' simply
> has not been applied.

Ok, I see. I didn't think of that at first.
Look, your `box.info` patch won't help here as well.
Since promote is blocked on its way to WAL, it isn't applied yet,
so we won't see the term increase.

There is a way to dect a blocked promote:
There's ERRINJ_WAL_WRITE_COUNT.
It's incremented each time you call wal_write. Even before the write is 
blocked.

So you need to save ERRINJ_WAL_WRITE_COUNT, then do promote on replica1, 
then
return to replica2 and wait until ERRINJ_WAL_WRITE_COUNT gets bigger 
than your saved
value.


-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v19 2/3] qsync: order access to the limbo terms
  2021-10-05 13:25           ` Serge Petrenko via Tarantool-patches
@ 2021-10-05 21:52             ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 0 replies; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-10-05 21:52 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: tml, Vladislav Shpilevoy

On Tue, Oct 05, 2021 at 04:25:21PM +0300, Serge Petrenko wrote:
> 
> Here's what can happen. You're the old leader, and you've received a
> promote from someone else already, it's being written now.
> Now some replica sends you an ack. You enter txn_limbo_ack
> (it doesn't check the latch). txn_limbo_ack() may result in writing a
> CONFIRM.
> 
> The PROMOTE is already in progress, so you'll end up writing CONFIRM for
> a wrong leader id (your own) after PROMOTE is written.

This sounds like we need more general lock for limbo operations. And
as we have been discussing already for box change params as well. I'm
starting to doubt that promote latch is a good name. Anyway I'll prepare
a draft version and send it out.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v19 3/3] test: add gh-6036-qsync-order test
  2021-10-05 13:55       ` Serge Petrenko via Tarantool-patches
@ 2021-10-05 22:26         ` Cyrill Gorcunov via Tarantool-patches
  2021-10-05 22:32           ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-10-05 22:26 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: tml, Vladislav Shpilevoy

On Tue, Oct 05, 2021 at 04:55:24PM +0300, Serge Petrenko wrote:
> There is a way to dect a blocked promote:
> There's ERRINJ_WAL_WRITE_COUNT.
> It's incremented each time you call wal_write. Even before the write is
> blocked.
> 
> So you need to save ERRINJ_WAL_WRITE_COUNT, then do promote on replica1,
> then
> return to replica2 and wait until ERRINJ_WAL_WRITE_COUNT gets bigger than
> your saved
> value.

Here is a final version (which I didn't pushed outside yet, since I
started a new branch v20), anyway to share data early.
---
From: Cyrill Gorcunov <gorcunov@gmail.com>
Subject: [PATCH] test: add gh-6036-qsync-order test

To test that promotion requests are handled only when appropriate
write to WAL completes, because we update memory data before the
write finishes.

Note that without the patch this test fires assertion

> tarantool: src/box/txn_limbo.c:481: txn_limbo_read_rollback: Assertion `e->txn->signature >= 0' failed.

Part-of #6036

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 test/replication/election_replica.lua         |   3 +-
 test/replication/gh-6036-qsync-order.result   | 222 ++++++++++++++++++
 test/replication/gh-6036-qsync-order.test.lua | 103 ++++++++
 test/replication/suite.cfg                    |   1 +
 test/replication/suite.ini                    |   2 +-
 5 files changed, 329 insertions(+), 2 deletions(-)
 create mode 100644 test/replication/gh-6036-qsync-order.result
 create mode 100644 test/replication/gh-6036-qsync-order.test.lua

diff --git a/test/replication/election_replica.lua b/test/replication/election_replica.lua
index 3b4d9a123..1dbfa96dc 100644
--- a/test/replication/election_replica.lua
+++ b/test/replication/election_replica.lua
@@ -6,6 +6,7 @@ local SYNCHRO_QUORUM = arg[1] and tonumber(arg[1]) or 3
 local ELECTION_TIMEOUT = arg[2] and tonumber(arg[2]) or 0.1
 local ELECTION_MODE = arg[3] or 'candidate'
 local CONNECT_QUORUM = arg[4] and tonumber(arg[4]) or 3
+local SYNCHRO_TIMEOUT = arg[5] and tonumber(arg[5]) or 0.1
 
 local function instance_uri(instance_id)
     return SOCKET_DIR..'/election_replica'..instance_id..'.sock';
@@ -25,7 +26,7 @@ box.cfg({
     election_mode = ELECTION_MODE,
     election_timeout = ELECTION_TIMEOUT,
     replication_synchro_quorum = SYNCHRO_QUORUM,
-    replication_synchro_timeout = 0.1,
+    replication_synchro_timeout = SYNCHRO_TIMEOUT,
     -- To reveal more election logs.
     log_level = 6,
 })
diff --git a/test/replication/gh-6036-qsync-order.result b/test/replication/gh-6036-qsync-order.result
new file mode 100644
index 000000000..b0d8c126e
--- /dev/null
+++ b/test/replication/gh-6036-qsync-order.result
@@ -0,0 +1,222 @@
+-- test-run result file version 2
+--
+-- gh-6036: verify that terms are locked when we're inside journal
+-- write routine, because parallel appliers may ignore the fact that
+-- the term is updated already but not yet written leading to data
+-- inconsistency.
+--
+test_run = require('test_run').new()
+ | ---
+ | ...
+
+test_run:cmd('create server master with script="replication/election_replica1.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('create server replica1 with script="replication/election_replica2.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('create server replica2 with script="replication/election_replica3.lua"')
+ | ---
+ | - true
+ | ...
+
+test_run:cmd("start server master with wait=False, args='1 nil manual 1 10000'")
+ | ---
+ | - true
+ | ...
+test_run:cmd("start server replica1 with wait=False, args='1 nil manual 1 10000'")
+ | ---
+ | - true
+ | ...
+test_run:cmd("start server replica2 with wait=False, args='1 nil manual 1 10000'")
+ | ---
+ | - true
+ | ...
+
+test_run:wait_fullmesh({"master", "replica1", "replica2"})
+ | ---
+ | ...
+
+--
+-- Create a synchro space on the master node and make
+-- sure the write processed just fine.
+test_run:switch("master")
+ | ---
+ | - true
+ | ...
+box.ctl.promote()
+ | ---
+ | ...
+s = box.schema.create_space('test', {is_sync = true})
+ | ---
+ | ...
+_ = s:create_index('pk')
+ | ---
+ | ...
+s:insert{1}
+ | ---
+ | - [1]
+ | ...
+
+test_run:switch("replica1")
+ | ---
+ | - true
+ | ...
+test_run:wait_lsn('replica1', 'master')
+ | ---
+ | ...
+
+test_run:switch("replica2")
+ | ---
+ | - true
+ | ...
+test_run:wait_lsn('replica2', 'master')
+ | ---
+ | ...
+
+--
+-- Drop connection between master and replica1.
+test_run:switch("master")
+ | ---
+ | - true
+ | ...
+box.cfg({                                   \
+    replication = {                         \
+        "unix/:./election_replica1.sock",   \
+        "unix/:./election_replica3.sock",   \
+    },                                      \
+})
+ | ---
+ | ...
+--
+-- Drop connection between replica1 and master.
+test_run:switch("replica1")
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.space.test:get{1} ~= nil end)
+ | ---
+ | - true
+ | ...
+box.cfg({                                   \
+    replication = {                         \
+        "unix/:./election_replica2.sock",   \
+        "unix/:./election_replica3.sock",   \
+    },                                      \
+})
+ | ---
+ | ...
+
+--
+-- Here we have the following scheme
+--
+--              replica2 (will be delayed)
+--              /     \
+--          master    replica1
+
+--
+-- Initiate disk delay in a bit tricky way: the next write will
+-- fall into forever sleep.
+test_run:switch("replica2")
+ | ---
+ | - true
+ | ...
+box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 1)
+ | ---
+ | - ok
+ | ...
+
+--
+-- Make replica1 been a leader and start writting data,
+-- the PROMOTE request get queued on replica2 and not
+-- yet processed, same time INSERT won't complete either
+-- waiting for PROMOTE completion first. Note that we
+-- enter replica2 as well just to be sure the PROMOTE
+-- reached it.
+test_run:switch("replica1")
+ | ---
+ | - true
+ | ...
+box.ctl.promote()
+ | ---
+ | ...
+test_run:switch("replica2")
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.error.injection.get('ERRINJ_WAL_DELAY_COUNTDOWN') ~= 0 end)
+ | ---
+ | - true
+ | ...
+test_run:switch("replica1")
+ | ---
+ | - true
+ | ...
+_ = require('fiber').create(function() box.space.test:insert{2} end)
+ | ---
+ | ...
+
+--
+-- The master node has no clue that there is a new leader
+-- and continue writing data with obsolete term. Since replica2
+-- is delayed now the INSERT won't proceed yet but get queued.
+test_run:switch("master")
+ | ---
+ | - true
+ | ...
+_ = require('fiber').create(function() box.space.test:insert{3} end)
+ | ---
+ | ...
+
+--
+-- Finally enable replica2 back. Make sure the data from new replica1
+-- leader get writing while old leader's data ignored.
+test_run:switch("replica2")
+ | ---
+ | - true
+ | ...
+box.error.injection.set('ERRINJ_WAL_DELAY', false)
+ | ---
+ | - ok
+ | ...
+test_run:wait_cond(function() return box.space.test:get{2} ~= nil end)
+ | ---
+ | - true
+ | ...
+box.space.test:select{}
+ | ---
+ | - - [1]
+ |   - [2]
+ | ...
+
+test_run:switch("default")
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica1')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica2')
+ | ---
+ | - true
+ | ...
+
+test_run:cmd('delete server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica1')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica2')
+ | ---
+ | - true
+ | ...
diff --git a/test/replication/gh-6036-qsync-order.test.lua b/test/replication/gh-6036-qsync-order.test.lua
new file mode 100644
index 000000000..4dac9c59a
--- /dev/null
+++ b/test/replication/gh-6036-qsync-order.test.lua
@@ -0,0 +1,103 @@
+--
+-- gh-6036: verify that terms are locked when we're inside journal
+-- write routine, because parallel appliers may ignore the fact that
+-- the term is updated already but not yet written leading to data
+-- inconsistency.
+--
+test_run = require('test_run').new()
+
+test_run:cmd('create server master with script="replication/election_replica1.lua"')
+test_run:cmd('create server replica1 with script="replication/election_replica2.lua"')
+test_run:cmd('create server replica2 with script="replication/election_replica3.lua"')
+
+test_run:cmd("start server master with wait=False, args='1 nil manual 1 10000'")
+test_run:cmd("start server replica1 with wait=False, args='1 nil manual 1 10000'")
+test_run:cmd("start server replica2 with wait=False, args='1 nil manual 1 10000'")
+
+test_run:wait_fullmesh({"master", "replica1", "replica2"})
+
+--
+-- Create a synchro space on the master node and make
+-- sure the write processed just fine.
+test_run:switch("master")
+box.ctl.promote()
+s = box.schema.create_space('test', {is_sync = true})
+_ = s:create_index('pk')
+s:insert{1}
+
+test_run:switch("replica1")
+test_run:wait_lsn('replica1', 'master')
+
+test_run:switch("replica2")
+test_run:wait_lsn('replica2', 'master')
+
+--
+-- Drop connection between master and replica1.
+test_run:switch("master")
+box.cfg({                                   \
+    replication = {                         \
+        "unix/:./election_replica1.sock",   \
+        "unix/:./election_replica3.sock",   \
+    },                                      \
+})
+--
+-- Drop connection between replica1 and master.
+test_run:switch("replica1")
+test_run:wait_cond(function() return box.space.test:get{1} ~= nil end)
+box.cfg({                                   \
+    replication = {                         \
+        "unix/:./election_replica2.sock",   \
+        "unix/:./election_replica3.sock",   \
+    },                                      \
+})
+
+--
+-- Here we have the following scheme
+--
+--              replica2 (will be delayed)
+--              /     \
+--          master    replica1
+
+--
+-- Initiate disk delay in a bit tricky way: the next write will
+-- fall into forever sleep.
+test_run:switch("replica2")
+box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 1)
+
+--
+-- Make replica1 been a leader and start writting data,
+-- the PROMOTE request get queued on replica2 and not
+-- yet processed, same time INSERT won't complete either
+-- waiting for PROMOTE completion first. Note that we
+-- enter replica2 as well just to be sure the PROMOTE
+-- reached it.
+test_run:switch("replica1")
+box.ctl.promote()
+test_run:switch("replica2")
+test_run:wait_cond(function() return box.error.injection.get('ERRINJ_WAL_DELAY_COUNTDOWN') ~= 0 end)
+test_run:switch("replica1")
+_ = require('fiber').create(function() box.space.test:insert{2} end)
+
+--
+-- The master node has no clue that there is a new leader
+-- and continue writing data with obsolete term. Since replica2
+-- is delayed now the INSERT won't proceed yet but get queued.
+test_run:switch("master")
+_ = require('fiber').create(function() box.space.test:insert{3} end)
+
+--
+-- Finally enable replica2 back. Make sure the data from new replica1
+-- leader get writing while old leader's data ignored.
+test_run:switch("replica2")
+box.error.injection.set('ERRINJ_WAL_DELAY', false)
+test_run:wait_cond(function() return box.space.test:get{2} ~= nil end)
+box.space.test:select{}
+
+test_run:switch("default")
+test_run:cmd('stop server master')
+test_run:cmd('stop server replica1')
+test_run:cmd('stop server replica2')
+
+test_run:cmd('delete server master')
+test_run:cmd('delete server replica1')
+test_run:cmd('delete server replica2')
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 3eee0803c..ed09b2087 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -59,6 +59,7 @@
     "gh-6094-rs-uuid-mismatch.test.lua": {},
     "gh-6127-election-join-new.test.lua": {},
     "gh-6035-applier-filter.test.lua": {},
+    "gh-6036-qsync-order.test.lua": {},
     "election-candidate-promote.test.lua": {},
     "*": {
         "memtx": {"engine": "memtx"},
diff --git a/test/replication/suite.ini b/test/replication/suite.ini
index 77eb95f49..080e4fbf4 100644
--- a/test/replication/suite.ini
+++ b/test/replication/suite.ini
@@ -3,7 +3,7 @@ core = tarantool
 script =  master.lua
 description = tarantool/box, replication
 disabled = consistent.test.lua
-release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5430-qsync-promote-crash.test.lua gh-5430-cluster-mvcc.test.lua  gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua gh-6027-applier-error-show.test.lua gh-6032-promote-wal-write.test.lua gh-6057-qsync-confirm-async-no-wal.test.lua gh-5447-downstream-lag.test.lua gh-4040-invalid-msgpack.test.lua
+release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5430-qsync-promote-crash.test.lua gh-5430-cluster-mvcc.test.lua  gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua gh-6027-applier-error-show.test.lua gh-6032-promote-wal-write.test.lua gh-6057-qsync-confirm-async-no-wal.test.lua gh-5447-downstream-lag.test.lua gh-4040-invalid-msgpack.test.lua gh-6036-qsync-order.test.lua
 config = suite.cfg
 lua_libs = lua/fast_replica.lua lua/rlimit.lua
 use_unix_sockets = True
-- 
2.31.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v19 3/3] test: add gh-6036-qsync-order test
  2021-10-05 22:26         ` Cyrill Gorcunov via Tarantool-patches
@ 2021-10-05 22:32           ` Cyrill Gorcunov via Tarantool-patches
  2021-10-06  7:06             ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-10-05 22:32 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: tml, Vladislav Shpilevoy

On Wed, Oct 06, 2021 at 01:26:16AM +0300, Cyrill Gorcunov wrote:
> On Tue, Oct 05, 2021 at 04:55:24PM +0300, Serge Petrenko wrote:
> > There is a way to dect a blocked promote:
> > There's ERRINJ_WAL_WRITE_COUNT.
> > It's incremented each time you call wal_write. Even before the write is
> > blocked.
> > 
> > So you need to save ERRINJ_WAL_WRITE_COUNT, then do promote on replica1,
> > then
> > return to replica2 and wait until ERRINJ_WAL_WRITE_COUNT gets bigger than
> > your saved
> > value.
> 
> Here is a final version (which I didn't pushed outside yet, since I
> started a new branch v20), anyway to share data early.
> ---
> From: Cyrill Gorcunov <gorcunov@gmail.com>
> Subject: [PATCH] test: add gh-6036-qsync-order test
> 
> To test that promotion requests are handled only when appropriate
> write to WAL completes, because we update memory data before the
> write finishes.

Drop it please, need to rework :/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v19 3/3] test: add gh-6036-qsync-order test
  2021-10-05 22:32           ` Cyrill Gorcunov via Tarantool-patches
@ 2021-10-06  7:06             ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 0 replies; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-10-06  7:06 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: tml, Vladislav Shpilevoy

On Wed, Oct 06, 2021 at 01:32:20AM +0300, Cyrill Gorcunov wrote:
> 
> Drop it please, need to rework :/

Here is the last one, take a look please.
---
From ca074681354099a79105bee14ab804492964184a Mon Sep 17 00:00:00 2001
From: Cyrill Gorcunov <gorcunov@gmail.com>
Date: Mon, 20 Sep 2021 17:22:38 +0300
Subject: [PATCH] test: add gh-6036-qsync-order test

To test that promotion requests are handled only when appropriate
write to WAL completes, because we update memory data before the
write finishes.

Note that without the patch this test fires assertion

> tarantool: src/box/txn_limbo.c:481: txn_limbo_read_rollback: Assertion `e->txn->signature >= 0' failed.

Part-of #6036

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 test/replication/election_replica.lua         |   3 +-
 test/replication/gh-6036-qsync-order.result   | 224 ++++++++++++++++++
 test/replication/gh-6036-qsync-order.test.lua | 103 ++++++++
 test/replication/suite.cfg                    |   1 +
 test/replication/suite.ini                    |   2 +-
 5 files changed, 331 insertions(+), 2 deletions(-)
 create mode 100644 test/replication/gh-6036-qsync-order.result
 create mode 100644 test/replication/gh-6036-qsync-order.test.lua

diff --git a/test/replication/election_replica.lua b/test/replication/election_replica.lua
index 3b4d9a123..1dbfa96dc 100644
--- a/test/replication/election_replica.lua
+++ b/test/replication/election_replica.lua
@@ -6,6 +6,7 @@ local SYNCHRO_QUORUM = arg[1] and tonumber(arg[1]) or 3
 local ELECTION_TIMEOUT = arg[2] and tonumber(arg[2]) or 0.1
 local ELECTION_MODE = arg[3] or 'candidate'
 local CONNECT_QUORUM = arg[4] and tonumber(arg[4]) or 3
+local SYNCHRO_TIMEOUT = arg[5] and tonumber(arg[5]) or 0.1
 
 local function instance_uri(instance_id)
     return SOCKET_DIR..'/election_replica'..instance_id..'.sock';
@@ -25,7 +26,7 @@ box.cfg({
     election_mode = ELECTION_MODE,
     election_timeout = ELECTION_TIMEOUT,
     replication_synchro_quorum = SYNCHRO_QUORUM,
-    replication_synchro_timeout = 0.1,
+    replication_synchro_timeout = SYNCHRO_TIMEOUT,
     -- To reveal more election logs.
     log_level = 6,
 })
diff --git a/test/replication/gh-6036-qsync-order.result b/test/replication/gh-6036-qsync-order.result
new file mode 100644
index 000000000..34a7e7803
--- /dev/null
+++ b/test/replication/gh-6036-qsync-order.result
@@ -0,0 +1,224 @@
+-- test-run result file version 2
+--
+-- gh-6036: verify that terms are locked when we're inside journal
+-- write routine, because parallel appliers may ignore the fact that
+-- the term is updated already but not yet written leading to data
+-- inconsistency.
+--
+test_run = require('test_run').new()
+ | ---
+ | ...
+
+test_run:cmd('create server master with script="replication/election_replica1.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('create server replica1 with script="replication/election_replica2.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('create server replica2 with script="replication/election_replica3.lua"')
+ | ---
+ | - true
+ | ...
+
+test_run:cmd("start server master with wait=False, args='1 nil manual 1 10000'")
+ | ---
+ | - true
+ | ...
+test_run:cmd("start server replica1 with wait=False, args='1 nil manual 1 10000'")
+ | ---
+ | - true
+ | ...
+test_run:cmd("start server replica2 with wait=False, args='1 nil manual 1 10000'")
+ | ---
+ | - true
+ | ...
+
+test_run:wait_fullmesh({"master", "replica1", "replica2"})
+ | ---
+ | ...
+
+--
+-- Create a synchro space on the master node and make
+-- sure the write processed just fine.
+test_run:switch("master")
+ | ---
+ | - true
+ | ...
+box.ctl.promote()
+ | ---
+ | ...
+s = box.schema.create_space('test', {is_sync = true})
+ | ---
+ | ...
+_ = s:create_index('pk')
+ | ---
+ | ...
+s:insert{1}
+ | ---
+ | - [1]
+ | ...
+
+test_run:switch("replica1")
+ | ---
+ | - true
+ | ...
+test_run:wait_lsn('replica1', 'master')
+ | ---
+ | ...
+
+test_run:switch("replica2")
+ | ---
+ | - true
+ | ...
+test_run:wait_lsn('replica2', 'master')
+ | ---
+ | ...
+
+--
+-- Drop connection between master and replica1.
+test_run:switch("master")
+ | ---
+ | - true
+ | ...
+box.cfg({                                   \
+    replication = {                         \
+        "unix/:./election_replica1.sock",   \
+        "unix/:./election_replica3.sock",   \
+    },                                      \
+})
+ | ---
+ | ...
+--
+-- Drop connection between replica1 and master.
+test_run:switch("replica1")
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.space.test:get{1} ~= nil end)
+ | ---
+ | - true
+ | ...
+box.cfg({                                   \
+    replication = {                         \
+        "unix/:./election_replica2.sock",   \
+        "unix/:./election_replica3.sock",   \
+    },                                      \
+})
+ | ---
+ | ...
+
+--
+-- Here we have the following scheme
+--
+--              replica2 (will be delayed)
+--              /     \
+--          master    replica1
+
+--
+-- Initiate disk delay in a bit tricky way: the next write will
+-- fall into forever sleep.
+test_run:switch("replica2")
+ | ---
+ | - true
+ | ...
+box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 1)
+ | ---
+ | - ok
+ | ...
+cnt_before = box.error.injection.get('ERRINJ_WAL_DELAY_COUNTDOWN')
+ | ---
+ | ...
+--
+-- Make replica1 been a leader and start writting data,
+-- the PROMOTE request get queued on replica2 and not
+-- yet processed, same time INSERT won't complete either
+-- waiting for PROMOTE completion first. Note that we
+-- enter replica2 as well just to be sure the PROMOTE
+-- reached it.
+test_run:switch("replica1")
+ | ---
+ | - true
+ | ...
+box.ctl.promote()
+ | ---
+ | ...
+test_run:switch("replica2")
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.error.injection.get('ERRINJ_WAL_DELAY_COUNTDOWN') < cnt_before end)
+ | ---
+ | - true
+ | ...
+test_run:switch("replica1")
+ | ---
+ | - true
+ | ...
+_ = require('fiber').create(function() box.space.test:insert{2} end)
+ | ---
+ | ...
+
+--
+-- The master node has no clue that there is a new leader
+-- and continue writing data with obsolete term. Since replica2
+-- is delayed now the INSERT won't proceed yet but get queued.
+test_run:switch("master")
+ | ---
+ | - true
+ | ...
+_ = require('fiber').create(function() box.space.test:insert{3} end)
+ | ---
+ | ...
+
+--
+-- Finally enable replica2 back. Make sure the data from new replica1
+-- leader get writing while old leader's data ignored.
+test_run:switch("replica2")
+ | ---
+ | - true
+ | ...
+box.error.injection.set('ERRINJ_WAL_DELAY', false)
+ | ---
+ | - ok
+ | ...
+test_run:wait_cond(function() return box.space.test:get{2} ~= nil end)
+ | ---
+ | - true
+ | ...
+box.space.test:select{}
+ | ---
+ | - - [1]
+ |   - [2]
+ | ...
+
+test_run:switch("default")
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica1')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica2')
+ | ---
+ | - true
+ | ...
+
+test_run:cmd('delete server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica1')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica2')
+ | ---
+ | - true
+ | ...
diff --git a/test/replication/gh-6036-qsync-order.test.lua b/test/replication/gh-6036-qsync-order.test.lua
new file mode 100644
index 000000000..47996998d
--- /dev/null
+++ b/test/replication/gh-6036-qsync-order.test.lua
@@ -0,0 +1,103 @@
+--
+-- gh-6036: verify that terms are locked when we're inside journal
+-- write routine, because parallel appliers may ignore the fact that
+-- the term is updated already but not yet written leading to data
+-- inconsistency.
+--
+test_run = require('test_run').new()
+
+test_run:cmd('create server master with script="replication/election_replica1.lua"')
+test_run:cmd('create server replica1 with script="replication/election_replica2.lua"')
+test_run:cmd('create server replica2 with script="replication/election_replica3.lua"')
+
+test_run:cmd("start server master with wait=False, args='1 nil manual 1 10000'")
+test_run:cmd("start server replica1 with wait=False, args='1 nil manual 1 10000'")
+test_run:cmd("start server replica2 with wait=False, args='1 nil manual 1 10000'")
+
+test_run:wait_fullmesh({"master", "replica1", "replica2"})
+
+--
+-- Create a synchro space on the master node and make
+-- sure the write processed just fine.
+test_run:switch("master")
+box.ctl.promote()
+s = box.schema.create_space('test', {is_sync = true})
+_ = s:create_index('pk')
+s:insert{1}
+
+test_run:switch("replica1")
+test_run:wait_lsn('replica1', 'master')
+
+test_run:switch("replica2")
+test_run:wait_lsn('replica2', 'master')
+
+--
+-- Drop connection between master and replica1.
+test_run:switch("master")
+box.cfg({                                   \
+    replication = {                         \
+        "unix/:./election_replica1.sock",   \
+        "unix/:./election_replica3.sock",   \
+    },                                      \
+})
+--
+-- Drop connection between replica1 and master.
+test_run:switch("replica1")
+test_run:wait_cond(function() return box.space.test:get{1} ~= nil end)
+box.cfg({                                   \
+    replication = {                         \
+        "unix/:./election_replica2.sock",   \
+        "unix/:./election_replica3.sock",   \
+    },                                      \
+})
+
+--
+-- Here we have the following scheme
+--
+--              replica2 (will be delayed)
+--              /     \
+--          master    replica1
+
+--
+-- Initiate disk delay in a bit tricky way: the next write will
+-- fall into forever sleep.
+test_run:switch("replica2")
+box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 1)
+cnt_before = box.error.injection.get('ERRINJ_WAL_DELAY_COUNTDOWN')
+--
+-- Make replica1 been a leader and start writting data,
+-- the PROMOTE request get queued on replica2 and not
+-- yet processed, same time INSERT won't complete either
+-- waiting for PROMOTE completion first. Note that we
+-- enter replica2 as well just to be sure the PROMOTE
+-- reached it.
+test_run:switch("replica1")
+box.ctl.promote()
+test_run:switch("replica2")
+test_run:wait_cond(function() return box.error.injection.get('ERRINJ_WAL_DELAY_COUNTDOWN') < cnt_before end)
+test_run:switch("replica1")
+_ = require('fiber').create(function() box.space.test:insert{2} end)
+
+--
+-- The master node has no clue that there is a new leader
+-- and continue writing data with obsolete term. Since replica2
+-- is delayed now the INSERT won't proceed yet but get queued.
+test_run:switch("master")
+_ = require('fiber').create(function() box.space.test:insert{3} end)
+
+--
+-- Finally enable replica2 back. Make sure the data from new replica1
+-- leader get writing while old leader's data ignored.
+test_run:switch("replica2")
+box.error.injection.set('ERRINJ_WAL_DELAY', false)
+test_run:wait_cond(function() return box.space.test:get{2} ~= nil end)
+box.space.test:select{}
+
+test_run:switch("default")
+test_run:cmd('stop server master')
+test_run:cmd('stop server replica1')
+test_run:cmd('stop server replica2')
+
+test_run:cmd('delete server master')
+test_run:cmd('delete server replica1')
+test_run:cmd('delete server replica2')
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 3eee0803c..ed09b2087 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -59,6 +59,7 @@
     "gh-6094-rs-uuid-mismatch.test.lua": {},
     "gh-6127-election-join-new.test.lua": {},
     "gh-6035-applier-filter.test.lua": {},
+    "gh-6036-qsync-order.test.lua": {},
     "election-candidate-promote.test.lua": {},
     "*": {
         "memtx": {"engine": "memtx"},
diff --git a/test/replication/suite.ini b/test/replication/suite.ini
index 77eb95f49..080e4fbf4 100644
--- a/test/replication/suite.ini
+++ b/test/replication/suite.ini
@@ -3,7 +3,7 @@ core = tarantool
 script =  master.lua
 description = tarantool/box, replication
 disabled = consistent.test.lua
-release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5430-qsync-promote-crash.test.lua gh-5430-cluster-mvcc.test.lua  gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua gh-6027-applier-error-show.test.lua gh-6032-promote-wal-write.test.lua gh-6057-qsync-confirm-async-no-wal.test.lua gh-5447-downstream-lag.test.lua gh-4040-invalid-msgpack.test.lua
+release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5430-qsync-promote-crash.test.lua gh-5430-cluster-mvcc.test.lua  gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua gh-6027-applier-error-show.test.lua gh-6032-promote-wal-write.test.lua gh-6057-qsync-confirm-async-no-wal.test.lua gh-5447-downstream-lag.test.lua gh-4040-invalid-msgpack.test.lua gh-6036-qsync-order.test.lua
 config = suite.cfg
 lua_libs = lua/fast_replica.lua lua/rlimit.lua
 use_unix_sockets = True
-- 
2.31.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-10-06  7:06 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-30  9:44 [Tarantool-patches] [PATCH v19 0/3] qsync: implement packet filtering (part 1) Cyrill Gorcunov via Tarantool-patches
2021-09-30  9:44 ` [Tarantool-patches] [PATCH v19 1/3] latch: add latch_is_locked helper Cyrill Gorcunov via Tarantool-patches
2021-09-30  9:44 ` [Tarantool-patches] [PATCH v19 2/3] qsync: order access to the limbo terms Cyrill Gorcunov via Tarantool-patches
2021-10-01 12:14   ` Serge Petrenko via Tarantool-patches
2021-10-01 12:31     ` Cyrill Gorcunov via Tarantool-patches
2021-10-01 12:37       ` Serge Petrenko via Tarantool-patches
2021-10-04 21:53         ` Cyrill Gorcunov via Tarantool-patches
2021-10-05 13:25           ` Serge Petrenko via Tarantool-patches
2021-10-05 21:52             ` Cyrill Gorcunov via Tarantool-patches
2021-09-30  9:44 ` [Tarantool-patches] [PATCH v19 3/3] test: add gh-6036-qsync-order test Cyrill Gorcunov via Tarantool-patches
2021-10-01 12:30   ` Serge Petrenko via Tarantool-patches
2021-10-04 21:16     ` Cyrill Gorcunov via Tarantool-patches
2021-10-05 13:55       ` Serge Petrenko via Tarantool-patches
2021-10-05 22:26         ` Cyrill Gorcunov via Tarantool-patches
2021-10-05 22:32           ` Cyrill Gorcunov via Tarantool-patches
2021-10-06  7:06             ` Cyrill Gorcunov via Tarantool-patches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox