Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH v11 0/8] box/replication: prevent nil dereference on applier rollback
@ 2020-04-04 16:15 Cyrill Gorcunov
  2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 1/8] box: fix bootstrap comment Cyrill Gorcunov
                   ` (8 more replies)
  0 siblings, 9 replies; 20+ messages in thread
From: Cyrill Gorcunov @ 2020-04-04 16:15 UTC (permalink / raw)
  To: tml

In the series a few fixups including simple code cleanup.

I've assigned a separate bug for myself for applier redesign
since I need more time to understand code better
https://github.com/tarantool/tarantool/issues/4853

Issue https://github.com/tarantool/tarantool/issues/4730
Branch gorcunov/gh-4730-diag-raise-master-11

Cyrill Gorcunov (8):
  box: fix bootstrap comment
  box/alter: shrink txn_alter_trigger_new code
  box/request: add missing OutOfMemory diag_set
  box/applier: add missing diag_set on region_alloc failure
  box/replication: merge replica_by_id into replicaset
  applier: reduce applier_txn_rollback_cb code density
  box/applier: prevent nil dereference on applier rollback
  test: add replication/applier-rollback

 src/box/alter.cc                            |   4 +-
 src/box/applier.cc                          |  24 ++-
 src/box/box.cc                              |   2 +-
 src/box/replication.cc                      |   2 -
 src/box/replication.h                       |   2 +-
 src/box/request.c                           |   8 +-
 src/box/txn.c                               |  13 ++
 src/lib/core/errinj.h                       |   1 +
 test/box/errinj.result                      |   1 +
 test/replication/applier-rollback-slave.lua |  16 ++
 test/replication/applier-rollback.result    | 162 ++++++++++++++++++++
 test/replication/applier-rollback.test.lua  |  81 ++++++++++
 test/replication/suite.ini                  |   2 +-
 13 files changed, 305 insertions(+), 13 deletions(-)
 create mode 100644 test/replication/applier-rollback-slave.lua
 create mode 100644 test/replication/applier-rollback.result
 create mode 100644 test/replication/applier-rollback.test.lua

-- 
2.20.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Tarantool-patches] [PATCH v11 1/8] box: fix bootstrap comment
  2020-04-04 16:15 [Tarantool-patches] [PATCH v11 0/8] box/replication: prevent nil dereference on applier rollback Cyrill Gorcunov
@ 2020-04-04 16:15 ` Cyrill Gorcunov
  2020-04-05  7:31   ` Konstantin Osipov
  2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 2/8] box/alter: shrink txn_alter_trigger_new code Cyrill Gorcunov
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 20+ messages in thread
From: Cyrill Gorcunov @ 2020-04-04 16:15 UTC (permalink / raw)
  To: tml

We're not starting new master node but
a new instance instead. The comment simply
leftover from older modifications.

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/box.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 765d64678..0c15ba5e9 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -2414,7 +2414,7 @@ box_cfg_xc(void)
 		local_recovery(&instance_uuid, &replicaset_uuid,
 			       &checkpoint->vclock);
 	} else {
-		/* Bootstrap a new master */
+		/* Bootstrap a new instance */
 		bootstrap(&instance_uuid, &replicaset_uuid,
 			  &is_bootstrap_leader);
 	}
-- 
2.20.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Tarantool-patches] [PATCH v11 2/8] box/alter: shrink txn_alter_trigger_new code
  2020-04-04 16:15 [Tarantool-patches] [PATCH v11 0/8] box/replication: prevent nil dereference on applier rollback Cyrill Gorcunov
  2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 1/8] box: fix bootstrap comment Cyrill Gorcunov
@ 2020-04-04 16:15 ` Cyrill Gorcunov
  2020-04-06  7:39   ` Konstantin Osipov
  2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 3/8] box/request: add missing OutOfMemory diag_set Cyrill Gorcunov
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 20+ messages in thread
From: Cyrill Gorcunov @ 2020-04-04 16:15 UTC (permalink / raw)
  To: tml

Instead of calling memset which is useless here
just use trigger_create helper.

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/alter.cc | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/src/box/alter.cc b/src/box/alter.cc
index d73679fb8..dbbbcbc44 100644
--- a/src/box/alter.cc
+++ b/src/box/alter.cc
@@ -820,9 +820,7 @@ txn_alter_trigger_new(trigger_f run, void *data)
 		diag_set(OutOfMemory, size, "region", "struct trigger");
 		return NULL;
 	}
-	trigger = (struct trigger *)memset(trigger, 0, size);
-	trigger->run = run;
-	trigger->data = data;
+	trigger_create(trigger, run, data, NULL);
 	return trigger;
 }
 
-- 
2.20.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Tarantool-patches] [PATCH v11 3/8] box/request: add missing OutOfMemory diag_set
  2020-04-04 16:15 [Tarantool-patches] [PATCH v11 0/8] box/replication: prevent nil dereference on applier rollback Cyrill Gorcunov
  2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 1/8] box: fix bootstrap comment Cyrill Gorcunov
  2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 2/8] box/alter: shrink txn_alter_trigger_new code Cyrill Gorcunov
@ 2020-04-04 16:15 ` Cyrill Gorcunov
  2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 4/8] box/applier: add missing diag_set on region_alloc failure Cyrill Gorcunov
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Cyrill Gorcunov @ 2020-04-04 16:15 UTC (permalink / raw)
  To: tml

In request_create_from_tuple and request_handle_sequence
we may be unable to request memory for tuples, don't
forget to setup diag error otherwise diag_raise will
lead to nil dereference.

Acked-by: Sergey Ostanevich <sergos@tarantool.org>
Acked-by: Konstantin Osipov <kostja.osipov@gmail.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/request.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/src/box/request.c b/src/box/request.c
index 82232a155..994f2da62 100644
--- a/src/box/request.c
+++ b/src/box/request.c
@@ -109,8 +109,10 @@ request_create_from_tuple(struct request *request, struct space *space,
 		 * the tuple data to WAL on commit.
 		 */
 		char *buf = region_alloc(&fiber()->gc, size);
-		if (buf == NULL)
+		if (buf == NULL) {
+			diag_set(OutOfMemory, size, "region_alloc", "tuple");
 			return -1;
+		}
 		memcpy(buf, data, size);
 		request->tuple = buf;
 		request->tuple_end = buf + size;
@@ -199,8 +201,10 @@ request_handle_sequence(struct request *request, struct space *space)
 		size_t buf_size = (request->tuple_end - request->tuple) +
 						mp_sizeof_uint(UINT64_MAX);
 		char *tuple = region_alloc(&fiber()->gc, buf_size);
-		if (tuple == NULL)
+		if (tuple == NULL) {
+			diag_set(OutOfMemory, buf_size, "region_alloc", "tuple");
 			return -1;
+		}
 		char *tuple_end = mp_encode_array(tuple, len);
 
 		if (unlikely(key != data)) {
-- 
2.20.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Tarantool-patches] [PATCH v11 4/8] box/applier: add missing diag_set on region_alloc failure
  2020-04-04 16:15 [Tarantool-patches] [PATCH v11 0/8] box/replication: prevent nil dereference on applier rollback Cyrill Gorcunov
                   ` (2 preceding siblings ...)
  2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 3/8] box/request: add missing OutOfMemory diag_set Cyrill Gorcunov
@ 2020-04-04 16:15 ` Cyrill Gorcunov
  2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 5/8] box/replication: merge replica_by_id into replicaset Cyrill Gorcunov
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Cyrill Gorcunov @ 2020-04-04 16:15 UTC (permalink / raw)
  To: tml

In case if we're hitting memory limit allocating triggers
we should setup diag error to prevent nil dereference
in diag_raise call (for example from applier_apply_tx).

Note that there are region_alloc_xc helpers which are
throwing errors but as far as I understand we need the
rollback action to process first instead of immediate
throw/catch thus we use diag_set.

Acked-by: Sergey Ostanevich <sergos@tarantool.org>
Acked-by: Konstantin Osipov <kostja.osipov@gmail.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/applier.cc | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index 47a26c366..2eb1e04fc 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -796,8 +796,11 @@ applier_apply_tx(struct stailq *rows)
 						     sizeof(struct trigger));
 	on_commit = (struct trigger *)region_alloc(&txn->region,
 						   sizeof(struct trigger));
-	if (on_rollback == NULL || on_commit == NULL)
+	if (on_rollback == NULL || on_commit == NULL) {
+		diag_set(OutOfMemory, sizeof(struct trigger),
+			 "region_alloc", "on_rollback/on_commit");
 		goto rollback;
+	}
 
 	trigger_create(on_rollback, applier_txn_rollback_cb, NULL, NULL);
 	txn_on_rollback(txn, on_rollback);
-- 
2.20.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Tarantool-patches] [PATCH v11 5/8] box/replication: merge replica_by_id into replicaset
  2020-04-04 16:15 [Tarantool-patches] [PATCH v11 0/8] box/replication: prevent nil dereference on applier rollback Cyrill Gorcunov
                   ` (3 preceding siblings ...)
  2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 4/8] box/applier: add missing diag_set on region_alloc failure Cyrill Gorcunov
@ 2020-04-04 16:15 ` Cyrill Gorcunov
  2020-04-06  7:40   ` Konstantin Osipov
  2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 6/8] applier: reduce applier_txn_rollback_cb code density Cyrill Gorcunov
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 20+ messages in thread
From: Cyrill Gorcunov @ 2020-04-04 16:15 UTC (permalink / raw)
  To: tml

For some reason the replica_by_id member (which is an
array of pointers) is allocated dynamically. Moreover
VCLOCK_MAX = 32 by now and extending it to some new
limit will require a way more efforts than just increase
the number.

Thus reserve memory for replica_by_id inside replicaset
statically. This allows to simplify code a bit and
drop calloc/free calls.

The former code comes from edd76a2a0ae17e3d without any
explanation why the dynamic member is needed.

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/replication.cc | 2 --
 src/box/replication.h  | 2 +-
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/src/box/replication.cc b/src/box/replication.cc
index 1345f189b..7c10fb6f2 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -89,7 +89,6 @@ replication_init(void)
 	rlist_create(&replicaset.anon);
 	vclock_create(&replicaset.vclock);
 	fiber_cond_create(&replicaset.applier.cond);
-	replicaset.replica_by_id = (struct replica **)calloc(VCLOCK_MAX, sizeof(struct replica *));
 	latch_create(&replicaset.applier.order_latch);
 
 	vclock_create(&replicaset.applier.vclock);
@@ -112,7 +111,6 @@ replication_free(void)
 		relay_cancel(replica->relay);
 
 	diag_destroy(&replicaset.applier.diag);
-	free(replicaset.replica_by_id);
 }
 
 int
diff --git a/src/box/replication.h b/src/box/replication.h
index 2ef1255b3..9df91e611 100644
--- a/src/box/replication.h
+++ b/src/box/replication.h
@@ -251,7 +251,7 @@ struct replicaset {
 		struct diag diag;
 	} applier;
 	/** Map of all known replica_id's to correspponding replica's. */
-	struct replica **replica_by_id;
+	struct replica *replica_by_id[VCLOCK_MAX];
 };
 extern struct replicaset replicaset;
 
-- 
2.20.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Tarantool-patches] [PATCH v11 6/8] applier: reduce applier_txn_rollback_cb code density
  2020-04-04 16:15 [Tarantool-patches] [PATCH v11 0/8] box/replication: prevent nil dereference on applier rollback Cyrill Gorcunov
                   ` (4 preceding siblings ...)
  2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 5/8] box/replication: merge replica_by_id into replicaset Cyrill Gorcunov
@ 2020-04-04 16:15 ` Cyrill Gorcunov
  2020-04-06  7:40   ` Konstantin Osipov
  2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 7/8] box/applier: prevent nil dereference on applier rollback Cyrill Gorcunov
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 20+ messages in thread
From: Cyrill Gorcunov @ 2020-04-04 16:15 UTC (permalink / raw)
  To: tml

To make it a bit more readable.

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/applier.cc | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index 2eb1e04fc..2f9c9c797 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -695,8 +695,10 @@ applier_txn_rollback_cb(struct trigger *trigger, void *event)
 	/* Setup shared applier diagnostic area. */
 	diag_set(ClientError, ER_WAL_IO);
 	diag_move(&fiber()->diag, &replicaset.applier.diag);
+
 	/* Broadcast the rollback event across all appliers. */
 	trigger_run(&replicaset.applier.on_rollback, event);
+
 	/* Rollback applier vclock to the committed one. */
 	vclock_copy(&replicaset.applier.vclock, &replicaset.vclock);
 	return 0;
-- 
2.20.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Tarantool-patches] [PATCH v11 7/8] box/applier: prevent nil dereference on applier rollback
  2020-04-04 16:15 [Tarantool-patches] [PATCH v11 0/8] box/replication: prevent nil dereference on applier rollback Cyrill Gorcunov
                   ` (5 preceding siblings ...)
  2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 6/8] applier: reduce applier_txn_rollback_cb code density Cyrill Gorcunov
@ 2020-04-04 16:15 ` Cyrill Gorcunov
  2020-04-07 10:36   ` Serge Petrenko
  2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 8/8] test: add replication/applier-rollback Cyrill Gorcunov
  2020-04-07 10:46 ` [Tarantool-patches] [PATCH v11 0/8] box/replication: prevent nil dereference on applier rollback Serge Petrenko
  8 siblings, 1 reply; 20+ messages in thread
From: Cyrill Gorcunov @ 2020-04-04 16:15 UTC (permalink / raw)
  To: tml

Currently when transaction rollback happens we just drop an existing
error setting ClientError to the replicaset.applier.diag. This action
leaves current fiber with diag=nil, which in turn leads to sigsegv once
diag_raise() called right after applier_apply_tx():

 | applier_f
 |   try {
 |   applier_subscribe
 |     applier_apply_tx
 |       // error happens
 |       txn_rollback
 |         diag_set(ClientError, ER_WAL_IO)
 |         diag_move(&fiber()->diag, &replicaset.applier.diag)
 |         // fiber->diag = nil
 |       applier_on_rollback
 |         diag_add_error(&applier->diag, diag_last_error(&replicaset.applier.diag)
 |         fiber_cancel(applier->reader);
 |     diag_raise() -> NULL dereference
 |   } catch { ... }

Thus:
 - use diag_set_error() instead of diag_move() to not drop error
   from a current fiber() preventing a nil dereference;
 - put fixme mark into the code: we need to rework it in a
   more sense way.

Fixes #4730

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/applier.cc | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index 2f9c9c797..68de3c08c 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -692,9 +692,22 @@ static int
 applier_txn_rollback_cb(struct trigger *trigger, void *event)
 {
 	(void) trigger;
-	/* Setup shared applier diagnostic area. */
+
+	/*
+	 * Setup shared applier diagnostic area.
+	 *
+	 * FIXME: We should consider redesign this
+	 * moment and instead of carrying one shared
+	 * diag use per-applier diag instead all the time
+	 * (which actually already present in the structure).
+	 *
+	 * But remember that transactions are asynchronous
+	 * and rollback may happen a way latter after it
+	 * passed to the journal engine.
+	 */
 	diag_set(ClientError, ER_WAL_IO);
-	diag_move(&fiber()->diag, &replicaset.applier.diag);
+	diag_set_error(&replicaset.applier.diag,
+		       diag_last_error(diag_get()));
 
 	/* Broadcast the rollback event across all appliers. */
 	trigger_run(&replicaset.applier.on_rollback, event);
-- 
2.20.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Tarantool-patches] [PATCH v11 8/8] test: add replication/applier-rollback
  2020-04-04 16:15 [Tarantool-patches] [PATCH v11 0/8] box/replication: prevent nil dereference on applier rollback Cyrill Gorcunov
                   ` (6 preceding siblings ...)
  2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 7/8] box/applier: prevent nil dereference on applier rollback Cyrill Gorcunov
@ 2020-04-04 16:15 ` Cyrill Gorcunov
  2020-04-07 10:26   ` Serge Petrenko
  2020-04-07 10:46 ` [Tarantool-patches] [PATCH v11 0/8] box/replication: prevent nil dereference on applier rollback Serge Petrenko
  8 siblings, 1 reply; 20+ messages in thread
From: Cyrill Gorcunov @ 2020-04-04 16:15 UTC (permalink / raw)
  To: tml

Test that diag_raise doesn't happen if async transaction
fails inside replication procedure.

Side note: I don't like merging tests with patches in
general and I hate doing so for big tests with a passion
because it hides the patch code itself. So here is a
separate patch on top of the fix.

Test-of #4730

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/txn.c                               |  13 ++
 src/lib/core/errinj.h                       |   1 +
 test/box/errinj.result                      |   1 +
 test/replication/applier-rollback-slave.lua |  16 ++
 test/replication/applier-rollback.result    | 162 ++++++++++++++++++++
 test/replication/applier-rollback.test.lua  |  81 ++++++++++
 test/replication/suite.ini                  |   2 +-
 7 files changed, 275 insertions(+), 1 deletion(-)
 create mode 100644 test/replication/applier-rollback-slave.lua
 create mode 100644 test/replication/applier-rollback.result
 create mode 100644 test/replication/applier-rollback.test.lua

diff --git a/src/box/txn.c b/src/box/txn.c
index f9c3e3675..488aa4bdd 100644
--- a/src/box/txn.c
+++ b/src/box/txn.c
@@ -34,6 +34,7 @@
 #include "journal.h"
 #include <fiber.h>
 #include "xrow.h"
+#include "errinj.h"
 
 double too_long_threshold;
 
@@ -576,6 +577,18 @@ txn_commit_async(struct txn *txn)
 {
 	struct journal_entry *req;
 
+	ERROR_INJECT(ERRINJ_TXN_COMMIT_ASYNC, {
+		diag_set(ClientError, ER_INJECTION,
+			 "txn commit async injection");
+		/*
+		 * Log it for the testing sake: we grep
+		 * output to mark this event.
+		 */
+		diag_log();
+		txn_rollback(txn);
+		return -1;
+	});
+
 	if (txn_prepare(txn) != 0) {
 		txn_rollback(txn);
 		return -1;
diff --git a/src/lib/core/errinj.h b/src/lib/core/errinj.h
index ee6c57a0d..7577ed11a 100644
--- a/src/lib/core/errinj.h
+++ b/src/lib/core/errinj.h
@@ -139,6 +139,7 @@ struct errinj {
 	_(ERRINJ_FIBER_MPROTECT, ERRINJ_INT, {.iparam = -1}) \
 	_(ERRINJ_RELAY_FASTER_THAN_TX, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_INDEX_RESERVE, ERRINJ_BOOL, {.bparam = false})\
+	_(ERRINJ_TXN_COMMIT_ASYNC, ERRINJ_BOOL, {.bparam = false})\
 
 ENUM0(errinj_id, ERRINJ_LIST);
 extern struct errinj errinjs[];
diff --git a/test/box/errinj.result b/test/box/errinj.result
index 0d3fedeb3..de877b708 100644
--- a/test/box/errinj.result
+++ b/test/box/errinj.result
@@ -76,6 +76,7 @@ evals
   - ERRINJ_TUPLE_ALLOC: false
   - ERRINJ_TUPLE_FIELD: false
   - ERRINJ_TUPLE_FORMAT_COUNT: -1
+  - ERRINJ_TXN_COMMIT_ASYNC: false
   - ERRINJ_VYRUN_DATA_READ: false
   - ERRINJ_VY_COMPACTION_DELAY: false
   - ERRINJ_VY_DELAY_PK_LOOKUP: false
diff --git a/test/replication/applier-rollback-slave.lua b/test/replication/applier-rollback-slave.lua
new file mode 100644
index 000000000..26fb10055
--- /dev/null
+++ b/test/replication/applier-rollback-slave.lua
@@ -0,0 +1,16 @@
+--
+-- vim: ts=4 sw=4 et
+--
+
+print('arg', arg)
+
+box.cfg({
+    replication                 = os.getenv("MASTER"),
+    listen                      = os.getenv("LISTEN"),
+    memtx_memory                = 107374182,
+    replication_timeout         = 0.1,
+    replication_connect_timeout = 0.5,
+    read_only                   = true,
+})
+
+require('console').listen(os.getenv('ADMIN'))
diff --git a/test/replication/applier-rollback.result b/test/replication/applier-rollback.result
new file mode 100644
index 000000000..3c659f460
--- /dev/null
+++ b/test/replication/applier-rollback.result
@@ -0,0 +1,162 @@
+-- test-run result file version 2
+#!/usr/bin/env tarantool
+ | ---
+ | ...
+--
+-- vim: ts=4 sw=4 et
+--
+
+test_run = require('test_run').new()
+ | ---
+ | ...
+
+errinj = box.error.injection
+ | ---
+ | ...
+engine = test_run:get_cfg('engine')
+ | ---
+ | ...
+
+--
+-- Allow replica to connect to us
+box.schema.user.grant('guest', 'replication')
+ | ---
+ | ...
+
+--
+-- Create replica instance, we're the master and
+-- start it, no data to sync yet though
+test_run:cmd("create server replica_slave with rpl_master=default, script='replication/applier-rollback-slave.lua'")
+ | ---
+ | - true
+ | ...
+test_run:cmd("start server replica_slave")
+ | ---
+ | - true
+ | ...
+
+--
+-- Fill initial data on the master instance
+test_run:cmd('switch default')
+ | ---
+ | - true
+ | ...
+
+_ = box.schema.space.create('test', {engine=engine})
+ | ---
+ | ...
+s = box.space.test
+ | ---
+ | ...
+
+s:format({{name = 'id', type = 'unsigned'}, {name = 'band_name', type = 'string'}})
+ | ---
+ | ...
+
+_ = s:create_index('primary', {type = 'tree', parts = {'id'}})
+ | ---
+ | ...
+s:insert({1, '1'})
+ | ---
+ | - [1, '1']
+ | ...
+s:insert({2, '2'})
+ | ---
+ | - [2, '2']
+ | ...
+s:insert({3, '3'})
+ | ---
+ | - [3, '3']
+ | ...
+
+--
+-- To make sure we're running
+box.info.status
+ | ---
+ | - running
+ | ...
+
+--
+-- Wait for data from master get propagated
+test_run:wait_lsn('replica_slave', 'default')
+ | ---
+ | ...
+
+--
+-- Now inject error into slave instance
+test_run:cmd('switch replica_slave')
+ | ---
+ | - true
+ | ...
+
+--
+-- To make sure we're running
+box.info.status
+ | ---
+ | - running
+ | ...
+
+--
+-- To fail inserting new record.
+errinj = box.error.injection
+ | ---
+ | ...
+errinj.set('ERRINJ_TXN_COMMIT_ASYNC', true)
+ | ---
+ | - ok
+ | ...
+
+--
+-- Jump back to master node and write new
+-- entry which should cause error to happen
+-- on slave instance
+test_run:cmd('switch default')
+ | ---
+ | - true
+ | ...
+s:insert({4, '4'})
+ | ---
+ | - [4, '4']
+ | ...
+
+--
+-- Wait for error to trigger
+test_run:cmd('switch replica_slave')
+ | ---
+ | - true
+ | ...
+fiber = require('fiber')
+ | ---
+ | ...
+while test_run:grep_log('replica_slave', 'ER_INJECTION:[^\n]*') == nil do fiber.sleep(0.1) end
+ | ---
+ | ...
+
+----
+---- Such error cause the applier to be
+---- cancelled and reaped, thus stop the
+---- slave node and cleanup
+test_run:cmd('switch default')
+ | ---
+ | - true
+ | ...
+
+--
+-- Cleanup
+test_run:cmd("stop server replica_slave")
+ | ---
+ | - true
+ | ...
+test_run:cmd("delete server replica_slave")
+ | ---
+ | - true
+ | ...
+box.cfg{replication=""}
+ | ---
+ | ...
+box.space.test:drop()
+ | ---
+ | ...
+box.schema.user.revoke('guest', 'replication')
+ | ---
+ | ...
diff --git a/test/replication/applier-rollback.test.lua b/test/replication/applier-rollback.test.lua
new file mode 100644
index 000000000..2c32af5c6
--- /dev/null
+++ b/test/replication/applier-rollback.test.lua
@@ -0,0 +1,81 @@
+#!/usr/bin/env tarantool
+--
+-- vim: ts=4 sw=4 et
+--
+
+test_run = require('test_run').new()
+
+errinj = box.error.injection
+engine = test_run:get_cfg('engine')
+
+--
+-- Allow replica to connect to us
+box.schema.user.grant('guest', 'replication')
+
+--
+-- Create replica instance, we're the master and
+-- start it, no data to sync yet though
+test_run:cmd("create server replica_slave with rpl_master=default, script='replication/applier-rollback-slave.lua'")
+test_run:cmd("start server replica_slave")
+
+--
+-- Fill initial data on the master instance
+test_run:cmd('switch default')
+
+_ = box.schema.space.create('test', {engine=engine})
+s = box.space.test
+
+s:format({{name = 'id', type = 'unsigned'}, {name = 'band_name', type = 'string'}})
+
+_ = s:create_index('primary', {type = 'tree', parts = {'id'}})
+s:insert({1, '1'})
+s:insert({2, '2'})
+s:insert({3, '3'})
+
+--
+-- To make sure we're running
+box.info.status
+
+--
+-- Wait for data from master get propagated
+test_run:wait_lsn('replica_slave', 'default')
+
+--
+-- Now inject error into slave instance
+test_run:cmd('switch replica_slave')
+
+--
+-- To make sure we're running
+box.info.status
+
+--
+-- To fail inserting new record.
+errinj = box.error.injection
+errinj.set('ERRINJ_TXN_COMMIT_ASYNC', true)
+
+--
+-- Jump back to master node and write new
+-- entry which should cause error to happen
+-- on slave instance
+test_run:cmd('switch default')
+s:insert({4, '4'})
+
+--
+-- Wait for error to trigger
+test_run:cmd('switch replica_slave')
+fiber = require('fiber')
+while test_run:grep_log('replica_slave', 'ER_INJECTION:[^\n]*') == nil do fiber.sleep(0.1) end
+
+----
+---- Such error cause the applier to be
+---- cancelled and reaped, thus stop the
+---- slave node and cleanup
+test_run:cmd('switch default')
+
+--
+-- Cleanup
+test_run:cmd("stop server replica_slave")
+test_run:cmd("delete server replica_slave")
+box.cfg{replication=""}
+box.space.test:drop()
+box.schema.user.revoke('guest', 'replication')
diff --git a/test/replication/suite.ini b/test/replication/suite.ini
index b4e09744a..f6c924762 100644
--- a/test/replication/suite.ini
+++ b/test/replication/suite.ini
@@ -3,7 +3,7 @@ core = tarantool
 script =  master.lua
 description = tarantool/box, replication
 disabled = consistent.test.lua
-release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua
+release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua applier-rollback.test.lua
 config = suite.cfg
 lua_libs = lua/fast_replica.lua lua/rlimit.lua
 use_unix_sockets = True
-- 
2.20.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Tarantool-patches] [PATCH v11 1/8] box: fix bootstrap comment
  2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 1/8] box: fix bootstrap comment Cyrill Gorcunov
@ 2020-04-05  7:31   ` Konstantin Osipov
  2020-04-05  7:56     ` Cyrill Gorcunov
  0 siblings, 1 reply; 20+ messages in thread
From: Konstantin Osipov @ 2020-04-05  7:31 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

* Cyrill Gorcunov <gorcunov@gmail.com> [20/04/05 05:06]:
> We're not starting new master node but
> a new instance instead. The comment simply
> leftover from older modifications.

I wrote it and it had a meaning. If we call bootstrap, it's
definitely a new master here, because we're creating a new
replicaset uuid.


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Tarantool-patches] [PATCH v11 1/8] box: fix bootstrap comment
  2020-04-05  7:31   ` Konstantin Osipov
@ 2020-04-05  7:56     ` Cyrill Gorcunov
  2020-04-05  8:35       ` Konstantin Osipov
  0 siblings, 1 reply; 20+ messages in thread
From: Cyrill Gorcunov @ 2020-04-05  7:56 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tml

On Sun, Apr 05, 2020 at 10:31:40AM +0300, Konstantin Osipov wrote:
> * Cyrill Gorcunov <gorcunov@gmail.com> [20/04/05 05:06]:
> > We're not starting new master node but
> > a new instance instead. The comment simply
> > leftover from older modifications.
> 
> I wrote it and it had a meaning. If we call bootstrap, it's
> definitely a new master here, because we're creating a new
> replicaset uuid.

Wait, this looks vague. Here what we have inside

/**
 * Bootstrap a new instance either as the first master in a
 * replica set or as a replica of an existing master.
 *
 * @param[out] is_bootstrap_leader  set if this instance is
 *                                  the leader of a new cluster
 */
static void
bootstrap(const struct tt_uuid *instance_uuid,
	  const struct tt_uuid *replicaset_uuid,
	  bool *is_bootstrap_leader)
{
	...
	/* Use the first replica by URI as a bootstrap leader */
	struct replica *master = replicaset_leader();
	assert(master == NULL || master->applier != NULL);

	if (master != NULL && !tt_uuid_is_equal(&master->uuid, &INSTANCE_UUID)) {
		bootstrap_from_master(master);
		/* Check replica set UUID */
		if (!tt_uuid_is_nil(replicaset_uuid) &&
		    !tt_uuid_is_equal(replicaset_uuid, &REPLICASET_UUID)) {
			tnt_raise(ClientError, ER_REPLICASET_UUID_MISMATCH,
				  tt_uuid_str(replicaset_uuid),
				  tt_uuid_str(&REPLICASET_UUID));
		}
	} else {
		bootstrap_master(replicaset_uuid);
		*is_bootstrap_leader = true;
	}
	...
}

Either comment in the function description is wrong, either the
comment I modified was wrong. Or maybe I don't understand terminology.
Who is master and who is slave node?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Tarantool-patches] [PATCH v11 1/8] box: fix bootstrap comment
  2020-04-05  7:56     ` Cyrill Gorcunov
@ 2020-04-05  8:35       ` Konstantin Osipov
  0 siblings, 0 replies; 20+ messages in thread
From: Konstantin Osipov @ 2020-04-05  8:35 UTC (permalink / raw)
  To: Cyrill Gorcunov, tarantool-patches

* Cyrill Gorcunov <gorcunov@gmail.com> [20/04/05 11:00]:
> On Sun, Apr 05, 2020 at 10:31:40AM +0300, Konstantin Osipov wrote:
> > * Cyrill Gorcunov <gorcunov@gmail.com> [20/04/05 05:06]:
> > > We're not starting new master node but
> > > a new instance instead. The comment simply
> > > leftover from older modifications.
> > 
> > I wrote it and it had a meaning. If we call bootstrap, it's
> > definitely a new master here, because we're creating a new
> > replicaset uuid.
> 
> Wait, this looks vague. Here what we have inside

Sorry, I looked out of the context. My comment is indeed obsolete,
perhaps some code rot. The patch is lgtm.


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Tarantool-patches] [PATCH v11 2/8] box/alter: shrink txn_alter_trigger_new code
  2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 2/8] box/alter: shrink txn_alter_trigger_new code Cyrill Gorcunov
@ 2020-04-06  7:39   ` Konstantin Osipov
  0 siblings, 0 replies; 20+ messages in thread
From: Konstantin Osipov @ 2020-04-06  7:39 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

* Cyrill Gorcunov <gorcunov@gmail.com> [20/04/05 05:06]:
> Instead of calling memset which is useless here
> just use trigger_create helper.
lgtm


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Tarantool-patches] [PATCH v11 5/8] box/replication: merge replica_by_id into replicaset
  2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 5/8] box/replication: merge replica_by_id into replicaset Cyrill Gorcunov
@ 2020-04-06  7:40   ` Konstantin Osipov
  0 siblings, 0 replies; 20+ messages in thread
From: Konstantin Osipov @ 2020-04-06  7:40 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

* Cyrill Gorcunov <gorcunov@gmail.com> [20/04/05 05:06]:
> For some reason the replica_by_id member (which is an
> array of pointers) is allocated dynamically. Moreover
> VCLOCK_MAX = 32 by now and extending it to some new
> limit will require a way more efforts than just increase
> the number.
> 
> Thus reserve memory for replica_by_id inside replicaset
> statically. This allows to simplify code a bit and
> drop calloc/free calls.
> 
> The former code comes from edd76a2a0ae17e3d without any
> explanation why the dynamic member is needed.

lgtm


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Tarantool-patches] [PATCH v11 6/8] applier: reduce applier_txn_rollback_cb code density
  2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 6/8] applier: reduce applier_txn_rollback_cb code density Cyrill Gorcunov
@ 2020-04-06  7:40   ` Konstantin Osipov
  0 siblings, 0 replies; 20+ messages in thread
From: Konstantin Osipov @ 2020-04-06  7:40 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

* Cyrill Gorcunov <gorcunov@gmail.com> [20/04/05 05:06]:

Trivial, doesn't require a review.

> To make it a bit more readable.
> 
> Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
> ---
>  src/box/applier.cc | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index 2eb1e04fc..2f9c9c797 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc
> @@ -695,8 +695,10 @@ applier_txn_rollback_cb(struct trigger *trigger, void *event)
>  	/* Setup shared applier diagnostic area. */
>  	diag_set(ClientError, ER_WAL_IO);
>  	diag_move(&fiber()->diag, &replicaset.applier.diag);
> +
>  	/* Broadcast the rollback event across all appliers. */
>  	trigger_run(&replicaset.applier.on_rollback, event);
> +
>  	/* Rollback applier vclock to the committed one. */
>  	vclock_copy(&replicaset.applier.vclock, &replicaset.vclock);
>  	return 0;
> -- 
> 2.20.1
> 

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Tarantool-patches] [PATCH v11 8/8] test: add replication/applier-rollback
  2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 8/8] test: add replication/applier-rollback Cyrill Gorcunov
@ 2020-04-07 10:26   ` Serge Petrenko
  2020-04-07 10:55     ` Cyrill Gorcunov
  0 siblings, 1 reply; 20+ messages in thread
From: Serge Petrenko @ 2020-04-07 10:26 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

Hi! Thanks for the patch!
Please find my comments below.

> 4 апр. 2020 г., в 19:15, Cyrill Gorcunov <gorcunov@gmail.com> написал(а):
> 
> Test that diag_raise doesn't happen if async transaction
> fails inside replication procedure.
> 
> Side note: I don't like merging tests with patches in
> general and I hate doing so for big tests with a passion
> because it hides the patch code itself. So here is a
> separate patch on top of the fix.

I like that. It was easy to check the test without the fix.

> 
> Test-of #4730
> 
> Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
> ---
> src/box/txn.c                               |  13 ++
> src/lib/core/errinj.h                       |   1 +
> test/box/errinj.result                      |   1 +
> test/replication/applier-rollback-slave.lua |  16 ++
> test/replication/applier-rollback.result    | 162 ++++++++++++++++++++
> test/replication/applier-rollback.test.lua  |  81 ++++++++++

It’s a bugfix, so please name the test gh-4730-something-something.test.lua

> test/replication/suite.ini                  |   2 +-
> 7 files changed, 275 insertions(+), 1 deletion(-)
> create mode 100644 test/replication/applier-rollback-slave.lua
> create mode 100644 test/replication/applier-rollback.result
> create mode 100644 test/replication/applier-rollback.test.lua
> 
> diff --git a/src/box/txn.c b/src/box/txn.c
> index f9c3e3675..488aa4bdd 100644
> --- a/src/box/txn.c
> +++ b/src/box/txn.c
> @@ -34,6 +34,7 @@
> #include "journal.h"
> #include <fiber.h>
> #include "xrow.h"
> +#include "errinj.h"
> 
> double too_long_threshold;
> 
> @@ -576,6 +577,18 @@ txn_commit_async(struct txn *txn)
> {
> 	struct journal_entry *req;
> 
> +	ERROR_INJECT(ERRINJ_TXN_COMMIT_ASYNC, {
> +		diag_set(ClientError, ER_INJECTION,
> +			 "txn commit async injection");
> +		/*
> +		 * Log it for the testing sake: we grep
> +		 * output to mark this event.
> +		 */
> +		diag_log();
> +		txn_rollback(txn);
> +		return -1;
> +	});
> +
> 	if (txn_prepare(txn) != 0) {
> 		txn_rollback(txn);
> 		return -1;
> diff --git a/src/lib/core/errinj.h b/src/lib/core/errinj.h
> index ee6c57a0d..7577ed11a 100644
> --- a/src/lib/core/errinj.h
> +++ b/src/lib/core/errinj.h
> @@ -139,6 +139,7 @@ struct errinj {
> 	_(ERRINJ_FIBER_MPROTECT, ERRINJ_INT, {.iparam = -1}) \
> 	_(ERRINJ_RELAY_FASTER_THAN_TX, ERRINJ_BOOL, {.bparam = false}) \
> 	_(ERRINJ_INDEX_RESERVE, ERRINJ_BOOL, {.bparam = false})\
> +	_(ERRINJ_TXN_COMMIT_ASYNC, ERRINJ_BOOL, {.bparam = false})\
> 
> ENUM0(errinj_id, ERRINJ_LIST);
> extern struct errinj errinjs[];
> diff --git a/test/box/errinj.result b/test/box/errinj.result
> index 0d3fedeb3..de877b708 100644
> --- a/test/box/errinj.result
> +++ b/test/box/errinj.result
> @@ -76,6 +76,7 @@ evals
>   - ERRINJ_TUPLE_ALLOC: false
>   - ERRINJ_TUPLE_FIELD: false
>   - ERRINJ_TUPLE_FORMAT_COUNT: -1
> +  - ERRINJ_TXN_COMMIT_ASYNC: false
>   - ERRINJ_VYRUN_DATA_READ: false
>   - ERRINJ_VY_COMPACTION_DELAY: false
>   - ERRINJ_VY_DELAY_PK_LOOKUP: false
> diff --git a/test/replication/applier-rollback-slave.lua b/test/replication/applier-rollback-slave.lua
> new file mode 100644
> index 000000000..26fb10055
> --- /dev/null
> +++ b/test/replication/applier-rollback-slave.lua

Better name it replica_applier_rollback.lua for the sake of consistency
with other instance file names.

> @@ -0,0 +1,16 @@
> +--
> +-- vim: ts=4 sw=4 et
> +--
> +
> +print('arg', arg)
> +
> +box.cfg({
> +    replication                 = os.getenv("MASTER"),
> +    listen                      = os.getenv("LISTEN"),
> +    memtx_memory                = 107374182,
> +    replication_timeout         = 0.1,
> +    replication_connect_timeout = 0.5,
> +    read_only                   = true,
> +})
> +
> +require('console').listen(os.getenv('ADMIN'))
> diff --git a/test/replication/applier-rollback.result b/test/replication/applier-rollback.result
> new file mode 100644
> index 000000000..3c659f460
> --- /dev/null
> +++ b/test/replication/applier-rollback.result
> @@ -0,0 +1,162 @@
> +-- test-run result file version 2
> +#!/usr/bin/env tarantool
> + | ---
> + | ...
> +--
> +-- vim: ts=4 sw=4 et
> +--
> +
> +test_run = require('test_run').new()
> + | ---
> + | ...
> +
> +errinj = box.error.injection
> + | ---
> + | ...
> +engine = test_run:get_cfg('engine')
> + | ---
> + | ...
> +
> +--
> +-- Allow replica to connect to us
> +box.schema.user.grant('guest', 'replication')
> + | ---
> + | ...
> +
> +--
> +-- Create replica instance, we're the master and
> +-- start it, no data to sync yet though
> +test_run:cmd("create server replica_slave with rpl_master=default, script='replication/applier-rollback-slave.lua'")
> + | ---
> + | - true
> + | ...
> +test_run:cmd("start server replica_slave")
> + | ---
> + | - true
> + | ...
> +
> +--
> +-- Fill initial data on the master instance
> +test_run:cmd('switch default')
> + | ---
> + | - true
> + | ...
> +
> +_ = box.schema.space.create('test', {engine=engine})
> + | ---
> + | ...
> +s = box.space.test
> + | ---
> + | ...
> +
> +s:format({{name = 'id', type = 'unsigned'}, {name = 'band_name', type = 'string'}})
> + | ---
> + | ...
> +
> +_ = s:create_index('primary', {type = 'tree', parts = {'id'}})
> + | ---
> + | ...
> +s:insert({1, '1'})
> + | ---
> + | - [1, '1']
> + | ...
> +s:insert({2, '2'})
> + | ---
> + | - [2, '2']
> + | ...
> +s:insert({3, '3'})
> + | ---
> + | - [3, '3']
> + | ...
> +
> +--
> +-- To make sure we're running
> +box.info.status
> + | ---
> + | - running
> + | ...
> +
> +--
> +-- Wait for data from master get propagated
> +test_run:wait_lsn('replica_slave', 'default')
> + | ---
> + | ...
> +
> +--
> +-- Now inject error into slave instance
> +test_run:cmd('switch replica_slave')
> + | ---
> + | - true
> + | ...
> +
> +--
> +-- To make sure we're running
> +box.info.status
> + | ---
> + | - running
> + | ...
> +
> +--
> +-- To fail inserting new record.
> +errinj = box.error.injection
> + | ---
> + | ...
> +errinj.set('ERRINJ_TXN_COMMIT_ASYNC', true)
> + | ---
> + | - ok
> + | ...
> +
> +--
> +-- Jump back to master node and write new
> +-- entry which should cause error to happen
> +-- on slave instance
> +test_run:cmd('switch default')
> + | ---
> + | - true
> + | ...
> +s:insert({4, '4'})
> + | ---
> + | - [4, '4']
> + | ...
> +
> +--
> +-- Wait for error to trigger
> +test_run:cmd('switch replica_slave')
> + | ---
> + | - true
> + | ...
> +fiber = require('fiber')
> + | ---
> + | ...
> +while test_run:grep_log('replica_slave', 'ER_INJECTION:[^\n]*') == nil do fiber.sleep(0.1) end
> + | ---
> + | ...
> +
> +----
> +---- Such error cause the applier to be
> +---- cancelled and reaped, thus stop the
> +---- slave node and cleanup
> +test_run:cmd('switch default')
> + | ---
> + | - true
> + | ...
> +
> +--
> +-- Cleanup
> +test_run:cmd("stop server replica_slave")
> + | ---
> + | - true
> + | ...
> +test_run:cmd("delete server replica_slave")
> + | ---
> + | - true
> + | ...
> +box.cfg{replication=""}
> + | ---
> + | ...
> +box.space.test:drop()
> + | ---
> + | ...
> +box.schema.user.revoke('guest', 'replication')
> + | ---
> + | ...
> diff --git a/test/replication/applier-rollback.test.lua b/test/replication/applier-rollback.test.lua
> new file mode 100644
> index 000000000..2c32af5c6
> --- /dev/null
> +++ b/test/replication/applier-rollback.test.lua
> @@ -0,0 +1,81 @@
> +#!/usr/bin/env tarantool
> +--
> +-- vim: ts=4 sw=4 et
> +--
> +
> +test_run = require('test_run').new()
> +
> +errinj = box.error.injection

You don’t need the errinj on master, only on replica, AFAICS.

> +engine = test_run:get_cfg('engine')

Why test both engines? I suggest you run the test with no arguments.
(you’ll have to add a line to replication/suite.cfg for that)

> +
> +--
> +-- Allow replica to connect to us
> +box.schema.user.grant('guest', 'replication')
> +
> +--
> +-- Create replica instance, we're the master and
> +-- start it, no data to sync yet though
> +test_run:cmd("create server replica_slave with rpl_master=default, script='replication/applier-rollback-slave.lua'")
> +test_run:cmd("start server replica_slave")
> +
> +--
> +-- Fill initial data on the master instance
> +test_run:cmd('switch default')
> +
> +_ = box.schema.space.create('test', {engine=engine})
> +s = box.space.test
> +
> +s:format({{name = 'id', type = 'unsigned'}, {name = 'band_name', type = 'string'}})
> +
> +_ = s:create_index('primary', {type = 'tree', parts = {'id'}})
> +s:insert({1, '1'})
> +s:insert({2, '2'})
> +s:insert({3, '3'})
> +
> +--
> +-- To make sure we're running
> +box.info.status
> +

This should always evaluate to ‘running’, you’ve just inserted some
values and got a result back, so I’d omit this check.

> +--
> +-- Wait for data from master get propagated
> +test_run:wait_lsn('replica_slave', 'default')
> +
> +--
> +-- Now inject error into slave instance
> +test_run:cmd('switch replica_slave')
> +
> +--
> +-- To make sure we're running
> +box.info.status
> +
> +--
> +-- To fail inserting new record.
> +errinj = box.error.injection
> +errinj.set('ERRINJ_TXN_COMMIT_ASYNC', true)
> +
> +--
> +-- Jump back to master node and write new
> +-- entry which should cause error to happen
> +-- on slave instance
> +test_run:cmd('switch default')
> +s:insert({4, '4'})
> +
> +--
> +-- Wait for error to trigger
> +test_run:cmd('switch replica_slave')
> +fiber = require('fiber')
> +while test_run:grep_log('replica_slave', 'ER_INJECTION:[^\n]*') == nil do fiber.sleep(0.1) end
> +
> +----
> +---- Such error cause the applier to be
> +---- cancelled and reaped, thus stop the
> +---- slave node and cleanup
> +test_run:cmd('switch default')
> +
> +--
> +-- Cleanup
> +test_run:cmd("stop server replica_slave")
> +test_run:cmd("delete server replica_slave")
> +box.cfg{replication=""}

You didn’t set box.cfg.replication, so you shouldn’t reset it.

> +box.space.test:drop()
> +box.schema.user.revoke('guest', 'replication')
> diff --git a/test/replication/suite.ini b/test/replication/suite.ini
> index b4e09744a..f6c924762 100644
> --- a/test/replication/suite.ini
> +++ b/test/replication/suite.ini
> @@ -3,7 +3,7 @@ core = tarantool
> script =  master.lua
> description = tarantool/box, replication
> disabled = consistent.test.lua
> -release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua
> +release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua applier-rollback.test.lua
> config = suite.cfg
> lua_libs = lua/fast_replica.lua lua/rlimit.lua
> use_unix_sockets = True
> -- 
> 2.20.1
> 


--
Serge Petrenko
sergepetrenko@tarantool.org

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Tarantool-patches] [PATCH v11 7/8] box/applier: prevent nil dereference on applier rollback
  2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 7/8] box/applier: prevent nil dereference on applier rollback Cyrill Gorcunov
@ 2020-04-07 10:36   ` Serge Petrenko
  0 siblings, 0 replies; 20+ messages in thread
From: Serge Petrenko @ 2020-04-07 10:36 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

Hi! Thanks for the patch.

> 4 апр. 2020 г., в 19:15, Cyrill Gorcunov <gorcunov@gmail.com> написал(а):
> 
> Currently when transaction rollback happens we just drop an existing
> error setting ClientError to the replicaset.applier.diag. This action
> leaves current fiber with diag=nil, which in turn leads to sigsegv once
> diag_raise() called right after applier_apply_tx():
> 
> | applier_f
> |   try {
> |   applier_subscribe
> |     applier_apply_tx
> |       // error happens
> |       txn_rollback
> |         diag_set(ClientError, ER_WAL_IO)
> |         diag_move(&fiber()->diag, &replicaset.applier.diag)
> |         // fiber->diag = nil
> |       applier_on_rollback
> |         diag_add_error(&applier->diag, diag_last_error(&replicaset.applier.diag)
> |         fiber_cancel(applier->reader);
> |     diag_raise() -> NULL dereference
> |   } catch { ... }
> 
> Thus:
> - use diag_set_error() instead of diag_move() to not drop error
>   from a current fiber() preventing a nil dereference;
> - put fixme mark into the code: we need to rework it in a
>   more sense way.
> 
> Fixes #4730
> 
> Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
> ---
> src/box/applier.cc | 17 +++++++++++++++--
> 1 file changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index 2f9c9c797..68de3c08c 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc
> @@ -692,9 +692,22 @@ static int
> applier_txn_rollback_cb(struct trigger *trigger, void *event)
> {
> 	(void) trigger;
> -	/* Setup shared applier diagnostic area. */
> +
> +	/*
> +	 * Setup shared applier diagnostic area.
> +	 *
> +	 * FIXME: We should consider redesign this
> +	 * moment and instead of carrying one shared
> +	 * diag use per-applier diag instead all the time
> +	 * (which actually already present in the structure).
> +	 *
> +	 * But remember that transactions are asynchronous
> +	 * and rollback may happen a way latter after it
> +	 * passed to the journal engine.
> +	 */
> 	diag_set(ClientError, ER_WAL_IO);
> -	diag_move(&fiber()->diag, &replicaset.applier.diag);
> +	diag_set_error(&replicaset.applier.diag,
> +		       diag_last_error(diag_get()));
> 
> 	/* Broadcast the rollback event across all appliers. */
> 	trigger_run(&replicaset.applier.on_rollback, event);
> — 
> 2.20.1
> 

LGTM.


--
Serge Petrenko
sergepetrenko@tarantool.org

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Tarantool-patches] [PATCH v11 0/8] box/replication: prevent nil dereference on applier rollback
  2020-04-04 16:15 [Tarantool-patches] [PATCH v11 0/8] box/replication: prevent nil dereference on applier rollback Cyrill Gorcunov
                   ` (7 preceding siblings ...)
  2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 8/8] test: add replication/applier-rollback Cyrill Gorcunov
@ 2020-04-07 10:46 ` Serge Petrenko
  2020-04-07 11:00   ` Cyrill Gorcunov
  8 siblings, 1 reply; 20+ messages in thread
From: Serge Petrenko @ 2020-04-07 10:46 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml



> 4 апр. 2020 г., в 19:15, Cyrill Gorcunov <gorcunov@gmail.com> написал(а):
> 
> In the series a few fixups including simple code cleanup.
> 
> I've assigned a separate bug for myself for applier redesign
> since I need more time to understand code better
> https://github.com/tarantool/tarantool/issues/4853
> 
> Issue https://github.com/tarantool/tarantool/issues/4730
> Branch gorcunov/gh-4730-diag-raise-master-11
> 
> Cyrill Gorcunov (8):
>  box: fix bootstrap comment
>  box/alter: shrink txn_alter_trigger_new code
>  box/request: add missing OutOfMemory diag_set
>  box/applier: add missing diag_set on region_alloc failure
>  box/replication: merge replica_by_id into replicaset
>  applier: reduce applier_txn_rollback_cb code density
>  box/applier: prevent nil dereference on applier rollback
>  test: add replication/applier-rollback
> 
> src/box/alter.cc                            |   4 +-
> src/box/applier.cc                          |  24 ++-
> src/box/box.cc                              |   2 +-
> src/box/replication.cc                      |   2 -
> src/box/replication.h                       |   2 +-
> src/box/request.c                           |   8 +-
> src/box/txn.c                               |  13 ++
> src/lib/core/errinj.h                       |   1 +
> test/box/errinj.result                      |   1 +
> test/replication/applier-rollback-slave.lua |  16 ++
> test/replication/applier-rollback.result    | 162 ++++++++++++++++++++
> test/replication/applier-rollback.test.lua  |  81 ++++++++++
> test/replication/suite.ini                  |   2 +-
> 13 files changed, 305 insertions(+), 13 deletions(-)
> create mode 100644 test/replication/applier-rollback-slave.lua
> create mode 100644 test/replication/applier-rollback.result
> create mode 100644 test/replication/applier-rollback.test.lua
> 
> — 
> 2.20.1
> 

Hi! Thanks for the patchset!

Commits 1,2, 5,6 LGTM except one comment:
Please use ‘applier’ instead of ‘box/applier’
prefixes in the commit titles.
Similarly, ‘replication’ instead of ‘box/replication’ and
‘alter’ instead of ‘box/alter’. 
--
Serge Petrenko
sergepetrenko@tarantool.org

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Tarantool-patches] [PATCH v11 8/8] test: add replication/applier-rollback
  2020-04-07 10:26   ` Serge Petrenko
@ 2020-04-07 10:55     ` Cyrill Gorcunov
  0 siblings, 0 replies; 20+ messages in thread
From: Cyrill Gorcunov @ 2020-04-07 10:55 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: tml

On Tue, Apr 07, 2020 at 01:26:34PM +0300, Serge Petrenko wrote:
> > test/replication/applier-rollback.test.lua  |  81 ++++++++++
> 
> It’s a bugfix, so please name the test gh-4730-something-something.test.lua

OK, will do.

> > --- /dev/null
> > +++ b/test/replication/applier-rollback-slave.lua
> 
> Better name it replica_applier_rollback.lua for the sake of consistency
> with other instance file names.

ok. I don't have any preference on naming, will do.

> > +
> > +errinj = box.error.injection
> 
> You don’t need the errinj on master, only on replica, AFAICS.

yea, sorry, thanks!

> 
> > +engine = test_run:get_cfg('engine')
> 
> Why test both engines? I suggest you run the test with no arguments.
> (you’ll have to add a line to replication/suite.cfg for that)

Will take a look, thanks!

> > +--
> > +-- To make sure we're running
> > +box.info.status
> > +
> 
> This should always evaluate to ‘running’, you’ve just inserted some
> values and got a result back, so I’d omit this check.

Indeed, thanks!

> > +--
> > +-- Cleanup
> > +test_run:cmd("stop server replica_slave")
> > +test_run:cmd("delete server replica_slave")
> > +box.cfg{replication=""}
> 
> You didn’t set box.cfg.replication, so you shouldn’t reset it.

+1

	Cyrill

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Tarantool-patches] [PATCH v11 0/8] box/replication: prevent nil dereference on applier rollback
  2020-04-07 10:46 ` [Tarantool-patches] [PATCH v11 0/8] box/replication: prevent nil dereference on applier rollback Serge Petrenko
@ 2020-04-07 11:00   ` Cyrill Gorcunov
  0 siblings, 0 replies; 20+ messages in thread
From: Cyrill Gorcunov @ 2020-04-07 11:00 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: tml

On Tue, Apr 07, 2020 at 01:46:08PM +0300, Serge Petrenko wrote:
> 
> Hi! Thanks for the patchset!
> 
> Commits 1,2, 5,6 LGTM except one comment:
> Please use ‘applier’ instead of ‘box/applier’
> prefixes in the commit titles.
> Similarly, ‘replication’ instead of ‘box/replication’ and
> ‘alter’ instead of ‘box/alter’. 

OK, thanks, will do.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2020-04-07 11:00 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-04 16:15 [Tarantool-patches] [PATCH v11 0/8] box/replication: prevent nil dereference on applier rollback Cyrill Gorcunov
2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 1/8] box: fix bootstrap comment Cyrill Gorcunov
2020-04-05  7:31   ` Konstantin Osipov
2020-04-05  7:56     ` Cyrill Gorcunov
2020-04-05  8:35       ` Konstantin Osipov
2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 2/8] box/alter: shrink txn_alter_trigger_new code Cyrill Gorcunov
2020-04-06  7:39   ` Konstantin Osipov
2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 3/8] box/request: add missing OutOfMemory diag_set Cyrill Gorcunov
2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 4/8] box/applier: add missing diag_set on region_alloc failure Cyrill Gorcunov
2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 5/8] box/replication: merge replica_by_id into replicaset Cyrill Gorcunov
2020-04-06  7:40   ` Konstantin Osipov
2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 6/8] applier: reduce applier_txn_rollback_cb code density Cyrill Gorcunov
2020-04-06  7:40   ` Konstantin Osipov
2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 7/8] box/applier: prevent nil dereference on applier rollback Cyrill Gorcunov
2020-04-07 10:36   ` Serge Petrenko
2020-04-04 16:15 ` [Tarantool-patches] [PATCH v11 8/8] test: add replication/applier-rollback Cyrill Gorcunov
2020-04-07 10:26   ` Serge Petrenko
2020-04-07 10:55     ` Cyrill Gorcunov
2020-04-07 10:46 ` [Tarantool-patches] [PATCH v11 0/8] box/replication: prevent nil dereference on applier rollback Serge Petrenko
2020-04-07 11:00   ` Cyrill Gorcunov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox