Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH v5 0/5] box/replication: add missing diag set and fix sigsegv
@ 2020-01-27 21:53 Cyrill Gorcunov
  2020-01-27 21:53 ` [Tarantool-patches] [PATCH v5 1/5] box/request: add missing OutOfMemory diag_set Cyrill Gorcunov
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Cyrill Gorcunov @ 2020-01-27 21:53 UTC (permalink / raw)
  To: tml

The series is on top of master branch and sits in
gorcunov/gh-4730-diag-raise-master-5

v2-4 are skipped for internal dev sake

Cyrill Gorcunov (5):
  box/request: add missing OutOfMemory diag_set
  box/applier: add missing diag_set on region_alloc failure
  box/applier: fix nil dereference in applier rollback
  errinj: add ERRINJ_REPLICA_TXN_WRITE
  test: add replication/applier-rollback

 src/box/applier.cc                          |  54 +++++-
 src/box/request.c                           |   8 +-
 src/lib/core/errinj.h                       |   1 +
 test/replication/applier-rollback-slave.lua |  16 ++
 test/replication/applier-rollback.result    | 182 ++++++++++++++++++++
 test/replication/applier-rollback.test.lua  |  85 +++++++++
 6 files changed, 339 insertions(+), 7 deletions(-)
 create mode 100644 test/replication/applier-rollback-slave.lua
 create mode 100644 test/replication/applier-rollback.result
 create mode 100644 test/replication/applier-rollback.test.lua

-- 
2.20.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH v5 1/5] box/request: add missing OutOfMemory diag_set
  2020-01-27 21:53 [Tarantool-patches] [PATCH v5 0/5] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
@ 2020-01-27 21:53 ` Cyrill Gorcunov
  2020-01-27 21:53 ` [Tarantool-patches] [PATCH v5 2/5] box/applier: add missing diag_set on region_alloc failure Cyrill Gorcunov
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Cyrill Gorcunov @ 2020-01-27 21:53 UTC (permalink / raw)
  To: tml

In request_create_from_tuple and request_handle_sequence
we may be unable to request memory for tuples, don't
forget to setup diag error otherwise diag_raise will
lead to nil dereference.

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/request.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/src/box/request.c b/src/box/request.c
index 82232a155..994f2da62 100644
--- a/src/box/request.c
+++ b/src/box/request.c
@@ -109,8 +109,10 @@ request_create_from_tuple(struct request *request, struct space *space,
 		 * the tuple data to WAL on commit.
 		 */
 		char *buf = region_alloc(&fiber()->gc, size);
-		if (buf == NULL)
+		if (buf == NULL) {
+			diag_set(OutOfMemory, size, "region_alloc", "tuple");
 			return -1;
+		}
 		memcpy(buf, data, size);
 		request->tuple = buf;
 		request->tuple_end = buf + size;
@@ -199,8 +201,10 @@ request_handle_sequence(struct request *request, struct space *space)
 		size_t buf_size = (request->tuple_end - request->tuple) +
 						mp_sizeof_uint(UINT64_MAX);
 		char *tuple = region_alloc(&fiber()->gc, buf_size);
-		if (tuple == NULL)
+		if (tuple == NULL) {
+			diag_set(OutOfMemory, buf_size, "region_alloc", "tuple");
 			return -1;
+		}
 		char *tuple_end = mp_encode_array(tuple, len);
 
 		if (unlikely(key != data)) {
-- 
2.20.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH v5 2/5] box/applier: add missing diag_set on region_alloc failure
  2020-01-27 21:53 [Tarantool-patches] [PATCH v5 0/5] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
  2020-01-27 21:53 ` [Tarantool-patches] [PATCH v5 1/5] box/request: add missing OutOfMemory diag_set Cyrill Gorcunov
@ 2020-01-27 21:53 ` Cyrill Gorcunov
  2020-01-27 21:53 ` [Tarantool-patches] [PATCH v5 3/5] box/applier: fix nil dereference in applier rollback Cyrill Gorcunov
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Cyrill Gorcunov @ 2020-01-27 21:53 UTC (permalink / raw)
  To: tml

In case if we're hitting memory limit allocating triggers
we should setup diag error to prevent nil dereference
in diag_raise call (for example from applier_apply_tx).

Note that there are region_alloc_xc helpers which are
throwing errors but as far as I understand we need the
rollback action to process first instead of immediate
throw/catch thus we use diag_set.

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/applier.cc | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index ae3d281a5..2ed5125d0 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -796,8 +796,11 @@ applier_apply_tx(struct stailq *rows)
 						     sizeof(struct trigger));
 	on_commit = (struct trigger *)region_alloc(&txn->region,
 						   sizeof(struct trigger));
-	if (on_rollback == NULL || on_commit == NULL)
+	if (on_rollback == NULL || on_commit == NULL) {
+		diag_set(OutOfMemory, sizeof(struct trigger),
+			 "region_alloc", "on_rollback/on_commit");
 		goto rollback;
+	}
 
 	trigger_create(on_rollback, applier_txn_rollback_cb, NULL, NULL);
 	txn_on_rollback(txn, on_rollback);
-- 
2.20.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH v5 3/5] box/applier: fix nil dereference in applier rollback
  2020-01-27 21:53 [Tarantool-patches] [PATCH v5 0/5] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
  2020-01-27 21:53 ` [Tarantool-patches] [PATCH v5 1/5] box/request: add missing OutOfMemory diag_set Cyrill Gorcunov
  2020-01-27 21:53 ` [Tarantool-patches] [PATCH v5 2/5] box/applier: add missing diag_set on region_alloc failure Cyrill Gorcunov
@ 2020-01-27 21:53 ` Cyrill Gorcunov
  2020-02-04 22:11   ` Konstantin Osipov
  2020-01-27 21:53 ` [Tarantool-patches] [PATCH v5 4/5] errinj: add ERRINJ_REPLICA_TXN_WRITE Cyrill Gorcunov
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Cyrill Gorcunov @ 2020-01-27 21:53 UTC (permalink / raw)
  To: tml

Currently when transaction rollback happens we just drop an existing
error setting ClientError to the replicaset.applier.diag. This action
leaves current fiber with diag=nil, which in turn leads to sigsegv once
diag_raise() called right after applier_apply_tx():

 | applier_f
 |   try {
 |   applier_subscribe
 |     applier_apply_tx
 |       // error happens
 |       txn_rollback
 |         diag_set(ClientError, ER_WAL_IO)
 |         diag_move(&fiber()->diag, &replicaset.applier.diag)
 |         // fiber->diag = nil
 |       applier_on_rollback
 |         diag_add_error(&applier->diag, diag_last_error(&replicaset.applier.diag)
 |         fiber_cancel(applier->reader);
 |     diag_raise() -> NULL dereference
 |   } catch { ... }

The applier_f works in try/catch cycle and handles errors depending on
what exactly happened during transaction application. It might reconnect
appliers in some cases, the applier is simply cancelled and reaped out in
others.

The problem is that the shared replicaset.applier.diag is handled on
FiberIsCancelled exception only (while it is set inside transaction
rollback action) and we never trigger this specific exception. But
even if we would the former error which has been causing the applier
abort is vanished by ClientError which is too general.

Thus:

 - on transaction rollback save the origin error which caused
   the transaction abort to the replicaset.applier.diag;

 - there are cases (such as xlog error injection) where diag
   is explicitly clear on error path, for this sake we setup
   ClientError instead;

 - trigger FiberIsCancelled exception which will log the
   problem and zap the applier;

 - put fixme mark into the code: we need to figure out
   if underlierd error is really critical one maybe we
   could retry the applier iteration instead.

Part-of #4730

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/applier.cc | 43 +++++++++++++++++++++++++++++++++++++++----
 1 file changed, 39 insertions(+), 4 deletions(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index 2ed5125d0..967dc91de 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -692,9 +692,31 @@ static int
 applier_txn_rollback_cb(struct trigger *trigger, void *event)
 {
 	(void) trigger;
-	/* Setup shared applier diagnostic area. */
-	diag_set(ClientError, ER_WAL_IO);
-	diag_move(&fiber()->diag, &replicaset.applier.diag);
+
+	/*
+	 * We must not loose the origin error, instead
+	 * lets keep it in replicaset diag instance.
+	 *
+	 * FIXME: We need to revisit this code and
+	 * figure out if we can reconnect and retry
+	 * the prelication process instead of cancelling
+	 * applier with FiberIsCancelled.
+	 */
+	struct error *e = diag_last_error(diag_get());
+	if (!e) {
+		/*
+		 * If information is already lost
+		 * (say xlog cleared diag instance)
+		 * setup general ClientError, seriously
+		 * we need to unweave this mess, if error
+		 * happened it must never been cleared
+		 * until error handling in rollback.
+		 */
+		diag_set(ClientError, ER_WAL_IO);
+		e = diag_last_error(diag_get());
+	}
+	diag_add_error(&replicaset.applier.diag, e);
+
 	/* Broadcast the rollback event across all appliers. */
 	trigger_run(&replicaset.applier.on_rollback, event);
 	/* Rollback applier vclock to the committed one. */
@@ -849,8 +871,20 @@ applier_on_rollback(struct trigger *trigger, void *event)
 		diag_add_error(&applier->diag,
 			       diag_last_error(&replicaset.applier.diag));
 	}
-	/* Stop the applier fiber. */
+
+	/*
+	 * Something really bad happened, we can't proceed
+	 * thus stop the applier and throw FiberIsCancelled
+	 * exception which will be catched by the caller
+	 * and the fiber gracefully finish.
+	 *
+	 * FIXME: Need to make sure that this is a really
+	 * final error where we can't longer proceed and should
+	 * zap the applier, probably we could reconnect and
+	 * retry instead?
+	 */
 	fiber_cancel(applier->reader);
+	diag_set(FiberIsCancelled);
 	return 0;
 }
 
@@ -1098,6 +1132,7 @@ applier_f(va_list ap)
 		} catch (FiberIsCancelled *e) {
 			if (!diag_is_empty(&applier->diag)) {
 				diag_move(&applier->diag, &fiber()->diag);
+				diag_log();
 				applier_disconnect(applier, APPLIER_STOPPED);
 				break;
 			}
-- 
2.20.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH v5 4/5] errinj: add ERRINJ_REPLICA_TXN_WRITE
  2020-01-27 21:53 [Tarantool-patches] [PATCH v5 0/5] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
                   ` (2 preceding siblings ...)
  2020-01-27 21:53 ` [Tarantool-patches] [PATCH v5 3/5] box/applier: fix nil dereference in applier rollback Cyrill Gorcunov
@ 2020-01-27 21:53 ` Cyrill Gorcunov
  2020-02-04 22:11   ` Konstantin Osipov
  2020-01-27 21:53 ` [Tarantool-patches] [PATCH v5 5/5] test: add replication/applier-rollback Cyrill Gorcunov
  2020-01-28 14:23 ` [Tarantool-patches] [PATCH v5 0/5] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
  5 siblings, 1 reply; 13+ messages in thread
From: Cyrill Gorcunov @ 2020-01-27 21:53 UTC (permalink / raw)
  To: tml

To test rollback error nil dereference

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/applier.cc    | 6 ++++++
 src/lib/core/errinj.h | 1 +
 2 files changed, 7 insertions(+)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index 967dc91de..e739f23e2 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -51,6 +51,7 @@
 #include "txn.h"
 #include "box.h"
 #include "scoped_guard.h"
+#include "errinj.h"
 
 STRS(applier_state, applier_STATE);
 
@@ -830,6 +831,11 @@ applier_apply_tx(struct stailq *rows)
 	trigger_create(on_commit, applier_txn_commit_cb, NULL, NULL);
 	txn_on_commit(txn, on_commit);
 
+	ERROR_INJECT(ERRINJ_REPLICA_TXN_WRITE, {
+		diag_set(ClientError, ER_INJECTION, "replica txn write injection");
+		goto rollback;
+	});
+
 	if (txn_write(txn) < 0)
 		goto fail;
 
diff --git a/src/lib/core/errinj.h b/src/lib/core/errinj.h
index 672da2119..da6adfe97 100644
--- a/src/lib/core/errinj.h
+++ b/src/lib/core/errinj.h
@@ -135,6 +135,7 @@ struct errinj {
 	_(ERRINJ_COIO_SENDFILE_CHUNK, ERRINJ_INT, {.iparam = -1}) \
 	_(ERRINJ_SWIM_FD_ONLY, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_DYN_MODULE_COUNT, ERRINJ_INT, {.iparam = 0}) \
+	_(ERRINJ_REPLICA_TXN_WRITE, ERRINJ_BOOL, {.bparam = false}) \
 
 ENUM0(errinj_id, ERRINJ_LIST);
 extern struct errinj errinjs[];
-- 
2.20.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH v5 5/5] test: add replication/applier-rollback
  2020-01-27 21:53 [Tarantool-patches] [PATCH v5 0/5] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
                   ` (3 preceding siblings ...)
  2020-01-27 21:53 ` [Tarantool-patches] [PATCH v5 4/5] errinj: add ERRINJ_REPLICA_TXN_WRITE Cyrill Gorcunov
@ 2020-01-27 21:53 ` Cyrill Gorcunov
  2020-01-28  8:26   ` [Tarantool-patches] [PATCH v6 " Cyrill Gorcunov
  2020-01-28 14:23 ` [Tarantool-patches] [PATCH v5 0/5] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
  5 siblings, 1 reply; 13+ messages in thread
From: Cyrill Gorcunov @ 2020-01-27 21:53 UTC (permalink / raw)
  To: tml

In the test force error injection ERRINJ_REPLICA_TXN_WRITE
to happen which will initiate applier transaction rollback.
Without the fix it will cause SIGSEGV due to lack of error
propagation.

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 test/replication/applier-rollback-slave.lua |  16 ++
 test/replication/applier-rollback.result    | 182 ++++++++++++++++++++
 test/replication/applier-rollback.test.lua  |  85 +++++++++
 3 files changed, 283 insertions(+)
 create mode 100644 test/replication/applier-rollback-slave.lua
 create mode 100644 test/replication/applier-rollback.result
 create mode 100644 test/replication/applier-rollback.test.lua

diff --git a/test/replication/applier-rollback-slave.lua b/test/replication/applier-rollback-slave.lua
new file mode 100644
index 000000000..26fb10055
--- /dev/null
+++ b/test/replication/applier-rollback-slave.lua
@@ -0,0 +1,16 @@
+--
+-- vim: ts=4 sw=4 et
+--
+
+print('arg', arg)
+
+box.cfg({
+    replication                 = os.getenv("MASTER"),
+    listen                      = os.getenv("LISTEN"),
+    memtx_memory                = 107374182,
+    replication_timeout         = 0.1,
+    replication_connect_timeout = 0.5,
+    read_only                   = true,
+})
+
+require('console').listen(os.getenv('ADMIN'))
diff --git a/test/replication/applier-rollback.result b/test/replication/applier-rollback.result
new file mode 100644
index 000000000..5a83f1e7d
--- /dev/null
+++ b/test/replication/applier-rollback.result
@@ -0,0 +1,182 @@
+-- test-run result file version 2
+#!/usr/bin/env tarantool
+ | ---
+ | ...
+--
+-- vim: ts=4 sw=4 et
+--
+
+test_run = require('test_run').new()
+ | ---
+ | ...
+
+vclock_diff = require('fast_replica').vclock_diff
+ | ---
+ | ...
+errinj = box.error.injection
+ | ---
+ | ...
+engine = test_run:get_cfg('engine')
+ | ---
+ | ...
+
+--
+-- Allow replica to connect to us
+box.schema.user.grant('guest', 'replication')
+ | ---
+ | ...
+
+--
+-- Create replica instance, we're the master and
+-- start it, no data to sync yet though
+test_run:cmd("create server replica_slave with rpl_master=default, script='replication/applier-rollback-slave.lua'")
+ | ---
+ | - true
+ | ...
+test_run:cmd("start server replica_slave")
+ | ---
+ | - true
+ | ...
+
+--
+-- Fill initial data on the master instance
+test_run:cmd('switch default')
+ | ---
+ | - true
+ | ...
+
+_ = box.schema.space.create('test', {engine=engine})
+ | ---
+ | ...
+s = box.space.test
+ | ---
+ | ...
+
+s:format({{name = 'id', type = 'unsigned'}, {name = 'band_name', type = 'string'}})
+ | ---
+ | ...
+
+_ = s:create_index('primary', {type = 'tree', parts = {'id'}})
+ | ---
+ | ...
+s:insert({1, '1'})
+ | ---
+ | - [1, '1']
+ | ...
+s:insert({2, '2'})
+ | ---
+ | - [2, '2']
+ | ...
+s:insert({3, '3'})
+ | ---
+ | - [3, '3']
+ | ...
+
+--
+-- To make sure we're running
+box.info.status
+ | ---
+ | - running
+ | ...
+box.info.id
+ | ---
+ | - 1
+ | ...
+box.info.lsn
+ | ---
+ | - 9
+ | ...
+
+--
+-- Wait for data from master get propagated
+test_run:wait_lsn('replica_slave', 'default')
+ | ---
+ | ...
+
+--
+-- Now inject error into slave instance
+test_run:cmd('switch replica_slave')
+ | ---
+ | - true
+ | ...
+
+box.info.status
+ | ---
+ | - running
+ | ...
+box.info.id
+ | ---
+ | - 2
+ | ...
+box.info.lsn
+ | ---
+ | - 0
+ | ...
+
+fiber = require('fiber')
+ | ---
+ | ...
+errinj = box.error.injection
+ | ---
+ | ...
+errinj.set('ERRINJ_REPLICA_TXN_WRITE', true)
+ | ---
+ | - ok
+ | ...
+
+--
+-- Jump back to master node and write new
+-- entry which should cause error to happen
+-- on slave instance
+test_run:cmd('switch default')
+ | ---
+ | - true
+ | ...
+s:insert({4, '4'})
+ | ---
+ | - [4, '4']
+ | ...
+
+--
+-- Wait for error to trigger
+test_run:cmd('switch replica_slave')
+ | ---
+ | - true
+ | ...
+while test_run:grep_log('replica_slave', 'ER_INJECTION:[^\n]*') == nil do fiber.sleep(1) end
+ | ---
+ | ...
+
+--
+-- Such error cause the applier to be
+-- cancelled and reaped, thus stop the
+-- slave node and restart it back
+test_run:cmd('switch default')
+ | ---
+ | - true
+ | ...
+test_run:cmd("stop server replica_slave")
+ | ---
+ | - true
+ | ...
+test_run:cmd("start server replica_slave")
+ | ---
+ | - true
+ | ...
+
+--
+-- Wait until the new data fetched and sync'ed
+test_run:wait_lsn('replica_slave', 'default')
+ | ---
+ | ...
+
+--
+-- Cleanup
+test_run:cmd("stop server replica_slave")
+ | ---
+ | - true
+ | ...
+test_run:cmd("cleanup server replica_slave")
+ | ---
+ | - true
+ | ...
diff --git a/test/replication/applier-rollback.test.lua b/test/replication/applier-rollback.test.lua
new file mode 100644
index 000000000..c164c58eb
--- /dev/null
+++ b/test/replication/applier-rollback.test.lua
@@ -0,0 +1,85 @@
+#!/usr/bin/env tarantool
+--
+-- vim: ts=4 sw=4 et
+--
+
+test_run = require('test_run').new()
+
+vclock_diff = require('fast_replica').vclock_diff
+errinj = box.error.injection
+engine = test_run:get_cfg('engine')
+
+--
+-- Allow replica to connect to us
+box.schema.user.grant('guest', 'replication')
+
+--
+-- Create replica instance, we're the master and
+-- start it, no data to sync yet though
+test_run:cmd("create server replica_slave with rpl_master=default, script='replication/applier-rollback-slave.lua'")
+test_run:cmd("start server replica_slave")
+
+--
+-- Fill initial data on the master instance
+test_run:cmd('switch default')
+
+_ = box.schema.space.create('test', {engine=engine})
+s = box.space.test
+
+s:format({{name = 'id', type = 'unsigned'}, {name = 'band_name', type = 'string'}})
+
+_ = s:create_index('primary', {type = 'tree', parts = {'id'}})
+s:insert({1, '1'})
+s:insert({2, '2'})
+s:insert({3, '3'})
+
+--
+-- To make sure we're running
+box.info.status
+box.info.id
+box.info.lsn
+
+--
+-- Wait for data from master get propagated
+test_run:wait_lsn('replica_slave', 'default')
+
+--
+-- Now inject error into slave instance
+test_run:cmd('switch replica_slave')
+
+box.info.status
+box.info.id
+box.info.lsn
+
+fiber = require('fiber')
+errinj = box.error.injection
+errinj.set('ERRINJ_REPLICA_TXN_WRITE', true)
+
+--
+-- Jump back to master node and write new
+-- entry which should cause error to happen
+-- on slave instance
+test_run:cmd('switch default')
+s:insert({4, '4'})
+
+--
+-- Wait for error to trigger
+test_run:cmd('switch replica_slave')
+while test_run:grep_log('replica_slave', 'ER_INJECTION:[^\n]*') == nil do fiber.sleep(1) end
+
+--
+-- Such error cause the applier to be
+-- cancelled and reaped, thus stop the
+-- slave node and restart it back
+test_run:cmd('switch default')
+test_run:cmd("stop server replica_slave")
+test_run:cmd("start server replica_slave")
+
+--
+-- Wait until the new data fetched and sync'ed
+test_run:wait_lsn('replica_slave', 'default')
+
+--
+-- Cleanup
+test_run:cmd("stop server replica_slave")
+test_run:cmd("cleanup server replica_slave")
-- 
2.20.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH v6 5/5] test: add replication/applier-rollback
  2020-01-27 21:53 ` [Tarantool-patches] [PATCH v5 5/5] test: add replication/applier-rollback Cyrill Gorcunov
@ 2020-01-28  8:26   ` Cyrill Gorcunov
  0 siblings, 0 replies; 13+ messages in thread
From: Cyrill Gorcunov @ 2020-01-28  8:26 UTC (permalink / raw)
  To: tml

In the test force error injection ERRINJ_REPLICA_TXN_WRITE
to happen which will initiate applier transaction rollback.
Without the fix it will cause SIGSEGV due to lack of error
propagation.

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---

I've updated test a bit since it didn't pass on travis due
to redundant mutuable information printed. Lets see if it
pass tests on gitlab

 test/replication/applier-rollback-slave.lua |  16 ++
 test/replication/applier-rollback.result    | 166 ++++++++++++++++++++
 test/replication/applier-rollback.test.lua  |  81 ++++++++++
 3 files changed, 263 insertions(+)
 create mode 100644 test/replication/applier-rollback-slave.lua
 create mode 100644 test/replication/applier-rollback.result
 create mode 100644 test/replication/applier-rollback.test.lua

diff --git a/test/replication/applier-rollback-slave.lua b/test/replication/applier-rollback-slave.lua
new file mode 100644
index 000000000..26fb10055
--- /dev/null
+++ b/test/replication/applier-rollback-slave.lua
@@ -0,0 +1,16 @@
+--
+-- vim: ts=4 sw=4 et
+--
+
+print('arg', arg)
+
+box.cfg({
+    replication                 = os.getenv("MASTER"),
+    listen                      = os.getenv("LISTEN"),
+    memtx_memory                = 107374182,
+    replication_timeout         = 0.1,
+    replication_connect_timeout = 0.5,
+    read_only                   = true,
+})
+
+require('console').listen(os.getenv('ADMIN'))
diff --git a/test/replication/applier-rollback.result b/test/replication/applier-rollback.result
new file mode 100644
index 000000000..0349d5723
--- /dev/null
+++ b/test/replication/applier-rollback.result
@@ -0,0 +1,166 @@
+-- test-run result file version 2
+#!/usr/bin/env tarantool
+ | ---
+ | ...
+--
+-- vim: ts=4 sw=4 et
+--
+
+test_run = require('test_run').new()
+ | ---
+ | ...
+
+vclock_diff = require('fast_replica').vclock_diff
+ | ---
+ | ...
+errinj = box.error.injection
+ | ---
+ | ...
+engine = test_run:get_cfg('engine')
+ | ---
+ | ...
+
+--
+-- Allow replica to connect to us
+box.schema.user.grant('guest', 'replication')
+ | ---
+ | ...
+
+--
+-- Create replica instance, we're the master and
+-- start it, no data to sync yet though
+test_run:cmd("create server replica_slave with rpl_master=default, script='replication/applier-rollback-slave.lua'")
+ | ---
+ | - true
+ | ...
+test_run:cmd("start server replica_slave")
+ | ---
+ | - true
+ | ...
+
+--
+-- Fill initial data on the master instance
+test_run:cmd('switch default')
+ | ---
+ | - true
+ | ...
+
+_ = box.schema.space.create('test', {engine=engine})
+ | ---
+ | ...
+s = box.space.test
+ | ---
+ | ...
+
+s:format({{name = 'id', type = 'unsigned'}, {name = 'band_name', type = 'string'}})
+ | ---
+ | ...
+
+_ = s:create_index('primary', {type = 'tree', parts = {'id'}})
+ | ---
+ | ...
+s:insert({1, '1'})
+ | ---
+ | - [1, '1']
+ | ...
+s:insert({2, '2'})
+ | ---
+ | - [2, '2']
+ | ...
+s:insert({3, '3'})
+ | ---
+ | - [3, '3']
+ | ...
+
+--
+-- To make sure we're running
+box.info.status
+ | ---
+ | - running
+ | ...
+
+--
+-- Wait for data from master get propagated
+test_run:wait_lsn('replica_slave', 'default')
+ | ---
+ | ...
+
+--
+-- Now inject error into slave instance
+test_run:cmd('switch replica_slave')
+ | ---
+ | - true
+ | ...
+
+box.info.status
+ | ---
+ | - running
+ | ...
+
+fiber = require('fiber')
+ | ---
+ | ...
+errinj = box.error.injection
+ | ---
+ | ...
+errinj.set('ERRINJ_REPLICA_TXN_WRITE', true)
+ | ---
+ | - ok
+ | ...
+
+--
+-- Jump back to master node and write new
+-- entry which should cause error to happen
+-- on slave instance
+test_run:cmd('switch default')
+ | ---
+ | - true
+ | ...
+s:insert({4, '4'})
+ | ---
+ | - [4, '4']
+ | ...
+
+--
+-- Wait for error to trigger
+test_run:cmd('switch replica_slave')
+ | ---
+ | - true
+ | ...
+while test_run:grep_log('replica_slave', 'ER_INJECTION:[^\n]*') == nil do fiber.sleep(1) end
+ | ---
+ | ...
+
+--
+-- Such error cause the applier to be
+-- cancelled and reaped, thus stop the
+-- slave node and restart it back
+test_run:cmd('switch default')
+ | ---
+ | - true
+ | ...
+test_run:cmd("stop server replica_slave")
+ | ---
+ | - true
+ | ...
+test_run:cmd("start server replica_slave")
+ | ---
+ | - true
+ | ...
+
+--
+-- Wait until the new data fetched and sync'ed
+test_run:wait_lsn('replica_slave', 'default')
+ | ---
+ | ...
+
+--
+-- Cleanup
+test_run:cmd("stop server replica_slave")
+ | ---
+ | - true
+ | ...
+test_run:cmd("cleanup server replica_slave")
+ | ---
+ | - true
+ | ...
diff --git a/test/replication/applier-rollback.test.lua b/test/replication/applier-rollback.test.lua
new file mode 100644
index 000000000..d74f8df0e
--- /dev/null
+++ b/test/replication/applier-rollback.test.lua
@@ -0,0 +1,81 @@
+#!/usr/bin/env tarantool
+--
+-- vim: ts=4 sw=4 et
+--
+
+test_run = require('test_run').new()
+
+vclock_diff = require('fast_replica').vclock_diff
+errinj = box.error.injection
+engine = test_run:get_cfg('engine')
+
+--
+-- Allow replica to connect to us
+box.schema.user.grant('guest', 'replication')
+
+--
+-- Create replica instance, we're the master and
+-- start it, no data to sync yet though
+test_run:cmd("create server replica_slave with rpl_master=default, script='replication/applier-rollback-slave.lua'")
+test_run:cmd("start server replica_slave")
+
+--
+-- Fill initial data on the master instance
+test_run:cmd('switch default')
+
+_ = box.schema.space.create('test', {engine=engine})
+s = box.space.test
+
+s:format({{name = 'id', type = 'unsigned'}, {name = 'band_name', type = 'string'}})
+
+_ = s:create_index('primary', {type = 'tree', parts = {'id'}})
+s:insert({1, '1'})
+s:insert({2, '2'})
+s:insert({3, '3'})
+
+--
+-- To make sure we're running
+box.info.status
+
+--
+-- Wait for data from master get propagated
+test_run:wait_lsn('replica_slave', 'default')
+
+--
+-- Now inject error into slave instance
+test_run:cmd('switch replica_slave')
+
+box.info.status
+
+fiber = require('fiber')
+errinj = box.error.injection
+errinj.set('ERRINJ_REPLICA_TXN_WRITE', true)
+
+--
+-- Jump back to master node and write new
+-- entry which should cause error to happen
+-- on slave instance
+test_run:cmd('switch default')
+s:insert({4, '4'})
+
+--
+-- Wait for error to trigger
+test_run:cmd('switch replica_slave')
+while test_run:grep_log('replica_slave', 'ER_INJECTION:[^\n]*') == nil do fiber.sleep(1) end
+
+--
+-- Such error cause the applier to be
+-- cancelled and reaped, thus stop the
+-- slave node and restart it back
+test_run:cmd('switch default')
+test_run:cmd("stop server replica_slave")
+test_run:cmd("start server replica_slave")
+
+--
+-- Wait until the new data fetched and sync'ed
+test_run:wait_lsn('replica_slave', 'default')
+
+--
+-- Cleanup
+test_run:cmd("stop server replica_slave")
+test_run:cmd("cleanup server replica_slave")
-- 
2.20.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Tarantool-patches] [PATCH v5 0/5] box/replication: add missing diag set and fix sigsegv
  2020-01-27 21:53 [Tarantool-patches] [PATCH v5 0/5] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
                   ` (4 preceding siblings ...)
  2020-01-27 21:53 ` [Tarantool-patches] [PATCH v5 5/5] test: add replication/applier-rollback Cyrill Gorcunov
@ 2020-01-28 14:23 ` Cyrill Gorcunov
  5 siblings, 0 replies; 13+ messages in thread
From: Cyrill Gorcunov @ 2020-01-28 14:23 UTC (permalink / raw)
  To: tml

On Tue, Jan 28, 2020 at 12:53:01AM +0300, Cyrill Gorcunov wrote:
> The series is on top of master branch and sits in
> gorcunov/gh-4730-diag-raise-master-5
> 
> v2-4 are skipped for internal dev sake
>

Drop this series please, I will send an update.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Tarantool-patches] [PATCH v5 3/5] box/applier: fix nil dereference in applier rollback
  2020-01-27 21:53 ` [Tarantool-patches] [PATCH v5 3/5] box/applier: fix nil dereference in applier rollback Cyrill Gorcunov
@ 2020-02-04 22:11   ` Konstantin Osipov
  2020-02-05  8:27     ` Cyrill Gorcunov
  0 siblings, 1 reply; 13+ messages in thread
From: Konstantin Osipov @ 2020-02-04 22:11 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

* Cyrill Gorcunov <gorcunov@gmail.com> [20/01/28 10:16]:
> +
> +	/*
> +	 * We must not loose the origin error, instead
> +	 * lets keep it in replicaset diag instance.
> +	 *
> +	 * FIXME: We need to revisit this code and
> +	 * figure out if we can reconnect and retry
> +	 * the prelication process instead of cancelling
> +	 * applier with FiberIsCancelled.

First of all, we're dealing with a regression introduced by the 
parallel applier patch. 

Could you please describe what is triggering the error? 

> +		/*
> +		 * If information is already lost
> +		 * (say xlog cleared diag instance)

I don't understand this comment. How can it be lost exactly?

> +		 * setup general ClientError, seriously
> +		 * we need to unweave this mess, if error
> +		 * happened it must never been cleared
> +		 * until error handling in rollback.

:-/

> +		 */
> +		diag_set(ClientError, ER_WAL_IO);
> +		e = diag_last_error(diag_get());
> +	}
> +	diag_add_error(&replicaset.applier.diag, e);
> +
>  	/* Broadcast the rollback event across all appliers. */
>  	trigger_run(&replicaset.applier.on_rollback, event);
>  	/* Rollback applier vclock to the committed one. */
> @@ -849,8 +871,20 @@ applier_on_rollback(struct trigger *trigger, void *event)
>  		diag_add_error(&applier->diag,
>  			       diag_last_error(&replicaset.applier.diag));
>  	}
> -	/* Stop the applier fiber. */
> +
> +	/*
> +	 * Something really bad happened, we can't proceed
> +	 * thus stop the applier and throw FiberIsCancelled
> +	 * exception which will be catched by the caller
> +	 * and the fiber gracefully finish.
> +	 *
> +	 * FIXME: Need to make sure that this is a really
> +	 * final error where we can't longer proceed and should
> +	 * zap the applier, probably we could reconnect and
> +	 * retry instead?
> +	 */
>  	fiber_cancel(applier->reader);

Let's begin by explaining why we need to cancel the reader fiber here.




> +	diag_set(FiberIsCancelled);

This is clearly a clutch: first you make an effort to set
replicaset.applier.diag and then it is not used by diag_raise().


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Tarantool-patches] [PATCH v5 4/5] errinj: add ERRINJ_REPLICA_TXN_WRITE
  2020-01-27 21:53 ` [Tarantool-patches] [PATCH v5 4/5] errinj: add ERRINJ_REPLICA_TXN_WRITE Cyrill Gorcunov
@ 2020-02-04 22:11   ` Konstantin Osipov
  0 siblings, 0 replies; 13+ messages in thread
From: Konstantin Osipov @ 2020-02-04 22:11 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

* Cyrill Gorcunov <gorcunov@gmail.com> [20/01/28 10:16]:
> To test rollback error nil dereference

What's the point of injecting an impossible event?


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Tarantool-patches] [PATCH v5 3/5] box/applier: fix nil dereference in applier rollback
  2020-02-04 22:11   ` Konstantin Osipov
@ 2020-02-05  8:27     ` Cyrill Gorcunov
  2020-02-05  9:55       ` Konstantin Osipov
  0 siblings, 1 reply; 13+ messages in thread
From: Cyrill Gorcunov @ 2020-02-05  8:27 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tml

On Wed, Feb 05, 2020 at 01:11:10AM +0300, Konstantin Osipov wrote:
> * Cyrill Gorcunov <gorcunov@gmail.com> [20/01/28 10:16]:
> > +
> > +	/*
> > +	 * We must not loose the origin error, instead
> > +	 * lets keep it in replicaset diag instance.
> > +	 *
> > +	 * FIXME: We need to revisit this code and
> > +	 * figure out if we can reconnect and retry
> > +	 * the prelication process instead of cancelling
> > +	 * applier with FiberIsCancelled.
> 
> First of all, we're dealing with a regression introduced by the 
> parallel applier patch. 
> 
> Could you please describe what is triggering the error?

Kostya, it is vague area where we only managed to narrow down
that the issue is coming once we started rollback procedure
(i think real trigger was somewhere inside vynil processing,
 for example inside b+ tree failue)

> 
> > +		/*
> > +		 * If information is already lost
> > +		 * (say xlog cleared diag instance)
> 
> I don't understand this comment. How can it be lost exactly?

Hmmm, I think you're right. Actually unweaving the all possible
call traces by hands (which I had to do) is quite exhausting task
so I might be wrong here.

> 
> > +		 * setup general ClientError, seriously
> > +		 * we need to unweave this mess, if error
> > +		 * happened it must never been cleared
> > +		 * until error handling in rollback.
> 
> :-/
> 
> > +		 */
> > +		diag_set(ClientError, ER_WAL_IO);
> > +		e = diag_last_error(diag_get());
> > +	}
> > +	diag_add_error(&replicaset.applier.diag, e);
> > +
> >  	/* Broadcast the rollback event across all appliers. */
> >  	trigger_run(&replicaset.applier.on_rollback, event);
> >  	/* Rollback applier vclock to the committed one. */
> > @@ -849,8 +871,20 @@ applier_on_rollback(struct trigger *trigger, void *event)
> >  		diag_add_error(&applier->diag,
> >  			       diag_last_error(&replicaset.applier.diag));
> >  	}
> > -	/* Stop the applier fiber. */
> > +
> > +	/*
> > +	 * Something really bad happened, we can't proceed
> > +	 * thus stop the applier and throw FiberIsCancelled
> > +	 * exception which will be catched by the caller
> > +	 * and the fiber gracefully finish.
> > +	 *
> > +	 * FIXME: Need to make sure that this is a really
> > +	 * final error where we can't longer proceed and should
> > +	 * zap the applier, probably we could reconnect and
> > +	 * retry instead?
> > +	 */
> >  	fiber_cancel(applier->reader);
> 
> Let's begin by explaining why we need to cancel the reader fiber here.

This fiber_cancel has been already here, I only added diag_set(FiberIsCancelled)
to throw an exception thus the caller would zap this applier fiber.
Actually I think we could retry instead of reaping off the fiber
completely but it requies more deep understanding of how applier
works. So I left it in comment.

> 
> 
> 
> 
> > +	diag_set(FiberIsCancelled);
> 
> This is clearly a clutch: first you make an effort to set
> replicaset.applier.diag and then it is not used by diag_raise().

Not exactly, if I understand the initial logic of this applier
try/cath branch  we need to setup replicaset.applier.diag and
then on FiberIsCancelled we should move it from replicaset.applier.diag
back to current fiber->diag.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Tarantool-patches] [PATCH v5 3/5] box/applier: fix nil dereference in applier rollback
  2020-02-05  8:27     ` Cyrill Gorcunov
@ 2020-02-05  9:55       ` Konstantin Osipov
  2020-02-05 10:48         ` Cyrill Gorcunov
  0 siblings, 1 reply; 13+ messages in thread
From: Konstantin Osipov @ 2020-02-05  9:55 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

* Cyrill Gorcunov <gorcunov@gmail.com> [20/02/05 11:31]:
> > I don't understand this comment. How can it be lost exactly?
> 
> Hmmm, I think you're right. Actually unweaving the all possible
> call traces by hands (which I had to do) is quite exhausting task
> so I might be wrong here.

It was added by parallel applier patch, so the most likely cause
of the bug is that this way of cancelling parallel appliers on
conflict was not tested well enough. Previously we only had a
single applier per peer so did not need to coordinate.

> > Let's begin by explaining why we need to cancel the reader fiber here.
> 
> This fiber_cancel has been already here, I only added diag_set(FiberIsCancelled)
> to throw an exception thus the caller would zap this applier fiber.
> Actually I think we could retry instead of reaping off the fiber
> completely but it requies more deep understanding of how applier
> works. So I left it in comment.

No, we can't and shouldn't retry. Retry handling is done elsewhere already - see replication_skip_conflict.

If replication is stopped by an apply error, it's most likely a
transaction conflict, indicating that active-active setup is
broken, so it has to be resolved by a DBA (which can set
replication_skip_conflict). This is why it's critical to preserve
the original error.
> 
> Not exactly, if I understand the initial logic of this applier
> try/cath branch  we need to setup replicaset.applier.diag and
> then on FiberIsCancelled we should move it from replicaset.applier.diag
> back to current fiber->diag.

Please dig into what is "current" here. Which fiber is current if
there are many fibers handling a single peer?

-- 
Konstantin Osipov, Moscow, Russia
https://scylladb.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Tarantool-patches] [PATCH v5 3/5] box/applier: fix nil dereference in applier rollback
  2020-02-05  9:55       ` Konstantin Osipov
@ 2020-02-05 10:48         ` Cyrill Gorcunov
  0 siblings, 0 replies; 13+ messages in thread
From: Cyrill Gorcunov @ 2020-02-05 10:48 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tml

On Wed, Feb 05, 2020 at 12:55:24PM +0300, Konstantin Osipov wrote:
> * Cyrill Gorcunov <gorcunov@gmail.com> [20/02/05 11:31]:
> > > I don't understand this comment. How can it be lost exactly?
> > 
> > Hmmm, I think you're right. Actually unweaving the all possible
> > call traces by hands (which I had to do) is quite exhausting task
> > so I might be wrong here.
> 
> It was added by parallel applier patch, so the most likely cause
> of the bug is that this way of cancelling parallel appliers on
> conflict was not tested well enough. Previously we only had a
> single applier per peer so did not need to coordinate.

I see, thanks!

> > > Let's begin by explaining why we need to cancel the reader fiber here.
> > 
> > This fiber_cancel has been already here, I only added diag_set(FiberIsCancelled)
> > to throw an exception thus the caller would zap this applier fiber.
> > Actually I think we could retry instead of reaping off the fiber
> > completely but it requies more deep understanding of how applier
> > works. So I left it in comment.
> 
> No, we can't and shouldn't retry. Retry handling is done elsewhere already -
> see replication_skip_conflict.

Sure, will do, thanks for pointing.

> If replication is stopped by an apply error, it's most likely a
> transaction conflict, indicating that active-active setup is
> broken, so it has to be resolved by a DBA (which can set
> replication_skip_conflict). This is why it's critical to preserve
> the original error.
> > 
> > Not exactly, if I understand the initial logic of this applier
> > try/cath branch  we need to setup replicaset.applier.diag and
> > then on FiberIsCancelled we should move it from replicaset.applier.diag
> > back to current fiber->diag.
> 
> Please dig into what is "current" here. Which fiber is current if
> there are many fibers handling a single peer?

Will do, tahnks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-02-05 10:48 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-27 21:53 [Tarantool-patches] [PATCH v5 0/5] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
2020-01-27 21:53 ` [Tarantool-patches] [PATCH v5 1/5] box/request: add missing OutOfMemory diag_set Cyrill Gorcunov
2020-01-27 21:53 ` [Tarantool-patches] [PATCH v5 2/5] box/applier: add missing diag_set on region_alloc failure Cyrill Gorcunov
2020-01-27 21:53 ` [Tarantool-patches] [PATCH v5 3/5] box/applier: fix nil dereference in applier rollback Cyrill Gorcunov
2020-02-04 22:11   ` Konstantin Osipov
2020-02-05  8:27     ` Cyrill Gorcunov
2020-02-05  9:55       ` Konstantin Osipov
2020-02-05 10:48         ` Cyrill Gorcunov
2020-01-27 21:53 ` [Tarantool-patches] [PATCH v5 4/5] errinj: add ERRINJ_REPLICA_TXN_WRITE Cyrill Gorcunov
2020-02-04 22:11   ` Konstantin Osipov
2020-01-27 21:53 ` [Tarantool-patches] [PATCH v5 5/5] test: add replication/applier-rollback Cyrill Gorcunov
2020-01-28  8:26   ` [Tarantool-patches] [PATCH v6 " Cyrill Gorcunov
2020-01-28 14:23 ` [Tarantool-patches] [PATCH v5 0/5] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox