Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH v10 0/4] box/replication: add missing diag set and fix sigsegv
@ 2020-02-14 14:03 Cyrill Gorcunov
  2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 1/4] box/request: add missing OutOfMemory diag_set Cyrill Gorcunov
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Cyrill Gorcunov @ 2020-02-14 14:03 UTC (permalink / raw)
  To: tml

In v10 I tried to sanitize code a bit. Still there are few problems we will
need (well, I'll need) to solve:

 - move error injection deeper into transaction code
 - keep former error
 - make sure that we really need separated replicaset diag instance
   (i must confess i still not follow why we need it)

Anyway with the patch we stop causing nil dereference and write
error log. Same time since I'll have to deal with inmemory wal
task I won't leave these FIXME and continue working on them.

Cyrill Gorcunov (4):
  box/request: add missing OutOfMemory diag_set
  box/applier: add missing diag_set on region_alloc failure
  box/applier: prevent nil dereference on applier rollback
  test: add replication/applier-rollback

 src/box/applier.cc                          |  43 +++++-
 src/box/request.c                           |   8 +-
 src/lib/core/errinj.h                       |   1 +
 test/box/errinj.result                      |   1 +
 test/replication/applier-rollback-slave.lua |  16 ++
 test/replication/applier-rollback.result    | 160 ++++++++++++++++++++
 test/replication/applier-rollback.test.lua  |  79 ++++++++++
 test/replication/suite.ini                  |   2 +-
 8 files changed, 305 insertions(+), 5 deletions(-)
 create mode 100644 test/replication/applier-rollback-slave.lua
 create mode 100644 test/replication/applier-rollback.result
 create mode 100644 test/replication/applier-rollback.test.lua


base-commit: 95b9a48daf00b6686b8e1993a023daa0755b460e
-- 
2.20.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH v10 1/4] box/request: add missing OutOfMemory diag_set
  2020-02-14 14:03 [Tarantool-patches] [PATCH v10 0/4] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
@ 2020-02-14 14:03 ` Cyrill Gorcunov
  2020-02-15 17:33   ` Konstantin Osipov
  2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 2/4] box/applier: add missing diag_set on region_alloc failure Cyrill Gorcunov
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Cyrill Gorcunov @ 2020-02-14 14:03 UTC (permalink / raw)
  To: tml

In request_create_from_tuple and request_handle_sequence
we may be unable to request memory for tuples, don't
forget to setup diag error otherwise diag_raise will
lead to nil dereference.

Acked-by: Sergey Ostanevich <sergos@tarantool.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/request.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/src/box/request.c b/src/box/request.c
index 82232a155..994f2da62 100644
--- a/src/box/request.c
+++ b/src/box/request.c
@@ -109,8 +109,10 @@ request_create_from_tuple(struct request *request, struct space *space,
 		 * the tuple data to WAL on commit.
 		 */
 		char *buf = region_alloc(&fiber()->gc, size);
-		if (buf == NULL)
+		if (buf == NULL) {
+			diag_set(OutOfMemory, size, "region_alloc", "tuple");
 			return -1;
+		}
 		memcpy(buf, data, size);
 		request->tuple = buf;
 		request->tuple_end = buf + size;
@@ -199,8 +201,10 @@ request_handle_sequence(struct request *request, struct space *space)
 		size_t buf_size = (request->tuple_end - request->tuple) +
 						mp_sizeof_uint(UINT64_MAX);
 		char *tuple = region_alloc(&fiber()->gc, buf_size);
-		if (tuple == NULL)
+		if (tuple == NULL) {
+			diag_set(OutOfMemory, buf_size, "region_alloc", "tuple");
 			return -1;
+		}
 		char *tuple_end = mp_encode_array(tuple, len);
 
 		if (unlikely(key != data)) {
-- 
2.20.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH v10 2/4] box/applier: add missing diag_set on region_alloc failure
  2020-02-14 14:03 [Tarantool-patches] [PATCH v10 0/4] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
  2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 1/4] box/request: add missing OutOfMemory diag_set Cyrill Gorcunov
@ 2020-02-14 14:03 ` Cyrill Gorcunov
  2020-02-15 17:34   ` Konstantin Osipov
  2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 3/4] box/applier: prevent nil dereference on applier rollback Cyrill Gorcunov
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Cyrill Gorcunov @ 2020-02-14 14:03 UTC (permalink / raw)
  To: tml

In case if we're hitting memory limit allocating triggers
we should setup diag error to prevent nil dereference
in diag_raise call (for example from applier_apply_tx).

Note that there are region_alloc_xc helpers which are
throwing errors but as far as I understand we need the
rollback action to process first instead of immediate
throw/catch thus we use diag_set.

Acked-by: Sergey Ostanevich <sergos@tarantool.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/applier.cc | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index ae3d281a5..2ed5125d0 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -796,8 +796,11 @@ applier_apply_tx(struct stailq *rows)
 						     sizeof(struct trigger));
 	on_commit = (struct trigger *)region_alloc(&txn->region,
 						   sizeof(struct trigger));
-	if (on_rollback == NULL || on_commit == NULL)
+	if (on_rollback == NULL || on_commit == NULL) {
+		diag_set(OutOfMemory, sizeof(struct trigger),
+			 "region_alloc", "on_rollback/on_commit");
 		goto rollback;
+	}
 
 	trigger_create(on_rollback, applier_txn_rollback_cb, NULL, NULL);
 	txn_on_rollback(txn, on_rollback);
-- 
2.20.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH v10 3/4] box/applier: prevent nil dereference on applier rollback
  2020-02-14 14:03 [Tarantool-patches] [PATCH v10 0/4] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
  2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 1/4] box/request: add missing OutOfMemory diag_set Cyrill Gorcunov
  2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 2/4] box/applier: add missing diag_set on region_alloc failure Cyrill Gorcunov
@ 2020-02-14 14:03 ` Cyrill Gorcunov
  2020-02-15 17:37   ` Konstantin Osipov
  2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 4/4] test: add replication/applier-rollback Cyrill Gorcunov
  2020-02-14 14:12 ` [Tarantool-patches] [PATCH v10 0/4] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
  4 siblings, 1 reply; 13+ messages in thread
From: Cyrill Gorcunov @ 2020-02-14 14:03 UTC (permalink / raw)
  To: tml

Currently when transaction rollback happens we just drop an existing
error setting ClientError to the replicaset.applier.diag. This action
leaves current fiber with diag=nil, which in turn leads to sigsegv once
diag_raise() called right after applier_apply_tx():

 | applier_f
 |   try {
 |   applier_subscribe
 |     applier_apply_tx
 |       // error happens
 |       txn_rollback
 |         diag_set(ClientError, ER_WAL_IO)
 |         diag_move(&fiber()->diag, &replicaset.applier.diag)
 |         // fiber->diag = nil
 |       applier_on_rollback
 |         diag_add_error(&applier->diag, diag_last_error(&replicaset.applier.diag)
 |         fiber_cancel(applier->reader);
 |     diag_raise() -> NULL dereference
 |   } catch { ... }

The applier_f works in try/catch cycle and handles errors depending on
what exactly happened during transaction application. It might reconnect
appliers in some cases, the applier is simply cancelled and reaped out in
others.

The problem is that the shared replicaset.applier.diag is handled on
FiberIsCancelled exception only (while it is set inside transaction
rollback action) and we never trigger this specific exception. But
even if we would the former error which has been causing the applier
abort is vanished by ClientError which is too general.

Thus:

 - log the former error, but leave ClientError as a new one
   to be preserved in replicaset diag instance (I don't want
   to make an intrusive patch which would change a logic since
   I'm far from being expert in this code);

 - put fixme mark into the code: we need to rework it in a
   more sense way.

Part-of #4730

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/applier.cc | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index 2ed5125d0..a4a73d383 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -692,9 +692,33 @@ static int
 applier_txn_rollback_cb(struct trigger *trigger, void *event)
 {
 	(void) trigger;
+	/*
+	 * FIXME: Do not clear fiber()->diag since it
+	 * cause nil dereference
+	 *
+	 *   applier_subscribe
+	 *     applier_apply_tx
+	 *       diag_raise
+	 *
+	 * In turn we need to redesign this code:
+	 *  - preserve original error or log it somewhere
+	 *  - make the error path more clear
+	 *
+	 *  We must never reach this point with clean diag
+	 *  area, if we do it means we're simply screwed
+	 *  somewhere and there is a bug.
+	 */
+
+	if (!diag_is_empty(diag_get()))
+		diag_log();
+	else
+		say_warn_ratelimited("applier_txn_rollback_cb: empty diag");
+
 	/* Setup shared applier diagnostic area. */
 	diag_set(ClientError, ER_WAL_IO);
-	diag_move(&fiber()->diag, &replicaset.applier.diag);
+	diag_add_error(&replicaset.applier.diag,
+		       diag_last_error(&fiber()->diag));
+
 	/* Broadcast the rollback event across all appliers. */
 	trigger_run(&replicaset.applier.on_rollback, event);
 	/* Rollback applier vclock to the committed one. */
-- 
2.20.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH v10 4/4] test: add replication/applier-rollback
  2020-02-14 14:03 [Tarantool-patches] [PATCH v10 0/4] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
                   ` (2 preceding siblings ...)
  2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 3/4] box/applier: prevent nil dereference on applier rollback Cyrill Gorcunov
@ 2020-02-14 14:03 ` Cyrill Gorcunov
  2020-02-15 17:38   ` Konstantin Osipov
  2020-02-14 14:12 ` [Tarantool-patches] [PATCH v10 0/4] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
  4 siblings, 1 reply; 13+ messages in thread
From: Cyrill Gorcunov @ 2020-02-14 14:03 UTC (permalink / raw)
  To: tml

In the test force error injection ERRINJ_REPLICA_TXN_WRITE
to happen which will initiate applier transaction rollback.
Without the fix it will cause SIGSEGV due to lack of error
propagation.

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/applier.cc                          |  12 ++
 src/lib/core/errinj.h                       |   1 +
 test/box/errinj.result                      |   1 +
 test/replication/applier-rollback-slave.lua |  16 ++
 test/replication/applier-rollback.result    | 160 ++++++++++++++++++++
 test/replication/applier-rollback.test.lua  |  79 ++++++++++
 test/replication/suite.ini                  |   2 +-
 7 files changed, 270 insertions(+), 1 deletion(-)
 create mode 100644 test/replication/applier-rollback-slave.lua
 create mode 100644 test/replication/applier-rollback.result
 create mode 100644 test/replication/applier-rollback.test.lua

diff --git a/src/box/applier.cc b/src/box/applier.cc
index a4a73d383..574d086e0 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -51,6 +51,7 @@
 #include "txn.h"
 #include "box.h"
 #include "scoped_guard.h"
+#include "errinj.h"
 
 STRS(applier_state, applier_STATE);
 
@@ -832,6 +833,17 @@ applier_apply_tx(struct stailq *rows)
 	trigger_create(on_commit, applier_txn_commit_cb, NULL, NULL);
 	txn_on_commit(txn, on_commit);
 
+	/*
+	 * FIXME: Move this injection somewhere inside
+	 * txn_write, but since it has own bug (gh 4776)
+	 * we will cure it a bit later.
+	 */
+	ERROR_INJECT(ERRINJ_REPLICA_TXN_WRITE, {
+		diag_set(ClientError, ER_INJECTION,
+			 "replica txn write injection");
+		goto rollback;
+	});
+
 	if (txn_write(txn) < 0)
 		goto fail;
 
diff --git a/src/lib/core/errinj.h b/src/lib/core/errinj.h
index 672da2119..da6adfe97 100644
--- a/src/lib/core/errinj.h
+++ b/src/lib/core/errinj.h
@@ -135,6 +135,7 @@ struct errinj {
 	_(ERRINJ_COIO_SENDFILE_CHUNK, ERRINJ_INT, {.iparam = -1}) \
 	_(ERRINJ_SWIM_FD_ONLY, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_DYN_MODULE_COUNT, ERRINJ_INT, {.iparam = 0}) \
+	_(ERRINJ_REPLICA_TXN_WRITE, ERRINJ_BOOL, {.bparam = false}) \
 
 ENUM0(errinj_id, ERRINJ_LIST);
 extern struct errinj errinjs[];
diff --git a/test/box/errinj.result b/test/box/errinj.result
index f043c6689..60db28938 100644
--- a/test/box/errinj.result
+++ b/test/box/errinj.result
@@ -63,6 +63,7 @@ evals
   - ERRINJ_RELAY_SEND_DELAY: false
   - ERRINJ_RELAY_TIMEOUT: 0
   - ERRINJ_REPLICA_JOIN_DELAY: false
+  - ERRINJ_REPLICA_TXN_WRITE: false
   - ERRINJ_SIO_READ_MAX: -1
   - ERRINJ_SNAP_COMMIT_DELAY: false
   - ERRINJ_SNAP_WRITE_DELAY: false
diff --git a/test/replication/applier-rollback-slave.lua b/test/replication/applier-rollback-slave.lua
new file mode 100644
index 000000000..26fb10055
--- /dev/null
+++ b/test/replication/applier-rollback-slave.lua
@@ -0,0 +1,16 @@
+--
+-- vim: ts=4 sw=4 et
+--
+
+print('arg', arg)
+
+box.cfg({
+    replication                 = os.getenv("MASTER"),
+    listen                      = os.getenv("LISTEN"),
+    memtx_memory                = 107374182,
+    replication_timeout         = 0.1,
+    replication_connect_timeout = 0.5,
+    read_only                   = true,
+})
+
+require('console').listen(os.getenv('ADMIN'))
diff --git a/test/replication/applier-rollback.result b/test/replication/applier-rollback.result
new file mode 100644
index 000000000..3209fc7fd
--- /dev/null
+++ b/test/replication/applier-rollback.result
@@ -0,0 +1,160 @@
+-- test-run result file version 2
+#!/usr/bin/env tarantool
+ | ---
+ | ...
+--
+-- vim: ts=4 sw=4 et
+--
+
+test_run = require('test_run').new()
+ | ---
+ | ...
+
+errinj = box.error.injection
+ | ---
+ | ...
+engine = test_run:get_cfg('engine')
+ | ---
+ | ...
+
+--
+-- Allow replica to connect to us
+box.schema.user.grant('guest', 'replication')
+ | ---
+ | ...
+
+--
+-- Create replica instance, we're the master and
+-- start it, no data to sync yet though
+test_run:cmd("create server replica_slave with rpl_master=default, script='replication/applier-rollback-slave.lua'")
+ | ---
+ | - true
+ | ...
+test_run:cmd("start server replica_slave")
+ | ---
+ | - true
+ | ...
+
+--
+-- Fill initial data on the master instance
+test_run:cmd('switch default')
+ | ---
+ | - true
+ | ...
+
+_ = box.schema.space.create('test', {engine=engine})
+ | ---
+ | ...
+s = box.space.test
+ | ---
+ | ...
+
+s:format({{name = 'id', type = 'unsigned'}, {name = 'band_name', type = 'string'}})
+ | ---
+ | ...
+
+_ = s:create_index('primary', {type = 'tree', parts = {'id'}})
+ | ---
+ | ...
+s:insert({1, '1'})
+ | ---
+ | - [1, '1']
+ | ...
+s:insert({2, '2'})
+ | ---
+ | - [2, '2']
+ | ...
+s:insert({3, '3'})
+ | ---
+ | - [3, '3']
+ | ...
+
+--
+-- To make sure we're running
+box.info.status
+ | ---
+ | - running
+ | ...
+
+--
+-- Wait for data from master get propagated
+test_run:wait_lsn('replica_slave', 'default')
+ | ---
+ | ...
+
+--
+-- Now inject error into slave instance
+test_run:cmd('switch replica_slave')
+ | ---
+ | - true
+ | ...
+
+--
+-- To make sure we're running
+box.info.status
+ | ---
+ | - running
+ | ...
+
+errinj = box.error.injection
+ | ---
+ | ...
+errinj.set('ERRINJ_REPLICA_TXN_WRITE', true)
+ | ---
+ | - ok
+ | ...
+
+--
+-- Jump back to master node and write new
+-- entry which should cause error to happen
+-- on slave instance
+test_run:cmd('switch default')
+ | ---
+ | - true
+ | ...
+s:insert({4, '4'})
+ | ---
+ | - [4, '4']
+ | ...
+
+--
+-- Wait for error to trigger
+test_run:cmd('switch replica_slave')
+ | ---
+ | - true
+ | ...
+fiber = require('fiber')
+ | ---
+ | ...
+while test_run:grep_log('replica_slave', 'ER_INJECTION:[^\n]*') == nil do fiber.sleep(0.1) end
+ | ---
+ | ...
+
+----
+---- Such error cause the applier to be
+---- cancelled and reaped, thus stop the
+---- slave node and cleanup
+test_run:cmd('switch default')
+ | ---
+ | - true
+ | ...
+
+--
+-- Cleanup
+test_run:cmd("stop server replica_slave")
+ | ---
+ | - true
+ | ...
+test_run:cmd("delete server replica_slave")
+ | ---
+ | - true
+ | ...
+box.cfg{replication=""}
+ | ---
+ | ...
+box.space.test:drop()
+ | ---
+ | ...
+box.schema.user.revoke('guest', 'replication')
+ | ---
+ | ...
diff --git a/test/replication/applier-rollback.test.lua b/test/replication/applier-rollback.test.lua
new file mode 100644
index 000000000..d31eff9f0
--- /dev/null
+++ b/test/replication/applier-rollback.test.lua
@@ -0,0 +1,79 @@
+#!/usr/bin/env tarantool
+--
+-- vim: ts=4 sw=4 et
+--
+
+test_run = require('test_run').new()
+
+errinj = box.error.injection
+engine = test_run:get_cfg('engine')
+
+--
+-- Allow replica to connect to us
+box.schema.user.grant('guest', 'replication')
+
+--
+-- Create replica instance, we're the master and
+-- start it, no data to sync yet though
+test_run:cmd("create server replica_slave with rpl_master=default, script='replication/applier-rollback-slave.lua'")
+test_run:cmd("start server replica_slave")
+
+--
+-- Fill initial data on the master instance
+test_run:cmd('switch default')
+
+_ = box.schema.space.create('test', {engine=engine})
+s = box.space.test
+
+s:format({{name = 'id', type = 'unsigned'}, {name = 'band_name', type = 'string'}})
+
+_ = s:create_index('primary', {type = 'tree', parts = {'id'}})
+s:insert({1, '1'})
+s:insert({2, '2'})
+s:insert({3, '3'})
+
+--
+-- To make sure we're running
+box.info.status
+
+--
+-- Wait for data from master get propagated
+test_run:wait_lsn('replica_slave', 'default')
+
+--
+-- Now inject error into slave instance
+test_run:cmd('switch replica_slave')
+
+--
+-- To make sure we're running
+box.info.status
+
+errinj = box.error.injection
+errinj.set('ERRINJ_REPLICA_TXN_WRITE', true)
+
+--
+-- Jump back to master node and write new
+-- entry which should cause error to happen
+-- on slave instance
+test_run:cmd('switch default')
+s:insert({4, '4'})
+
+--
+-- Wait for error to trigger
+test_run:cmd('switch replica_slave')
+fiber = require('fiber')
+while test_run:grep_log('replica_slave', 'ER_INJECTION:[^\n]*') == nil do fiber.sleep(0.1) end
+
+----
+---- Such error cause the applier to be
+---- cancelled and reaped, thus stop the
+---- slave node and cleanup
+test_run:cmd('switch default')
+
+--
+-- Cleanup
+test_run:cmd("stop server replica_slave")
+test_run:cmd("delete server replica_slave")
+box.cfg{replication=""}
+box.space.test:drop()
+box.schema.user.revoke('guest', 'replication')
diff --git a/test/replication/suite.ini b/test/replication/suite.ini
index ed1de3140..b804b85f6 100644
--- a/test/replication/suite.ini
+++ b/test/replication/suite.ini
@@ -3,7 +3,7 @@ core = tarantool
 script =  master.lua
 description = tarantool/box, replication
 disabled = consistent.test.lua
-release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua
+release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua applier-rollback.test.lua
 config = suite.cfg
 lua_libs = lua/fast_replica.lua lua/rlimit.lua
 use_unix_sockets = True
-- 
2.20.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Tarantool-patches] [PATCH v10 0/4] box/replication: add missing diag set and fix sigsegv
  2020-02-14 14:03 [Tarantool-patches] [PATCH v10 0/4] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
                   ` (3 preceding siblings ...)
  2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 4/4] test: add replication/applier-rollback Cyrill Gorcunov
@ 2020-02-14 14:12 ` Cyrill Gorcunov
  4 siblings, 0 replies; 13+ messages in thread
From: Cyrill Gorcunov @ 2020-02-14 14:12 UTC (permalink / raw)
  To: tml

On Fri, Feb 14, 2020 at 05:03:35PM +0300, Cyrill Gorcunov wrote:
> In v10 I tried to sanitize code a bit. Still there are few problems we will
> need (well, I'll need) to solve:

https://github.com/tarantool/tarantool/issues/4730
https://github.com/tarantool/tarantool/tree/gorcunov/gh-4730-diag-raise-master-10

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Tarantool-patches] [PATCH v10 1/4] box/request: add missing OutOfMemory diag_set
  2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 1/4] box/request: add missing OutOfMemory diag_set Cyrill Gorcunov
@ 2020-02-15 17:33   ` Konstantin Osipov
  0 siblings, 0 replies; 13+ messages in thread
From: Konstantin Osipov @ 2020-02-15 17:33 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

* Cyrill Gorcunov <gorcunov@gmail.com> [20/02/14 17:06]:
> In request_create_from_tuple and request_handle_sequence
> we may be unable to request memory for tuples, don't
> forget to setup diag error otherwise diag_raise will
> lead to nil dereference.
> 
> Acked-by: Sergey Ostanevich <sergos@tarantool.org>
> Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
> ---
>  src/box/request.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 

lgmt

> diff --git a/src/box/request.c b/src/box/request.c
> index 82232a155..994f2da62 100644
> --- a/src/box/request.c
> +++ b/src/box/request.c
> @@ -109,8 +109,10 @@ request_create_from_tuple(struct request *request, struct space *space,
>  		 * the tuple data to WAL on commit.
>  		 */
>  		char *buf = region_alloc(&fiber()->gc, size);
> -		if (buf == NULL)
> +		if (buf == NULL) {
> +			diag_set(OutOfMemory, size, "region_alloc", "tuple");
>  			return -1;
> +		}
>  		memcpy(buf, data, size);
>  		request->tuple = buf;
>  		request->tuple_end = buf + size;
> @@ -199,8 +201,10 @@ request_handle_sequence(struct request *request, struct space *space)
>  		size_t buf_size = (request->tuple_end - request->tuple) +
>  						mp_sizeof_uint(UINT64_MAX);
>  		char *tuple = region_alloc(&fiber()->gc, buf_size);
> -		if (tuple == NULL)
> +		if (tuple == NULL) {
> +			diag_set(OutOfMemory, buf_size, "region_alloc", "tuple");
>  			return -1;
> +		}
>  		char *tuple_end = mp_encode_array(tuple, len);
>  
>  		if (unlikely(key != data)) {
> -- 
> 2.20.1

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Tarantool-patches] [PATCH v10 2/4] box/applier: add missing diag_set on region_alloc failure
  2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 2/4] box/applier: add missing diag_set on region_alloc failure Cyrill Gorcunov
@ 2020-02-15 17:34   ` Konstantin Osipov
  0 siblings, 0 replies; 13+ messages in thread
From: Konstantin Osipov @ 2020-02-15 17:34 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

* Cyrill Gorcunov <gorcunov@gmail.com> [20/02/14 17:06]:
> In case if we're hitting memory limit allocating triggers
> we should setup diag error to prevent nil dereference
> in diag_raise call (for example from applier_apply_tx).
> 
> Note that there are region_alloc_xc helpers which are
> throwing errors but as far as I understand we need the
> rollback action to process first instead of immediate
> throw/catch thus we use diag_set.
> 
> Acked-by: Sergey Ostanevich <sergos@tarantool.org>
> Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

lgtm

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Tarantool-patches] [PATCH v10 3/4] box/applier: prevent nil dereference on applier rollback
  2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 3/4] box/applier: prevent nil dereference on applier rollback Cyrill Gorcunov
@ 2020-02-15 17:37   ` Konstantin Osipov
  2020-02-15 18:24     ` Cyrill Gorcunov
  0 siblings, 1 reply; 13+ messages in thread
From: Konstantin Osipov @ 2020-02-15 17:37 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

* Cyrill Gorcunov <gorcunov@gmail.com> [20/02/14 17:06]:
> +	/*
> +	 * FIXME: Do not clear fiber()->diag since it
> +	 * cause nil dereference
> +	 *
> +	 *   applier_subscribe
> +	 *     applier_apply_tx
> +	 *       diag_raise
> +	 *
> +	 * In turn we need to redesign this code:
> +	 *  - preserve original error or log it somewhere
> +	 *  - make the error path more clear
> +	 *
> +	 *  We must never reach this point with clean diag
> +	 *  area, if we do it means we're simply screwed
> +	 *  somewhere and there is a bug.

I think this comment is obsolete now with the fix below.

You no longer clear fiber->diag.

> +	 */
> +
> +	if (!diag_is_empty(diag_get()))
> +		diag_log();
> +	else
> +		say_warn_ratelimited("applier_txn_rollback_cb: empty diag");

You can also add assert here for debug mode. It should never
happen.

> +
>  	/* Setup shared applier diagnostic area. */
>  	diag_set(ClientError, ER_WAL_IO);
> -	diag_move(&fiber()->diag, &replicaset.applier.diag);
> +	diag_add_error(&replicaset.applier.diag,
> +		       diag_last_error(&fiber()->diag));
> +

It would be nice to explain in the comment why you want to
preserve the original error in the fiber here: because when later
this fiber is joined in (add call site here), we may want to check
its diagnostics area.

>  	/* Broadcast the rollback event across all appliers. */
>  	trigger_run(&replicaset.applier.on_rollback, event);
>  	/* Rollback applier vclock to the committed one. */

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Tarantool-patches] [PATCH v10 4/4] test: add replication/applier-rollback
  2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 4/4] test: add replication/applier-rollback Cyrill Gorcunov
@ 2020-02-15 17:38   ` Konstantin Osipov
  2020-02-15 17:59     ` Cyrill Gorcunov
  0 siblings, 1 reply; 13+ messages in thread
From: Konstantin Osipov @ 2020-02-15 17:38 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

* Cyrill Gorcunov <gorcunov@gmail.com> [20/02/14 17:06]:
>  STRS(applier_state, applier_STATE);
>  
> @@ -832,6 +833,17 @@ applier_apply_tx(struct stailq *rows)
>  	trigger_create(on_commit, applier_txn_commit_cb, NULL, NULL);
>  	txn_on_commit(txn, on_commit);
>  
> +	/*
> +	 * FIXME: Move this injection somewhere inside
> +	 * txn_write, but since it has own bug (gh 4776)
> +	 * we will cure it a bit later.
> +	 */
> +	ERROR_INJECT(ERRINJ_REPLICA_TXN_WRITE, {
> +		diag_set(ClientError, ER_INJECTION,
> +			 "replica txn write injection");
> +		goto rollback;
> +	});

I thought yo wrote in the changes in v10 that you pushed the error
down into txn layer. But here you don't do it. What am I missing?


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Tarantool-patches] [PATCH v10 4/4] test: add replication/applier-rollback
  2020-02-15 17:38   ` Konstantin Osipov
@ 2020-02-15 17:59     ` Cyrill Gorcunov
  0 siblings, 0 replies; 13+ messages in thread
From: Cyrill Gorcunov @ 2020-02-15 17:59 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tml

On Sat, Feb 15, 2020 at 08:38:51PM +0300, Konstantin Osipov wrote:
> * Cyrill Gorcunov <gorcunov@gmail.com> [20/02/14 17:06]:
> >  STRS(applier_state, applier_STATE);
> >  
> > @@ -832,6 +833,17 @@ applier_apply_tx(struct stailq *rows)
> >  	trigger_create(on_commit, applier_txn_commit_cb, NULL, NULL);
> >  	txn_on_commit(txn, on_commit);
> >  
> > +	/*
> > +	 * FIXME: Move this injection somewhere inside
> > +	 * txn_write, but since it has own bug (gh 4776)
> > +	 * we will cure it a bit later.
> > +	 */
> > +	ERROR_INJECT(ERRINJ_REPLICA_TXN_WRITE, {
> > +		diag_set(ClientError, ER_INJECTION,
> > +			 "replica txn write injection");
> > +		goto rollback;
> > +	});
> 
> I thought yo wrote in the changes in v10 that you pushed the error
> down into txn layer. But here you don't do it. What am I missing?

Not yet. Pushing it sown cause https://github.com/tarantool/tarantool/issues/4776
I plan to fix it and then get back to this issue. Look this
all code is still a bit vague for me, so I've to understand every
byte before doing some more deep changes. For exactly this reason
the patch in the series does NOT change the current logic of
code flow much but tries to leave everything as it were except
a hot fix for nil dereference.

That said, Kostya, I remember about your proposal and FIXME
left here for exactly that.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Tarantool-patches] [PATCH v10 3/4] box/applier: prevent nil dereference on applier rollback
  2020-02-15 17:37   ` Konstantin Osipov
@ 2020-02-15 18:24     ` Cyrill Gorcunov
  2020-02-15 18:51       ` Konstantin Osipov
  0 siblings, 1 reply; 13+ messages in thread
From: Cyrill Gorcunov @ 2020-02-15 18:24 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tml

On Sat, Feb 15, 2020 at 08:37:41PM +0300, Konstantin Osipov wrote:
> * Cyrill Gorcunov <gorcunov@gmail.com> [20/02/14 17:06]:
> > +	/*
> > +	 * FIXME: Do not clear fiber()->diag since it
> > +	 * cause nil dereference
> > +	 *
> > +	 *   applier_subscribe
> > +	 *     applier_apply_tx
> > +	 *       diag_raise
> > +	 *
> > +	 * In turn we need to redesign this code:
> > +	 *  - preserve original error or log it somewhere
> > +	 *  - make the error path more clear
> > +	 *
> > +	 *  We must never reach this point with clean diag
> > +	 *  area, if we do it means we're simply screwed
> > +	 *  somewhere and there is a bug.
> 
> I think this comment is obsolete now with the fix below.

The thing is that I saw once the error path which may end
up here without setting diag (well, to be honest i can't
plot a call graph right now but it was related to our
"light" salad helpers and [probably] bptree). That's why
I don't wanted to put assert. But if you prefer sure, I
may update the comment and put assert.

> 
> You no longer clear fiber->diag.
> 
> > +	 */
> > +
> > +	if (!diag_is_empty(diag_get()))
> > +		diag_log();
> > +	else
> > +		say_warn_ratelimited("applier_txn_rollback_cb: empty diag");
> 
> You can also add assert here for debug mode. It should never
> happen.
> 
> > +
> >  	/* Setup shared applier diagnostic area. */
> >  	diag_set(ClientError, ER_WAL_IO);
> > -	diag_move(&fiber()->diag, &replicaset.applier.diag);
> > +	diag_add_error(&replicaset.applier.diag,
> > +		       diag_last_error(&fiber()->diag));
> > +
> 
> It would be nice to explain in the comment why you want to
> preserve the original error in the fiber here: because when later
> this fiber is joined in (add call site here), we may want to check
> its diagnostics area.

You know I think we might need to drop replicaset.diag completely
and use applier.diag instead. But again, Kostya, I don't fully
understand this code yet, so I need more time to come with some
clean solution (which I plan to implement on the top).

> 
> >  	/* Broadcast the rollback event across all appliers. */
> >  	trigger_run(&replicaset.applier.on_rollback, event);
> >  	/* Rollback applier vclock to the committed one. */
> 
> -- 
> Konstantin Osipov, Moscow, Russia
> 

	Cyrill

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Tarantool-patches] [PATCH v10 3/4] box/applier: prevent nil dereference on applier rollback
  2020-02-15 18:24     ` Cyrill Gorcunov
@ 2020-02-15 18:51       ` Konstantin Osipov
  0 siblings, 0 replies; 13+ messages in thread
From: Konstantin Osipov @ 2020-02-15 18:51 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

* Cyrill Gorcunov <gorcunov@gmail.com> [20/02/15 21:26]:
> > It would be nice to explain in the comment why you want to
> > preserve the original error in the fiber here: because when later
> > this fiber is joined in (add call site here), we may want to check
> > its diagnostics area.
> 
> You know I think we might need to drop replicaset.diag completely
> and use applier.diag instead.

I don't mind, but that could be too intrusive for a bug fix.

I still don't see why you can't patch the crash site - i.e. the
place which joins the fiber with broken diag?


Let me remind you that this is just a regression from the parallel
applier patch, and not a big one - just a coding bug. So let's fix
the coding bug first and refactor the patch second.

> understand this code yet, so I need more time to come with some
> clean solution (which I plan to implement on the top).
> 
> > 
> > >  	/* Broadcast the rollback event across all appliers. */
> > >  	trigger_run(&replicaset.applier.on_rollback, event);
> > >  	/* Rollback applier vclock to the committed one. */


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-02-15 18:51 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-14 14:03 [Tarantool-patches] [PATCH v10 0/4] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 1/4] box/request: add missing OutOfMemory diag_set Cyrill Gorcunov
2020-02-15 17:33   ` Konstantin Osipov
2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 2/4] box/applier: add missing diag_set on region_alloc failure Cyrill Gorcunov
2020-02-15 17:34   ` Konstantin Osipov
2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 3/4] box/applier: prevent nil dereference on applier rollback Cyrill Gorcunov
2020-02-15 17:37   ` Konstantin Osipov
2020-02-15 18:24     ` Cyrill Gorcunov
2020-02-15 18:51       ` Konstantin Osipov
2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 4/4] test: add replication/applier-rollback Cyrill Gorcunov
2020-02-15 17:38   ` Konstantin Osipov
2020-02-15 17:59     ` Cyrill Gorcunov
2020-02-14 14:12 ` [Tarantool-patches] [PATCH v10 0/4] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox