Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH 0/3] box/replication: add missing diag set and fix sigsegv
@ 2020-01-26 22:30 Cyrill Gorcunov
  2020-01-26 22:30 ` [Tarantool-patches] [PATCH 1/3] box/request: add missing OutOfMemory diag_set Cyrill Gorcunov
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Cyrill Gorcunov @ 2020-01-26 22:30 UTC (permalink / raw)
  To: tml

The series is on top of master branch and sits in
gorcunov/gh-4730-diag-raise-master-4

Cyrill Gorcunov (3):
  box/request: add missing OutOfMemory diag_set
  box/applier: add missing diag_set on region_alloc failure
  box/applier: fix nil dereference in applier rollback

 src/box/applier.cc | 31 ++++++++++++++++++++++++++-----
 src/box/request.c  |  8 ++++++--
 2 files changed, 32 insertions(+), 7 deletions(-)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Tarantool-patches] [PATCH 1/3] box/request: add missing OutOfMemory diag_set
  2020-01-26 22:30 [Tarantool-patches] [PATCH 0/3] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
@ 2020-01-26 22:30 ` Cyrill Gorcunov
  2020-01-26 22:30 ` [Tarantool-patches] [PATCH 2/3] box/applier: add missing diag_set on region_alloc failure Cyrill Gorcunov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Cyrill Gorcunov @ 2020-01-26 22:30 UTC (permalink / raw)
  To: tml

In request_create_from_tuple and request_handle_sequence
we may be unable to request memory for tuples, don't
forget to setup diag error otherwise diag_raise will
lead to nil dereference.

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/request.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/src/box/request.c b/src/box/request.c
index 82232a155..994f2da62 100644
--- a/src/box/request.c
+++ b/src/box/request.c
@@ -109,8 +109,10 @@ request_create_from_tuple(struct request *request, struct space *space,
 		 * the tuple data to WAL on commit.
 		 */
 		char *buf = region_alloc(&fiber()->gc, size);
-		if (buf == NULL)
+		if (buf == NULL) {
+			diag_set(OutOfMemory, size, "region_alloc", "tuple");
 			return -1;
+		}
 		memcpy(buf, data, size);
 		request->tuple = buf;
 		request->tuple_end = buf + size;
@@ -199,8 +201,10 @@ request_handle_sequence(struct request *request, struct space *space)
 		size_t buf_size = (request->tuple_end - request->tuple) +
 						mp_sizeof_uint(UINT64_MAX);
 		char *tuple = region_alloc(&fiber()->gc, buf_size);
-		if (tuple == NULL)
+		if (tuple == NULL) {
+			diag_set(OutOfMemory, buf_size, "region_alloc", "tuple");
 			return -1;
+		}
 		char *tuple_end = mp_encode_array(tuple, len);
 
 		if (unlikely(key != data)) {
-- 
2.20.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Tarantool-patches] [PATCH 2/3] box/applier: add missing diag_set on region_alloc failure
  2020-01-26 22:30 [Tarantool-patches] [PATCH 0/3] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
  2020-01-26 22:30 ` [Tarantool-patches] [PATCH 1/3] box/request: add missing OutOfMemory diag_set Cyrill Gorcunov
@ 2020-01-26 22:30 ` Cyrill Gorcunov
  2020-01-26 22:30 ` [Tarantool-patches] [PATCH 3/3] box/applier: fix nil dereference in applier rollback Cyrill Gorcunov
  2020-01-27 16:19 ` [Tarantool-patches] [PATCH 0/3] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
  3 siblings, 0 replies; 10+ messages in thread
From: Cyrill Gorcunov @ 2020-01-26 22:30 UTC (permalink / raw)
  To: tml

In case if we're hitting memory limit allocating triggers
we should setup diag error to prevent nil dereference
in diag_raise call (for example from applier_apply_tx).

Note that there are region_alloc_xc helpers which are
throwing errors but as far as I understand we need the
rollback action to process first instead of immediate
throw/catch thus we use diag_set.

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/applier.cc | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index ae3d281a5..2ed5125d0 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -796,8 +796,11 @@ applier_apply_tx(struct stailq *rows)
 						     sizeof(struct trigger));
 	on_commit = (struct trigger *)region_alloc(&txn->region,
 						   sizeof(struct trigger));
-	if (on_rollback == NULL || on_commit == NULL)
+	if (on_rollback == NULL || on_commit == NULL) {
+		diag_set(OutOfMemory, sizeof(struct trigger),
+			 "region_alloc", "on_rollback/on_commit");
 		goto rollback;
+	}
 
 	trigger_create(on_rollback, applier_txn_rollback_cb, NULL, NULL);
 	txn_on_rollback(txn, on_rollback);
-- 
2.20.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Tarantool-patches] [PATCH 3/3] box/applier: fix nil dereference in applier rollback
  2020-01-26 22:30 [Tarantool-patches] [PATCH 0/3] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
  2020-01-26 22:30 ` [Tarantool-patches] [PATCH 1/3] box/request: add missing OutOfMemory diag_set Cyrill Gorcunov
  2020-01-26 22:30 ` [Tarantool-patches] [PATCH 2/3] box/applier: add missing diag_set on region_alloc failure Cyrill Gorcunov
@ 2020-01-26 22:30 ` Cyrill Gorcunov
  2020-02-04 22:04   ` Konstantin Osipov
  2020-01-27 16:19 ` [Tarantool-patches] [PATCH 0/3] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
  3 siblings, 1 reply; 10+ messages in thread
From: Cyrill Gorcunov @ 2020-01-26 22:30 UTC (permalink / raw)
  To: tml

Currently when transaction rollback happens we just drop an existing
error setting ClientError to the replicaset.applier.diag. This action
leaves current fiber with diag=nil, which in turn leads to sigsegv once
diag_raise() called right after applier_apply_tx():

 | applier_f
 |   try {
 |   applier_subscribe
 |     applier_apply_tx
 |       // error happens
 |       txn_rollback
 |         diag_set(ClientError, ER_WAL_IO)
 |         diag_move(&fiber()->diag, &replicaset.applier.diag)
 |         // fiber->diag = nil
 |       applier_on_rollback
 |         diag_add_error(&applier->diag, diag_last_error(&replicaset.applier.diag)
 |         fiber_cancel(applier->reader);
 |     diag_raise() -> NULL dereference
 |   } catch { ... }

The applier_f works in try/catch cycle and handles errors depending on
what exactly happened during transaction application. It might reconnect
appliers in some cases, the applier is simply cancelled and reaped out in
others.

The problem is that the shared replicaset.applier.diag is handled on
FiberIsCancelled exception only (while it is set inside transaction
rollback action) and we never trigger this specific exception. But
even if we would the former error which has been causing the applier
abort is vanished by ClientError which is too general.

Thus:

 - on transaction rollback save the origin error which caused
   the transaction abort to the replicaset.applier.diag;

 - trigger FiberIsCancelled exception which will log the
   problem and zap the applier;

 - put fixme mark into the code: we need to figure out
   if underlierd error is really critical one maybe we
   could retry the applier iteration instead.

Part-of #4730

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/applier.cc | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index 2ed5125d0..8c6b00f76 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -692,9 +692,19 @@ static int
 applier_txn_rollback_cb(struct trigger *trigger, void *event)
 {
 	(void) trigger;
-	/* Setup shared applier diagnostic area. */
-	diag_set(ClientError, ER_WAL_IO);
-	diag_move(&fiber()->diag, &replicaset.applier.diag);
+
+	/*
+	 * We must not loose the origin error, instead
+	 * lets keep it in replicaset diag instance.
+	 *
+	 * FIXME: We need to revisit this code and
+	 * figure out if we can reconnect and retry
+	 * the prelication process instead of cancelling
+	 * applier with FiberIsCancelled.
+	 */
+	struct error *e = diag_last_error(diag_get());
+	diag_add_error(&replicaset.applier.diag, e);
+
 	/* Broadcast the rollback event across all appliers. */
 	trigger_run(&replicaset.applier.on_rollback, event);
 	/* Rollback applier vclock to the committed one. */
@@ -849,8 +859,15 @@ applier_on_rollback(struct trigger *trigger, void *event)
 		diag_add_error(&applier->diag,
 			       diag_last_error(&replicaset.applier.diag));
 	}
-	/* Stop the applier fiber. */
+
+	/*
+	 * Something really bad happened, we can't proceed
+	 * thus stop the applier and throw FiberIsCancelled
+	 * exception which will be catched by the caller
+	 * and the fiber gracefully finish.
+	 */
 	fiber_cancel(applier->reader);
+	diag_set(FiberIsCancelled);
 	return 0;
 }
 
@@ -1098,6 +1115,7 @@ applier_f(va_list ap)
 		} catch (FiberIsCancelled *e) {
 			if (!diag_is_empty(&applier->diag)) {
 				diag_move(&applier->diag, &fiber()->diag);
+				diag_log();
 				applier_disconnect(applier, APPLIER_STOPPED);
 				break;
 			}
-- 
2.20.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH 0/3] box/replication: add missing diag set and fix sigsegv
  2020-01-26 22:30 [Tarantool-patches] [PATCH 0/3] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
                   ` (2 preceding siblings ...)
  2020-01-26 22:30 ` [Tarantool-patches] [PATCH 3/3] box/applier: fix nil dereference in applier rollback Cyrill Gorcunov
@ 2020-01-27 16:19 ` Cyrill Gorcunov
  3 siblings, 0 replies; 10+ messages in thread
From: Cyrill Gorcunov @ 2020-01-27 16:19 UTC (permalink / raw)
  To: tml

On Mon, Jan 27, 2020 at 01:30:20AM +0300, Cyrill Gorcunov wrote:
> The series is on top of master branch and sits in
> gorcunov/gh-4730-diag-raise-master-4
> 
> Cyrill Gorcunov (3):
>   box/request: add missing OutOfMemory diag_set
>   box/applier: add missing diag_set on region_alloc failure
>   box/applier: fix nil dereference in applier rollback

Ignore this serie please, I'll resend a new one once I finish with test...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH 3/3] box/applier: fix nil dereference in applier rollback
  2020-01-26 22:30 ` [Tarantool-patches] [PATCH 3/3] box/applier: fix nil dereference in applier rollback Cyrill Gorcunov
@ 2020-02-04 22:04   ` Konstantin Osipov
  2020-02-05  8:18     ` Cyrill Gorcunov
  0 siblings, 1 reply; 10+ messages in thread
From: Konstantin Osipov @ 2020-02-04 22:04 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

* Cyrill Gorcunov <gorcunov@gmail.com> [20/01/27 10:14]:
> Currently when transaction rollback happens we just drop an existing
> error setting ClientError to the replicaset.applier.diag. This action
> leaves current fiber with diag=nil, which in turn leads to sigsegv once
> diag_raise() called right after applier_apply_tx():
> 
>  | applier_f
>  |   try {
>  |   applier_subscribe
>  |     applier_apply_tx
>  |       // error happens
>  |       txn_rollback
>  |         diag_set(ClientError, ER_WAL_IO)
>  |         diag_move(&fiber()->diag, &replicaset.applier.diag)
>  |         // fiber->diag = nil

>  |       applier_on_rollback
>  |         diag_add_error(&applier->diag, diag_last_error(&replicaset.applier.diag)
>  |         fiber_cancel(applier->reader);
>  |     diag_raise() -> NULL dereference
>  |   } catch { ... }

Where exactly does the error happen in applier_apply_tx?

Looks like this:

>  |         diag_set(ClientError, ER_WAL_IO)
>  |         diag_move(&fiber()->diag, &replicaset.applier.diag)


overwrites the original error. 

Instead, the original error should be preserved and copied to the
shared diagnostics area (replicaset.applier.error).


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH 3/3] box/applier: fix nil dereference in applier rollback
  2020-02-04 22:04   ` Konstantin Osipov
@ 2020-02-05  8:18     ` Cyrill Gorcunov
  2020-02-05  9:50       ` Konstantin Osipov
  0 siblings, 1 reply; 10+ messages in thread
From: Cyrill Gorcunov @ 2020-02-05  8:18 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tml

On Wed, Feb 05, 2020 at 01:04:30AM +0300, Konstantin Osipov wrote:
> * Cyrill Gorcunov <gorcunov@gmail.com> [20/01/27 10:14]:
> > Currently when transaction rollback happens we just drop an existing
> > error setting ClientError to the replicaset.applier.diag. This action
> > leaves current fiber with diag=nil, which in turn leads to sigsegv once
> > diag_raise() called right after applier_apply_tx():
> > 
> >  | applier_f
> >  |   try {
> >  |   applier_subscribe
> >  |     applier_apply_tx
> >  |       // error happens
> >  |       txn_rollback
> >  |         diag_set(ClientError, ER_WAL_IO)
> >  |         diag_move(&fiber()->diag, &replicaset.applier.diag)
> >  |         // fiber->diag = nil
> 
> >  |       applier_on_rollback
> >  |         diag_add_error(&applier->diag, diag_last_error(&replicaset.applier.diag)
> >  |         fiber_cancel(applier->reader);
> >  |     diag_raise() -> NULL dereference
> >  |   } catch { ... }
> 
> Where exactly does the error happen in applier_apply_tx?

The reporter pointed somwhere into a deep dive into vynil, the
problem is that its been runnin release build first time it
triggered. Actually it doesn't matter where exactly it failed,
the only important thing is that it failed the way we need
to run a rollback procedure.

> 
> Looks like this:
> 
> >  |         diag_set(ClientError, ER_WAL_IO)
> >  |         diag_move(&fiber()->diag, &replicaset.applier.diag)
> 
> 
> overwrites the original error. 

True

> 
> Instead, the original error should be preserved and copied to the
> shared diagnostics area (replicaset.applier.error).

Sounds reasonable

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH 3/3] box/applier: fix nil dereference in applier rollback
  2020-02-05  8:18     ` Cyrill Gorcunov
@ 2020-02-05  9:50       ` Konstantin Osipov
  2020-02-05 10:12         ` Cyrill Gorcunov
  0 siblings, 1 reply; 10+ messages in thread
From: Konstantin Osipov @ 2020-02-05  9:50 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

* Cyrill Gorcunov <gorcunov@gmail.com> [20/02/05 11:21]:
> The reporter pointed somwhere into a deep dive into vynil, the
> problem is that its been runnin release build first time it
> triggered. Actually it doesn't matter where exactly it failed,
> the only important thing is that it failed the way we need
> to run a rollback procedure.

You can easily get and reconstruct a transaction conflict in
engine_prepare() in applier if you run active-active and vinyl.

I'd put the injection in engine_prepare() of vinyl then, to be
closer to the real world scenario.

-- 
Konstantin Osipov, Moscow, Russia
https://scylladb.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH 3/3] box/applier: fix nil dereference in applier rollback
  2020-02-05  9:50       ` Konstantin Osipov
@ 2020-02-05 10:12         ` Cyrill Gorcunov
  2020-02-05 10:45           ` Konstantin Osipov
  0 siblings, 1 reply; 10+ messages in thread
From: Cyrill Gorcunov @ 2020-02-05 10:12 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tml

On Wed, Feb 05, 2020 at 12:50:59PM +0300, Konstantin Osipov wrote:
> * Cyrill Gorcunov <gorcunov@gmail.com> [20/02/05 11:21]:
> > The reporter pointed somwhere into a deep dive into vynil, the
> > problem is that its been runnin release build first time it
> > triggered. Actually it doesn't matter where exactly it failed,
> > the only important thing is that it failed the way we need
> > to run a rollback procedure.
> 
> You can easily get and reconstruct a transaction conflict in
> engine_prepare() in applier if you run active-active and vinyl.

Could you please clarify the "active-active" term here? I don't
get it yet.

> 
> I'd put the injection in engine_prepare() of vinyl then, to be
> closer to the real world scenario.

I could but you know I think we should step away from backend engine
and assume that error may happen in any engine (memtx, vynil or whatever
else could be here in future). This way allows us to cover any possible
error. Though I don't have a strong opinion here, since you prefer
vynil error I'll try to implement it.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH 3/3] box/applier: fix nil dereference in applier rollback
  2020-02-05 10:12         ` Cyrill Gorcunov
@ 2020-02-05 10:45           ` Konstantin Osipov
  0 siblings, 0 replies; 10+ messages in thread
From: Konstantin Osipov @ 2020-02-05 10:45 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

* Cyrill Gorcunov <gorcunov@gmail.com> [20/02/05 13:42]:
> On Wed, Feb 05, 2020 at 12:50:59PM +0300, Konstantin Osipov wrote:
> > * Cyrill Gorcunov <gorcunov@gmail.com> [20/02/05 11:21]:
> > > The reporter pointed somwhere into a deep dive into vynil, the
> > > problem is that its been runnin release build first time it
> > > triggered. Actually it doesn't matter where exactly it failed,
> > > the only important thing is that it failed the way we need
> > > to run a rollback procedure.
> > 
> > You can easily get and reconstruct a transaction conflict in
> > engine_prepare() in applier if you run active-active and vinyl.
> 
> Could you please clarify the "active-active" term here? I don't
> get it yet.

Two replicas, both accepting changes on the same data set.

> > I'd put the injection in engine_prepare() of vinyl then, to be
> > closer to the real world scenario.
> 
> I could but you know I think we should step away from backend engine
> and assume that error may happen in any engine (memtx, vynil or whatever
> else could be here in future). This way allows us to cover any possible
> error. Though I don't have a strong opinion here, since you prefer
> vynil error I'll try to implement it.

txn_prepare() never throws in memtx.

-- 
Konstantin Osipov, Moscow, Russia
https://scylladb.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-02-05 10:45 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-26 22:30 [Tarantool-patches] [PATCH 0/3] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
2020-01-26 22:30 ` [Tarantool-patches] [PATCH 1/3] box/request: add missing OutOfMemory diag_set Cyrill Gorcunov
2020-01-26 22:30 ` [Tarantool-patches] [PATCH 2/3] box/applier: add missing diag_set on region_alloc failure Cyrill Gorcunov
2020-01-26 22:30 ` [Tarantool-patches] [PATCH 3/3] box/applier: fix nil dereference in applier rollback Cyrill Gorcunov
2020-02-04 22:04   ` Konstantin Osipov
2020-02-05  8:18     ` Cyrill Gorcunov
2020-02-05  9:50       ` Konstantin Osipov
2020-02-05 10:12         ` Cyrill Gorcunov
2020-02-05 10:45           ` Konstantin Osipov
2020-01-27 16:19 ` [Tarantool-patches] [PATCH 0/3] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox