Tarantool development patches archive
 help / color / mirror / Atom feed
From: Cyrill Gorcunov <gorcunov@gmail.com>
To: Konstantin Osipov <kostja.osipov@gmail.com>
Cc: tml <tarantool-patches@dev.tarantool.org>
Subject: Re: [Tarantool-patches] [PATCH v5 3/5] box/applier: fix nil dereference in applier rollback
Date: Wed, 5 Feb 2020 11:27:21 +0300	[thread overview]
Message-ID: <20200205082721.GJ12445@uranus> (raw)
In-Reply-To: <20200204221110.GC20146@atlas>

On Wed, Feb 05, 2020 at 01:11:10AM +0300, Konstantin Osipov wrote:
> * Cyrill Gorcunov <gorcunov@gmail.com> [20/01/28 10:16]:
> > +
> > +	/*
> > +	 * We must not loose the origin error, instead
> > +	 * lets keep it in replicaset diag instance.
> > +	 *
> > +	 * FIXME: We need to revisit this code and
> > +	 * figure out if we can reconnect and retry
> > +	 * the prelication process instead of cancelling
> > +	 * applier with FiberIsCancelled.
> 
> First of all, we're dealing with a regression introduced by the 
> parallel applier patch. 
> 
> Could you please describe what is triggering the error?

Kostya, it is vague area where we only managed to narrow down
that the issue is coming once we started rollback procedure
(i think real trigger was somewhere inside vynil processing,
 for example inside b+ tree failue)

> 
> > +		/*
> > +		 * If information is already lost
> > +		 * (say xlog cleared diag instance)
> 
> I don't understand this comment. How can it be lost exactly?

Hmmm, I think you're right. Actually unweaving the all possible
call traces by hands (which I had to do) is quite exhausting task
so I might be wrong here.

> 
> > +		 * setup general ClientError, seriously
> > +		 * we need to unweave this mess, if error
> > +		 * happened it must never been cleared
> > +		 * until error handling in rollback.
> 
> :-/
> 
> > +		 */
> > +		diag_set(ClientError, ER_WAL_IO);
> > +		e = diag_last_error(diag_get());
> > +	}
> > +	diag_add_error(&replicaset.applier.diag, e);
> > +
> >  	/* Broadcast the rollback event across all appliers. */
> >  	trigger_run(&replicaset.applier.on_rollback, event);
> >  	/* Rollback applier vclock to the committed one. */
> > @@ -849,8 +871,20 @@ applier_on_rollback(struct trigger *trigger, void *event)
> >  		diag_add_error(&applier->diag,
> >  			       diag_last_error(&replicaset.applier.diag));
> >  	}
> > -	/* Stop the applier fiber. */
> > +
> > +	/*
> > +	 * Something really bad happened, we can't proceed
> > +	 * thus stop the applier and throw FiberIsCancelled
> > +	 * exception which will be catched by the caller
> > +	 * and the fiber gracefully finish.
> > +	 *
> > +	 * FIXME: Need to make sure that this is a really
> > +	 * final error where we can't longer proceed and should
> > +	 * zap the applier, probably we could reconnect and
> > +	 * retry instead?
> > +	 */
> >  	fiber_cancel(applier->reader);
> 
> Let's begin by explaining why we need to cancel the reader fiber here.

This fiber_cancel has been already here, I only added diag_set(FiberIsCancelled)
to throw an exception thus the caller would zap this applier fiber.
Actually I think we could retry instead of reaping off the fiber
completely but it requies more deep understanding of how applier
works. So I left it in comment.

> 
> 
> 
> 
> > +	diag_set(FiberIsCancelled);
> 
> This is clearly a clutch: first you make an effort to set
> replicaset.applier.diag and then it is not used by diag_raise().

Not exactly, if I understand the initial logic of this applier
try/cath branch  we need to setup replicaset.applier.diag and
then on FiberIsCancelled we should move it from replicaset.applier.diag
back to current fiber->diag.

  reply	other threads:[~2020-02-05  8:27 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-27 21:53 [Tarantool-patches] [PATCH v5 0/5] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
2020-01-27 21:53 ` [Tarantool-patches] [PATCH v5 1/5] box/request: add missing OutOfMemory diag_set Cyrill Gorcunov
2020-01-27 21:53 ` [Tarantool-patches] [PATCH v5 2/5] box/applier: add missing diag_set on region_alloc failure Cyrill Gorcunov
2020-01-27 21:53 ` [Tarantool-patches] [PATCH v5 3/5] box/applier: fix nil dereference in applier rollback Cyrill Gorcunov
2020-02-04 22:11   ` Konstantin Osipov
2020-02-05  8:27     ` Cyrill Gorcunov [this message]
2020-02-05  9:55       ` Konstantin Osipov
2020-02-05 10:48         ` Cyrill Gorcunov
2020-01-27 21:53 ` [Tarantool-patches] [PATCH v5 4/5] errinj: add ERRINJ_REPLICA_TXN_WRITE Cyrill Gorcunov
2020-02-04 22:11   ` Konstantin Osipov
2020-01-27 21:53 ` [Tarantool-patches] [PATCH v5 5/5] test: add replication/applier-rollback Cyrill Gorcunov
2020-01-28  8:26   ` [Tarantool-patches] [PATCH v6 " Cyrill Gorcunov
2020-01-28 14:23 ` [Tarantool-patches] [PATCH v5 0/5] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200205082721.GJ12445@uranus \
    --to=gorcunov@gmail.com \
    --cc=kostja.osipov@gmail.com \
    --cc=tarantool-patches@dev.tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH v5 3/5] box/applier: fix nil dereference in applier rollback' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox