Tarantool development patches archive
 help / color / mirror / Atom feed
From: Vladimir Davydov <vdavydov.dev@gmail.com>
To: Sergei Kalashnikov <ztarvos@gmail.com>
Cc: tarantool-patches@freelists.org
Subject: Re: [tarantool-patches] [PATCH v2] xlog: add request details to panic message for broken LSN
Date: Mon, 27 Aug 2018 20:04:07 +0300	[thread overview]
Message-ID: <20180827170407.2igpxsf4zprp7vdu@esperanza> (raw)
In-Reply-To: <1535368944-23713-1-git-send-email-ztarvos@gmail.com>

On Mon, Aug 27, 2018 at 02:22:24PM +0300, Sergei Kalashnikov wrote:
> Aid the debugging of replication issues related to out-of-order
> requests. Adds the details of request/tuple to the diagnostic
> message whenever possible.
> 
> Closes #3105
> ---
> https://github.com/tarantool/tarantool/issues/3105
> https://github.com/tarantool/tarantool/commits/ztarvos/gh-3105-log-request-broken-lsn

Please write a brief change log here, after the hyperlinks,
when you re-submit a patch, e.g.

https://www.freelists.org/post/tarantool-patches/PATCH-v3-replication-do-not-ignore-replication-connect-quorum

> diff --git a/src/box/relay.cc b/src/box/relay.cc
> index 138e66d..2d0b651 100644
> --- a/src/box/relay.cc
> +++ b/src/box/relay.cc
> @@ -678,4 +678,12 @@ relay_send_row(struct xstream *stream, struct xrow_header *packet)
>  				      packet->replica_id)) {
>  		relay_send(relay, packet);
>  	}
> +#ifndef NDEBUG

You don't really need 'ifndef NDEBUG' here. Please remove.

> +	struct errinj *inj = errinj(ERRINJ_RELAY_BREAK_LSN, ERRINJ_INT);
> +	if (inj != NULL && packet->lsn == inj->iparam) {
> +		relay_send(relay, packet);
> +		diag_set(ClientError, ER_INJECTION,
> +			 "relay broken lsn injection lsn");

There's no point in setting diag here. Please remove. At the same time,
I think it would be useful if you logged the LSN of the row you mangled
with say_warn() or say_error().

Also, IMO the code would be clearer if you modified LSN of the row
before sending it instead of sending two rows.

> +	}
> +#endif
>  }
> diff --git a/src/box/xlog.c b/src/box/xlog.c
> index d399a72..46cf7ed 100644
> --- a/src/box/xlog.c
> +++ b/src/box/xlog.c
> @@ -1246,6 +1246,14 @@ xlog_write_row(struct xlog *log, const struct xrow_header *packet)
>  	size_t page_offset = obuf_size(&log->obuf);
>  	/** encode row into iovec */
>  	struct iovec iov[XROW_IOVMAX];
> +#ifndef NDEBUG

Again, ifndef and diag aren't really needed here while logging would
probably make sense.

> +	struct errinj *inj = errinj(ERRINJ_XLOG_BREAK_LSN, ERRINJ_INT);
> +	if (inj != NULL && inj->iparam == packet->lsn) {
> +		((struct xrow_header *)packet)->lsn = inj->iparam - 1;
> +		diag_set(ClientError, ER_INJECTION,
> +			 "xlog broken lsn injection lsn");
> +	}
> +#endif

I'd expect to see this error injection in wal.c - xlog_write_row() is
used not only for writing xlogs, but also run and snap files, we don't
want to mangle those.

>  	/** don't write sync to the disk */
>  	int iovcnt = xrow_header_encode(packet, 0, iov, 0);
>  	if (iovcnt < 0) {
> diff --git a/src/box/xrow.h b/src/box/xrow.h
> index 4721670..f7eda91 100644
> --- a/src/box/xrow.h
> +++ b/src/box/xrow.h
> @@ -522,6 +522,35 @@ xrow_to_iovec(const struct xrow_header *row, struct iovec *out);
>  void
>  xrow_decode_error(struct xrow_header *row);
>  
> +/**
> + * Update vclock with the next LSN value for given replica id.
> + * The function will cause panic if the next LSN happens to be
> + * out of order. The details of provided row are included into
> + * diagnostic message.
> + *
> + * @param vclock Vector clock.
> + * @param row Data row.
> + * @return Previous LSN value.
> + */
> +static inline int64_t
> +vclock_follow_xrow(struct vclock* vclock, const struct xrow_header *row)
> +{
> +	assert(row);
> +	assert(row->replica_id < VCLOCK_MAX);
> +	if (row->lsn <= vclock->lsn[row->replica_id]) {
> +		struct request req;
> +		xrow_decode_dml((struct xrow_header *)row, &req, 0);

xrow_decode_dml() may fail, you should handle it.

> +		/* Never confirm LSN out of order. */
> +		panic("LSN for %u is used twice or COMMIT order is broken: "
> +		      "confirmed: %lld, new: %lld, req: %s",
> +		      (unsigned) row->replica_id,
> +		      (long long) vclock->lsn[row->replica_id],
> +		      (long long) row->lsn,
> +		      request_str(&req));
> +	}
> +	return vclock_follow(vclock, row->replica_id, row->lsn);
> +}
> +
>  #if defined(__cplusplus)
>  } /* extern "C" */
>  
> diff --git a/test/xlog/log_request_broken_lsn_panic.test.lua b/test/xlog/log_request_broken_lsn_panic.test.lua

The test doesn't pass when run twice in a row:

  yes xlog/log_request_broken_lsn_panic.test.lua | head -n 10 | xargs ./test-run.py -j 1 --long

Please fix.

Also, the test name is rather cumbersome IMHO. What about
xlog/panic_on_broken_lsn.test.lua ?

> new file mode 100644
> index 0000000..e0b3295
> --- /dev/null
> +++ b/test/xlog/log_request_broken_lsn_panic.test.lua
> @@ -0,0 +1,78 @@
> +-- Issue 3105: Test logging of request with broken lsn before panicking
> +-- Two cases are covered: Recovery and Initial Join
> +env = require('test_run')
> +test_run = env.new()
> +
> +fio = require('fio')
> +test_run:cmd("setopt delimiter ';'")
> +function grep_file_tail(filepath, bytes, pattern)
> +    local fh = fio.open(filepath, {'O_RDONLY'})
> +    local size = fh:seek(0, 'SEEK_END')
> +    if size < bytes then
> +        bytes = size
> +    end
> +    fh:seek(-bytes, 'SEEK_END')
> +    local line = fh:read(bytes)
> +    fh:close()
> +    return string.match(line, pattern)
> +end;
> +test_run:cmd("setopt delimiter ''");
> +
> +-- Testing case of panic on recovery
> +test_run:cmd('create server replica with script="xlog/panic_replica.lua"')
> +test_run:cmd('start server replica')
> +test_run:switch('replica')
> +
> +box.space._schema:replace{"t0", "v0"}
> +lsn = box.info.vclock[1]
> +box.error.injection.set("ERRINJ_XLOG_BREAK_LSN", lsn + 1)
> +box.space._schema:replace{"t0", "v1"}
> +box.error.injection.set("ERRINJ_XLOG_BREAK_LSN", -1)
> +
> +test_run:switch('default')
> +test_run:cmd('stop server replica')
> +
> +dirname = fio.pathjoin(fio.cwd(), "panic_replica")
> +xlogs = fio.glob(dirname .. "/*.xlog")
> +fio.unlink(xlogs[#xlogs])
> +
> +test_run:cmd('start server replica with crash_expected=True')
> +
> +logpath = fio.pathjoin(fio.cwd(), 'panic_replica.log')
> +
> +-- Check that log contains the mention of broken LSN and the request printout
> +grep_file_tail(logpath, 256, "LSN for 1 is used twice or COMMIT order is broken: confirmed: 1, new: 1, req: {.*}")
> +
> +test_run:cmd('cleanup server replica')
> +test_run:cmd('delete server replica')
> +
> +-- Testing case of panic on initial join

In tarantool the process of joining a new replica consists of two
phases: initial join, when the last checkpoint is sent, and final join,
when xlogs are sent. So this is final join, actually. Let's use phrase
'joining a new replica' in comments to avoid confusion.

> +test_run:cmd('create server leader with script="xlog/replica.lua"')

Why do you need a separate master? Why can't you simply bootstrap a
replica from the default instance? Then you wouldn't need to add the
new script, xlog/panic_replica.lua. I surmise this is because you need
to access box.info.replication below while on the default server the
replica doesn't necessarily have id 2. If this is the reason, then you
might want to look at test_run.cleanup_cluster().

> +test_run:cmd("start server leader")
> +test_run:switch('leader')
> +
> +box.schema.user.grant('guest', 'super')

box.schema.user.grant('guest', 'replication')

> +_ = box.schema.space.create('test')
> +_ = box.space.test:create_index('pk')
> +
> +box.space.test:auto_increment{'v0'}
> +lsn = box.info.vclock[1]
> +box.error.injection.set("ERRINJ_RELAY_BREAK_LSN", lsn + 1)
> +box.space.test:auto_increment{'v1'}
> +
> +test_run:cmd('create server replica with rpl_master=leader, script="xlog/panic_replica.lua"')
> +test_run:cmd('start server replica with crash_expected=True')
> +fiber = require('fiber')
> +while box.info.replication == nil do fiber.sleep(0.001) end
> +while box.info.replication[2].downstream.status ~= "stopped" do fiber.sleep(0.001) end
> +
> +test_run:switch('default')
> +-- Check that log contains the mention of broken LSN and the request printout
> +grep_file_tail(logpath, 256, "LSN for 1 is used twice or COMMIT order is broken: confirmed: 6, new: 6, req: {.*}")
> +
> +test_run:cmd('stop server leader')
> +test_run:cmd('cleanup server leader')
> +test_run:cmd('delete server leader')
> +
> +test_run:cmd('cleanup server replica')
> +test_run:cmd('delete server replica')

  reply	other threads:[~2018-08-27 17:04 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-27 11:22 Sergei Kalashnikov
2018-08-27 17:04 ` Vladimir Davydov [this message]
2018-08-28 16:27   ` Sergei Kalashnikov
2018-08-29  9:52     ` Vladimir Davydov
2018-08-29 15:21       ` Sergei Kalashnikov
2018-08-29 16:31         ` Vladimir Davydov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180827170407.2igpxsf4zprp7vdu@esperanza \
    --to=vdavydov.dev@gmail.com \
    --cc=tarantool-patches@freelists.org \
    --cc=ztarvos@gmail.com \
    --subject='Re: [tarantool-patches] [PATCH v2] xlog: add request details to panic message for broken LSN' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox