From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Return-Path: Date: Tue, 28 Aug 2018 19:27:41 +0300 From: Sergei Kalashnikov Subject: Re: [tarantool-patches] [PATCH v2] xlog: add request details to panic message for broken LSN Message-ID: <20180828162740.GA11135@localhost.localdomain> References: <1535368944-23713-1-git-send-email-ztarvos@gmail.com> <20180827170407.2igpxsf4zprp7vdu@esperanza> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180827170407.2igpxsf4zprp7vdu@esperanza> To: Vladimir Davydov Cc: tarantool-patches List-ID: Hi Vladimir, Thank you for review. Please find my answers inline. I also included the fixed patch at the end of this mail. On Mon, Aug 27, 2018 at 08:04:07PM +0300, Vladimir Davydov wrote: > On Mon, Aug 27, 2018 at 02:22:24PM +0300, Sergei Kalashnikov wrote: > > Aid the debugging of replication issues related to out-of-order > > requests. Adds the details of request/tuple to the diagnostic > > message whenever possible. > > > > Closes #3105 > > --- > > https://github.com/tarantool/tarantool/issues/3105 > > https://github.com/tarantool/tarantool/commits/ztarvos/gh-3105-log-request-broken-lsn > > Please write a brief change log here, after the hyperlinks, > when you re-submit a patch, e.g. > > https://www.freelists.org/post/tarantool-patches/PATCH-v3-replication-do-not-ignore-replication-connect-quorum > OK, got it! > > diff --git a/src/box/relay.cc b/src/box/relay.cc > > index 138e66d..2d0b651 100644 > > --- a/src/box/relay.cc > > +++ b/src/box/relay.cc > > @@ -678,4 +678,12 @@ relay_send_row(struct xstream *stream, struct xrow_header *packet) > > packet->replica_id)) { > > relay_send(relay, packet); > > } > > +#ifndef NDEBUG > > You don't really need 'ifndef NDEBUG' here. Please remove. > Done. > > + struct errinj *inj = errinj(ERRINJ_RELAY_BREAK_LSN, ERRINJ_INT); > > + if (inj != NULL && packet->lsn == inj->iparam) { > > + relay_send(relay, packet); > > + diag_set(ClientError, ER_INJECTION, > > + "relay broken lsn injection lsn"); > > There's no point in setting diag here. Please remove. At the same time, > I think it would be useful if you logged the LSN of the row you mangled > with say_warn() or say_error(). > Fixed. Added say_warn with the lsn. > Also, IMO the code would be clearer if you modified LSN of the row > before sending it instead of sending two rows. > Done. > > + } > > +#endif > > } > > diff --git a/src/box/xlog.c b/src/box/xlog.c > > index d399a72..46cf7ed 100644 > > --- a/src/box/xlog.c > > +++ b/src/box/xlog.c > > @@ -1246,6 +1246,14 @@ xlog_write_row(struct xlog *log, const struct xrow_header *packet) > > size_t page_offset = obuf_size(&log->obuf); > > /** encode row into iovec */ > > struct iovec iov[XROW_IOVMAX]; > > +#ifndef NDEBUG > > Again, ifndef and diag aren't really needed here while logging would > probably make sense. > Fixed. > > + struct errinj *inj = errinj(ERRINJ_XLOG_BREAK_LSN, ERRINJ_INT); > > + if (inj != NULL && inj->iparam == packet->lsn) { > > + ((struct xrow_header *)packet)->lsn = inj->iparam - 1; > > + diag_set(ClientError, ER_INJECTION, > > + "xlog broken lsn injection lsn"); > > + } > > +#endif > > I'd expect to see this error injection in wal.c - xlog_write_row() is > used not only for writing xlogs, but also run and snap files, we don't > want to mangle those. > Moved to wal.c > > /** don't write sync to the disk */ > > int iovcnt = xrow_header_encode(packet, 0, iov, 0); > > if (iovcnt < 0) { > > diff --git a/src/box/xrow.h b/src/box/xrow.h > > index 4721670..f7eda91 100644 > > --- a/src/box/xrow.h > > +++ b/src/box/xrow.h > > @@ -522,6 +522,35 @@ xrow_to_iovec(const struct xrow_header *row, struct iovec *out); > > void > > xrow_decode_error(struct xrow_header *row); > > > > +/** > > + * Update vclock with the next LSN value for given replica id. > > + * The function will cause panic if the next LSN happens to be > > + * out of order. The details of provided row are included into > > + * diagnostic message. > > + * > > + * @param vclock Vector clock. > > + * @param row Data row. > > + * @return Previous LSN value. > > + */ > > +static inline int64_t > > +vclock_follow_xrow(struct vclock* vclock, const struct xrow_header *row) > > +{ > > + assert(row); > > + assert(row->replica_id < VCLOCK_MAX); > > + if (row->lsn <= vclock->lsn[row->replica_id]) { > > + struct request req; > > + xrow_decode_dml((struct xrow_header *)row, &req, 0); > > xrow_decode_dml() may fail, you should handle it. > Agree, fixed. > > + /* Never confirm LSN out of order. */ > > + panic("LSN for %u is used twice or COMMIT order is broken: " > > + "confirmed: %lld, new: %lld, req: %s", > > + (unsigned) row->replica_id, > > + (long long) vclock->lsn[row->replica_id], > > + (long long) row->lsn, > > + request_str(&req)); > > + } > > + return vclock_follow(vclock, row->replica_id, row->lsn); > > +} > > + > > #if defined(__cplusplus) > > } /* extern "C" */ > > > > diff --git a/test/xlog/log_request_broken_lsn_panic.test.lua b/test/xlog/log_request_broken_lsn_panic.test.lua > > The test doesn't pass when run twice in a row: > > yes xlog/log_request_broken_lsn_panic.test.lua | head -n 10 | xargs ./test-run.py -j 1 --long > > Please fix. I corrected the test. My understanding is that the issue stemmed from the fact that I used the xlog/replica.lua script without specifying repl_master parameter and on the second test run, the environment already contained the MASTER variable set from the previous run, so it caused a failed replication attempt. In order to fix this, I replaced the replica.lua script with panic.lua since the latter doesn't expect replication config. > > Also, the test name is rather cumbersome IMHO. What about > xlog/panic_on_broken_lsn.test.lua ? > OK. > > new file mode 100644 > > index 0000000..e0b3295 > > --- /dev/null > > +++ b/test/xlog/log_request_broken_lsn_panic.test.lua > > @@ -0,0 +1,78 @@ > > +-- Issue 3105: Test logging of request with broken lsn before panicking > > +-- Two cases are covered: Recovery and Initial Join > > +env = require('test_run') > > +test_run = env.new() > > + > > +fio = require('fio') > > +test_run:cmd("setopt delimiter ';'") > > +function grep_file_tail(filepath, bytes, pattern) > > + local fh = fio.open(filepath, {'O_RDONLY'}) > > + local size = fh:seek(0, 'SEEK_END') > > + if size < bytes then > > + bytes = size > > + end > > + fh:seek(-bytes, 'SEEK_END') > > + local line = fh:read(bytes) > > + fh:close() > > + return string.match(line, pattern) > > +end; > > +test_run:cmd("setopt delimiter ''"); > > + > > +-- Testing case of panic on recovery > > +test_run:cmd('create server replica with script="xlog/panic_replica.lua"') > > +test_run:cmd('start server replica') > > +test_run:switch('replica') > > + > > +box.space._schema:replace{"t0", "v0"} > > +lsn = box.info.vclock[1] > > +box.error.injection.set("ERRINJ_XLOG_BREAK_LSN", lsn + 1) > > +box.space._schema:replace{"t0", "v1"} > > +box.error.injection.set("ERRINJ_XLOG_BREAK_LSN", -1) > > + > > +test_run:switch('default') > > +test_run:cmd('stop server replica') > > + > > +dirname = fio.pathjoin(fio.cwd(), "panic_replica") > > +xlogs = fio.glob(dirname .. "/*.xlog") > > +fio.unlink(xlogs[#xlogs]) > > + > > +test_run:cmd('start server replica with crash_expected=True') > > + > > +logpath = fio.pathjoin(fio.cwd(), 'panic_replica.log') > > + > > +-- Check that log contains the mention of broken LSN and the request printout > > +grep_file_tail(logpath, 256, "LSN for 1 is used twice or COMMIT order is broken: confirmed: 1, new: 1, req: {.*}") > > + > > +test_run:cmd('cleanup server replica') > > +test_run:cmd('delete server replica') > > + > > +-- Testing case of panic on initial join > > In tarantool the process of joining a new replica consists of two > phases: initial join, when the last checkpoint is sent, and final join, > when xlogs are sent. So this is final join, actually. Let's use phrase > 'joining a new replica' in comments to avoid confusion. > OK. > > +test_run:cmd('create server leader with script="xlog/replica.lua"') > > Why do you need a separate master? Why can't you simply bootstrap a > replica from the default instance? Then you wouldn't need to add the > new script, xlog/panic_replica.lua. I surmise this is because you need > to access box.info.replication below while on the default server the > replica doesn't necessarily have id 2. If this is the reason, then you > might want to look at test_run.cleanup_cluster(). > In the fixed patch, I removed the extra script xlog/panic_replica.lua. I also added the default server restart to the beginning and the very end of the test. I expect this will guarantee that replica will always have the id 2. > > +test_run:cmd("start server leader") > > +test_run:switch('leader') > > + > > +box.schema.user.grant('guest', 'super') > > box.schema.user.grant('guest', 'replication') > > > +_ = box.schema.space.create('test') > > +_ = box.space.test:create_index('pk') > > + > > +box.space.test:auto_increment{'v0'} > > +lsn = box.info.vclock[1] > > +box.error.injection.set("ERRINJ_RELAY_BREAK_LSN", lsn + 1) > > +box.space.test:auto_increment{'v1'} > > + > > +test_run:cmd('create server replica with rpl_master=leader, script="xlog/panic_replica.lua"') > > +test_run:cmd('start server replica with crash_expected=True') > > +fiber = require('fiber') > > +while box.info.replication == nil do fiber.sleep(0.001) end > > +while box.info.replication[2].downstream.status ~= "stopped" do fiber.sleep(0.001) end > > + > > +test_run:switch('default') > > +-- Check that log contains the mention of broken LSN and the request printout > > +grep_file_tail(logpath, 256, "LSN for 1 is used twice or COMMIT order is broken: confirmed: 6, new: 6, req: {.*}") > > + > > +test_run:cmd('stop server leader') > > +test_run:cmd('cleanup server leader') > > +test_run:cmd('delete server leader') > > + > > +test_run:cmd('cleanup server replica') > > +test_run:cmd('delete server replica') Please find the fixed patch below. ============================================================================ >From 8299d697f0dff8d552c84d1450f70cf5fbde7e55 Mon Sep 17 00:00:00 2001 From: Sergei Kalashnikov Date: Tue, 28 Aug 2018 18:05:34 +0300 Subject: [PATCH v2] xlog: add request details to panic message for broken LSN Aid the debugging of replication issues related to out-of-order requests. Adds the details of request/tuple to the diagnostic message whenever possible. Closes #3105 --- https://github.com/tarantool/tarantool/issues/3105 https://github.com/tarantool/tarantool/commits/ztarvos/gh-3105-log-request-broken-lsn Changes in v2: - Removed vclock_follow_unsafe() - Renamed the new function vclock_follow_xrow(), removed extra params and moved it to the header. - Added error injections and simplified the test. - Added test for replica join case. - Removed trailing spaces. src/box/applier.cc | 6 +- src/box/recovery.cc | 4 +- src/box/relay.cc | 6 ++ src/box/vclock.c | 8 +- src/box/vclock.h | 8 ++ src/box/wal.c | 13 ++- src/box/xrow.h | 31 ++++++ src/errinj.h | 2 + test/box/errinj.result | 4 + test/xlog/panic_on_broken_lsn.result | 168 +++++++++++++++++++++++++++++++++ test/xlog/panic_on_broken_lsn.test.lua | 72 ++++++++++++++ test/xlog/suite.ini | 2 +- 12 files changed, 305 insertions(+), 19 deletions(-) create mode 100644 test/xlog/panic_on_broken_lsn.result create mode 100644 test/xlog/panic_on_broken_lsn.test.lua diff --git a/src/box/applier.cc b/src/box/applier.cc index dbb4d05..5d87867 100644 --- a/src/box/applier.cc +++ b/src/box/applier.cc @@ -336,8 +336,7 @@ applier_join(struct applier *applier) coio_read_xrow(coio, ibuf, &row); applier->last_row_time = ev_monotonic_now(loop()); if (iproto_type_is_dml(row.type)) { - vclock_follow(&replicaset.vclock, row.replica_id, - row.lsn); + vclock_follow_xrow(&replicaset.vclock, &row); xstream_write_xc(applier->subscribe_stream, &row); } else if (row.type == IPROTO_OK) { /* @@ -503,8 +502,7 @@ applier_subscribe(struct applier *applier) * the row is skipped when the replication * is resumed. */ - vclock_follow(&replicaset.vclock, row.replica_id, - row.lsn); + vclock_follow_xrow(&replicaset.vclock, &row); struct replica *replica = replica_by_id(row.replica_id); struct latch *latch = (replica ? &replica->order_latch : &replicaset.applier.order_latch); diff --git a/src/box/recovery.cc b/src/box/recovery.cc index fe14def..77357f0 100644 --- a/src/box/recovery.cc +++ b/src/box/recovery.cc @@ -135,7 +135,7 @@ recovery_scan(struct recovery *r, struct vclock *end_vclock) return; struct xrow_header row; while (xlog_cursor_next(&cursor, &row, true) == 0) - vclock_follow(end_vclock, row.replica_id, row.lsn); + vclock_follow_xrow(end_vclock, &row); xlog_cursor_close(&cursor, false); } @@ -266,7 +266,7 @@ recover_xlog(struct recovery *r, struct xstream *stream, * in case of forced recovery, when we skip the * failed row anyway. */ - vclock_follow(&r->vclock, row.replica_id, row.lsn); + vclock_follow_xrow(&r->vclock, &row); if (xstream_write(stream, &row) == 0) { ++row_count; if (row_count % 100000 == 0) diff --git a/src/box/relay.cc b/src/box/relay.cc index 138e66d..9f07655 100644 --- a/src/box/relay.cc +++ b/src/box/relay.cc @@ -676,6 +676,12 @@ relay_send_row(struct xstream *stream, struct xrow_header *packet) if (packet->replica_id != relay->replica->id || packet->lsn <= vclock_get(&relay->local_vclock_at_subscribe, packet->replica_id)) { + struct errinj *inj = errinj(ERRINJ_RELAY_BREAK_LSN, + ERRINJ_INT); + if (inj != NULL && packet->lsn == inj->iparam) { + packet->lsn = inj->iparam - 1; + say_warn("injected broken lsn: %lld", packet->lsn); + } relay_send(relay, packet); } } diff --git a/src/box/vclock.c b/src/box/vclock.c index 2189ea8..b5eb280 100644 --- a/src/box/vclock.c +++ b/src/box/vclock.c @@ -42,13 +42,7 @@ vclock_follow(struct vclock *vclock, uint32_t replica_id, int64_t lsn) assert(lsn >= 0); assert(replica_id < VCLOCK_MAX); int64_t prev_lsn = vclock->lsn[replica_id]; - if (lsn <= prev_lsn) { - /* Never confirm LSN out of order. */ - panic("LSN for %u is used twice or COMMIT order is broken: " - "confirmed: %lld, new: %lld", - (unsigned) replica_id, - (long long) prev_lsn, (long long) lsn); - } + assert(lsn > prev_lsn); /* Easier add each time than check. */ vclock->map |= 1 << replica_id; vclock->lsn[replica_id] = lsn; diff --git a/src/box/vclock.h b/src/box/vclock.h index 928717c..111e291 100644 --- a/src/box/vclock.h +++ b/src/box/vclock.h @@ -199,6 +199,14 @@ vclock_sum(const struct vclock *vclock) return vclock->signature; } +/** + * Update vclock with the next LSN value for given replica id. + * + * @param vclock Vector clock. + * @param replica_id Replica identifier. + * @param lsn Next lsn. + * @return previous lsn value. + */ int64_t vclock_follow(struct vclock *vclock, uint32_t replica_id, int64_t lsn); diff --git a/src/box/wal.c b/src/box/wal.c index cd068a4..ade2c71 100644 --- a/src/box/wal.c +++ b/src/box/wal.c @@ -186,6 +186,11 @@ xlog_write_entry(struct xlog *l, struct journal_entry *entry) struct xrow_header **row = entry->rows; for (; row < entry->rows + entry->n_rows; row++) { (*row)->tm = ev_now(loop()); + struct errinj *inj = errinj(ERRINJ_XLOG_BREAK_LSN, ERRINJ_INT); + if (inj != NULL && inj->iparam == (*row)->lsn) { + (*row)->lsn = inj->iparam - 1; + say_warn("injected broken lsn: %lld", (*row)->lsn); + } if (xlog_write_row(l, *row) < 0) { /* * Rollback all un-written rows @@ -652,8 +657,7 @@ wal_assign_lsn(struct wal_writer *writer, struct xrow_header **row, (*row)->lsn = vclock_inc(&writer->vclock, instance_id); (*row)->replica_id = instance_id; } else { - vclock_follow(&writer->vclock, (*row)->replica_id, - (*row)->lsn); + vclock_follow_xrow(&writer->vclock, *row); } } } @@ -878,9 +882,8 @@ wal_write(struct journal *journal, struct journal_entry *entry) */ if (vclock_get(&replicaset.vclock, instance_id) < (*last)->lsn) { - vclock_follow(&replicaset.vclock, - instance_id, - (*last)->lsn); + vclock_follow_xrow(&replicaset.vclock, + *last); } break; } diff --git a/src/box/xrow.h b/src/box/xrow.h index 4721670..3fc007a 100644 --- a/src/box/xrow.h +++ b/src/box/xrow.h @@ -522,6 +522,37 @@ xrow_to_iovec(const struct xrow_header *row, struct iovec *out); void xrow_decode_error(struct xrow_header *row); +/** + * Update vclock with the next LSN value for given replica id. + * The function will cause panic if the next LSN happens to be + * out of order. The details of provided row are included into + * diagnostic message. + * + * @param vclock Vector clock. + * @param row Data row. + * @return Previous LSN value. + */ +static inline int64_t +vclock_follow_xrow(struct vclock* vclock, const struct xrow_header *row) +{ + assert(row); + assert(row->replica_id < VCLOCK_MAX); + if (row->lsn <= vclock->lsn[row->replica_id]) { + struct request req; + const char *req_str = "n/a"; + if (xrow_decode_dml((struct xrow_header *)row, &req, 0) == 0) + req_str = request_str(&req); + /* Never confirm LSN out of order. */ + panic("LSN for %u is used twice or COMMIT order is broken: " + "confirmed: %lld, new: %lld, req: %s", + (unsigned) row->replica_id, + (long long) vclock->lsn[row->replica_id], + (long long) row->lsn, + req_str); + } + return vclock_follow(vclock, row->replica_id, row->lsn); +} + #if defined(__cplusplus) } /* extern "C" */ diff --git a/src/errinj.h b/src/errinj.h index 64d13b0..c658cd7 100644 --- a/src/errinj.h +++ b/src/errinj.h @@ -118,6 +118,8 @@ struct errinj { _(ERRINJ_VY_LOG_FILE_RENAME, ERRINJ_BOOL, {.bparam = false}) \ _(ERRINJ_VY_RUN_FILE_RENAME, ERRINJ_BOOL, {.bparam = false}) \ _(ERRINJ_VY_INDEX_FILE_RENAME, ERRINJ_BOOL, {.bparam = false}) \ + _(ERRINJ_RELAY_BREAK_LSN, ERRINJ_INT, {.iparam = -1}) \ + _(ERRINJ_XLOG_BREAK_LSN, ERRINJ_INT, {.iparam = -1}) \ ENUM0(errinj_id, ERRINJ_LIST); extern struct errinj errinjs[]; diff --git a/test/box/errinj.result b/test/box/errinj.result index c7e4ce2..dca2c2d 100644 --- a/test/box/errinj.result +++ b/test/box/errinj.result @@ -16,6 +16,8 @@ errinj.info() state: 0 ERRINJ_WAL_WRITE: state: false + ERRINJ_XLOG_BREAK_LSN: + state: -1 ERRINJ_HTTPC_EXECUTE: state: false ERRINJ_VYRUN_DATA_READ: @@ -60,6 +62,8 @@ errinj.info() state: false ERRINJ_WAL_WRITE_DISK: state: false + ERRINJ_RELAY_BREAK_LSN: + state: -1 ERRINJ_VY_LOG_FILE_RENAME: state: false ERRINJ_VY_RUN_WRITE: diff --git a/test/xlog/panic_on_broken_lsn.result b/test/xlog/panic_on_broken_lsn.result new file mode 100644 index 0000000..b14514b --- /dev/null +++ b/test/xlog/panic_on_broken_lsn.result @@ -0,0 +1,168 @@ +-- Issue 3105: Test logging of request with broken lsn before panicking +-- Two cases are covered: Recovery and Joining a new replica +env = require('test_run') +--- +... +test_run = env.new() +--- +... +test_run:cmd('restart server default with cleanup=True') +fio = require('fio') +--- +... +test_run:cmd("setopt delimiter ';'") +--- +- true +... +function grep_file_tail(filepath, bytes, pattern) + local fh = fio.open(filepath, {'O_RDONLY'}) + local size = fh:seek(0, 'SEEK_END') + if size < bytes then + bytes = size + end + fh:seek(-bytes, 'SEEK_END') + local line = fh:read(bytes) + fh:close() + return string.match(line, pattern) +end; +--- +... +test_run:cmd("setopt delimiter ''"); +--- +- true +... +-- Testing case of panic on recovery +test_run:cmd('create server panic with script="xlog/panic.lua"') +--- +- true +... +test_run:cmd('start server panic') +--- +- true +... +test_run:switch('panic') +--- +- true +... +box.space._schema:replace{"t0", "v0"} +--- +- ['t0', 'v0'] +... +lsn = box.info.vclock[1] +--- +... +box.error.injection.set("ERRINJ_XLOG_BREAK_LSN", lsn + 1) +--- +- ok +... +box.space._schema:replace{"t0", "v1"} +--- +- ['t0', 'v1'] +... +box.error.injection.set("ERRINJ_XLOG_BREAK_LSN", -1) +--- +- ok +... +test_run:switch('default') +--- +- true +... +test_run:cmd('stop server panic') +--- +- true +... +dirname = fio.pathjoin(fio.cwd(), "panic") +--- +... +xlogs = fio.glob(dirname .. "/*.xlog") +--- +... +fio.unlink(xlogs[#xlogs]) +--- +- true +... +test_run:cmd('start server panic with crash_expected=True') +--- +- false +... +logpath = fio.pathjoin(fio.cwd(), 'panic.log') +--- +... +-- Check that log contains the mention of broken LSN and the request printout +grep_file_tail(logpath, 256, "LSN for 1 is used twice or COMMIT order is broken: confirmed: 1, new: 1, req: {.*}") +--- +- 'LSN for 1 is used twice or COMMIT order is broken: confirmed: 1, new: 1, req: {type: + ''REPLACE'', lsn: 1, space_id: 272, index_id: 0, tuple: ["t0", "v1"]}' +... +test_run:cmd('cleanup server panic') +--- +- true +... +test_run:cmd('delete server panic') +--- +- true +... +-- Testing case of panic on joining a new replica +box.schema.user.grant('guest', 'super') +--- +... +_ = box.schema.space.create('test') +--- +... +_ = box.space.test:create_index('pk') +--- +... +box.space.test:auto_increment{'v0'} +--- +- [1, 'v0'] +... +lsn = box.info.vclock[1] +--- +... +box.error.injection.set("ERRINJ_RELAY_BREAK_LSN", lsn + 1) +--- +- ok +... +box.space.test:auto_increment{'v1'} +--- +- [2, 'v1'] +... +test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"') +--- +- true +... +test_run:cmd('start server replica with crash_expected=True') +--- +- false +... +fiber = require('fiber') +--- +... +while box.info.replication == nil do fiber.sleep(0.001) end +--- +... +while box.info.replication[2].downstream.status ~= "stopped" do fiber.sleep(0.001) end +--- +... +test_run:switch('default') +--- +- true +... +logpath = fio.pathjoin(fio.cwd(), 'replica.log') +--- +... +-- Check that log contains the mention of broken LSN and the request printout +grep_file_tail(logpath, 256, "LSN for 1 is used twice or COMMIT order is broken: confirmed: 5, new: 5, req: {.*}") +--- +- 'LSN for 1 is used twice or COMMIT order is broken: confirmed: 5, new: 5, req: {type: + ''INSERT'', lsn: 5, space_id: 512, index_id: 0, tuple: [2, "v1"]}' +... +test_run:cmd('cleanup server replica') +--- +- true +... +test_run:cmd('delete server replica') +--- +- true +... +test_run:cmd('restart server default with cleanup=True') diff --git a/test/xlog/panic_on_broken_lsn.test.lua b/test/xlog/panic_on_broken_lsn.test.lua new file mode 100644 index 0000000..d4abf6f --- /dev/null +++ b/test/xlog/panic_on_broken_lsn.test.lua @@ -0,0 +1,72 @@ +-- Issue 3105: Test logging of request with broken lsn before panicking +-- Two cases are covered: Recovery and Joining a new replica +env = require('test_run') +test_run = env.new() +test_run:cmd('restart server default with cleanup=True') + +fio = require('fio') +test_run:cmd("setopt delimiter ';'") +function grep_file_tail(filepath, bytes, pattern) + local fh = fio.open(filepath, {'O_RDONLY'}) + local size = fh:seek(0, 'SEEK_END') + if size < bytes then + bytes = size + end + fh:seek(-bytes, 'SEEK_END') + local line = fh:read(bytes) + fh:close() + return string.match(line, pattern) +end; +test_run:cmd("setopt delimiter ''"); + +-- Testing case of panic on recovery +test_run:cmd('create server panic with script="xlog/panic.lua"') +test_run:cmd('start server panic') +test_run:switch('panic') + +box.space._schema:replace{"t0", "v0"} +lsn = box.info.vclock[1] +box.error.injection.set("ERRINJ_XLOG_BREAK_LSN", lsn + 1) +box.space._schema:replace{"t0", "v1"} +box.error.injection.set("ERRINJ_XLOG_BREAK_LSN", -1) + +test_run:switch('default') +test_run:cmd('stop server panic') + +dirname = fio.pathjoin(fio.cwd(), "panic") +xlogs = fio.glob(dirname .. "/*.xlog") +fio.unlink(xlogs[#xlogs]) + +test_run:cmd('start server panic with crash_expected=True') + +logpath = fio.pathjoin(fio.cwd(), 'panic.log') + +-- Check that log contains the mention of broken LSN and the request printout +grep_file_tail(logpath, 256, "LSN for 1 is used twice or COMMIT order is broken: confirmed: 1, new: 1, req: {.*}") + +test_run:cmd('cleanup server panic') +test_run:cmd('delete server panic') + +-- Testing case of panic on joining a new replica +box.schema.user.grant('guest', 'super') +_ = box.schema.space.create('test') +_ = box.space.test:create_index('pk') +box.space.test:auto_increment{'v0'} +lsn = box.info.vclock[1] +box.error.injection.set("ERRINJ_RELAY_BREAK_LSN", lsn + 1) +box.space.test:auto_increment{'v1'} + +test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"') +test_run:cmd('start server replica with crash_expected=True') +fiber = require('fiber') +while box.info.replication == nil do fiber.sleep(0.001) end +while box.info.replication[2].downstream.status ~= "stopped" do fiber.sleep(0.001) end + +test_run:switch('default') +logpath = fio.pathjoin(fio.cwd(), 'replica.log') +-- Check that log contains the mention of broken LSN and the request printout +grep_file_tail(logpath, 256, "LSN for 1 is used twice or COMMIT order is broken: confirmed: 5, new: 5, req: {.*}") + +test_run:cmd('cleanup server replica') +test_run:cmd('delete server replica') +test_run:cmd('restart server default with cleanup=True') diff --git a/test/xlog/suite.ini b/test/xlog/suite.ini index 8ee9720..ffc812e 100644 --- a/test/xlog/suite.ini +++ b/test/xlog/suite.ini @@ -4,7 +4,7 @@ description = tarantool write ahead log tests script = xlog.lua disabled = snap_io_rate.test.lua valgrind_disabled = -release_disabled = errinj.test.lua panic_on_lsn_gap.test.lua +release_disabled = errinj.test.lua panic_on_lsn_gap.test.lua panic_on_broken_lsn.test.lua config = suite.cfg use_unix_sockets = True long_run = snap_io_rate.test.lua -- 1.8.3.1