[tarantool-patches] [PATCH v2] xlog: add request details to panic message for broken LSN
Sergei Kalashnikov
ztarvos at gmail.com
Wed Aug 29 18:21:15 MSK 2018
On Wed, Aug 29, 2018 at 12:52:47PM +0300, Vladimir Davydov wrote:
> On Tue, Aug 28, 2018 at 07:27:41PM +0300, Sergei Kalashnikov wrote:
> > diff --git a/src/box/relay.cc b/src/box/relay.cc
> > index 138e66d..9f07655 100644
> > --- a/src/box/relay.cc
> > +++ b/src/box/relay.cc
> > @@ -676,6 +676,12 @@ relay_send_row(struct xstream *stream, struct xrow_header *packet)
> > if (packet->replica_id != relay->replica->id ||
> > packet->lsn <= vclock_get(&relay->local_vclock_at_subscribe,
> > packet->replica_id)) {
> > + struct errinj *inj = errinj(ERRINJ_RELAY_BREAK_LSN,
> > + ERRINJ_INT);
> > + if (inj != NULL && packet->lsn == inj->iparam) {
> > + packet->lsn = inj->iparam - 1;
> > + say_warn("injected broken lsn: %lld", packet->lsn);
>
> Nit: int64_t isn't guaranteed to be equivalent to long long. Please
> converse packet->lsn to long long explicitly.
>
OK, fixed.
> > + }
> > relay_send(relay, packet);
> > }
> > }
> > diff --git a/src/box/wal.c b/src/box/wal.c
> > index cd068a4..ade2c71 100644
> > --- a/src/box/wal.c
> > +++ b/src/box/wal.c
> > @@ -186,6 +186,11 @@ xlog_write_entry(struct xlog *l, struct journal_entry *entry)
> > struct xrow_header **row = entry->rows;
> > for (; row < entry->rows + entry->n_rows; row++) {
> > (*row)->tm = ev_now(loop());
> > + struct errinj *inj = errinj(ERRINJ_XLOG_BREAK_LSN, ERRINJ_INT);
> > + if (inj != NULL && inj->iparam == (*row)->lsn) {
> > + (*row)->lsn = inj->iparam - 1;
> > + say_warn("injected broken lsn: %lld", (*row)->lsn);
>
> Ditto.
>
Fixed.
> > + }
> > if (xlog_write_row(l, *row) < 0) {
> > /*
> > * Rollback all un-written rows
> > diff --git a/src/errinj.h b/src/errinj.h
> > index 64d13b0..c658cd7 100644
> > --- a/src/errinj.h
> > +++ b/src/errinj.h
> > @@ -118,6 +118,8 @@ struct errinj {
> > _(ERRINJ_VY_LOG_FILE_RENAME, ERRINJ_BOOL, {.bparam = false}) \
> > _(ERRINJ_VY_RUN_FILE_RENAME, ERRINJ_BOOL, {.bparam = false}) \
> > _(ERRINJ_VY_INDEX_FILE_RENAME, ERRINJ_BOOL, {.bparam = false}) \
> > + _(ERRINJ_RELAY_BREAK_LSN, ERRINJ_INT, {.iparam = -1}) \
> > + _(ERRINJ_XLOG_BREAK_LSN, ERRINJ_INT, {.iparam = -1}) \
>
> Nit: please rename ERRINJ_XLOG_BREAK_LSN to ERRINJ_WAL_BREAK_LSN -
> it's used by the WAL thread, after all.
>
Done.
> > diff --git a/test/xlog/panic_on_broken_lsn.test.lua b/test/xlog/panic_on_broken_lsn.test.lua
> > new file mode 100644
> > index 0000000..d4abf6f
> > --- /dev/null
> > +++ b/test/xlog/panic_on_broken_lsn.test.lua
> > @@ -0,0 +1,72 @@
> > +-- Issue 3105: Test logging of request with broken lsn before panicking
> > +-- Two cases are covered: Recovery and Joining a new replica
> > +env = require('test_run')
> > +test_run = env.new()
> > +test_run:cmd('restart server default with cleanup=True')
>
> We strive to make our tests run as fast as possible, that's why we don't
> restart the default instance before/after each test. Instead we clean up
> only what we have to. In your case, it means revoking privileges you
> granted, dropping spaces you created, and disabling error injections you
> enabled.
>
> Since in your test you assume that a new replica is assigned id 2, you
> should call test_run:cleanup_cluster() in the beginning of this test.
> This is much cheaper than restarting the default instance.
>
> If you don't restart the default instance, LSNs and space ids might be
> different from run to run. That's OK, you can filter them out in the
> test output, e.g. see
>
> https://github.com/tarantool/tarantool/blob/ac17838c9634ba7ebe329290ebd3314e37ec2255/test/replication/status.test.lua#L117
>
I changed the test accordingly. Thanks for valuable info.
> > +
> > +fio = require('fio')
> > +test_run:cmd("setopt delimiter ';'")
> > +function grep_file_tail(filepath, bytes, pattern)
> > + local fh = fio.open(filepath, {'O_RDONLY'})
> > + local size = fh:seek(0, 'SEEK_END')
> > + if size < bytes then
> > + bytes = size
> > + end
> > + fh:seek(-bytes, 'SEEK_END')
> > + local line = fh:read(bytes)
> > + fh:close()
> > + return string.match(line, pattern)
> > +end;
> > +test_run:cmd("setopt delimiter ''");
> > +
> > +-- Testing case of panic on recovery
> > +test_run:cmd('create server panic with script="xlog/panic.lua"')
> > +test_run:cmd('start server panic')
> > +test_run:switch('panic')
> > +
> > +box.space._schema:replace{"t0", "v0"}
> > +lsn = box.info.vclock[1]
> > +box.error.injection.set("ERRINJ_XLOG_BREAK_LSN", lsn + 1)
> > +box.space._schema:replace{"t0", "v1"}
> > +box.error.injection.set("ERRINJ_XLOG_BREAK_LSN", -1)
> > +
> > +test_run:switch('default')
> > +test_run:cmd('stop server panic')
> > +
> > +dirname = fio.pathjoin(fio.cwd(), "panic")
> > +xlogs = fio.glob(dirname .. "/*.xlog")
> > +fio.unlink(xlogs[#xlogs])
>
> I don't understand why you need to delete the last xlog here.
>
This last xlog is always empty, so recovery_scan() skips the vclock_follow()
call and it doesn't panic unless I delete the file. I'm not sure why it gets
created since I didn't create checkpoints in the test explicitly. Later on, the
actual row recovery happens, but recover_xlog() silently skips unordered lsn.
> > +
> > +test_run:cmd('start server panic with crash_expected=True')
> > +
> > +logpath = fio.pathjoin(fio.cwd(), 'panic.log')
> > +
> > +-- Check that log contains the mention of broken LSN and the request printout
> > +grep_file_tail(logpath, 256, "LSN for 1 is used twice or COMMIT order is broken: confirmed: 1, new: 1, req: {.*}")
> > +
> > +test_run:cmd('cleanup server panic')
> > +test_run:cmd('delete server panic')
> > +
> > +-- Testing case of panic on joining a new replica
> > +box.schema.user.grant('guest', 'super')
>
> Please instead use
>
> box.schema.user.grant('guest', 'replication')
>
Done.
> > +_ = box.schema.space.create('test')
> > +_ = box.space.test:create_index('pk')
> > +box.space.test:auto_increment{'v0'}
> > +lsn = box.info.vclock[1]
> > +box.error.injection.set("ERRINJ_RELAY_BREAK_LSN", lsn + 1)
> > +box.space.test:auto_increment{'v1'}
> > +
> > +test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"')
> > +test_run:cmd('start server replica with crash_expected=True')
> > +fiber = require('fiber')
> > +while box.info.replication == nil do fiber.sleep(0.001) end
> > +while box.info.replication[2].downstream.status ~= "stopped" do fiber.sleep(0.001) end
> > +
> > +test_run:switch('default')
>
> Pointless 'switch' - the test is already running on behalf of the
> default instance here.
>
Fixed.
> > +logpath = fio.pathjoin(fio.cwd(), 'replica.log')
> > +-- Check that log contains the mention of broken LSN and the request printout
> > +grep_file_tail(logpath, 256, "LSN for 1 is used twice or COMMIT order is broken: confirmed: 5, new: 5, req: {.*}")
> > +
> > +test_run:cmd('cleanup server replica')
> > +test_run:cmd('delete server replica')
> > +test_run:cmd('restart server default with cleanup=True')
Please find below the fixed patch.
===================================
>From 2199d4e7e44fc8f6d93aca2774890b2dfd71f716 Mon Sep 17 00:00:00 2001
From: Sergei Kalashnikov <ztarvos at gmail.com>
Date: Wed, 29 Aug 2018 17:35:42 +0300
Subject: [PATCH v2] xlog: add request details to panic message for broken LSN
Aid the debugging of replication issues related to out-of-order
requests. Adds the details of request/tuple to the diagnostic
message whenever possible.
Closes #3105
---
https://github.com/tarantool/tarantool/issues/3105
https://github.com/tarantool/tarantool/commits/ztarvos/gh-3105-log-request-broken-lsn
Changes in v2:
- Removed vclock_follow_unsafe()
- Renamed the new function vclock_follow_xrow(),
removed extra params and moved it to the header.
- Added error injections and simplified the test.
- Added test for replica join case.
- Removed trailing spaces.
src/box/applier.cc | 6 +-
src/box/recovery.cc | 4 +-
src/box/relay.cc | 7 ++
src/box/vclock.c | 8 +-
src/box/vclock.h | 8 ++
src/box/wal.c | 14 ++-
src/box/xrow.h | 31 ++++++
src/errinj.h | 2 +
test/box/errinj.result | 4 +
test/xlog/panic_on_broken_lsn.result | 187 +++++++++++++++++++++++++++++++++
test/xlog/panic_on_broken_lsn.test.lua | 77 ++++++++++++++
test/xlog/suite.ini | 2 +-
12 files changed, 331 insertions(+), 19 deletions(-)
create mode 100644 test/xlog/panic_on_broken_lsn.result
create mode 100644 test/xlog/panic_on_broken_lsn.test.lua
diff --git a/src/box/applier.cc b/src/box/applier.cc
index dbb4d05..5d87867 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -336,8 +336,7 @@ applier_join(struct applier *applier)
coio_read_xrow(coio, ibuf, &row);
applier->last_row_time = ev_monotonic_now(loop());
if (iproto_type_is_dml(row.type)) {
- vclock_follow(&replicaset.vclock, row.replica_id,
- row.lsn);
+ vclock_follow_xrow(&replicaset.vclock, &row);
xstream_write_xc(applier->subscribe_stream, &row);
} else if (row.type == IPROTO_OK) {
/*
@@ -503,8 +502,7 @@ applier_subscribe(struct applier *applier)
* the row is skipped when the replication
* is resumed.
*/
- vclock_follow(&replicaset.vclock, row.replica_id,
- row.lsn);
+ vclock_follow_xrow(&replicaset.vclock, &row);
struct replica *replica = replica_by_id(row.replica_id);
struct latch *latch = (replica ? &replica->order_latch :
&replicaset.applier.order_latch);
diff --git a/src/box/recovery.cc b/src/box/recovery.cc
index fe14def..77357f0 100644
--- a/src/box/recovery.cc
+++ b/src/box/recovery.cc
@@ -135,7 +135,7 @@ recovery_scan(struct recovery *r, struct vclock *end_vclock)
return;
struct xrow_header row;
while (xlog_cursor_next(&cursor, &row, true) == 0)
- vclock_follow(end_vclock, row.replica_id, row.lsn);
+ vclock_follow_xrow(end_vclock, &row);
xlog_cursor_close(&cursor, false);
}
@@ -266,7 +266,7 @@ recover_xlog(struct recovery *r, struct xstream *stream,
* in case of forced recovery, when we skip the
* failed row anyway.
*/
- vclock_follow(&r->vclock, row.replica_id, row.lsn);
+ vclock_follow_xrow(&r->vclock, &row);
if (xstream_write(stream, &row) == 0) {
++row_count;
if (row_count % 100000 == 0)
diff --git a/src/box/relay.cc b/src/box/relay.cc
index 138e66d..b0bc03e 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -676,6 +676,13 @@ relay_send_row(struct xstream *stream, struct xrow_header *packet)
if (packet->replica_id != relay->replica->id ||
packet->lsn <= vclock_get(&relay->local_vclock_at_subscribe,
packet->replica_id)) {
+ struct errinj *inj = errinj(ERRINJ_RELAY_BREAK_LSN,
+ ERRINJ_INT);
+ if (inj != NULL && packet->lsn == inj->iparam) {
+ packet->lsn = inj->iparam - 1;
+ say_warn("injected broken lsn: %lld",
+ (long long) packet->lsn);
+ }
relay_send(relay, packet);
}
}
diff --git a/src/box/vclock.c b/src/box/vclock.c
index 2189ea8..b5eb280 100644
--- a/src/box/vclock.c
+++ b/src/box/vclock.c
@@ -42,13 +42,7 @@ vclock_follow(struct vclock *vclock, uint32_t replica_id, int64_t lsn)
assert(lsn >= 0);
assert(replica_id < VCLOCK_MAX);
int64_t prev_lsn = vclock->lsn[replica_id];
- if (lsn <= prev_lsn) {
- /* Never confirm LSN out of order. */
- panic("LSN for %u is used twice or COMMIT order is broken: "
- "confirmed: %lld, new: %lld",
- (unsigned) replica_id,
- (long long) prev_lsn, (long long) lsn);
- }
+ assert(lsn > prev_lsn);
/* Easier add each time than check. */
vclock->map |= 1 << replica_id;
vclock->lsn[replica_id] = lsn;
diff --git a/src/box/vclock.h b/src/box/vclock.h
index 928717c..111e291 100644
--- a/src/box/vclock.h
+++ b/src/box/vclock.h
@@ -199,6 +199,14 @@ vclock_sum(const struct vclock *vclock)
return vclock->signature;
}
+/**
+ * Update vclock with the next LSN value for given replica id.
+ *
+ * @param vclock Vector clock.
+ * @param replica_id Replica identifier.
+ * @param lsn Next lsn.
+ * @return previous lsn value.
+ */
int64_t
vclock_follow(struct vclock *vclock, uint32_t replica_id, int64_t lsn);
diff --git a/src/box/wal.c b/src/box/wal.c
index cd068a4..a11e0ad 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -186,6 +186,12 @@ xlog_write_entry(struct xlog *l, struct journal_entry *entry)
struct xrow_header **row = entry->rows;
for (; row < entry->rows + entry->n_rows; row++) {
(*row)->tm = ev_now(loop());
+ struct errinj *inj = errinj(ERRINJ_WAL_BREAK_LSN, ERRINJ_INT);
+ if (inj != NULL && inj->iparam == (*row)->lsn) {
+ (*row)->lsn = inj->iparam - 1;
+ say_warn("injected broken lsn: %lld",
+ (long long) (*row)->lsn);
+ }
if (xlog_write_row(l, *row) < 0) {
/*
* Rollback all un-written rows
@@ -652,8 +658,7 @@ wal_assign_lsn(struct wal_writer *writer, struct xrow_header **row,
(*row)->lsn = vclock_inc(&writer->vclock, instance_id);
(*row)->replica_id = instance_id;
} else {
- vclock_follow(&writer->vclock, (*row)->replica_id,
- (*row)->lsn);
+ vclock_follow_xrow(&writer->vclock, *row);
}
}
}
@@ -878,9 +883,8 @@ wal_write(struct journal *journal, struct journal_entry *entry)
*/
if (vclock_get(&replicaset.vclock,
instance_id) < (*last)->lsn) {
- vclock_follow(&replicaset.vclock,
- instance_id,
- (*last)->lsn);
+ vclock_follow_xrow(&replicaset.vclock,
+ *last);
}
break;
}
diff --git a/src/box/xrow.h b/src/box/xrow.h
index 4721670..3fc007a 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -522,6 +522,37 @@ xrow_to_iovec(const struct xrow_header *row, struct iovec *out);
void
xrow_decode_error(struct xrow_header *row);
+/**
+ * Update vclock with the next LSN value for given replica id.
+ * The function will cause panic if the next LSN happens to be
+ * out of order. The details of provided row are included into
+ * diagnostic message.
+ *
+ * @param vclock Vector clock.
+ * @param row Data row.
+ * @return Previous LSN value.
+ */
+static inline int64_t
+vclock_follow_xrow(struct vclock* vclock, const struct xrow_header *row)
+{
+ assert(row);
+ assert(row->replica_id < VCLOCK_MAX);
+ if (row->lsn <= vclock->lsn[row->replica_id]) {
+ struct request req;
+ const char *req_str = "n/a";
+ if (xrow_decode_dml((struct xrow_header *)row, &req, 0) == 0)
+ req_str = request_str(&req);
+ /* Never confirm LSN out of order. */
+ panic("LSN for %u is used twice or COMMIT order is broken: "
+ "confirmed: %lld, new: %lld, req: %s",
+ (unsigned) row->replica_id,
+ (long long) vclock->lsn[row->replica_id],
+ (long long) row->lsn,
+ req_str);
+ }
+ return vclock_follow(vclock, row->replica_id, row->lsn);
+}
+
#if defined(__cplusplus)
} /* extern "C" */
diff --git a/src/errinj.h b/src/errinj.h
index 64d13b0..b6d4a4c 100644
--- a/src/errinj.h
+++ b/src/errinj.h
@@ -118,6 +118,8 @@ struct errinj {
_(ERRINJ_VY_LOG_FILE_RENAME, ERRINJ_BOOL, {.bparam = false}) \
_(ERRINJ_VY_RUN_FILE_RENAME, ERRINJ_BOOL, {.bparam = false}) \
_(ERRINJ_VY_INDEX_FILE_RENAME, ERRINJ_BOOL, {.bparam = false}) \
+ _(ERRINJ_RELAY_BREAK_LSN, ERRINJ_INT, {.iparam = -1}) \
+ _(ERRINJ_WAL_BREAK_LSN, ERRINJ_INT, {.iparam = -1}) \
ENUM0(errinj_id, ERRINJ_LIST);
extern struct errinj errinjs[];
diff --git a/test/box/errinj.result b/test/box/errinj.result
index c7e4ce2..8dae761 100644
--- a/test/box/errinj.result
+++ b/test/box/errinj.result
@@ -58,8 +58,12 @@ errinj.info()
state: 0
ERRINJ_XLOG_META:
state: false
+ ERRINJ_WAL_BREAK_LSN:
+ state: -1
ERRINJ_WAL_WRITE_DISK:
state: false
+ ERRINJ_RELAY_BREAK_LSN:
+ state: -1
ERRINJ_VY_LOG_FILE_RENAME:
state: false
ERRINJ_VY_RUN_WRITE:
diff --git a/test/xlog/panic_on_broken_lsn.result b/test/xlog/panic_on_broken_lsn.result
new file mode 100644
index 0000000..ad10217
--- /dev/null
+++ b/test/xlog/panic_on_broken_lsn.result
@@ -0,0 +1,187 @@
+-- Issue 3105: Test logging of request with broken lsn before panicking
+-- Two cases are covered: Recovery and Joining a new replica
+env = require('test_run')
+---
+...
+test_run = env.new()
+---
+...
+test_run:cleanup_cluster()
+---
+...
+fio = require('fio')
+---
+...
+test_run:cmd("setopt delimiter ';'")
+---
+- true
+...
+function grep_file_tail(filepath, bytes, pattern)
+ local fh = fio.open(filepath, {'O_RDONLY'})
+ local size = fh:seek(0, 'SEEK_END')
+ if size < bytes then
+ bytes = size
+ end
+ fh:seek(-bytes, 'SEEK_END')
+ local line = fh:read(bytes)
+ fh:close()
+ return string.match(line, pattern)
+end;
+---
+...
+test_run:cmd("setopt delimiter ''");
+---
+- true
+...
+-- Testing case of panic on recovery
+test_run:cmd('create server panic with script="xlog/panic.lua"')
+---
+- true
+...
+test_run:cmd('start server panic')
+---
+- true
+...
+test_run:switch('panic')
+---
+- true
+...
+box.space._schema:replace{"t0", "v0"}
+---
+- ['t0', 'v0']
+...
+lsn = box.info.vclock[1]
+---
+...
+box.error.injection.set("ERRINJ_WAL_BREAK_LSN", lsn + 1)
+---
+- ok
+...
+box.space._schema:replace{"t0", "v1"}
+---
+- ['t0', 'v1']
+...
+box.error.injection.set("ERRINJ_WAL_BREAK_LSN", -1)
+---
+- ok
+...
+test_run:switch('default')
+---
+- true
+...
+test_run:cmd('stop server panic')
+---
+- true
+...
+dirname = fio.pathjoin(fio.cwd(), "panic")
+---
+...
+xlogs = fio.glob(dirname .. "/*.xlog")
+---
+...
+fio.unlink(xlogs[#xlogs])
+---
+- true
+...
+test_run:cmd('start server panic with crash_expected=True')
+---
+- false
+...
+logpath = fio.pathjoin(fio.cwd(), 'panic.log')
+---
+...
+-- Check that log contains the mention of broken LSN and the request printout
+grep_file_tail(logpath, 256, "LSN for 1 is used twice or COMMIT order is broken: confirmed: 1, new: 1, req: {.*}")
+---
+- 'LSN for 1 is used twice or COMMIT order is broken: confirmed: 1, new: 1, req: {type:
+ ''REPLACE'', lsn: 1, space_id: 272, index_id: 0, tuple: ["t0", "v1"]}'
+...
+test_run:cmd('cleanup server panic')
+---
+- true
+...
+test_run:cmd('delete server panic')
+---
+- true
+...
+-- Testing case of panic on joining a new replica
+box.schema.user.grant('guest', 'replication')
+---
+...
+_ = box.schema.space.create('test')
+---
+...
+_ = box.space.test:create_index('pk')
+---
+...
+box.space.test:auto_increment{'v0'}
+---
+- [1, 'v0']
+...
+lsn = box.info.vclock[1]
+---
+...
+box.error.injection.set("ERRINJ_RELAY_BREAK_LSN", lsn + 1)
+---
+- ok
+...
+box.space.test:auto_increment{'v1'}
+---
+- [2, 'v1']
+...
+test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"')
+---
+- true
+...
+test_run:cmd('start server replica with crash_expected=True')
+---
+- false
+...
+fiber = require('fiber')
+---
+...
+while box.info.replication == nil do fiber.sleep(0.001) end
+---
+...
+while box.info.replication[2].downstream.status ~= "stopped" do fiber.sleep(0.001) end
+---
+...
+box.error.injection.set("ERRINJ_RELAY_BREAK_LSN", -1)
+---
+- ok
+...
+logpath = fio.pathjoin(fio.cwd(), 'replica.log')
+---
+...
+-- Check that log contains the mention of broken LSN and the request printout
+test_run:cmd("push filter 'lsn: "..lsn..", space_id: [0-9]+' to 'lsn: <lsn>, space_id: <space_id>'")
+---
+- true
+...
+test_run:cmd("push filter 'confirmed: "..lsn..", new: "..lsn.."' to '<lsn>'")
+---
+- true
+...
+grep_file_tail(logpath, 256, "(LSN for 1 is used twice or COMMIT order is broken: confirmed: "..lsn..", new: "..lsn.."), req: ({.*})")
+---
+- 'LSN for 1 is used twice or COMMIT order is broken: <lsn>'
+- '{type: ''INSERT'', lsn: <lsn>, space_id: <space_id>, index_id: 0, tuple: [2, "v1"]}'
+...
+test_run:cmd("clear filter")
+---
+- true
+...
+test_run:cmd('cleanup server replica')
+---
+- true
+...
+test_run:cmd('delete server replica')
+---
+- true
+...
+box.space.test:drop()
+---
+...
+box.schema.user.revoke('guest', 'replication')
+---
+...
diff --git a/test/xlog/panic_on_broken_lsn.test.lua b/test/xlog/panic_on_broken_lsn.test.lua
new file mode 100644
index 0000000..dd450d1
--- /dev/null
+++ b/test/xlog/panic_on_broken_lsn.test.lua
@@ -0,0 +1,77 @@
+-- Issue 3105: Test logging of request with broken lsn before panicking
+-- Two cases are covered: Recovery and Joining a new replica
+env = require('test_run')
+test_run = env.new()
+test_run:cleanup_cluster()
+
+fio = require('fio')
+test_run:cmd("setopt delimiter ';'")
+function grep_file_tail(filepath, bytes, pattern)
+ local fh = fio.open(filepath, {'O_RDONLY'})
+ local size = fh:seek(0, 'SEEK_END')
+ if size < bytes then
+ bytes = size
+ end
+ fh:seek(-bytes, 'SEEK_END')
+ local line = fh:read(bytes)
+ fh:close()
+ return string.match(line, pattern)
+end;
+test_run:cmd("setopt delimiter ''");
+
+-- Testing case of panic on recovery
+test_run:cmd('create server panic with script="xlog/panic.lua"')
+test_run:cmd('start server panic')
+test_run:switch('panic')
+
+box.space._schema:replace{"t0", "v0"}
+lsn = box.info.vclock[1]
+box.error.injection.set("ERRINJ_WAL_BREAK_LSN", lsn + 1)
+box.space._schema:replace{"t0", "v1"}
+box.error.injection.set("ERRINJ_WAL_BREAK_LSN", -1)
+
+test_run:switch('default')
+test_run:cmd('stop server panic')
+
+dirname = fio.pathjoin(fio.cwd(), "panic")
+xlogs = fio.glob(dirname .. "/*.xlog")
+fio.unlink(xlogs[#xlogs])
+
+test_run:cmd('start server panic with crash_expected=True')
+
+logpath = fio.pathjoin(fio.cwd(), 'panic.log')
+
+-- Check that log contains the mention of broken LSN and the request printout
+grep_file_tail(logpath, 256, "LSN for 1 is used twice or COMMIT order is broken: confirmed: 1, new: 1, req: {.*}")
+
+test_run:cmd('cleanup server panic')
+test_run:cmd('delete server panic')
+
+-- Testing case of panic on joining a new replica
+box.schema.user.grant('guest', 'replication')
+_ = box.schema.space.create('test')
+_ = box.space.test:create_index('pk')
+box.space.test:auto_increment{'v0'}
+lsn = box.info.vclock[1]
+box.error.injection.set("ERRINJ_RELAY_BREAK_LSN", lsn + 1)
+box.space.test:auto_increment{'v1'}
+
+test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"')
+test_run:cmd('start server replica with crash_expected=True')
+fiber = require('fiber')
+while box.info.replication == nil do fiber.sleep(0.001) end
+while box.info.replication[2].downstream.status ~= "stopped" do fiber.sleep(0.001) end
+box.error.injection.set("ERRINJ_RELAY_BREAK_LSN", -1)
+
+logpath = fio.pathjoin(fio.cwd(), 'replica.log')
+-- Check that log contains the mention of broken LSN and the request printout
+test_run:cmd("push filter 'lsn: "..lsn..", space_id: [0-9]+' to 'lsn: <lsn>, space_id: <space_id>'")
+test_run:cmd("push filter 'confirmed: "..lsn..", new: "..lsn.."' to '<lsn>'")
+grep_file_tail(logpath, 256, "(LSN for 1 is used twice or COMMIT order is broken: confirmed: "..lsn..", new: "..lsn.."), req: ({.*})")
+test_run:cmd("clear filter")
+
+test_run:cmd('cleanup server replica')
+test_run:cmd('delete server replica')
+
+box.space.test:drop()
+box.schema.user.revoke('guest', 'replication')
diff --git a/test/xlog/suite.ini b/test/xlog/suite.ini
index 8ee9720..ffc812e 100644
--- a/test/xlog/suite.ini
+++ b/test/xlog/suite.ini
@@ -4,7 +4,7 @@ description = tarantool write ahead log tests
script = xlog.lua
disabled = snap_io_rate.test.lua
valgrind_disabled =
-release_disabled = errinj.test.lua panic_on_lsn_gap.test.lua
+release_disabled = errinj.test.lua panic_on_lsn_gap.test.lua panic_on_broken_lsn.test.lua
config = suite.cfg
use_unix_sockets = True
long_run = snap_io_rate.test.lua
--
1.8.3.1
More information about the Tarantool-patches
mailing list