[patches] Re: [PATCH] Fix force_recovery on empty xlog
Vladimir Davydov
vdavydov.dev at gmail.com
Wed Jan 31 16:18:28 MSK 2018
On Wed, Jan 31, 2018 at 03:18:33PM +0300, Konstantin Belyavskiy wrote:
> From 6b54f6e0234a66155fb5a9782117160ffe3799ef Mon Sep 17 00:00:00 2001
> From: Konstantin Belyavskiy <k.belyavskiy at tarantool.org>
> Date: Thu, 25 Jan 2018 15:46:22 +0300
> Subject: [PATCH] Fix force_recovery on empty xlog
>
> * Fix force_recovery behaviour on empty xlog files and ones with corrupted header.
> * Add a test
>
> Closes #3026, #3076
> ---
> src/box/recovery.cc | 29 +++++----
> src/box/xlog.c | 2 -
> test/xlog/force_recovery.lua | 8 +++
> test/xlog/force_recovery.result | 125 ++++++++++++++++++++++++++++++++++++++
> test/xlog/force_recovery.test.lua | 63 +++++++++++++++++++
> 5 files changed, 214 insertions(+), 13 deletions(-)
> create mode 100644 test/xlog/force_recovery.lua
> create mode 100644 test/xlog/force_recovery.result
> create mode 100644 test/xlog/force_recovery.test.lua
The test doesn't pass on the branch. Looks like you forgot to update the
result file.
>
> diff --git a/src/box/recovery.cc b/src/box/recovery.cc
> index 281ac1838..dc1189092 100644
> --- a/src/box/recovery.cc
> +++ b/src/box/recovery.cc
> @@ -42,6 +42,8 @@
> #include "session.h"
> #include "coio_file.h"
> #include "error.h"
> +#include <sys/stat.h>
> +
access() is declared in unistd.h
>
> /*
> * Recovery subsystem
> @@ -330,20 +332,26 @@ recovery_finalize(struct recovery *r, struct xstream *stream)
> recovery_close_log(r);
>
> /*
> - * Check that the last xlog file has rows.
> + * Rename last corrupted xlog if any. Cases:
> + * - file has corrupted rows
> + * - file has corrupted header
> + * - file has zero size
> */
> - if (vclockset_last(&r->wal_dir.index) != NULL &&
> - vclock_sum(&r->vclock) ==
> - vclock_sum(vclockset_last(&r->wal_dir.index))) {
> - /*
> - * Delete the last empty xlog file.
> - */
> + if (vclockset_last(&r->wal_dir.index) != NULL) {
> char *name = xdir_format_filename(&r->wal_dir,
> vclock_sum(&r->vclock),
> NONE);
> - if (unlink(name) != 0) {
> - tnt_raise(SystemError, "%s: failed to unlink file",
> - name);
> + if (access(name, F_OK) == 0 ||
> + vclock_sum(&r->vclock) ==
> + vclock_sum(vclockset_last(&r->wal_dir.index))) {
If there's an xlog file corresponding to r->vclock in r->wal_dir.index,
it must exist, i.e. access() must return 0 for it. That said, the second
check looks redundant.
Come to think of it, we have scanned the whole xlog directory in
xdir_scan() by the time we get here so using access() here looks
strange. Can we call rename() right from xdir_scan() for files that
we failed to index (corrupted header, empty)?
> + say_info("rename corrupted xlog %s", name);
> + char to[PATH_MAX];
> + snprintf(to, sizeof(to), "%s.corrupted", name);
> + if (rename(name, to) != 0) {
> + tnt_raise(SystemError,
> + "%s: can't rename corrupted xlog",
> + name);
> + }
> }
> }
> }
> diff --git a/test/xlog/force_recovery.test.lua b/test/xlog/force_recovery.test.lua
> new file mode 100644
> index 000000000..575a21ede
> --- /dev/null
> +++ b/test/xlog/force_recovery.test.lua
> @@ -0,0 +1,63 @@
> +#!/usr/bin/env tarantool
> +
> +env = require('test_run')
> +fio = require('fio')
> +test_run = env.new()
> +
> +box.cfg{}
> +
> +test_run:cmd('create server test with script = "xlog/force_recovery.lua"')
> +
> +test_run:cmd("start server test")
> +test_run:cmd("switch test")
> +box.space._schema:replace({'test'})
I don't think it's a good idea to use a system space for tests.
Please create a normal space, like we do in other tests.
> +test_run:cmd("switch default")
> +test_run:cmd("stop server test")
> +
> +test_run:cmd("start server test")
> +test_run:cmd("switch test")
> +box.space._schema:replace({'lost'})
> +test_run:cmd("switch default")
> +test_run:cmd("stop server test")
You don't need to switch to 'default' for restarting a server, instead
you can use
test_run:cmd('restart server test')
Ought to make the test shorter.
> +
> +test_run:cmd("start server test")
> +test_run:cmd("switch test")
> +box.space._schema:replace({'tost'})
> +test_run:cmd("switch default")
> +test_run:cmd("stop server test")
> +
> +-- corrupted (empty) in the middle (old behavior: goto error on recovery)
> +path = fio.pathjoin(box.cfg.wal_dir, string.format('../force_recovery/%020d.xlog', 1))
That looks weird - using wal_dir from 'default' to look up an xlog
created by 'test'. Can't we corrupt xlog files from the 'test' context?
More information about the Tarantool-patches
mailing list