[patches] Re: [PATCH] Fix force_recovery on empty xlog

Wed Jan 31 16:18:28 MSK 2018

On Wed, Jan 31, 2018 at 03:18:33PM +0300, Konstantin Belyavskiy wrote:

> From 6b54f6e0234a66155fb5a9782117160ffe3799ef Mon Sep 17 00:00:00 2001
> From: Konstantin Belyavskiy <k.belyavskiy at tarantool.org>
> Date: Thu, 25 Jan 2018 15:46:22 +0300
> Subject: [PATCH] Fix force_recovery on empty xlog
> 
> * Fix force_recovery behaviour on empty xlog files and ones with corrupted header.
> * Add a test
> 
> Closes #3026, #3076
> ---
>  src/box/recovery.cc               |  29 +++++----
>  src/box/xlog.c                    |   2 -
>  test/xlog/force_recovery.lua      |   8 +++
>  test/xlog/force_recovery.result   | 125 ++++++++++++++++++++++++++++++++++++++
>  test/xlog/force_recovery.test.lua |  63 +++++++++++++++++++
>  5 files changed, 214 insertions(+), 13 deletions(-)
>  create mode 100644 test/xlog/force_recovery.lua
>  create mode 100644 test/xlog/force_recovery.result
>  create mode 100644 test/xlog/force_recovery.test.lua

The test doesn't pass on the branch. Looks like you forgot to update the
result file.

> 
> diff --git a/src/box/recovery.cc b/src/box/recovery.cc
> index 281ac1838..dc1189092 100644
> --- a/src/box/recovery.cc
> +++ b/src/box/recovery.cc
> @@ -42,6 +42,8 @@
>  #include "session.h"
>  #include "coio_file.h"
>  #include "error.h"
> +#include <sys/stat.h>
> +

access() is declared in unistd.h

>  
>  /*
>   * Recovery subsystem
> @@ -330,20 +332,26 @@ recovery_finalize(struct recovery *r, struct xstream *stream)
>  	recovery_close_log(r);
>  
>  	/*
> -	 * Check that the last xlog file has rows.
> +	 * Rename last corrupted xlog if any. Cases:
> +	 *  - file has corrupted rows
> +	 *  - file has corrupted header
> +	 *  - file has zero size
>  	 */
> -	if (vclockset_last(&r->wal_dir.index) != NULL &&
> -	    vclock_sum(&r->vclock) ==
> -	    vclock_sum(vclockset_last(&r->wal_dir.index))) {
> -		/*
> -		 * Delete the last empty xlog file.
> -		 */
> +	if (vclockset_last(&r->wal_dir.index) != NULL) {
>  		char *name = xdir_format_filename(&r->wal_dir,
>  						  vclock_sum(&r->vclock),
>  						  NONE);
> -		if (unlink(name) != 0) {
> -			tnt_raise(SystemError, "%s: failed to unlink file",
> -				  name);
> +		if (access(name, F_OK) == 0 ||

> +		    vclock_sum(&r->vclock) ==
> +		    vclock_sum(vclockset_last(&r->wal_dir.index))) {

If there's an xlog file corresponding to r->vclock in r->wal_dir.index,
it must exist, i.e. access() must return 0 for it. That said, the second
check looks redundant.

Come to think of it, we have scanned the whole xlog directory in
xdir_scan() by the time we get here so using access() here looks
strange. Can we call rename() right from xdir_scan() for files that
we failed to index (corrupted header, empty)?

> +			say_info("rename corrupted xlog %s", name);
> +			char to[PATH_MAX];
> +			snprintf(to, sizeof(to), "%s.corrupted", name);
> +			if (rename(name, to) != 0) {
> +				tnt_raise(SystemError,
> +					  "%s: can't rename corrupted xlog",
> +					  name);
> +			}
>  		}
>  	}
>  }

> diff --git a/test/xlog/force_recovery.test.lua b/test/xlog/force_recovery.test.lua
> new file mode 100644
> index 000000000..575a21ede
> --- /dev/null
> +++ b/test/xlog/force_recovery.test.lua
> @@ -0,0 +1,63 @@
> +#!/usr/bin/env tarantool
> +
> +env = require('test_run')
> +fio = require('fio')
> +test_run = env.new()
> +
> +box.cfg{}
> +
> +test_run:cmd('create server test with script = "xlog/force_recovery.lua"')
> +
> +test_run:cmd("start server test")
> +test_run:cmd("switch test")

> +box.space._schema:replace({'test'})

I don't think it's a good idea to use a system space for tests.
Please create a normal space, like we do in other tests.

> +test_run:cmd("switch default")
> +test_run:cmd("stop server test")
> +
> +test_run:cmd("start server test")
> +test_run:cmd("switch test")
> +box.space._schema:replace({'lost'})
> +test_run:cmd("switch default")
> +test_run:cmd("stop server test")

You don't need to switch to 'default' for restarting a server, instead
you can use

  test_run:cmd('restart server test')

Ought to make the test shorter.

> +
> +test_run:cmd("start server test")
> +test_run:cmd("switch test")
> +box.space._schema:replace({'tost'})
> +test_run:cmd("switch default")
> +test_run:cmd("stop server test")
> +
> +-- corrupted (empty) in the middle (old behavior: goto error on recovery)
> +path = fio.pathjoin(box.cfg.wal_dir, string.format('../force_recovery/%020d.xlog', 1))

That looks weird - using wal_dir from 'default' to look up an xlog
created by 'test'. Can't we corrupt xlog files from the 'test' context?