Tarantool development patches archive
 help / color / mirror / Atom feed
From: Vladimir Davydov <vdavydov.dev@gmail.com>
To: kostja@tarantool.org
Cc: tarantool-patches@freelists.org
Subject: Re: [PATCH] xlog: fix fallocate vs read race
Date: Fri, 14 Dec 2018 15:04:42 +0300	[thread overview]
Message-ID: <20181214120442.vdcg7nt5kzwvx3cn@esperanza> (raw)
In-Reply-To: <8548a4bd8439a1e4a7f78ff37216c170c61a33c3.1544783335.git.vdavydov.dev@gmail.com>

On Fri, Dec 14, 2018 at 01:29:44PM +0300, Vladimir Davydov wrote:
> posix_fallocate(), which is used for preallocating disk space for WAL
> files, increases the file size and fills the allocated space with zeros.
> The problem is a WAL file may be read by a relay thread at the same time
> it is written to. We try to handle the zeroed space in xlog_cursor (see
> xlog_cursor_next_tx()), however this turns out to be not enough, because
> transactions are written not atomically so it may occur that a writer
> writes half a transaction when a reader reads it. Without fallocate, the
> reader would stop at EOF until the rest of the transaction is written,
> but with fallocate it reads zeroes instead and thinks that the xlog file
> is corrupted while actually it is not.
> 
> Fix this issue by using fallocate() with FALLOC_FL_KEEP_SIZE flag
> instead of posix_fallocate(). With the flag fallocate() won't increase
> the file size, it will only allocate disk space beyond EOF.
> 
> The test will be added shortly.

Here goes the test:

diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 95e94e5a..984d2e81 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -6,6 +6,7 @@
     "wal_off.test.lua": {},
     "hot_standby.test.lua": {},
     "rebootstrap.test.lua": {},
+    "wal_rw_stress.test.lua": {},
     "*": {
         "memtx": {"engine": "memtx"},
         "vinyl": {"engine": "vinyl"}
diff --git a/test/replication/wal_rw_stress.result b/test/replication/wal_rw_stress.result
new file mode 100644
index 00000000..cc68877b
--- /dev/null
+++ b/test/replication/wal_rw_stress.result
@@ -0,0 +1,106 @@
+test_run = require('test_run').new()
+---
+...
+--
+-- gh-3893: Replication failure: relay may report that an xlog
+-- is corrupted if it it currently being written to.
+--
+s = box.schema.space.create('test')
+---
+...
+_ = s:create_index('primary')
+---
+...
+-- Deploy a replica.
+box.schema.user.grant('guest', 'replication')
+---
+...
+test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
+---
+- true
+...
+test_run:cmd("start server replica")
+---
+- true
+...
+-- Setup replica => master channel.
+box.cfg{replication = test_run:cmd("eval replica 'return box.cfg.listen'")}
+---
+...
+-- Disable master => replica channel.
+test_run:cmd("switch replica")
+---
+- true
+...
+replication = box.cfg.replication
+---
+...
+box.cfg{replication = {}}
+---
+...
+test_run:cmd("switch default")
+---
+- true
+...
+-- Write some xlogs on the master.
+test_run:cmd("setopt delimiter ';'")
+---
+- true
+...
+for i = 1, 100 do
+    box.begin()
+    for j = 1, 100 do
+        s:replace{1, require('digest').urandom(1000)}
+    end
+    box.commit()
+end;
+---
+...
+test_run:cmd("setopt delimiter ''");
+---
+- true
+...
+-- Enable master => replica channel and wait for the replica to catch up.
+-- The relay handling replica => master channel on the replica will read
+-- an xlog while the applier is writing to it. Although applier and relay
+-- are running in different threads, there shouldn't be any rw errors.
+test_run:cmd("switch replica")
+---
+- true
+...
+box.cfg{replication = replication}
+---
+...
+box.info.replication[1].downstream.status ~= 'stopped' or box.info
+---
+- true
+...
+test_run:cmd("switch default")
+---
+- true
+...
+-- Cleanup.
+box.cfg{replication = {}}
+---
+...
+test_run:cmd("stop server replica")
+---
+- true
+...
+test_run:cmd("cleanup server replica")
+---
+- true
+...
+test_run:cmd("delete server replica")
+---
+- true
+...
+test_run:cleanup_cluster()
+---
+...
+box.schema.user.revoke('guest', 'replication')
+---
+...
+s:drop()
+---
+...
diff --git a/test/replication/wal_rw_stress.test.lua b/test/replication/wal_rw_stress.test.lua
new file mode 100644
index 00000000..08570b28
--- /dev/null
+++ b/test/replication/wal_rw_stress.test.lua
@@ -0,0 +1,51 @@
+test_run = require('test_run').new()
+
+--
+-- gh-3893: Replication failure: relay may report that an xlog
+-- is corrupted if it it currently being written to.
+--
+s = box.schema.space.create('test')
+_ = s:create_index('primary')
+
+-- Deploy a replica.
+box.schema.user.grant('guest', 'replication')
+test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
+test_run:cmd("start server replica")
+
+-- Setup replica => master channel.
+box.cfg{replication = test_run:cmd("eval replica 'return box.cfg.listen'")}
+
+-- Disable master => replica channel.
+test_run:cmd("switch replica")
+replication = box.cfg.replication
+box.cfg{replication = {}}
+test_run:cmd("switch default")
+
+-- Write some xlogs on the master.
+test_run:cmd("setopt delimiter ';'")
+for i = 1, 100 do
+    box.begin()
+    for j = 1, 100 do
+        s:replace{1, require('digest').urandom(1000)}
+    end
+    box.commit()
+end;
+test_run:cmd("setopt delimiter ''");
+
+-- Enable master => replica channel and wait for the replica to catch up.
+-- The relay handling replica => master channel on the replica will read
+-- an xlog while the applier is writing to it. Although applier and relay
+-- are running in different threads, there shouldn't be any rw errors.
+test_run:cmd("switch replica")
+box.cfg{replication = replication}
+box.info.replication[1].downstream.status ~= 'stopped' or box.info
+test_run:cmd("switch default")
+
+-- Cleanup.
+box.cfg{replication = {}}
+test_run:cmd("stop server replica")
+test_run:cmd("cleanup server replica")
+test_run:cmd("delete server replica")
+test_run:cleanup_cluster()
+box.schema.user.revoke('guest', 'replication')
+s:drop()

  parent reply	other threads:[~2018-12-14 12:04 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-14 10:29 Vladimir Davydov
2018-12-14 11:07 ` Konstantin Osipov
2018-12-14 11:12   ` Vladimir Davydov
2018-12-14 12:04 ` Vladimir Davydov [this message]
2018-12-14 12:30 ` Vladimir Davydov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181214120442.vdcg7nt5kzwvx3cn@esperanza \
    --to=vdavydov.dev@gmail.com \
    --cc=kostja@tarantool.org \
    --cc=tarantool-patches@freelists.org \
    --subject='Re: [PATCH] xlog: fix fallocate vs read race' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox