Tarantool development patches archive
 help / color / mirror / Atom feed
* [PATCH] xlog: fix fallocate vs read race
@ 2018-12-14 10:29 Vladimir Davydov
  2018-12-14 11:07 ` Konstantin Osipov
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Vladimir Davydov @ 2018-12-14 10:29 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

posix_fallocate(), which is used for preallocating disk space for WAL
files, increases the file size and fills the allocated space with zeros.
The problem is a WAL file may be read by a relay thread at the same time
it is written to. We try to handle the zeroed space in xlog_cursor (see
xlog_cursor_next_tx()), however this turns out to be not enough, because
transactions are written not atomically so it may occur that a writer
writes half a transaction when a reader reads it. Without fallocate, the
reader would stop at EOF until the rest of the transaction is written,
but with fallocate it reads zeroes instead and thinks that the xlog file
is corrupted while actually it is not.

Fix this issue by using fallocate() with FALLOC_FL_KEEP_SIZE flag
instead of posix_fallocate(). With the flag fallocate() won't increase
the file size, it will only allocate disk space beyond EOF.

The test will be added shortly.

Closes #3883
---
https://github.com/tarantool/tarantool/issues/3883
https://github.com/tarantool/tarantool/tree/dv/gh-3883-xlog-fallocate-fix

 CMakeLists.txt            |  2 +-
 src/box/xlog.c            | 32 +++++++++++++++++++-------------
 src/trivia/config.h.cmake |  2 +-
 3 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 2b43e0c6..c2606de2 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -85,7 +85,7 @@ check_symbol_exists(fdatasync unistd.h HAVE_FDATASYNC)
 check_symbol_exists(pthread_yield pthread.h HAVE_PTHREAD_YIELD)
 check_symbol_exists(sched_yield sched.h HAVE_SCHED_YIELD)
 check_symbol_exists(posix_fadvise fcntl.h HAVE_POSIX_FADVISE)
-check_symbol_exists(posix_fallocate fcntl.h HAVE_POSIX_FALLOCATE)
+check_symbol_exists(fallocate fcntl.h HAVE_FALLOCATE)
 check_symbol_exists(mremap sys/mman.h HAVE_MREMAP)
 
 check_function_exists(sync_file_range HAVE_SYNC_FILE_RANGE)
diff --git a/src/box/xlog.c b/src/box/xlog.c
index 191fadcd..280c036f 100644
--- a/src/box/xlog.c
+++ b/src/box/xlog.c
@@ -996,10 +996,25 @@ xdir_create_xlog(struct xdir *dir, struct xlog *xlog,
 ssize_t
 xlog_fallocate(struct xlog *log, size_t len)
 {
-#ifdef HAVE_POSIX_FALLOCATE
-	int rc = posix_fallocate(log->fd, log->offset + log->allocated, len);
+#ifdef HAVE_FALLOCATE
+	static bool fallocate_not_supported = false;
+	if (fallocate_not_supported)
+		return 0;
+	/*
+	 * Keep the file size, because it is used to sync
+	 * concurrent readers vs the writer: xlog_cursor
+	 * assumes that everything written before EOF is
+	 * valid data.
+	 */
+	int rc = fallocate(log->fd, FALLOC_FL_KEEP_SIZE,
+			   log->offset + log->allocated, len);
 	if (rc != 0) {
-		errno = rc;
+		if (errno == ENOSYS || errno == EOPNOTSUPP) {
+			say_warn("fallocate is not supported, "
+				 "proceeding without it");
+			fallocate_not_supported = true;
+			return 0;
+		}
 		diag_set(SystemError, "%s: can't allocate disk space",
 			 log->filename);
 		return -1;
@@ -1010,7 +1025,7 @@ xlog_fallocate(struct xlog *log, size_t len)
 	(void)log;
 	(void)len;
 	return 0;
-#endif /* HAVE_POSIX_FALLOCATE */
+#endif /* HAVE_FALLOCATE */
 }
 
 /**
@@ -1832,15 +1847,6 @@ xlog_cursor_next_tx(struct xlog_cursor *i)
 		return -1;
 	if (rc > 0)
 		return 1;
-	if (load_u32(i->rbuf.rpos) == 0) {
-		/*
-		 * Space preallocated with xlog_fallocate().
-		 * Treat as eof and clear the buffer.
-		 */
-		i->read_offset -= ibuf_used(&i->rbuf);
-		ibuf_reset(&i->rbuf);
-		return 1;
-	}
 	if (load_u32(i->rbuf.rpos) == eof_marker) {
 		/* eof marker found */
 		goto eof_found;
diff --git a/src/trivia/config.h.cmake b/src/trivia/config.h.cmake
index 53eae2fe..f5c2e2c5 100644
--- a/src/trivia/config.h.cmake
+++ b/src/trivia/config.h.cmake
@@ -166,7 +166,7 @@
 #cmakedefine HAVE_PTHREAD_YIELD 1
 #cmakedefine HAVE_SCHED_YIELD 1
 #cmakedefine HAVE_POSIX_FADVISE 1
-#cmakedefine HAVE_POSIX_FALLOCATE 1
+#cmakedefine HAVE_FALLOCATE 1
 #cmakedefine HAVE_MREMAP 1
 #cmakedefine HAVE_SYNC_FILE_RANGE 1
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] xlog: fix fallocate vs read race
  2018-12-14 10:29 [PATCH] xlog: fix fallocate vs read race Vladimir Davydov
@ 2018-12-14 11:07 ` Konstantin Osipov
  2018-12-14 11:12   ` Vladimir Davydov
  2018-12-14 12:04 ` Vladimir Davydov
  2018-12-14 12:30 ` Vladimir Davydov
  2 siblings, 1 reply; 5+ messages in thread
From: Konstantin Osipov @ 2018-12-14 11:07 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/12/14 13:30]:
> posix_fallocate(), which is used for preallocating disk space for WAL
> files, increases the file size and fills the allocated space with zeros.
> The problem is a WAL file may be read by a relay thread at the same time
> it is written to. We try to handle the zeroed space in xlog_cursor (see
> xlog_cursor_next_tx()), however this turns out to be not enough, because
> transactions are written not atomically so it may occur that a writer
> writes half a transaction when a reader reads it. Without fallocate, the
> reader would stop at EOF until the rest of the transaction is written,
> but with fallocate it reads zeroes instead and thinks that the xlog file
> is corrupted while actually it is not.

You should use check_program_runs() not check_symbol_exists and
avoid checks at runtime. 

Please push this patch and then proceed to fix it according to the
review.
> 
> Fix this issue by using fallocate() with FALLOC_FL_KEEP_SIZE flag
> instead of posix_fallocate(). With the flag fallocate() won't increase
> the file size, it will only allocate disk space beyond EOF.
> 
> The test will be added shortly.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] xlog: fix fallocate vs read race
  2018-12-14 11:07 ` Konstantin Osipov
@ 2018-12-14 11:12   ` Vladimir Davydov
  0 siblings, 0 replies; 5+ messages in thread
From: Vladimir Davydov @ 2018-12-14 11:12 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Fri, Dec 14, 2018 at 02:07:41PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/12/14 13:30]:
> > posix_fallocate(), which is used for preallocating disk space for WAL
> > files, increases the file size and fills the allocated space with zeros.
> > The problem is a WAL file may be read by a relay thread at the same time
> > it is written to. We try to handle the zeroed space in xlog_cursor (see
> > xlog_cursor_next_tx()), however this turns out to be not enough, because
> > transactions are written not atomically so it may occur that a writer
> > writes half a transaction when a reader reads it. Without fallocate, the
> > reader would stop at EOF until the rest of the transaction is written,
> > but with fallocate it reads zeroes instead and thinks that the xlog file
> > is corrupted while actually it is not.
> 
> You should use check_program_runs() not check_symbol_exists and
> avoid checks at runtime. 

At compile time we don't know if the filesystem that will be used for
storing WALs supports fallocate() so the runtime check is a must.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] xlog: fix fallocate vs read race
  2018-12-14 10:29 [PATCH] xlog: fix fallocate vs read race Vladimir Davydov
  2018-12-14 11:07 ` Konstantin Osipov
@ 2018-12-14 12:04 ` Vladimir Davydov
  2018-12-14 12:30 ` Vladimir Davydov
  2 siblings, 0 replies; 5+ messages in thread
From: Vladimir Davydov @ 2018-12-14 12:04 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

On Fri, Dec 14, 2018 at 01:29:44PM +0300, Vladimir Davydov wrote:
> posix_fallocate(), which is used for preallocating disk space for WAL
> files, increases the file size and fills the allocated space with zeros.
> The problem is a WAL file may be read by a relay thread at the same time
> it is written to. We try to handle the zeroed space in xlog_cursor (see
> xlog_cursor_next_tx()), however this turns out to be not enough, because
> transactions are written not atomically so it may occur that a writer
> writes half a transaction when a reader reads it. Without fallocate, the
> reader would stop at EOF until the rest of the transaction is written,
> but with fallocate it reads zeroes instead and thinks that the xlog file
> is corrupted while actually it is not.
> 
> Fix this issue by using fallocate() with FALLOC_FL_KEEP_SIZE flag
> instead of posix_fallocate(). With the flag fallocate() won't increase
> the file size, it will only allocate disk space beyond EOF.
> 
> The test will be added shortly.

Here goes the test:

diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 95e94e5a..984d2e81 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -6,6 +6,7 @@
     "wal_off.test.lua": {},
     "hot_standby.test.lua": {},
     "rebootstrap.test.lua": {},
+    "wal_rw_stress.test.lua": {},
     "*": {
         "memtx": {"engine": "memtx"},
         "vinyl": {"engine": "vinyl"}
diff --git a/test/replication/wal_rw_stress.result b/test/replication/wal_rw_stress.result
new file mode 100644
index 00000000..cc68877b
--- /dev/null
+++ b/test/replication/wal_rw_stress.result
@@ -0,0 +1,106 @@
+test_run = require('test_run').new()
+---
+...
+--
+-- gh-3893: Replication failure: relay may report that an xlog
+-- is corrupted if it it currently being written to.
+--
+s = box.schema.space.create('test')
+---
+...
+_ = s:create_index('primary')
+---
+...
+-- Deploy a replica.
+box.schema.user.grant('guest', 'replication')
+---
+...
+test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
+---
+- true
+...
+test_run:cmd("start server replica")
+---
+- true
+...
+-- Setup replica => master channel.
+box.cfg{replication = test_run:cmd("eval replica 'return box.cfg.listen'")}
+---
+...
+-- Disable master => replica channel.
+test_run:cmd("switch replica")
+---
+- true
+...
+replication = box.cfg.replication
+---
+...
+box.cfg{replication = {}}
+---
+...
+test_run:cmd("switch default")
+---
+- true
+...
+-- Write some xlogs on the master.
+test_run:cmd("setopt delimiter ';'")
+---
+- true
+...
+for i = 1, 100 do
+    box.begin()
+    for j = 1, 100 do
+        s:replace{1, require('digest').urandom(1000)}
+    end
+    box.commit()
+end;
+---
+...
+test_run:cmd("setopt delimiter ''");
+---
+- true
+...
+-- Enable master => replica channel and wait for the replica to catch up.
+-- The relay handling replica => master channel on the replica will read
+-- an xlog while the applier is writing to it. Although applier and relay
+-- are running in different threads, there shouldn't be any rw errors.
+test_run:cmd("switch replica")
+---
+- true
+...
+box.cfg{replication = replication}
+---
+...
+box.info.replication[1].downstream.status ~= 'stopped' or box.info
+---
+- true
+...
+test_run:cmd("switch default")
+---
+- true
+...
+-- Cleanup.
+box.cfg{replication = {}}
+---
+...
+test_run:cmd("stop server replica")
+---
+- true
+...
+test_run:cmd("cleanup server replica")
+---
+- true
+...
+test_run:cmd("delete server replica")
+---
+- true
+...
+test_run:cleanup_cluster()
+---
+...
+box.schema.user.revoke('guest', 'replication')
+---
+...
+s:drop()
+---
+...
diff --git a/test/replication/wal_rw_stress.test.lua b/test/replication/wal_rw_stress.test.lua
new file mode 100644
index 00000000..08570b28
--- /dev/null
+++ b/test/replication/wal_rw_stress.test.lua
@@ -0,0 +1,51 @@
+test_run = require('test_run').new()
+
+--
+-- gh-3893: Replication failure: relay may report that an xlog
+-- is corrupted if it it currently being written to.
+--
+s = box.schema.space.create('test')
+_ = s:create_index('primary')
+
+-- Deploy a replica.
+box.schema.user.grant('guest', 'replication')
+test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
+test_run:cmd("start server replica")
+
+-- Setup replica => master channel.
+box.cfg{replication = test_run:cmd("eval replica 'return box.cfg.listen'")}
+
+-- Disable master => replica channel.
+test_run:cmd("switch replica")
+replication = box.cfg.replication
+box.cfg{replication = {}}
+test_run:cmd("switch default")
+
+-- Write some xlogs on the master.
+test_run:cmd("setopt delimiter ';'")
+for i = 1, 100 do
+    box.begin()
+    for j = 1, 100 do
+        s:replace{1, require('digest').urandom(1000)}
+    end
+    box.commit()
+end;
+test_run:cmd("setopt delimiter ''");
+
+-- Enable master => replica channel and wait for the replica to catch up.
+-- The relay handling replica => master channel on the replica will read
+-- an xlog while the applier is writing to it. Although applier and relay
+-- are running in different threads, there shouldn't be any rw errors.
+test_run:cmd("switch replica")
+box.cfg{replication = replication}
+box.info.replication[1].downstream.status ~= 'stopped' or box.info
+test_run:cmd("switch default")
+
+-- Cleanup.
+box.cfg{replication = {}}
+test_run:cmd("stop server replica")
+test_run:cmd("cleanup server replica")
+test_run:cmd("delete server replica")
+test_run:cleanup_cluster()
+box.schema.user.revoke('guest', 'replication')
+s:drop()

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] xlog: fix fallocate vs read race
  2018-12-14 10:29 [PATCH] xlog: fix fallocate vs read race Vladimir Davydov
  2018-12-14 11:07 ` Konstantin Osipov
  2018-12-14 12:04 ` Vladimir Davydov
@ 2018-12-14 12:30 ` Vladimir Davydov
  2 siblings, 0 replies; 5+ messages in thread
From: Vladimir Davydov @ 2018-12-14 12:30 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

On Fri, Dec 14, 2018 at 01:29:44PM +0300, Vladimir Davydov wrote:
> posix_fallocate(), which is used for preallocating disk space for WAL
> files, increases the file size and fills the allocated space with zeros.
> The problem is a WAL file may be read by a relay thread at the same time
> it is written to. We try to handle the zeroed space in xlog_cursor (see
> xlog_cursor_next_tx()), however this turns out to be not enough, because
> transactions are written not atomically so it may occur that a writer
> writes half a transaction when a reader reads it. Without fallocate, the
> reader would stop at EOF until the rest of the transaction is written,
> but with fallocate it reads zeroes instead and thinks that the xlog file
> is corrupted while actually it is not.
> 
> Fix this issue by using fallocate() with FALLOC_FL_KEEP_SIZE flag
> instead of posix_fallocate(). With the flag fallocate() won't increase
> the file size, it will only allocate disk space beyond EOF.
> 
> The test will be added shortly.
> 
> Closes #3883

Pushed to 2.1 and 1.10.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-12-14 12:30 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-14 10:29 [PATCH] xlog: fix fallocate vs read race Vladimir Davydov
2018-12-14 11:07 ` Konstantin Osipov
2018-12-14 11:12   ` Vladimir Davydov
2018-12-14 12:04 ` Vladimir Davydov
2018-12-14 12:30 ` Vladimir Davydov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox