From: Vladimir Davydov <vdavydov.dev@gmail.com> To: kostja@tarantool.org Cc: tarantool-patches@freelists.org Subject: [PATCH] xlog: fix fallocate vs read race Date: Fri, 14 Dec 2018 13:29:44 +0300 [thread overview] Message-ID: <8548a4bd8439a1e4a7f78ff37216c170c61a33c3.1544783335.git.vdavydov.dev@gmail.com> (raw) posix_fallocate(), which is used for preallocating disk space for WAL files, increases the file size and fills the allocated space with zeros. The problem is a WAL file may be read by a relay thread at the same time it is written to. We try to handle the zeroed space in xlog_cursor (see xlog_cursor_next_tx()), however this turns out to be not enough, because transactions are written not atomically so it may occur that a writer writes half a transaction when a reader reads it. Without fallocate, the reader would stop at EOF until the rest of the transaction is written, but with fallocate it reads zeroes instead and thinks that the xlog file is corrupted while actually it is not. Fix this issue by using fallocate() with FALLOC_FL_KEEP_SIZE flag instead of posix_fallocate(). With the flag fallocate() won't increase the file size, it will only allocate disk space beyond EOF. The test will be added shortly. Closes #3883 --- https://github.com/tarantool/tarantool/issues/3883 https://github.com/tarantool/tarantool/tree/dv/gh-3883-xlog-fallocate-fix CMakeLists.txt | 2 +- src/box/xlog.c | 32 +++++++++++++++++++------------- src/trivia/config.h.cmake | 2 +- 3 files changed, 21 insertions(+), 15 deletions(-) diff --git a/CMakeLists.txt b/CMakeLists.txt index 2b43e0c6..c2606de2 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -85,7 +85,7 @@ check_symbol_exists(fdatasync unistd.h HAVE_FDATASYNC) check_symbol_exists(pthread_yield pthread.h HAVE_PTHREAD_YIELD) check_symbol_exists(sched_yield sched.h HAVE_SCHED_YIELD) check_symbol_exists(posix_fadvise fcntl.h HAVE_POSIX_FADVISE) -check_symbol_exists(posix_fallocate fcntl.h HAVE_POSIX_FALLOCATE) +check_symbol_exists(fallocate fcntl.h HAVE_FALLOCATE) check_symbol_exists(mremap sys/mman.h HAVE_MREMAP) check_function_exists(sync_file_range HAVE_SYNC_FILE_RANGE) diff --git a/src/box/xlog.c b/src/box/xlog.c index 191fadcd..280c036f 100644 --- a/src/box/xlog.c +++ b/src/box/xlog.c @@ -996,10 +996,25 @@ xdir_create_xlog(struct xdir *dir, struct xlog *xlog, ssize_t xlog_fallocate(struct xlog *log, size_t len) { -#ifdef HAVE_POSIX_FALLOCATE - int rc = posix_fallocate(log->fd, log->offset + log->allocated, len); +#ifdef HAVE_FALLOCATE + static bool fallocate_not_supported = false; + if (fallocate_not_supported) + return 0; + /* + * Keep the file size, because it is used to sync + * concurrent readers vs the writer: xlog_cursor + * assumes that everything written before EOF is + * valid data. + */ + int rc = fallocate(log->fd, FALLOC_FL_KEEP_SIZE, + log->offset + log->allocated, len); if (rc != 0) { - errno = rc; + if (errno == ENOSYS || errno == EOPNOTSUPP) { + say_warn("fallocate is not supported, " + "proceeding without it"); + fallocate_not_supported = true; + return 0; + } diag_set(SystemError, "%s: can't allocate disk space", log->filename); return -1; @@ -1010,7 +1025,7 @@ xlog_fallocate(struct xlog *log, size_t len) (void)log; (void)len; return 0; -#endif /* HAVE_POSIX_FALLOCATE */ +#endif /* HAVE_FALLOCATE */ } /** @@ -1832,15 +1847,6 @@ xlog_cursor_next_tx(struct xlog_cursor *i) return -1; if (rc > 0) return 1; - if (load_u32(i->rbuf.rpos) == 0) { - /* - * Space preallocated with xlog_fallocate(). - * Treat as eof and clear the buffer. - */ - i->read_offset -= ibuf_used(&i->rbuf); - ibuf_reset(&i->rbuf); - return 1; - } if (load_u32(i->rbuf.rpos) == eof_marker) { /* eof marker found */ goto eof_found; diff --git a/src/trivia/config.h.cmake b/src/trivia/config.h.cmake index 53eae2fe..f5c2e2c5 100644 --- a/src/trivia/config.h.cmake +++ b/src/trivia/config.h.cmake @@ -166,7 +166,7 @@ #cmakedefine HAVE_PTHREAD_YIELD 1 #cmakedefine HAVE_SCHED_YIELD 1 #cmakedefine HAVE_POSIX_FADVISE 1 -#cmakedefine HAVE_POSIX_FALLOCATE 1 +#cmakedefine HAVE_FALLOCATE 1 #cmakedefine HAVE_MREMAP 1 #cmakedefine HAVE_SYNC_FILE_RANGE 1 -- 2.11.0
next reply other threads:[~2018-12-14 10:29 UTC|newest] Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top 2018-12-14 10:29 Vladimir Davydov [this message] 2018-12-14 11:07 ` Konstantin Osipov 2018-12-14 11:12 ` Vladimir Davydov 2018-12-14 12:04 ` Vladimir Davydov 2018-12-14 12:30 ` Vladimir Davydov
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=8548a4bd8439a1e4a7f78ff37216c170c61a33c3.1544783335.git.vdavydov.dev@gmail.com \ --to=vdavydov.dev@gmail.com \ --cc=kostja@tarantool.org \ --cc=tarantool-patches@freelists.org \ --subject='Re: [PATCH] xlog: fix fallocate vs read race' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox