From: Serge Petrenko via Tarantool-patches <tarantool-patches@dev.tarantool.org> To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>, gorcunov@gmail.com Cc: tarantool-patches@dev.tarantool.org Subject: Re: [Tarantool-patches] [PATCH v3] wal: introduce limits on simultaneous writes Date: Tue, 9 Mar 2021 18:10:48 +0300 [thread overview] Message-ID: <6977fff4-4781-c7b6-b310-0f15433254d9@tarantool.org> (raw) In-Reply-To: <8b33b97e-fb7b-6ae5-9007-3f82e3cb49ea@tarantool.org> 03.03.2021 23:59, Vladislav Shpilevoy пишет: >>>>>> @@ -159,6 +276,13 @@ journal_write(struct journal_entry *entry) >>>>>> static inline int >>>>>> journal_write_async(struct journal_entry *entry) >>>>>> { >>>>>> + /* >>>>>> + * It's the job of the caller to check whether the queue is full prior >>>>>> + * to submitting the request. >>>>>> + */ >>>>>> + assert(!journal_queue_is_full() || journal_queue.is_ready); >>>>>> + journal_queue_on_append(entry); >>>>> 8. Probably must assert that waiters list is empty. Otherwise you >>>>> could go out of order. >>>> It's not empty by the time the first entry gets to 'journal_write_async'. >>>> Everyone else is waken up, but not yet removed from the queue. >>>> >>>> Looks like we cannot determine whether a write is called after waiting in queue >>>> or not. >>> It bothers me a lot, the rest looks good. We don't have any protection against >>> a reorder, even not an assertion. How about this? (I didn't test): >> Thanks for thinking this through! >> Yes, indeed, this is a problem. Unfortunately, your solution will allow reordering. >> We needed the waiter count to prevent anyone from sliding in front when the qeueue >> is already waken, but the fibers are not yet scheduled. >> >> We also discussed that a reorder may happen if applier tries to apply >> a vinyl tx, which may yield after it has waited in queue. >> In this case, if some tx is committed (synchronously) and added to the queue >> while the vinyl tx sleeps, the vinyl tx will bypass that synchronous tx in queue. >> >> So, there shouldn't be any yields between journal_queue_wait() and >> journal_write(_async)(). >> >> We cannot yield in memtx transactions, so the only solution is to put >> journal_queue_wait() after txn_prepare() inside txn_commit_async(). >> >> Take a look at this patch (on top of the branch): >> I decided to move journal_queue_wait() to journal_write_async() to be >> consistent with journal_write(), which also checks queue. >> >> Now reordering is not a problem since waiting in queue is done right before >> the write request is submitted. No yields in between. >> >> Also, I renamed journal_write_async() to journal_write_try_async() and >> txn_commit_async() to txn_commit_try_async() to indicate that they may >> yield occasionally now. >> >> What do you think of this approach? > Looks good. Ok, squashed. > >> P.S. there's another thing I wanted to discuss: should we set defaults for >> wal_queue_max_size(len) to some finite value? If yes, then what value should >> it be? wal_queue_max_size=default memtx_memory (256 Mb)? >> Don't know which value to choose for wal_queue_max_len at all. > Probably we could try something sanely big. For instance, max_len = 1000000 > in assumption that you might replicate a million rows per second, and a WAL > write is likely to end faster than a second. But if you still have a queue > bigger than million, then it looks wrong. On the other hand, should give some > bit of protection against OOM assuming the transactions aren't big. > > For max_size we could be a bit more brave than 256MB. Perhaps 5GB? 10GB? Just > to have some sane limit up to which it is still imaginable that the instance > really has that much memory. Better than infinity. > > Don't know. Also can leave unlimited, or ask Mons for good defaults since he > has experience of configuring Tarantool in prod. Asked Mons. He suggests 10-100k transactions and 10-50 Mb buffer. Let's settle with 100k transactions and 100Mb buffer, just cause I think bigger is better. > > See 2 comments below. > >> diff --git a/test/replication/gh-5536-wal-limit.result b/test/replication/gh-5536-wal-limit.result >> new file mode 100644 >> index 000000000..f7799baa8 >> --- /dev/null >> +++ b/test/replication/gh-5536-wal-limit.result >> @@ -0,0 +1,132 @@ >> +-- test-run result file version 2 >> +test_run = require('test_run').new() >> + | --- >> + | ... >> +fiber = require('fiber') >> + | --- >> + | ... >> + >> +-- >> +-- gh-5536: out of memory on a joining replica. Introduce a WAL queue limit so >> +-- that appliers stop reading new transactions from master once the queue is >> +-- full. >> +-- >> +box.schema.user.grant('guest', 'replication') >> + | --- >> + | ... >> +_ = box.schema.space.create('test') >> + | --- >> + | ... >> +_ = box.space.test:create_index('pk') >> + | --- >> + | ... >> + >> +replication_timeout = box.cfg.replication_timeout > 1. Need to save the WAL limit in case it won't be defaulted > to 0 in future. To properly restore it in the end of the test. Actually, no. WAL limits are only set on replica. Which's destroyed in the end of the test anyway. > >> + | --- >> + | ... >> +box.cfg{replication_timeout=1000} >> + | --- >> + | ...= > 2. Would be also good to have at least one test for WAL row count > limit, not only for the byte size. Sure. Here's the incremental diff (I've also added the changelog entry): =============================== diff --git a/changelogs/unreleased/wal-queue-limit.md b/changelogs/unreleased/wal-queue-limit.md new file mode 100644 index 000000000..393932456 --- /dev/null +++ b/changelogs/unreleased/wal-queue-limit.md @@ -0,0 +1,9 @@ +## feature/core + +* Introduce the concept of WAL queue and 2 new configuration options: + `wal_queue_max_len`, measured in transactions, with 100k default and + `wal_queue_max_size`, measured in bytes, with 100 Mb default. + The options help limit the pace at which replica submits new transactions + to WAL: the limits are checked every time a transaction from master is + submitted to replica's WAL, and the space taken by a transaction is + considered empty once it's successfully written (gh-5536). diff --git a/src/box/journal.c b/src/box/journal.c index 92a773684..4319f8b33 100644 --- a/src/box/journal.c +++ b/src/box/journal.c @@ -35,9 +35,9 @@ struct journal *current_journal = NULL; struct journal_queue journal_queue = { - .max_size = INT64_MAX, + .max_size = 100 * 1024 * 1024, /* 100 megabytes */ .size = 0, - .max_len = INT64_MAX, + .max_len = 100000, /* 100k journal entries */ .len = 0, .waiters = RLIST_HEAD_INITIALIZER(journal_queue.waiters), .waiter_count = 0, diff --git a/src/box/lua/load_cfg.lua b/src/box/lua/load_cfg.lua index c11a9e103..c16ebd290 100644 --- a/src/box/lua/load_cfg.lua +++ b/src/box/lua/load_cfg.lua @@ -71,8 +71,8 @@ local default_cfg = { wal_mode = "write", wal_max_size = 256 * 1024 * 1024, wal_dir_rescan_delay= 2, - wal_queue_max_size = 0, - wal_queue_max_len = 0, + wal_queue_max_size = 100 * 1024 * 1024, + wal_queue_max_len = 100000, force_recovery = false, replication = nil, instance_uuid = nil, diff --git a/test/app-tap/init_script.result b/test/app-tap/init_script.result index 7a224e50e..2964f21bb 100644 --- a/test/app-tap/init_script.result +++ b/test/app-tap/init_script.result @@ -56,8 +56,8 @@ wal_dir:. wal_dir_rescan_delay:2 wal_max_size:268435456 wal_mode:write -wal_queue_max_len:0 -wal_queue_max_size:0 +wal_queue_max_len:100000 +wal_queue_max_size:104857600 worker_pool_threads:4 -- -- Test insert from detached fiber diff --git a/test/box/admin.result b/test/box/admin.result index c818f4f9f..91b193d1e 100644 --- a/test/box/admin.result +++ b/test/box/admin.result @@ -134,9 +134,9 @@ cfg_filter(box.cfg) - - wal_mode - write - - wal_queue_max_len - - 0 + - 100000 - - wal_queue_max_size - - 0 + - 104857600 - - worker_pool_threads - 4 ... diff --git a/test/box/cfg.result b/test/box/cfg.result index 19f322e7d..3adfbd9df 100644 --- a/test/box/cfg.result +++ b/test/box/cfg.result @@ -122,9 +122,9 @@ cfg_filter(box.cfg) | - - wal_mode | - write | - - wal_queue_max_len - | - 0 + | - 100000 | - - wal_queue_max_size - | - 0 + | - 104857600 | - - worker_pool_threads | - 4 | ... @@ -241,9 +241,9 @@ cfg_filter(box.cfg) | - - wal_mode | - write | - - wal_queue_max_len - | - 0 + | - 100000 | - - wal_queue_max_size - | - 0 + | - 104857600 | - - worker_pool_threads | - 4 | ... diff --git a/test/replication/gh-5536-wal-limit.result b/test/replication/gh-5536-wal-limit.result index f7799baa8..2f6e7644b 100644 --- a/test/replication/gh-5536-wal-limit.result +++ b/test/replication/gh-5536-wal-limit.result @@ -42,9 +42,19 @@ test_run:switch('replica') | --- | - true | ... + +queue_cfg_opt = test_run:get_cfg('cfg_option') + | --- + | ... -- Huge replication timeout to not cause reconnects while applier is blocked. --- Tiny queue size (in bytes) to allow exactly one queue entry at a time. -box.cfg{wal_queue_max_size=1, replication_timeout=1000} +cfg_tbl = {replication_timeout=1000} + | --- + | ... +-- Tiny queue size to allow exactly one queue entry at a time. +cfg_tbl[queue_cfg_opt] = 1 + | --- + | ... +box.cfg(cfg_tbl) | --- | ... write_cnt = box.error.injection.get("ERRINJ_WAL_WRITE_COUNT") diff --git a/test/replication/gh-5536-wal-limit.test.lua b/test/replication/gh-5536-wal-limit.test.lua index 1e7d61ff7..c32fbb08f 100644 --- a/test/replication/gh-5536-wal-limit.test.lua +++ b/test/replication/gh-5536-wal-limit.test.lua @@ -18,9 +18,13 @@ test_run:cmd('create server replica with rpl_master=default,\ test_run:cmd('start server replica with wait=True, wait_load=True') test_run:switch('replica') + +queue_cfg_opt = test_run:get_cfg('cfg_option') -- Huge replication timeout to not cause reconnects while applier is blocked. --- Tiny queue size (in bytes) to allow exactly one queue entry at a time. -box.cfg{wal_queue_max_size=1, replication_timeout=1000} +cfg_tbl = {replication_timeout=1000} +-- Tiny queue size to allow exactly one queue entry at a time. +cfg_tbl[queue_cfg_opt] = 1 +box.cfg(cfg_tbl) write_cnt = box.error.injection.get("ERRINJ_WAL_WRITE_COUNT") -- Block WAL writes so that we may test queue overflow. box.error.injection.set("ERRINJ_WAL_DELAY", true) diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg index 7e7004592..e7f660e47 100644 --- a/test/replication/suite.cfg +++ b/test/replication/suite.cfg @@ -37,7 +37,10 @@ "gh-4928-tx-boundaries.test.lua": {}, "gh-5440-qsync-ro.test.lua": {}, "gh-5435-qsync-clear-synchro-queue-commit-all.test.lua": {}, - "gh-5536-wal-limit.test.lua": {}, + "gh-5536-wal-limit.test.lua": { + "wal_queue_max_len": {"cfg_option": "wal_queue_max_len"}, + "wal_queue_max_size": {"cfg_option": "wal_queue_max_size"} + }, "*": { "memtx": {"engine": "memtx"}, "vinyl": {"engine": "vinyl"} -- Serge Petrenko
next prev parent reply other threads:[~2021-03-09 15:10 UTC|newest] Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-02-24 19:35 Serge Petrenko via Tarantool-patches 2021-02-24 19:40 ` Serge Petrenko via Tarantool-patches 2021-02-25 13:05 ` Konstantin Osipov via Tarantool-patches 2021-02-26 0:57 ` Vladislav Shpilevoy via Tarantool-patches 2021-02-26 7:18 ` Konstantin Osipov via Tarantool-patches 2021-02-26 20:23 ` Vladislav Shpilevoy via Tarantool-patches 2021-02-26 21:20 ` Konstantin Osipov via Tarantool-patches 2021-02-26 22:44 ` Vladislav Shpilevoy via Tarantool-patches 2021-02-27 13:27 ` Konstantin Osipov via Tarantool-patches 2021-03-01 19:15 ` Serge Petrenko via Tarantool-patches 2021-03-01 21:46 ` Konstantin Osipov via Tarantool-patches 2021-02-26 0:56 ` Vladislav Shpilevoy via Tarantool-patches 2021-03-01 19:08 ` Serge Petrenko via Tarantool-patches 2021-03-01 22:05 ` Vladislav Shpilevoy via Tarantool-patches 2021-03-02 17:51 ` Serge Petrenko via Tarantool-patches 2021-03-03 20:59 ` Vladislav Shpilevoy via Tarantool-patches 2021-03-09 15:10 ` Serge Petrenko via Tarantool-patches [this message] 2021-03-09 19:49 ` Vladislav Shpilevoy via Tarantool-patches 2021-03-10 8:18 ` Konstantin Osipov via Tarantool-patches 2021-03-12 17:10 ` Serge Petrenko via Tarantool-patches 2021-03-13 19:14 ` Konstantin Osipov via Tarantool-patches 2021-03-15 23:42 ` Vladislav Shpilevoy via Tarantool-patches 2021-03-16 6:45 ` Konstantin Osipov via Tarantool-patches 2021-03-16 20:27 ` Vladislav Shpilevoy via Tarantool-patches 2021-03-16 10:19 ` Serge Petrenko via Tarantool-patches 2021-03-16 20:48 ` Vladislav Shpilevoy via Tarantool-patches 2021-03-17 12:14 ` Serge Petrenko via Tarantool-patches 2021-03-17 21:02 ` Vladislav Shpilevoy via Tarantool-patches 2021-03-19 11:32 ` Serge Petrenko via Tarantool-patches 2021-03-19 15:36 ` Kirill Yukhin via Tarantool-patches
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=6977fff4-4781-c7b6-b310-0f15433254d9@tarantool.org \ --to=tarantool-patches@dev.tarantool.org \ --cc=gorcunov@gmail.com \ --cc=sergepetrenko@tarantool.org \ --cc=v.shpilevoy@tarantool.org \ --subject='Re: [Tarantool-patches] [PATCH v3] wal: introduce limits on simultaneous writes' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox