From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id 9CADD6EC59; Tue, 9 Mar 2021 18:10:51 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 9CADD6EC59 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1615302651; bh=g+35WJwGziKt0dF8B0iRSP1/1aMhRTBHGvNSIO37/v4=; h=To:Cc:References:Date:In-Reply-To:Subject:List-Id: List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe: From:Reply-To:From; b=FaogPQxSxJreTvx1Pomvhi7a95x3PrdAQWyFv4QEghm70DR++eWEhxjY6b8CPdssh ucHKnGlk/ULNV1c2omJk1rOznXGezfyltHFgmK7FsY/7Zml5IVaOOVhFX8VLkSL7bt M4ptwVEy/yx4OkiFN3vbswVlDQVRDgbf4PUf3Pg8= Received: from smtp38.i.mail.ru (smtp38.i.mail.ru [94.100.177.98]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 1DAFC6EC59 for ; Tue, 9 Mar 2021 18:10:50 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 1DAFC6EC59 Received: by smtp38.i.mail.ru with esmtpa (envelope-from ) id 1lJe0W-0007gr-RE; Tue, 09 Mar 2021 18:10:49 +0300 To: Vladislav Shpilevoy , gorcunov@gmail.com Cc: tarantool-patches@dev.tarantool.org References: <20210224193549.70017-1-sergepetrenko@tarantool.org> <469ef629-699c-0180-facb-f166d6d3c073@tarantool.org> <17046c4d-b239-f3d8-f150-941026096b1e@tarantool.org> <97734227-410b-3e72-2ee7-30e7d6f63217@tarantool.org> <8b33b97e-fb7b-6ae5-9007-3f82e3cb49ea@tarantool.org> Message-ID: <6977fff4-4781-c7b6-b310-0f15433254d9@tarantool.org> Date: Tue, 9 Mar 2021 18:10:48 +0300 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:78.0) Gecko/20100101 Thunderbird/78.7.1 MIME-Version: 1.0 In-Reply-To: <8b33b97e-fb7b-6ae5-9007-3f82e3cb49ea@tarantool.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-GB X-7564579A: B8F34718100C35BD X-77F55803: 4F1203BC0FB41BD9D3134714A9BDB69BE66984610072BB1D09A72A4B5C8F966400894C459B0CD1B91FBC846C7522665F6AABFEA7DCCB2F8E32D88492BE0C8F1577EA5CD51374AFCE X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE74E2C4641A2CB07F2EA1F7E6F0F101C67BD4B6F7A4D31EC0BCC500DACC3FED6E28638F802B75D45FF8AA50765F7900637C0EA1E34B58229798638F802B75D45FF5571747095F342E8C7A0BC55FA0FE5FC0CECA07D5D28398C35BF5776AEFFEDD6C4AD9AECA3A4F696389733CBF5DBD5E913377AFFFEAFD269A417C69337E82CC2CC7F00164DA146DAFE8445B8C89999729449624AB7ADAF37F6B57BC7E64490611E7FA7ABCAF51C92176DF2183F8FC7C0D9442B0B5983000E8941B15DA834481F9449624AB7ADAF3735872C767BF85DA29E625A9149C048EE1B544F03EFBC4D57D2D576BCF940C7364AD6D5ED66289B524E70A05D1297E1BB35872C767BF85DA227C277FBC8AE2E8BBAB85A1B463F485B75ECD9A6C639B01B4E70A05D1297E1BBC6867C52282FAC85D9B7C4F32B44FF57285124B2A10EEC6C00306258E7E6ABB4E4A6367B16DE6309 X-C1DE0DAB: 0D63561A33F958A5A87B47959A4A01DC167B22AC39F34C3CBB699AEA36290763D59269BC5F550898D99A6476B3ADF6B47008B74DF8BB9EF7333BD3B22AA88B938A852937E12ACA75448CF9D3A7B2C848410CA545F18667F91A7EA1CDA0B5A7A0 X-C8649E89: 4E36BF7865823D7055A7F0CF078B5EC49A30900B95165D34C2E47CA9A560609071F282B9758DD732EB98A4C056BFB7B25C738B0554395DD4D8676D0EDE65A0631D7E09C32AA3244CC0F76DB2761E6EE1982B7B4E314D2E24F522A1CF68F4BE05FACE5A9C96DEB163 X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2biojsR8tyFmO15OZRV/wwxKqdA== X-Mailru-Sender: 3B9A0136629DC9125D61937A2360A446F83C1BF3DFC19A378A00EF5771E50EB077B8DE0C416023C4424AE0EB1F3D1D21E2978F233C3FAE6EE63DB1732555E4A8EE80603BA4A5B0BC112434F685709FCF0DA7A0AF5A3A8387 X-Mras: Ok Subject: Re: [Tarantool-patches] [PATCH v3] wal: introduce limits on simultaneous writes X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Serge Petrenko via Tarantool-patches Reply-To: Serge Petrenko Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" 03.03.2021 23:59, Vladislav Shpilevoy пишет: >>>>>> @@ -159,6 +276,13 @@ journal_write(struct journal_entry *entry) >>>>>>    static inline int >>>>>>    journal_write_async(struct journal_entry *entry) >>>>>>    { >>>>>> +    /* >>>>>> +     * It's the job of the caller to check whether the queue is full prior >>>>>> +     * to submitting the request. >>>>>> +     */ >>>>>> +    assert(!journal_queue_is_full() || journal_queue.is_ready); >>>>>> +    journal_queue_on_append(entry); >>>>> 8. Probably must assert that waiters list is empty. Otherwise you >>>>> could go out of order. >>>> It's not empty by the time the first entry gets to 'journal_write_async'. >>>> Everyone else is waken up, but not yet removed from the queue. >>>> >>>> Looks like we cannot determine whether a write is called after waiting in queue >>>> or not. >>> It bothers me a lot, the rest looks good. We don't have any protection against >>> a reorder, even not an assertion. How about this? (I didn't test): >> Thanks for thinking this through! >> Yes, indeed, this is a problem. Unfortunately, your solution will allow reordering. >> We needed the waiter count to prevent anyone from sliding in front when the qeueue >> is already waken, but the fibers are not yet scheduled. >> >> We also discussed that a reorder may happen if applier tries to apply >> a vinyl tx, which may yield after it has waited in queue. >> In this case, if some tx is committed (synchronously) and added to the queue >> while the vinyl tx sleeps, the vinyl tx will bypass that synchronous tx in queue. >> >> So, there shouldn't be any yields between journal_queue_wait() and >> journal_write(_async)(). >> >> We cannot yield in memtx transactions, so the only solution is to put >> journal_queue_wait() after txn_prepare() inside txn_commit_async(). >> >> Take a look at this patch (on top of the branch): >> I decided to move journal_queue_wait() to journal_write_async() to be >> consistent with journal_write(), which also checks queue. >> >> Now reordering is not a problem since waiting in queue is done right before >> the write request is submitted. No yields in between. >> >> Also, I renamed journal_write_async() to journal_write_try_async() and >> txn_commit_async() to txn_commit_try_async() to indicate that they may >> yield occasionally now. >> >> What do you think of this approach? > Looks good. Ok,  squashed. > >> P.S. there's another thing I wanted to discuss: should we set defaults for >> wal_queue_max_size(len) to some finite value? If yes, then what value should >> it be? wal_queue_max_size=default memtx_memory (256 Mb)? >> Don't know which value to choose for wal_queue_max_len at all. > Probably we could try something sanely big. For instance, max_len = 1000000 > in assumption that you might replicate a million rows per second, and a WAL > write is likely to end faster than a second. But if you still have a queue > bigger than million, then it looks wrong. On the other hand, should give some > bit of protection against OOM assuming the transactions aren't big. > > For max_size we could be a bit more brave than 256MB. Perhaps 5GB? 10GB? Just > to have some sane limit up to which it is still imaginable that the instance > really has that much memory. Better than infinity. > > Don't know. Also can leave unlimited, or ask Mons for good defaults since he > has experience of configuring Tarantool in prod. Asked Mons. He suggests 10-100k transactions and 10-50 Mb buffer. Let's settle with 100k transactions and 100Mb buffer, just cause I think bigger is better. > > See 2 comments below. > >> diff --git a/test/replication/gh-5536-wal-limit.result b/test/replication/gh-5536-wal-limit.result >> new file mode 100644 >> index 000000000..f7799baa8 >> --- /dev/null >> +++ b/test/replication/gh-5536-wal-limit.result >> @@ -0,0 +1,132 @@ >> +-- test-run result file version 2 >> +test_run = require('test_run').new() >> + | --- >> + | ... >> +fiber = require('fiber') >> + | --- >> + | ... >> + >> +-- >> +-- gh-5536: out of memory on a joining replica. Introduce a WAL queue limit so >> +-- that appliers stop reading new transactions from master once the queue is >> +-- full. >> +-- >> +box.schema.user.grant('guest', 'replication') >> + | --- >> + | ... >> +_ = box.schema.space.create('test') >> + | --- >> + | ... >> +_ = box.space.test:create_index('pk') >> + | --- >> + | ... >> + >> +replication_timeout = box.cfg.replication_timeout > 1. Need to save the WAL limit in case it won't be defaulted > to 0 in future. To properly restore it in the end of the test. Actually, no. WAL limits are only set on replica. Which's destroyed in the end of the test anyway. > >> + | --- >> + | ... >> +box.cfg{replication_timeout=1000} >> + | --- >> + | ...= > 2. Would be also good to have at least one test for WAL row count > limit, not only for the byte size. Sure. Here's the incremental diff (I've also added the changelog entry): =============================== diff --git a/changelogs/unreleased/wal-queue-limit.md b/changelogs/unreleased/wal-queue-limit.md new file mode 100644 index 000000000..393932456 --- /dev/null +++ b/changelogs/unreleased/wal-queue-limit.md @@ -0,0 +1,9 @@ +## feature/core + +* Introduce the concept of WAL queue and 2 new configuration options: +  `wal_queue_max_len`, measured in transactions, with 100k default and +  `wal_queue_max_size`, measured in bytes, with 100 Mb default. +  The options help limit the pace at which replica submits new transactions +  to WAL: the limits are checked every time a transaction from master is +  submitted to replica's WAL, and the space taken by a transaction is +  considered empty once it's successfully written (gh-5536). diff --git a/src/box/journal.c b/src/box/journal.c index 92a773684..4319f8b33 100644 --- a/src/box/journal.c +++ b/src/box/journal.c @@ -35,9 +35,9 @@  struct journal *current_journal = NULL;  struct journal_queue journal_queue = { -    .max_size = INT64_MAX, +    .max_size = 100 * 1024 * 1024, /* 100 megabytes */      .size = 0, -    .max_len = INT64_MAX, +    .max_len = 100000,  /* 100k journal entries */      .len = 0,      .waiters = RLIST_HEAD_INITIALIZER(journal_queue.waiters),      .waiter_count = 0, diff --git a/src/box/lua/load_cfg.lua b/src/box/lua/load_cfg.lua index c11a9e103..c16ebd290 100644 --- a/src/box/lua/load_cfg.lua +++ b/src/box/lua/load_cfg.lua @@ -71,8 +71,8 @@ local default_cfg = {      wal_mode            = "write",      wal_max_size        = 256 * 1024 * 1024,      wal_dir_rescan_delay= 2, -    wal_queue_max_size  = 0, -    wal_queue_max_len   = 0, +    wal_queue_max_size  = 100 * 1024 * 1024, +    wal_queue_max_len   = 100000,      force_recovery      = false,      replication         = nil,      instance_uuid       = nil, diff --git a/test/app-tap/init_script.result b/test/app-tap/init_script.result index 7a224e50e..2964f21bb 100644 --- a/test/app-tap/init_script.result +++ b/test/app-tap/init_script.result @@ -56,8 +56,8 @@ wal_dir:.  wal_dir_rescan_delay:2  wal_max_size:268435456  wal_mode:write -wal_queue_max_len:0 -wal_queue_max_size:0 +wal_queue_max_len:100000 +wal_queue_max_size:104857600  worker_pool_threads:4  --  -- Test insert from detached fiber diff --git a/test/box/admin.result b/test/box/admin.result index c818f4f9f..91b193d1e 100644 --- a/test/box/admin.result +++ b/test/box/admin.result @@ -134,9 +134,9 @@ cfg_filter(box.cfg)    - - wal_mode      - write    - - wal_queue_max_len -    - 0 +    - 100000    - - wal_queue_max_size -    - 0 +    - 104857600    - - worker_pool_threads      - 4  ... diff --git a/test/box/cfg.result b/test/box/cfg.result index 19f322e7d..3adfbd9df 100644 --- a/test/box/cfg.result +++ b/test/box/cfg.result @@ -122,9 +122,9 @@ cfg_filter(box.cfg)   |   - - wal_mode   |     - write   |   - - wal_queue_max_len - |     - 0 + |     - 100000   |   - - wal_queue_max_size - |     - 0 + |     - 104857600   |   - - worker_pool_threads   |     - 4   | ... @@ -241,9 +241,9 @@ cfg_filter(box.cfg)   |   - - wal_mode   |     - write   |   - - wal_queue_max_len - |     - 0 + |     - 100000   |   - - wal_queue_max_size - |     - 0 + |     - 104857600   |   - - worker_pool_threads   |     - 4   | ... diff --git a/test/replication/gh-5536-wal-limit.result b/test/replication/gh-5536-wal-limit.result index f7799baa8..2f6e7644b 100644 --- a/test/replication/gh-5536-wal-limit.result +++ b/test/replication/gh-5536-wal-limit.result @@ -42,9 +42,19 @@ test_run:switch('replica')   | ---   | - true   | ... + +queue_cfg_opt = test_run:get_cfg('cfg_option') + | --- + | ...  -- Huge replication timeout to not cause reconnects while applier is blocked. --- Tiny queue size (in bytes) to allow exactly one queue entry at a time. -box.cfg{wal_queue_max_size=1, replication_timeout=1000} +cfg_tbl = {replication_timeout=1000} + | --- + | ... +-- Tiny queue size to allow exactly one queue entry at a time. +cfg_tbl[queue_cfg_opt] = 1 + | --- + | ... +box.cfg(cfg_tbl)   | ---   | ...  write_cnt = box.error.injection.get("ERRINJ_WAL_WRITE_COUNT") diff --git a/test/replication/gh-5536-wal-limit.test.lua b/test/replication/gh-5536-wal-limit.test.lua index 1e7d61ff7..c32fbb08f 100644 --- a/test/replication/gh-5536-wal-limit.test.lua +++ b/test/replication/gh-5536-wal-limit.test.lua @@ -18,9 +18,13 @@ test_run:cmd('create server replica with rpl_master=default,\  test_run:cmd('start server replica with wait=True, wait_load=True')  test_run:switch('replica') + +queue_cfg_opt = test_run:get_cfg('cfg_option')  -- Huge replication timeout to not cause reconnects while applier is blocked. --- Tiny queue size (in bytes) to allow exactly one queue entry at a time. -box.cfg{wal_queue_max_size=1, replication_timeout=1000} +cfg_tbl = {replication_timeout=1000} +-- Tiny queue size to allow exactly one queue entry at a time. +cfg_tbl[queue_cfg_opt] = 1 +box.cfg(cfg_tbl)  write_cnt = box.error.injection.get("ERRINJ_WAL_WRITE_COUNT")  -- Block WAL writes so that we may test queue overflow.  box.error.injection.set("ERRINJ_WAL_DELAY", true) diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg index 7e7004592..e7f660e47 100644 --- a/test/replication/suite.cfg +++ b/test/replication/suite.cfg @@ -37,7 +37,10 @@      "gh-4928-tx-boundaries.test.lua": {},      "gh-5440-qsync-ro.test.lua": {},      "gh-5435-qsync-clear-synchro-queue-commit-all.test.lua": {}, -    "gh-5536-wal-limit.test.lua": {}, +    "gh-5536-wal-limit.test.lua": { +        "wal_queue_max_len":  {"cfg_option": "wal_queue_max_len"}, +        "wal_queue_max_size": {"cfg_option": "wal_queue_max_size"} +    },      "*": {          "memtx": {"engine": "memtx"},          "vinyl": {"engine": "vinyl"} -- Serge Petrenko