[Tarantool-patches] [PATCH v3] wal: introduce limits on simultaneous writes

Tue Mar 9 18:10:48 MSK 2021

03.03.2021 23:59, Vladislav Shpilevoy пишет:
>>>>>> @@ -159,6 +276,13 @@ journal_write(struct journal_entry *entry)
>>>>>>     static inline int
>>>>>>     journal_write_async(struct journal_entry *entry)
>>>>>>     {
>>>>>> +    /*
>>>>>> +     * It's the job of the caller to check whether the queue is full prior
>>>>>> +     * to submitting the request.
>>>>>> +     */
>>>>>> +    assert(!journal_queue_is_full() || journal_queue.is_ready);
>>>>>> +    journal_queue_on_append(entry);
>>>>> 8. Probably must assert that waiters list is empty. Otherwise you
>>>>> could go out of order.
>>>> It's not empty by the time the first entry gets to 'journal_write_async'.
>>>> Everyone else is waken up, but not yet removed from the queue.
>>>>
>>>> Looks like we cannot determine whether a write is called after waiting in queue
>>>> or not.
>>> It bothers me a lot, the rest looks good. We don't have any protection against
>>> a reorder, even not an assertion. How about this? (I didn't test):
>> Thanks for thinking this through!
>> Yes, indeed, this is a problem. Unfortunately, your solution will allow reordering.
>> We needed the waiter count to prevent anyone from sliding in front when the qeueue
>> is already waken, but the fibers are not yet scheduled.
>>
>> We also discussed that a reorder may happen if applier tries to apply
>> a vinyl tx, which may yield after it has waited in queue.
>> In this case, if some tx is committed (synchronously) and added to the queue
>> while the vinyl tx sleeps, the vinyl tx will bypass that synchronous tx in queue.
>>
>> So, there shouldn't be any yields between journal_queue_wait() and
>> journal_write(_async)().
>>
>> We cannot yield in memtx transactions, so the only solution is to put
>> journal_queue_wait() after txn_prepare() inside txn_commit_async().
>>
>> Take a look at this patch (on top of the branch):
>> I decided to move journal_queue_wait() to journal_write_async() to be
>> consistent with journal_write(), which also checks queue.
>>
>> Now reordering is not a problem since waiting in queue is done right before
>> the write request is submitted. No yields in between.
>>
>> Also, I renamed journal_write_async() to journal_write_try_async() and
>> txn_commit_async() to txn_commit_try_async() to indicate that they may
>> yield occasionally now.
>>
>> What do you think of this approach?
> Looks good.

Ok,  squashed.

>
>> P.S. there's another thing I wanted to discuss: should we set defaults for
>> wal_queue_max_size(len) to some finite value? If yes, then what value should
>> it be? wal_queue_max_size=default memtx_memory (256 Mb)?
>> Don't know which value to choose for wal_queue_max_len at all.
> Probably we could try something sanely big. For instance, max_len = 1000000
> in assumption that you might replicate a million rows per second, and a WAL
> write is likely to end faster than a second. But if you still have a queue
> bigger than million, then it looks wrong. On the other hand, should give some
> bit of protection against OOM assuming the transactions aren't big.
>
> For max_size we could be a bit more brave than 256MB. Perhaps 5GB? 10GB? Just
> to have some sane limit up to which it is still imaginable that the instance
> really has that much memory. Better than infinity.
>
> Don't know. Also can leave unlimited, or ask Mons for good defaults since he
> has experience of configuring Tarantool in prod.

Asked Mons. He suggests 10-100k transactions and 10-50 Mb buffer.
Let's settle with 100k transactions and 100Mb buffer, just cause I think
bigger is better.

>
> See 2 comments below.
>
>> diff --git a/test/replication/gh-5536-wal-limit.result b/test/replication/gh-5536-wal-limit.result
>> new file mode 100644
>> index 000000000..f7799baa8
>> --- /dev/null
>> +++ b/test/replication/gh-5536-wal-limit.result
>> @@ -0,0 +1,132 @@
>> +-- test-run result file version 2
>> +test_run = require('test_run').new()
>> + | ---
>> + | ...
>> +fiber = require('fiber')
>> + | ---
>> + | ...
>> +
>> +--
>> +-- gh-5536: out of memory on a joining replica. Introduce a WAL queue limit so
>> +-- that appliers stop reading new transactions from master once the queue is
>> +-- full.
>> +--
>> +box.schema.user.grant('guest', 'replication')
>> + | ---
>> + | ...
>> +_ = box.schema.space.create('test')
>> + | ---
>> + | ...
>> +_ = box.space.test:create_index('pk')
>> + | ---
>> + | ...
>> +
>> +replication_timeout = box.cfg.replication_timeout
> 1. Need to save the WAL limit in case it won't be defaulted
> to 0 in future. To properly restore it in the end of the test.

Actually, no. WAL limits are only set on replica. Which's destroyed
in the end of the test anyway.
>
>> + | ---
>> + | ...
>> +box.cfg{replication_timeout=1000}
>> + | ---
>> + | ...=
> 2. Would be also good to have at least one test for WAL row count
> limit, not only for the byte size.

Sure. Here's the incremental diff (I've also added the changelog entry):

===============================

diff --git a/changelogs/unreleased/wal-queue-limit.md 
b/changelogs/unreleased/wal-queue-limit.md
new file mode 100644
index 000000000..393932456
--- /dev/null
+++ b/changelogs/unreleased/wal-queue-limit.md
@@ -0,0 +1,9 @@
+## feature/core
+
+* Introduce the concept of WAL queue and 2 new configuration options:
+  `wal_queue_max_len`, measured in transactions, with 100k default and
+  `wal_queue_max_size`, measured in bytes, with 100 Mb default.
+  The options help limit the pace at which replica submits new transactions
+  to WAL: the limits are checked every time a transaction from master is
+  submitted to replica's WAL, and the space taken by a transaction is
+  considered empty once it's successfully written (gh-5536).
diff --git a/src/box/journal.c b/src/box/journal.c
index 92a773684..4319f8b33 100644
--- a/src/box/journal.c
+++ b/src/box/journal.c
@@ -35,9 +35,9 @@
  struct journal *current_journal = NULL;

  struct journal_queue journal_queue = {
-    .max_size = INT64_MAX,
+    .max_size = 100 * 1024 * 1024, /* 100 megabytes */
      .size = 0,
-    .max_len = INT64_MAX,
+    .max_len = 100000,  /* 100k journal entries */
      .len = 0,
      .waiters = RLIST_HEAD_INITIALIZER(journal_queue.waiters),
      .waiter_count = 0,
diff --git a/src/box/lua/load_cfg.lua b/src/box/lua/load_cfg.lua
index c11a9e103..c16ebd290 100644
--- a/src/box/lua/load_cfg.lua
+++ b/src/box/lua/load_cfg.lua
@@ -71,8 +71,8 @@ local default_cfg = {
      wal_mode            = "write",
      wal_max_size        = 256 * 1024 * 1024,
      wal_dir_rescan_delay= 2,
-    wal_queue_max_size  = 0,
-    wal_queue_max_len   = 0,
+    wal_queue_max_size  = 100 * 1024 * 1024,
+    wal_queue_max_len   = 100000,
      force_recovery      = false,
      replication         = nil,
      instance_uuid       = nil,
diff --git a/test/app-tap/init_script.result 
b/test/app-tap/init_script.result
index 7a224e50e..2964f21bb 100644
--- a/test/app-tap/init_script.result
+++ b/test/app-tap/init_script.result
@@ -56,8 +56,8 @@ wal_dir:.
  wal_dir_rescan_delay:2
  wal_max_size:268435456
  wal_mode:write
-wal_queue_max_len:0
-wal_queue_max_size:0
+wal_queue_max_len:100000
+wal_queue_max_size:104857600
  worker_pool_threads:4
  --
  -- Test insert from detached fiber
diff --git a/test/box/admin.result b/test/box/admin.result
index c818f4f9f..91b193d1e 100644
--- a/test/box/admin.result
+++ b/test/box/admin.result
@@ -134,9 +134,9 @@ cfg_filter(box.cfg)
    - - wal_mode
      - write
    - - wal_queue_max_len
-    - 0
+    - 100000
    - - wal_queue_max_size
-    - 0
+    - 104857600
    - - worker_pool_threads
      - 4
  ...
diff --git a/test/box/cfg.result b/test/box/cfg.result
index 19f322e7d..3adfbd9df 100644
--- a/test/box/cfg.result
+++ b/test/box/cfg.result
@@ -122,9 +122,9 @@ cfg_filter(box.cfg)
   |   - - wal_mode
   |     - write
   |   - - wal_queue_max_len
- |     - 0
+ |     - 100000
   |   - - wal_queue_max_size
- |     - 0
+ |     - 104857600
   |   - - worker_pool_threads
   |     - 4
   | ...
@@ -241,9 +241,9 @@ cfg_filter(box.cfg)
   |   - - wal_mode
   |     - write
   |   - - wal_queue_max_len
- |     - 0
+ |     - 100000
   |   - - wal_queue_max_size
- |     - 0
+ |     - 104857600
   |   - - worker_pool_threads
   |     - 4
   | ...
diff --git a/test/replication/gh-5536-wal-limit.result 
b/test/replication/gh-5536-wal-limit.result
index f7799baa8..2f6e7644b 100644
--- a/test/replication/gh-5536-wal-limit.result
+++ b/test/replication/gh-5536-wal-limit.result
@@ -42,9 +42,19 @@ test_run:switch('replica')
   | ---
   | - true
   | ...
+
+queue_cfg_opt = test_run:get_cfg('cfg_option')
+ | ---
+ | ...
  -- Huge replication timeout to not cause reconnects while applier is 
blocked.
--- Tiny queue size (in bytes) to allow exactly one queue entry at a time.
-box.cfg{wal_queue_max_size=1, replication_timeout=1000}
+cfg_tbl = {replication_timeout=1000}
+ | ---
+ | ...
+-- Tiny queue size to allow exactly one queue entry at a time.
+cfg_tbl[queue_cfg_opt] = 1
+ | ---
+ | ...
+box.cfg(cfg_tbl)
   | ---
   | ...
  write_cnt = box.error.injection.get("ERRINJ_WAL_WRITE_COUNT")
diff --git a/test/replication/gh-5536-wal-limit.test.lua 
b/test/replication/gh-5536-wal-limit.test.lua
index 1e7d61ff7..c32fbb08f 100644
--- a/test/replication/gh-5536-wal-limit.test.lua
+++ b/test/replication/gh-5536-wal-limit.test.lua
@@ -18,9 +18,13 @@ test_run:cmd('create server replica with 
rpl_master=default,\
  test_run:cmd('start server replica with wait=True, wait_load=True')

  test_run:switch('replica')
+
+queue_cfg_opt = test_run:get_cfg('cfg_option')
  -- Huge replication timeout to not cause reconnects while applier is 
blocked.
--- Tiny queue size (in bytes) to allow exactly one queue entry at a time.
-box.cfg{wal_queue_max_size=1, replication_timeout=1000}
+cfg_tbl = {replication_timeout=1000}
+-- Tiny queue size to allow exactly one queue entry at a time.
+cfg_tbl[queue_cfg_opt] = 1
+box.cfg(cfg_tbl)
  write_cnt = box.error.injection.get("ERRINJ_WAL_WRITE_COUNT")
  -- Block WAL writes so that we may test queue overflow.
  box.error.injection.set("ERRINJ_WAL_DELAY", true)
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 7e7004592..e7f660e47 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -37,7 +37,10 @@
      "gh-4928-tx-boundaries.test.lua": {},
      "gh-5440-qsync-ro.test.lua": {},
      "gh-5435-qsync-clear-synchro-queue-commit-all.test.lua": {},
-    "gh-5536-wal-limit.test.lua": {},
+    "gh-5536-wal-limit.test.lua": {
+        "wal_queue_max_len":  {"cfg_option": "wal_queue_max_len"},
+        "wal_queue_max_size": {"cfg_option": "wal_queue_max_size"}
+    },
      "*": {
          "memtx": {"engine": "memtx"},
          "vinyl": {"engine": "vinyl"}


-- 
Serge Petrenko