[Tarantool-patches] [PATCH v3] wal: introduce limits on simultaneous writes

Wed Mar 3 23:59:02 MSK 2021

>>>>> @@ -159,6 +276,13 @@ journal_write(struct journal_entry *entry)
>>>>>    static inline int
>>>>>    journal_write_async(struct journal_entry *entry)
>>>>>    {
>>>>> +    /*
>>>>> +     * It's the job of the caller to check whether the queue is full prior
>>>>> +     * to submitting the request.
>>>>> +     */
>>>>> +    assert(!journal_queue_is_full() || journal_queue.is_ready);
>>>>> +    journal_queue_on_append(entry);
>>>> 8. Probably must assert that waiters list is empty. Otherwise you
>>>> could go out of order.
>>> It's not empty by the time the first entry gets to 'journal_write_async'.
>>> Everyone else is waken up, but not yet removed from the queue.
>>>
>>> Looks like we cannot determine whether a write is called after waiting in queue
>>> or not.
>> It bothers me a lot, the rest looks good. We don't have any protection against
>> a reorder, even not an assertion. How about this? (I didn't test):
> 
> Thanks for thinking this through!
> Yes, indeed, this is a problem. Unfortunately, your solution will allow reordering.
> We needed the waiter count to prevent anyone from sliding in front when the qeueue
> is already waken, but the fibers are not yet scheduled.
> 
> We also discussed that a reorder may happen if applier tries to apply
> a vinyl tx, which may yield after it has waited in queue.
> In this case, if some tx is committed (synchronously) and added to the queue
> while the vinyl tx sleeps, the vinyl tx will bypass that synchronous tx in queue.
> 
> So, there shouldn't be any yields between journal_queue_wait() and
> journal_write(_async)().
> 
> We cannot yield in memtx transactions, so the only solution is to put
> journal_queue_wait() after txn_prepare() inside txn_commit_async().
> 
> Take a look at this patch (on top of the branch):
> I decided to move journal_queue_wait() to journal_write_async() to be
> consistent with journal_write(), which also checks queue.
> 
> Now reordering is not a problem since waiting in queue is done right before
> the write request is submitted. No yields in between.
> 
> Also, I renamed journal_write_async() to journal_write_try_async() and
> txn_commit_async() to txn_commit_try_async() to indicate that they may
> yield occasionally now.
> 
> What do you think of this approach?

Looks good.

> P.S. there's another thing I wanted to discuss: should we set defaults for
> wal_queue_max_size(len) to some finite value? If yes, then what value should
> it be? wal_queue_max_size=default memtx_memory (256 Mb)?
> Don't know which value to choose for wal_queue_max_len at all.

Probably we could try something sanely big. For instance, max_len = 1000000
in assumption that you might replicate a million rows per second, and a WAL
write is likely to end faster than a second. But if you still have a queue
bigger than million, then it looks wrong. On the other hand, should give some
bit of protection against OOM assuming the transactions aren't big.

For max_size we could be a bit more brave than 256MB. Perhaps 5GB? 10GB? Just
to have some sane limit up to which it is still imaginable that the instance
really has that much memory. Better than infinity.

Don't know. Also can leave unlimited, or ask Mons for good defaults since he
has experience of configuring Tarantool in prod.

See 2 comments below.

> diff --git a/test/replication/gh-5536-wal-limit.result b/test/replication/gh-5536-wal-limit.result
> new file mode 100644
> index 000000000..f7799baa8
> --- /dev/null
> +++ b/test/replication/gh-5536-wal-limit.result
> @@ -0,0 +1,132 @@
> +-- test-run result file version 2
> +test_run = require('test_run').new()
> + | ---
> + | ...
> +fiber = require('fiber')
> + | ---
> + | ...
> +
> +--
> +-- gh-5536: out of memory on a joining replica. Introduce a WAL queue limit so
> +-- that appliers stop reading new transactions from master once the queue is
> +-- full.
> +--
> +box.schema.user.grant('guest', 'replication')
> + | ---
> + | ...
> +_ = box.schema.space.create('test')
> + | ---
> + | ...
> +_ = box.space.test:create_index('pk')
> + | ---
> + | ...
> +
> +replication_timeout = box.cfg.replication_timeout

1. Need to save the WAL limit in case it won't be defaulted
to 0 in future. To properly restore it in the end of the test.

> + | ---
> + | ...
> +box.cfg{replication_timeout=1000}
> + | ---
> + | ...=
2. Would be also good to have at least one test for WAL row count
limit, not only for the byte size.