Re: [Tarantool-patches] [PATCH v2] wal: introduce limits on simultaneous writes

Tarantool development patches archive
 help / color / mirror / Atom feed

From: Serge Petrenko via Tarantool-patches <tarantool-patches@dev.tarantool.org>
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>, gorcunov@gmail.com
Cc: tarantool-patches@dev.tarantool.org
Subject: Re: [Tarantool-patches] [PATCH v2] wal: introduce limits on simultaneous writes
Date: Wed, 24 Feb 2021 22:32:00 +0300	[thread overview]
Message-ID: <2e8b0be7-52ba-70b6-d1d9-cdf976106a2d@tarantool.org> (raw)
In-Reply-To: <77af94ed-71eb-74e5-da73-7ae7286bfeb1@tarantool.org>



24.02.2021 01:19, Vladislav Shpilevoy пишет:
> Hi! Thanks for the patch!

Thanks for the review!
Please find my answers below and v3 of the patch in your inbox.

>
>>>> diff --git a/src/box/journal.c b/src/box/journal.c
>>>> index cb320b557..49441e596 100644
>>>> --- a/src/box/journal.c
>>>> +++ b/src/box/journal.c
>>>> @@ -55,3 +55,66 @@ journal_entry_new(size_t n_rows, struct region *region,
>>>>                     complete_data);
>>>>        return entry;
>>>>    }
>>>> +
>>>> +struct journal_queue_entry {
>>>> +    /** The fiber waiting for queue space to free. */
>>>> +    struct fiber *fiber;
>>>> +    /** Whether the fiber should be waken up regardless of queue size. */
>>>> +    bool is_ready;
>>>> +    /** A link in all waiting fibers list. */
>>>> +    struct rlist in_queue;
>>>> +};
>>>> +
>>>> +/**
>>>> + * Wake up the next waiter in journal queue.
>>>> + */
>>>> +static inline void
>>>> +journal_queue_wakeup_next(struct rlist *link, bool force_ready)
>>> 1. The flag is known in all usage places at compilation time. Is it
>>> possible to split the function into force/normal versions? The same
>>> for journal_queue_wakeup() from which this runtime uncertainty arises.
>> Actually, the parameter is not known at compile time when wakeup_next()
>> is called from journal_wait_queue(). For now wakeup_next() only has a single
>> check for force_ready, so moving the check outside would only increase the
>> number of branches.
>>
>> journal_queue_wakeup() is called only once per a whole queue wakeup, so
>> I suppose it doesn't hurt much it has a compile-time known parameter.
> Is it called once? Then why does it have `if (current_journal->queue_is_woken)`
> check?

I was trying to say 'it's called once per a bunch of wakeup_next() calls'
Just ignore this. This is irrelevant.
Actually, no, it may be called multiple times, from every 
journal_async_complete().
But it is a no-op for each consequent call, except the first one. (while 
the queue is
being cleared).

>
>>>> +{
>>>> +    /* Empty queue or last entry in queue. */
>>>> +    if (link == rlist_last(&current_journal->waiters)) {
>>> 2. I am not sure I understand what is happening here. Why is this
>>> function in one place called with the pointer at the list itself,
>>> and in another place with the pointer at one element?
>> Well, <list head> -> next is the fist list entry, right?
> Perhaps. TBH, I don't remember and when see such tricky things in
> the code, it takes time to understand it.
>
>> In queue_wakeup() I wake the first waiter up.
>>
>> Once any waiter gets woken up, it wakes up the next waiter.
>> Which is <in_queue> -> next.
>>
>> That's why I have a common helper for these two cases.
> Ok, I see now. But it seems you could make it simpler, right?
>
> ====================
> @@ -69,10 +69,10 @@ struct journal_queue_entry {
>    * Wake up the next waiter in journal queue.
>    */
>   static inline void
> -journal_queue_wakeup_next(struct rlist *link, bool force_ready)
> +journal_queue_wakeup_first(bool force_ready)
>   {
>   	/* Empty queue or last entry in queue. */
> -	if (link == rlist_last(&current_journal->waiters)) {
> +	if (rlist_empty(&current_journal->waiters)) {
>   		current_journal->queue_is_woken = false;
>   		return;
>   	}
> @@ -97,7 +97,7 @@ journal_queue_wakeup(bool force_ready)
>   	if (current_journal->queue_is_woken)
>   		return;
>   	current_journal->queue_is_woken = true;
> -	journal_queue_wakeup_next(&current_journal->waiters, force_ready);
> +	journal_queue_wakeup_first(force_ready);
>   }
>   
>   void
> @@ -114,7 +114,7 @@ journal_wait_queue(void)
>   	while (journal_queue_is_full() && !entry.is_ready)
>   		fiber_yield();
>   
> -	journal_queue_wakeup_next(&entry.in_queue, entry.is_ready);
>   	assert(&entry.in_queue == rlist_first(&current_journal->waiters));
>   	rlist_del(&entry.in_queue);
> +	journal_queue_wakeup_first(entry.is_ready);
>   }
> ====================
>
> (I didn't test it.)

Yes, indeed. Thanks!
Applied with minor changes.

>>> 5. Can rlist_del be done along with fiber_wakeup()? Then you
>>> wouldn't need is_woken maybe.
>> Looks like it can't.
>> Say we have only one waiter. And remove it from the list on wakeup.
>> The list would become empty and there'd be no way to check whether
>> journal has any waiters, and we may reorder the entries (put new ones before
>> the waiting one). This is not necessarily bad, because I put entries into queue
>> before txn_begin(), but someone may call journal_wait_queue() from inside the
>> transaction, or right before txn_commit(). Then it might be bad to put other
>> transactions before this one.
> Order change is definitely not acceptable.
>
>> So while removing is_woken we would have to introduce queue_has_waiters flag for
>> the sake of this single waiter.
> It would rather become a counter - number of waiters. Because there can
> be many. But yeah, I see the problem.
>
>>>> +}
>>>> +
>>>> +/**
>>>> + * Check whether anyone is waiting for the journal queue to empty. If there are
>>>> + * other waiters we must go after them to preserve write order.
>>>> + */
>>>> +static inline bool
>>>> +journal_queue_has_waiters(void)
>>>> +{
>>>> +    return !rlist_empty(&current_journal->waiters);
>>>> +}
>>>> +
>>>> +/** Yield until there's some space in the journal queue. */
>>>> +void
>>>> +journal_wait_queue(void);
>>>> +
>>>> +/** Set maximal journal queue size in bytes. */
>>>> +static inline void
>>>> +journal_queue_set_max_size(struct journal *j, int64_t size)
>>> 7. Why do we have journal parameter here, but don't have it in
>>> the other functions? The same journal_queue_set_max_len.
>> This is my attempt to make sure only wal_writer's journal has a queue.
>> I explicitly set queue_max_... parameters only for wal_writer's journal.
>> And then there's an assert that journal_queue_set_...() is only called with
>> the current journal.
> Or the assertion could be done in wal_set_queue_*() functions. To keep the
> journal API consistent.

Actually, struct journal has a ton queue_* members now, so I'm following 
your
older advice and extracting everything related to queues into struct 
journal_queue.

>
> I just realized, journal can be easily unit-tested. It does not depend on
> anything except small/ and core/ libs. Although seems like a lot of work so
> maybe not now. Probably later, for something more complex and harder to test
> via functional tests. However if you would write tests now, it would be
> greatly appreciated.
>
>>>> +{
>>>> +    assert(j == current_journal);
>>>> +    j->queue_max_size = size;c
>>>> +    if (journal_queue_has_waiters() && !journal_queue_is_full())
>>>> +        journal_queue_wakeup(false);
>>>> +}
>>>> @@ -159,6 +264,12 @@ journal_write(struct journal_entry *entry)
>>>>    static inline int
>>>>    journal_write_async(struct journal_entry *entry)
>>>>    {
>>>> +    /*
>>>> +     * It's the job of the caller to check whether the queue is full prior
>>>> +     * to submitting the request.
>>> 8. Maybe add an assert though.
>> I wanted to, but it's impossible.
>> The queue may be full when all the waiters are forcefully waken up by a
>> synchronous commit. And it's hard to tell whether it was a "force" wakeup
>> or not. So let's just hope noone misuses this API.
> Yeah, I see now.
>
>> Or, even better, I can remove is_ready field from queue entries and add a new field
>> to the journal: queue_is_ready or something. And addition to queue_is_awake.
>> Then every entry will check queue_is_ready instead of entry.is_ready and
>> it'll be possible to add an assert here: !journal_queue_is_full || journal_queue_is_ready
>> Looks like this'll also allow us to extract queue_wakeup_(next)_force, like you suggested
>> in paragraph 1.
>> What do you think ?
> Sounds good, worth doing.

I introduced queue_is_ready and removed entry.is_ready.

The code looks cleaner now and together with your suggestion regarding
journal_queue_wakeup_first(), now it doesn't have parameters at all.
It does have a check for queue_is_ready internally, but there's no point 
in separating
_force and normal versions. This would simply move the check outside the 
function call.

>
> See 2 comments below.
>
>>>> +     */
>>>> +    journal_queue_on_append(entry);
>>>> +
>>>>        return current_journal->write_async(current_journal, entry);
>>>>    }> diff --git a/src/box/applier.cc b/src/box/applier.cc
>> index 553db76fc..7c2452d2b 100644
>> --- a/src/box/applier.cc
>> +++ b/src/box/applier.cc
>> @@ -967,6 +967,15 @@ applier_apply_tx(struct applier *applier, struct stailq *rows)
>>   		goto success;
>>   	}
>>   
>> +	/*
>> +	 * Do not spam WAL with excess write requests, let it process what's
>> +	 * piled up first.
>> +	 * This is done before opening the transaction to avoid problems with
>> +	 * yielding inside it.
>> +	 */
>> +	if (journal_queue_is_full())
>> +		journal_wait_queue();
> 1. I just noticed you do the waiting before starting the
> transaction. In case of Vinyl the transaction can yield. So
> by the time you get to commit, the queue could be full.
>
> Don't know what to do with this. We can't wait before
> txn_commit_async() because it would kill the memtx transactions.
>
> Maybe we could not to care now. Because overpopulation never
> will exceed number of appliers, which is tiny.
>
> But when async transactions will go to the public API, we
> will face this issue anyway. I assume we will need to extract
> txn_prepare to the "public" part of txn.h and use it separately
> from writing to the journal. So in our code it would look like
> this:
>
>      sync:  txn_begin() ... txn_commit()
>      async: txn_begin() ... txn_prepare() journal_wait() txn_persist()
>
> or something similar. But don't know for sure.
>
> Summary: leave it as is if don't want to tear commit_async()
> and commit() up into parts now.

Let's leave it as is for now then.

>
>> +
>>   	/**
>>   	 * Explicitly begin the transaction so that we can
>>   	 * control fiber->gc life cycle and, in case of apply
>> diff --git a/src/box/journal.h b/src/box/journal.h
>> index 5d8d5a726..d295dfa4b 100644
>> --- a/src/box/journal.h
>> +++ b/src/box/journal.h
>> @@ -159,6 +264,12 @@ journal_write(struct journal_entry *entry)
>>   static inline int
>>   journal_write_async(struct journal_entry *entry)
>>   {
>> +	/*
>> +	 * It's the job of the caller to check whether the queue is full prior
>> +	 * to submitting the request.
>> +	 */
>> +	journal_queue_on_append(entry);
>> +
>>   	return current_journal->write_async(current_journal, entry);
> 2. What if write_async() is called by some applier when the queue is
> not full, but also not empty? It seems it will bypass the existing
> waiters and lead to the transaction order change. No?

Yes, you're correct, and thanks for noticing this.
This is fixed simply:

diff --git a/src/box/applier.cc b/src/box/applier.cc
index 7c2452d2b..27ddd0f29 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -973,7 +973,7 @@ applier_apply_tx(struct applier *applier, struct 
stailq *rows)
          * This is done before opening the transaction to avoid 
problems with
          * yielding inside it.
          */
-       if (journal_queue_is_full())
+       if (journal_queue_is_full() || journal_queue_has_waiters())
                 journal_wait_queue();

         /**


Having this fix applied, nothing else could go wrong here AFAICS.

>
> I start thinking that we need to queue the journal_entry objects right
> in the journal object. So if their queue is not empty,
> journal_write_async() adds the entry to the queue and does not call
> write_async().

Why?

>
> Also would be cool to add a test how the applier can reorder WAL writes
> in the current patch.

-- 
Serge Petrenko

next prev parent reply	other threads:[~2021-02-24 19:32 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-11 12:17 Serge Petrenko via Tarantool-patches
2021-02-15 11:17 ` Cyrill Gorcunov via Tarantool-patches
2021-02-16 12:47   ` Serge Petrenko via Tarantool-patches
2021-02-16 12:49     ` Cyrill Gorcunov via Tarantool-patches
2021-02-17 20:46 ` Vladislav Shpilevoy via Tarantool-patches
2021-02-18 20:06   ` Serge Petrenko via Tarantool-patches
2021-02-23 22:19     ` Vladislav Shpilevoy via Tarantool-patches
2021-02-24 19:32       ` Serge Petrenko via Tarantool-patches [this message]
2021-02-26  0:58         ` Vladislav Shpilevoy via Tarantool-patches

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2e8b0be7-52ba-70b6-d1d9-cdf976106a2d@tarantool.org \
    --to=tarantool-patches@dev.tarantool.org \
    --cc=gorcunov@gmail.com \
    --cc=sergepetrenko@tarantool.org \
    --cc=v.shpilevoy@tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH v2] wal: introduce limits on simultaneous writes' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox