Re: [Tarantool-patches] [PATCH v2] wal: introduce limits on simultaneous writes - Vladislav Shpilevoy via Tarantool-patches

Tarantool development patches archive
 help / color / mirror / Atom feed

From: Vladislav Shpilevoy via Tarantool-patches <tarantool-patches@dev.tarantool.org>
To: Serge Petrenko <sergepetrenko@tarantool.org>, gorcunov@gmail.com
Cc: tarantool-patches@dev.tarantool.org
Subject: Re: [Tarantool-patches] [PATCH v2] wal: introduce limits on simultaneous writes
Date: Tue, 23 Feb 2021 23:19:04 +0100	[thread overview]
Message-ID: <77af94ed-71eb-74e5-da73-7ae7286bfeb1@tarantool.org> (raw)
In-Reply-To: <e44edd9b-b352-5ac0-2a50-1697e612b7a6@tarantool.org>

Hi! Thanks for the patch!

>>> diff --git a/src/box/journal.c b/src/box/journal.c
>>> index cb320b557..49441e596 100644
>>> --- a/src/box/journal.c
>>> +++ b/src/box/journal.c
>>> @@ -55,3 +55,66 @@ journal_entry_new(size_t n_rows, struct region *region,
>>>                    complete_data);
>>>       return entry;
>>>   }
>>> +
>>> +struct journal_queue_entry {
>>> +    /** The fiber waiting for queue space to free. */
>>> +    struct fiber *fiber;
>>> +    /** Whether the fiber should be waken up regardless of queue size. */
>>> +    bool is_ready;
>>> +    /** A link in all waiting fibers list. */
>>> +    struct rlist in_queue;
>>> +};
>>> +
>>> +/**
>>> + * Wake up the next waiter in journal queue.
>>> + */
>>> +static inline void
>>> +journal_queue_wakeup_next(struct rlist *link, bool force_ready)
>> 1. The flag is known in all usage places at compilation time. Is it
>> possible to split the function into force/normal versions? The same
>> for journal_queue_wakeup() from which this runtime uncertainty arises.
> 
> Actually, the parameter is not known at compile time when wakeup_next()
> is called from journal_wait_queue(). For now wakeup_next() only has a single
> check for force_ready, so moving the check outside would only increase the
> number of branches.
> 
> journal_queue_wakeup() is called only once per a whole queue wakeup, so
> I suppose it doesn't hurt much it has a compile-time known parameter.

Is it called once? Then why does it have `if (current_journal->queue_is_woken)`
check?

>>> +{
>>> +    /* Empty queue or last entry in queue. */
>>> +    if (link == rlist_last(&current_journal->waiters)) {
>> 2. I am not sure I understand what is happening here. Why is this
>> function in one place called with the pointer at the list itself,
>> and in another place with the pointer at one element?
> 
> Well, <list head> -> next is the fist list entry, right?

Perhaps. TBH, I don't remember and when see such tricky things in
the code, it takes time to understand it.

> In queue_wakeup() I wake the first waiter up.
> 
> Once any waiter gets woken up, it wakes up the next waiter.
> Which is <in_queue> -> next.
> 
> That's why I have a common helper for these two cases.

Ok, I see now. But it seems you could make it simpler, right?

====================
@@ -69,10 +69,10 @@ struct journal_queue_entry {
  * Wake up the next waiter in journal queue.
  */
 static inline void
-journal_queue_wakeup_next(struct rlist *link, bool force_ready)
+journal_queue_wakeup_first(bool force_ready)
 {
 	/* Empty queue or last entry in queue. */
-	if (link == rlist_last(&current_journal->waiters)) {
+	if (rlist_empty(&current_journal->waiters)) {
 		current_journal->queue_is_woken = false;
 		return;
 	}
@@ -97,7 +97,7 @@ journal_queue_wakeup(bool force_ready)
 	if (current_journal->queue_is_woken)
 		return;
 	current_journal->queue_is_woken = true;
-	journal_queue_wakeup_next(&current_journal->waiters, force_ready);
+	journal_queue_wakeup_first(force_ready);
 }
 
 void
@@ -114,7 +114,7 @@ journal_wait_queue(void)
 	while (journal_queue_is_full() && !entry.is_ready)
 		fiber_yield();
 
-	journal_queue_wakeup_next(&entry.in_queue, entry.is_ready);
 	assert(&entry.in_queue == rlist_first(&current_journal->waiters));
 	rlist_del(&entry.in_queue);
+	journal_queue_wakeup_first(entry.is_ready);
 }
====================

(I didn't test it.)

>> 5. Can rlist_del be done along with fiber_wakeup()? Then you
>> wouldn't need is_woken maybe.
> 
> Looks like it can't.
> Say we have only one waiter. And remove it from the list on wakeup.
> The list would become empty and there'd be no way to check whether
> journal has any waiters, and we may reorder the entries (put new ones before
> the waiting one). This is not necessarily bad, because I put entries into queue
> before txn_begin(), but someone may call journal_wait_queue() from inside the
> transaction, or right before txn_commit(). Then it might be bad to put other
> transactions before this one.

Order change is definitely not acceptable.

> So while removing is_woken we would have to introduce queue_has_waiters flag for
> the sake of this single waiter.

It would rather become a counter - number of waiters. Because there can
be many. But yeah, I see the problem.

>>> +}
>>> +
>>> +/**
>>> + * Check whether anyone is waiting for the journal queue to empty. If there are
>>> + * other waiters we must go after them to preserve write order.
>>> + */
>>> +static inline bool
>>> +journal_queue_has_waiters(void)
>>> +{
>>> +    return !rlist_empty(&current_journal->waiters);
>>> +}
>>> +
>>> +/** Yield until there's some space in the journal queue. */
>>> +void
>>> +journal_wait_queue(void);
>>> +
>>> +/** Set maximal journal queue size in bytes. */
>>> +static inline void
>>> +journal_queue_set_max_size(struct journal *j, int64_t size)
>> 7. Why do we have journal parameter here, but don't have it in
>> the other functions? The same journal_queue_set_max_len.
> 
> This is my attempt to make sure only wal_writer's journal has a queue.
> I explicitly set queue_max_... parameters only for wal_writer's journal.
> And then there's an assert that journal_queue_set_...() is only called with
> the current journal.

Or the assertion could be done in wal_set_queue_*() functions. To keep the
journal API consistent.

I just realized, journal can be easily unit-tested. It does not depend on
anything except small/ and core/ libs. Although seems like a lot of work so
maybe not now. Probably later, for something more complex and harder to test
via functional tests. However if you would write tests now, it would be
greatly appreciated.

>>> +{
>>> +    assert(j == current_journal);
>>> +    j->queue_max_size = size;
>>> +    if (journal_queue_has_waiters() && !journal_queue_is_full())
>>> +        journal_queue_wakeup(false);
>>> +}
>>> @@ -159,6 +264,12 @@ journal_write(struct journal_entry *entry)
>>>   static inline int
>>>   journal_write_async(struct journal_entry *entry)
>>>   {
>>> +    /*
>>> +     * It's the job of the caller to check whether the queue is full prior
>>> +     * to submitting the request.
>> 8. Maybe add an assert though.
> 
> I wanted to, but it's impossible.
> The queue may be full when all the waiters are forcefully waken up by a
> synchronous commit. And it's hard to tell whether it was a "force" wakeup
> or not. So let's just hope noone misuses this API.

Yeah, I see now.

> Or, even better, I can remove is_ready field from queue entries and add a new field
> to the journal: queue_is_ready or something. And addition to queue_is_awake.
> Then every entry will check queue_is_ready instead of entry.is_ready and
> it'll be possible to add an assert here: !journal_queue_is_full || journal_queue_is_ready
> Looks like this'll also allow us to extract queue_wakeup_(next)_force, like you suggested
> in paragraph 1.
> What do you think ?

Sounds good, worth doing.

See 2 comments below.

>>> +     */
>>> +    journal_queue_on_append(entry);
>>> +
>>>       return current_journal->write_async(current_journal, entry);
>>>   }> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index 553db76fc..7c2452d2b 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc
> @@ -967,6 +967,15 @@ applier_apply_tx(struct applier *applier, struct stailq *rows)
>  		goto success;
>  	}
>  
> +	/*
> +	 * Do not spam WAL with excess write requests, let it process what's
> +	 * piled up first.
> +	 * This is done before opening the transaction to avoid problems with
> +	 * yielding inside it.
> +	 */
> +	if (journal_queue_is_full())
> +		journal_wait_queue();

1. I just noticed you do the waiting before starting the
transaction. In case of Vinyl the transaction can yield. So
by the time you get to commit, the queue could be full.

Don't know what to do with this. We can't wait before
txn_commit_async() because it would kill the memtx transactions.

Maybe we could not to care now. Because overpopulation never
will exceed number of appliers, which is tiny.

But when async transactions will go to the public API, we
will face this issue anyway. I assume we will need to extract
txn_prepare to the "public" part of txn.h and use it separately
from writing to the journal. So in our code it would look like
this:

    sync:  txn_begin() ... txn_commit()
    async: txn_begin() ... txn_prepare() journal_wait() txn_persist()

or something similar. But don't know for sure.

Summary: leave it as is if don't want to tear commit_async()
and commit() up into parts now.

> +
>  	/**
>  	 * Explicitly begin the transaction so that we can
>  	 * control fiber->gc life cycle and, in case of apply
> diff --git a/src/box/journal.h b/src/box/journal.h
> index 5d8d5a726..d295dfa4b 100644
> --- a/src/box/journal.h
> +++ b/src/box/journal.h
> @@ -159,6 +264,12 @@ journal_write(struct journal_entry *entry)
>  static inline int
>  journal_write_async(struct journal_entry *entry)
>  {
> +	/*
> +	 * It's the job of the caller to check whether the queue is full prior
> +	 * to submitting the request.
> +	 */
> +	journal_queue_on_append(entry);
> +
>  	return current_journal->write_async(current_journal, entry);

2. What if write_async() is called by some applier when the queue is
not full, but also not empty? It seems it will bypass the existing
waiters and lead to the transaction order change. No?

I start thinking that we need to queue the journal_entry objects right
in the journal object. So if their queue is not empty,
journal_write_async() adds the entry to the queue and does not call
write_async().

Also would be cool to add a test how the applier can reorder WAL writes
in the current patch.

next prev parent reply	other threads:[~2021-02-23 22:19 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-11 12:17 Serge Petrenko via Tarantool-patches
2021-02-15 11:17 ` Cyrill Gorcunov via Tarantool-patches
2021-02-16 12:47   ` Serge Petrenko via Tarantool-patches
2021-02-16 12:49     ` Cyrill Gorcunov via Tarantool-patches
2021-02-17 20:46 ` Vladislav Shpilevoy via Tarantool-patches
2021-02-18 20:06   ` Serge Petrenko via Tarantool-patches
2021-02-23 22:19     ` Vladislav Shpilevoy via Tarantool-patches [this message]
2021-02-24 19:32       ` Serge Petrenko via Tarantool-patches
2021-02-26  0:58         ` Vladislav Shpilevoy via Tarantool-patches

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=77af94ed-71eb-74e5-da73-7ae7286bfeb1@tarantool.org \
    --to=tarantool-patches@dev.tarantool.org \
    --cc=gorcunov@gmail.com \
    --cc=sergepetrenko@tarantool.org \
    --cc=v.shpilevoy@tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH v2] wal: introduce limits on simultaneous writes' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox