From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tarantool-patches-bounces@dev.tarantool.org>
Received: from [87.239.111.99] (localhost [127.0.0.1])
	by dev.tarantool.org (Postfix) with ESMTP id 9070270202;
	Wed, 24 Feb 2021 22:32:03 +0300 (MSK)
DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 9070270202
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev;
	t=1614195123; bh=LOOde+R1ZuWT06E0t7lOihLIkO2M9U7gKGkLRK+E3lw=;
	h=To:Cc:References:Date:In-Reply-To:Subject:List-Id:
	 List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe:
	 From:Reply-To:From;
	b=QNg7vIBsPDOO1lXOTOK8ozhsd5fKEDrBh7oTHDXTl2WMtuNEskDJ1MvhmRAWNdfk4
	 D/QEd5GOWFcVx2WLwFdij8JFrsoc0QuSvnNWoFmTvs8FKHYz7tOndgHVjL+p80QBiJ
	 QPjT1+RRKv76xaSXEloBiERXJIrWaSk6S497d6ng=
Received: from smtp39.i.mail.ru (smtp39.i.mail.ru [94.100.177.99])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id 30C8470202
 for <tarantool-patches@dev.tarantool.org>;
 Wed, 24 Feb 2021 22:32:02 +0300 (MSK)
DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 30C8470202
Received: by smtp39.i.mail.ru with esmtpa (envelope-from
 <sergepetrenko@tarantool.org>)
 id 1lEztB-00052T-BY; Wed, 24 Feb 2021 22:32:01 +0300
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>, gorcunov@gmail.com
Cc: tarantool-patches@dev.tarantool.org
References: <20210211121750.46298-1-sergepetrenko@tarantool.org>
 <dd49c61a-913d-79b0-4409-000f0d1d08ab@tarantool.org>
 <e44edd9b-b352-5ac0-2a50-1697e612b7a6@tarantool.org>
 <77af94ed-71eb-74e5-da73-7ae7286bfeb1@tarantool.org>
Message-ID: <2e8b0be7-52ba-70b6-d1d9-cdf976106a2d@tarantool.org>
Date: Wed, 24 Feb 2021 22:32:00 +0300
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:78.0)
 Gecko/20100101 Thunderbird/78.7.1
MIME-Version: 1.0
In-Reply-To: <77af94ed-71eb-74e5-da73-7ae7286bfeb1@tarantool.org>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Content-Language: en-GB
X-7564579A: B8F34718100C35BD
X-77F55803: 4F1203BC0FB41BD975C3EC174F56692254B0AABE1FB071B2BA6557555153D6A0182A05F5380850408DB54D833A2DC1F99F46331A285D162AF64D5B07D2B7126E28A787FFD79C73E3
X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE73C871DD2182510D5EA1F7E6F0F101C67BD4B6F7A4D31EC0BCC500DACC3FED6E28638F802B75D45FF8AA50765F7900637DB576DCB83B448D28638F802B75D45FF914D58D5BE9E6BC131B5C99E7648C95C2115B036FF9F6D255EBE7D9934CEC98E2B07E92741E89F7CA471835C12D1D9774AD6D5ED66289B5259CC434672EE6371117882F4460429724CE54428C33FAD30A8DF7F3B2552694AC26CFBAC0749D213D2E47CDBA5A9658378DA827A17800CE7328B01A8D746D8839FA2833FD35BB23DF004C9065253843057739F23D657EF2B13377AFFFEAFD26923F8577A6DFFEA7CAF35BD5BAE7ED04193EC92FD9297F6715571747095F342E857739F23D657EF2BD5E8D9A59859A8B6A282B68738D2AF99089D37D7C0E48F6C5571747095F342E857739F23D657EF2B6825BDBE14D8E7028C9DFF55498CEFB0BD9CCCA9EDD067B1EDA766A37F9254B7
X-C1DE0DAB: 0D63561A33F958A5519E495BF0331884A4A8EEABEBFCC57ED8373F5153BB937CD59269BC5F550898D99A6476B3ADF6B47008B74DF8BB9EF7333BD3B22AA88B938A852937E12ACA75B7BFB303F1C7DB4D8E8E86DC7131B365E7726E8460B7C23C
X-C8649E89: 4E36BF7865823D7055A7F0CF078B5EC49A30900B95165D343D1F112031EF3D629280615735CF1D61494921938305D699EC0CAC88F874EAA0274FF0FCA5DB4B431D7E09C32AA3244C90A4AB6D7F4ABAA1A713A5A39D5DB80F9CA7333006C390A0FACE5A9C96DEB163
X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2biojyK6JYJ15DtKdeRZ8jSdc2Q==
X-Mailru-Sender: 3B9A0136629DC9125D61937A2360A446CEF0B87A09AE2353D88D76051ADC887552715C3B530C6BDB424AE0EB1F3D1D21E2978F233C3FAE6EE63DB1732555E4A8EE80603BA4A5B0BC112434F685709FCF0DA7A0AF5A3A8387
X-Mras: Ok
Subject: Re: [Tarantool-patches] [PATCH v2] wal: introduce limits on
 simultaneous writes
X-BeenThere: tarantool-patches@dev.tarantool.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: Tarantool development patches <tarantool-patches.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-patches/>
List-Post: <mailto:tarantool-patches@dev.tarantool.org>
List-Help: <mailto:tarantool-patches-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=subscribe>
From: Serge Petrenko via Tarantool-patches
 <tarantool-patches@dev.tarantool.org>
Reply-To: Serge Petrenko <sergepetrenko@tarantool.org>
Errors-To: tarantool-patches-bounces@dev.tarantool.org
Sender: "Tarantool-patches" <tarantool-patches-bounces@dev.tarantool.org>


24.02.2021 01:19, Vladislav Shpilevoy пишет:
> Hi! Thanks for the patch!

Thanks for the review!
Please find my answers below and v3 of the patch in your inbox.

>
>>>> diff --git a/src/box/journal.c b/src/box/journal.c
>>>> index cb320b557..49441e596 100644
>>>> --- a/src/box/journal.c
>>>> +++ b/src/box/journal.c
>>>> @@ -55,3 +55,66 @@ journal_entry_new(size_t n_rows, struct region *region,
>>>>                     complete_data);
>>>>        return entry;
>>>>    }
>>>> +
>>>> +struct journal_queue_entry {
>>>> +    /** The fiber waiting for queue space to free. */
>>>> +    struct fiber *fiber;
>>>> +    /** Whether the fiber should be waken up regardless of queue size. */
>>>> +    bool is_ready;
>>>> +    /** A link in all waiting fibers list. */
>>>> +    struct rlist in_queue;
>>>> +};
>>>> +
>>>> +/**
>>>> + * Wake up the next waiter in journal queue.
>>>> + */
>>>> +static inline void
>>>> +journal_queue_wakeup_next(struct rlist *link, bool force_ready)
>>> 1. The flag is known in all usage places at compilation time. Is it
>>> possible to split the function into force/normal versions? The same
>>> for journal_queue_wakeup() from which this runtime uncertainty arises.
>> Actually, the parameter is not known at compile time when wakeup_next()
>> is called from journal_wait_queue(). For now wakeup_next() only has a single
>> check for force_ready, so moving the check outside would only increase the
>> number of branches.
>>
>> journal_queue_wakeup() is called only once per a whole queue wakeup, so
>> I suppose it doesn't hurt much it has a compile-time known parameter.
> Is it called once? Then why does it have `if (current_journal->queue_is_woken)`
> check?

I was trying to say 'it's called once per a bunch of wakeup_next() calls'
Just ignore this. This is irrelevant.
Actually, no, it may be called multiple times, from every 
journal_async_complete().
But it is a no-op for each consequent call, except the first one. (while 
the queue is
being cleared).

>
>>>> +{
>>>> +    /* Empty queue or last entry in queue. */
>>>> +    if (link == rlist_last(&current_journal->waiters)) {
>>> 2. I am not sure I understand what is happening here. Why is this
>>> function in one place called with the pointer at the list itself,
>>> and in another place with the pointer at one element?
>> Well, <list head> -> next is the fist list entry, right?
> Perhaps. TBH, I don't remember and when see such tricky things in
> the code, it takes time to understand it.
>
>> In queue_wakeup() I wake the first waiter up.
>>
>> Once any waiter gets woken up, it wakes up the next waiter.
>> Which is <in_queue> -> next.
>>
>> That's why I have a common helper for these two cases.
> Ok, I see now. But it seems you could make it simpler, right?
>
> ====================
> @@ -69,10 +69,10 @@ struct journal_queue_entry {
>    * Wake up the next waiter in journal queue.
>    */
>   static inline void
> -journal_queue_wakeup_next(struct rlist *link, bool force_ready)
> +journal_queue_wakeup_first(bool force_ready)
>   {
>   	/* Empty queue or last entry in queue. */
> -	if (link == rlist_last(&current_journal->waiters)) {
> +	if (rlist_empty(&current_journal->waiters)) {
>   		current_journal->queue_is_woken = false;
>   		return;
>   	}
> @@ -97,7 +97,7 @@ journal_queue_wakeup(bool force_ready)
>   	if (current_journal->queue_is_woken)
>   		return;
>   	current_journal->queue_is_woken = true;
> -	journal_queue_wakeup_next(&current_journal->waiters, force_ready);
> +	journal_queue_wakeup_first(force_ready);
>   }
>   
>   void
> @@ -114,7 +114,7 @@ journal_wait_queue(void)
>   	while (journal_queue_is_full() && !entry.is_ready)
>   		fiber_yield();
>   
> -	journal_queue_wakeup_next(&entry.in_queue, entry.is_ready);
>   	assert(&entry.in_queue == rlist_first(&current_journal->waiters));
>   	rlist_del(&entry.in_queue);
> +	journal_queue_wakeup_first(entry.is_ready);
>   }
> ====================
>
> (I didn't test it.)

Yes, indeed. Thanks!
Applied with minor changes.

>>> 5. Can rlist_del be done along with fiber_wakeup()? Then you
>>> wouldn't need is_woken maybe.
>> Looks like it can't.
>> Say we have only one waiter. And remove it from the list on wakeup.
>> The list would become empty and there'd be no way to check whether
>> journal has any waiters, and we may reorder the entries (put new ones before
>> the waiting one). This is not necessarily bad, because I put entries into queue
>> before txn_begin(), but someone may call journal_wait_queue() from inside the
>> transaction, or right before txn_commit(). Then it might be bad to put other
>> transactions before this one.
> Order change is definitely not acceptable.
>
>> So while removing is_woken we would have to introduce queue_has_waiters flag for
>> the sake of this single waiter.
> It would rather become a counter - number of waiters. Because there can
> be many. But yeah, I see the problem.
>
>>>> +}
>>>> +
>>>> +/**
>>>> + * Check whether anyone is waiting for the journal queue to empty. If there are
>>>> + * other waiters we must go after them to preserve write order.
>>>> + */
>>>> +static inline bool
>>>> +journal_queue_has_waiters(void)
>>>> +{
>>>> +    return !rlist_empty(&current_journal->waiters);
>>>> +}
>>>> +
>>>> +/** Yield until there's some space in the journal queue. */
>>>> +void
>>>> +journal_wait_queue(void);
>>>> +
>>>> +/** Set maximal journal queue size in bytes. */
>>>> +static inline void
>>>> +journal_queue_set_max_size(struct journal *j, int64_t size)
>>> 7. Why do we have journal parameter here, but don't have it in
>>> the other functions? The same journal_queue_set_max_len.
>> This is my attempt to make sure only wal_writer's journal has a queue.
>> I explicitly set queue_max_... parameters only for wal_writer's journal.
>> And then there's an assert that journal_queue_set_...() is only called with
>> the current journal.
> Or the assertion could be done in wal_set_queue_*() functions. To keep the
> journal API consistent.

Actually, struct journal has a ton queue_* members now, so I'm following 
your
older advice and extracting everything related to queues into struct 
journal_queue.

>
> I just realized, journal can be easily unit-tested. It does not depend on
> anything except small/ and core/ libs. Although seems like a lot of work so
> maybe not now. Probably later, for something more complex and harder to test
> via functional tests. However if you would write tests now, it would be
> greatly appreciated.
>
>>>> +{
>>>> +    assert(j == current_journal);
>>>> +    j->queue_max_size = size;c
>>>> +    if (journal_queue_has_waiters() && !journal_queue_is_full())
>>>> +        journal_queue_wakeup(false);
>>>> +}
>>>> @@ -159,6 +264,12 @@ journal_write(struct journal_entry *entry)
>>>>    static inline int
>>>>    journal_write_async(struct journal_entry *entry)
>>>>    {
>>>> +    /*
>>>> +     * It's the job of the caller to check whether the queue is full prior
>>>> +     * to submitting the request.
>>> 8. Maybe add an assert though.
>> I wanted to, but it's impossible.
>> The queue may be full when all the waiters are forcefully waken up by a
>> synchronous commit. And it's hard to tell whether it was a "force" wakeup
>> or not. So let's just hope noone misuses this API.
> Yeah, I see now.
>
>> Or, even better, I can remove is_ready field from queue entries and add a new field
>> to the journal: queue_is_ready or something. And addition to queue_is_awake.
>> Then every entry will check queue_is_ready instead of entry.is_ready and
>> it'll be possible to add an assert here: !journal_queue_is_full || journal_queue_is_ready
>> Looks like this'll also allow us to extract queue_wakeup_(next)_force, like you suggested
>> in paragraph 1.
>> What do you think ?
> Sounds good, worth doing.

I introduced queue_is_ready and removed entry.is_ready.

The code looks cleaner now and together with your suggestion regarding
journal_queue_wakeup_first(), now it doesn't have parameters at all.
It does have a check for queue_is_ready internally, but there's no point 
in separating
_force and normal versions. This would simply move the check outside the 
function call.

>
> See 2 comments below.
>
>>>> +     */
>>>> +    journal_queue_on_append(entry);
>>>> +
>>>>        return current_journal->write_async(current_journal, entry);
>>>>    }> diff --git a/src/box/applier.cc b/src/box/applier.cc
>> index 553db76fc..7c2452d2b 100644
>> --- a/src/box/applier.cc
>> +++ b/src/box/applier.cc
>> @@ -967,6 +967,15 @@ applier_apply_tx(struct applier *applier, struct stailq *rows)
>>   		goto success;
>>   	}
>>   
>> +	/*
>> +	 * Do not spam WAL with excess write requests, let it process what's
>> +	 * piled up first.
>> +	 * This is done before opening the transaction to avoid problems with
>> +	 * yielding inside it.
>> +	 */
>> +	if (journal_queue_is_full())
>> +		journal_wait_queue();
> 1. I just noticed you do the waiting before starting the
> transaction. In case of Vinyl the transaction can yield. So
> by the time you get to commit, the queue could be full.
>
> Don't know what to do with this. We can't wait before
> txn_commit_async() because it would kill the memtx transactions.
>
> Maybe we could not to care now. Because overpopulation never
> will exceed number of appliers, which is tiny.
>
> But when async transactions will go to the public API, we
> will face this issue anyway. I assume we will need to extract
> txn_prepare to the "public" part of txn.h and use it separately
> from writing to the journal. So in our code it would look like
> this:
>
>      sync:  txn_begin() ... txn_commit()
>      async: txn_begin() ... txn_prepare() journal_wait() txn_persist()
>
> or something similar. But don't know for sure.
>
> Summary: leave it as is if don't want to tear commit_async()
> and commit() up into parts now.

Let's leave it as is for now then.

>
>> +
>>   	/**
>>   	 * Explicitly begin the transaction so that we can
>>   	 * control fiber->gc life cycle and, in case of apply
>> diff --git a/src/box/journal.h b/src/box/journal.h
>> index 5d8d5a726..d295dfa4b 100644
>> --- a/src/box/journal.h
>> +++ b/src/box/journal.h
>> @@ -159,6 +264,12 @@ journal_write(struct journal_entry *entry)
>>   static inline int
>>   journal_write_async(struct journal_entry *entry)
>>   {
>> +	/*
>> +	 * It's the job of the caller to check whether the queue is full prior
>> +	 * to submitting the request.
>> +	 */
>> +	journal_queue_on_append(entry);
>> +
>>   	return current_journal->write_async(current_journal, entry);
> 2. What if write_async() is called by some applier when the queue is
> not full, but also not empty? It seems it will bypass the existing
> waiters and lead to the transaction order change. No?

Yes, you're correct, and thanks for noticing this.
This is fixed simply:

diff --git a/src/box/applier.cc b/src/box/applier.cc
index 7c2452d2b..27ddd0f29 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -973,7 +973,7 @@ applier_apply_tx(struct applier *applier, struct 
stailq *rows)
          * This is done before opening the transaction to avoid 
problems with
          * yielding inside it.
          */
-       if (journal_queue_is_full())
+       if (journal_queue_is_full() || journal_queue_has_waiters())
                 journal_wait_queue();

         /**


Having this fix applied, nothing else could go wrong here AFAICS.

>
> I start thinking that we need to queue the journal_entry objects right
> in the journal object. So if their queue is not empty,
> journal_write_async() adds the entry to the queue and does not call
> write_async().

Why?

>
> Also would be cool to add a test how the applier can reorder WAL writes
> in the current patch.

-- 
Serge Petrenko