From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id 9070270202; Wed, 24 Feb 2021 22:32:03 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 9070270202 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1614195123; bh=LOOde+R1ZuWT06E0t7lOihLIkO2M9U7gKGkLRK+E3lw=; h=To:Cc:References:Date:In-Reply-To:Subject:List-Id: List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe: From:Reply-To:From; b=QNg7vIBsPDOO1lXOTOK8ozhsd5fKEDrBh7oTHDXTl2WMtuNEskDJ1MvhmRAWNdfk4 D/QEd5GOWFcVx2WLwFdij8JFrsoc0QuSvnNWoFmTvs8FKHYz7tOndgHVjL+p80QBiJ QPjT1+RRKv76xaSXEloBiERXJIrWaSk6S497d6ng= Received: from smtp39.i.mail.ru (smtp39.i.mail.ru [94.100.177.99]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 30C8470202 for ; Wed, 24 Feb 2021 22:32:02 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 30C8470202 Received: by smtp39.i.mail.ru with esmtpa (envelope-from ) id 1lEztB-00052T-BY; Wed, 24 Feb 2021 22:32:01 +0300 To: Vladislav Shpilevoy , gorcunov@gmail.com Cc: tarantool-patches@dev.tarantool.org References: <20210211121750.46298-1-sergepetrenko@tarantool.org> <77af94ed-71eb-74e5-da73-7ae7286bfeb1@tarantool.org> Message-ID: <2e8b0be7-52ba-70b6-d1d9-cdf976106a2d@tarantool.org> Date: Wed, 24 Feb 2021 22:32:00 +0300 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:78.0) Gecko/20100101 Thunderbird/78.7.1 MIME-Version: 1.0 In-Reply-To: <77af94ed-71eb-74e5-da73-7ae7286bfeb1@tarantool.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-GB X-7564579A: B8F34718100C35BD X-77F55803: 4F1203BC0FB41BD975C3EC174F56692254B0AABE1FB071B2BA6557555153D6A0182A05F5380850408DB54D833A2DC1F99F46331A285D162AF64D5B07D2B7126E28A787FFD79C73E3 X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE73C871DD2182510D5EA1F7E6F0F101C67BD4B6F7A4D31EC0BCC500DACC3FED6E28638F802B75D45FF8AA50765F7900637DB576DCB83B448D28638F802B75D45FF914D58D5BE9E6BC131B5C99E7648C95C2115B036FF9F6D255EBE7D9934CEC98E2B07E92741E89F7CA471835C12D1D9774AD6D5ED66289B5259CC434672EE6371117882F4460429724CE54428C33FAD30A8DF7F3B2552694AC26CFBAC0749D213D2E47CDBA5A9658378DA827A17800CE7328B01A8D746D8839FA2833FD35BB23DF004C9065253843057739F23D657EF2B13377AFFFEAFD26923F8577A6DFFEA7CAF35BD5BAE7ED04193EC92FD9297F6715571747095F342E857739F23D657EF2BD5E8D9A59859A8B6A282B68738D2AF99089D37D7C0E48F6C5571747095F342E857739F23D657EF2B6825BDBE14D8E7028C9DFF55498CEFB0BD9CCCA9EDD067B1EDA766A37F9254B7 X-C1DE0DAB: 0D63561A33F958A5519E495BF0331884A4A8EEABEBFCC57ED8373F5153BB937CD59269BC5F550898D99A6476B3ADF6B47008B74DF8BB9EF7333BD3B22AA88B938A852937E12ACA75B7BFB303F1C7DB4D8E8E86DC7131B365E7726E8460B7C23C X-C8649E89: 4E36BF7865823D7055A7F0CF078B5EC49A30900B95165D343D1F112031EF3D629280615735CF1D61494921938305D699EC0CAC88F874EAA0274FF0FCA5DB4B431D7E09C32AA3244C90A4AB6D7F4ABAA1A713A5A39D5DB80F9CA7333006C390A0FACE5A9C96DEB163 X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2biojyK6JYJ15DtKdeRZ8jSdc2Q== X-Mailru-Sender: 3B9A0136629DC9125D61937A2360A446CEF0B87A09AE2353D88D76051ADC887552715C3B530C6BDB424AE0EB1F3D1D21E2978F233C3FAE6EE63DB1732555E4A8EE80603BA4A5B0BC112434F685709FCF0DA7A0AF5A3A8387 X-Mras: Ok Subject: Re: [Tarantool-patches] [PATCH v2] wal: introduce limits on simultaneous writes X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Serge Petrenko via Tarantool-patches Reply-To: Serge Petrenko Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" 24.02.2021 01:19, Vladislav Shpilevoy пишет: > Hi! Thanks for the patch! Thanks for the review! Please find my answers below and v3 of the patch in your inbox. > >>>> diff --git a/src/box/journal.c b/src/box/journal.c >>>> index cb320b557..49441e596 100644 >>>> --- a/src/box/journal.c >>>> +++ b/src/box/journal.c >>>> @@ -55,3 +55,66 @@ journal_entry_new(size_t n_rows, struct region *region, >>>>                    complete_data); >>>>       return entry; >>>>   } >>>> + >>>> +struct journal_queue_entry { >>>> +    /** The fiber waiting for queue space to free. */ >>>> +    struct fiber *fiber; >>>> +    /** Whether the fiber should be waken up regardless of queue size. */ >>>> +    bool is_ready; >>>> +    /** A link in all waiting fibers list. */ >>>> +    struct rlist in_queue; >>>> +}; >>>> + >>>> +/** >>>> + * Wake up the next waiter in journal queue. >>>> + */ >>>> +static inline void >>>> +journal_queue_wakeup_next(struct rlist *link, bool force_ready) >>> 1. The flag is known in all usage places at compilation time. Is it >>> possible to split the function into force/normal versions? The same >>> for journal_queue_wakeup() from which this runtime uncertainty arises. >> Actually, the parameter is not known at compile time when wakeup_next() >> is called from journal_wait_queue(). For now wakeup_next() only has a single >> check for force_ready, so moving the check outside would only increase the >> number of branches. >> >> journal_queue_wakeup() is called only once per a whole queue wakeup, so >> I suppose it doesn't hurt much it has a compile-time known parameter. > Is it called once? Then why does it have `if (current_journal->queue_is_woken)` > check? I was trying to say 'it's called once per a bunch of wakeup_next() calls' Just ignore this. This is irrelevant. Actually, no, it may be called multiple times, from every journal_async_complete(). But it is a no-op for each consequent call, except the first one. (while the queue is being cleared). > >>>> +{ >>>> +    /* Empty queue or last entry in queue. */ >>>> +    if (link == rlist_last(¤t_journal->waiters)) { >>> 2. I am not sure I understand what is happening here. Why is this >>> function in one place called with the pointer at the list itself, >>> and in another place with the pointer at one element? >> Well, -> next is the fist list entry, right? > Perhaps. TBH, I don't remember and when see such tricky things in > the code, it takes time to understand it. > >> In queue_wakeup() I wake the first waiter up. >> >> Once any waiter gets woken up, it wakes up the next waiter. >> Which is -> next. >> >> That's why I have a common helper for these two cases. > Ok, I see now. But it seems you could make it simpler, right? > > ==================== > @@ -69,10 +69,10 @@ struct journal_queue_entry { > * Wake up the next waiter in journal queue. > */ > static inline void > -journal_queue_wakeup_next(struct rlist *link, bool force_ready) > +journal_queue_wakeup_first(bool force_ready) > { > /* Empty queue or last entry in queue. */ > - if (link == rlist_last(¤t_journal->waiters)) { > + if (rlist_empty(¤t_journal->waiters)) { > current_journal->queue_is_woken = false; > return; > } > @@ -97,7 +97,7 @@ journal_queue_wakeup(bool force_ready) > if (current_journal->queue_is_woken) > return; > current_journal->queue_is_woken = true; > - journal_queue_wakeup_next(¤t_journal->waiters, force_ready); > + journal_queue_wakeup_first(force_ready); > } > > void > @@ -114,7 +114,7 @@ journal_wait_queue(void) > while (journal_queue_is_full() && !entry.is_ready) > fiber_yield(); > > - journal_queue_wakeup_next(&entry.in_queue, entry.is_ready); > assert(&entry.in_queue == rlist_first(¤t_journal->waiters)); > rlist_del(&entry.in_queue); > + journal_queue_wakeup_first(entry.is_ready); > } > ==================== > > (I didn't test it.) Yes, indeed. Thanks! Applied with minor changes. >>> 5. Can rlist_del be done along with fiber_wakeup()? Then you >>> wouldn't need is_woken maybe. >> Looks like it can't. >> Say we have only one waiter. And remove it from the list on wakeup. >> The list would become empty and there'd be no way to check whether >> journal has any waiters, and we may reorder the entries (put new ones before >> the waiting one). This is not necessarily bad, because I put entries into queue >> before txn_begin(), but someone may call journal_wait_queue() from inside the >> transaction, or right before txn_commit(). Then it might be bad to put other >> transactions before this one. > Order change is definitely not acceptable. > >> So while removing is_woken we would have to introduce queue_has_waiters flag for >> the sake of this single waiter. > It would rather become a counter - number of waiters. Because there can > be many. But yeah, I see the problem. > >>>> +} >>>> + >>>> +/** >>>> + * Check whether anyone is waiting for the journal queue to empty. If there are >>>> + * other waiters we must go after them to preserve write order. >>>> + */ >>>> +static inline bool >>>> +journal_queue_has_waiters(void) >>>> +{ >>>> +    return !rlist_empty(¤t_journal->waiters); >>>> +} >>>> + >>>> +/** Yield until there's some space in the journal queue. */ >>>> +void >>>> +journal_wait_queue(void); >>>> + >>>> +/** Set maximal journal queue size in bytes. */ >>>> +static inline void >>>> +journal_queue_set_max_size(struct journal *j, int64_t size) >>> 7. Why do we have journal parameter here, but don't have it in >>> the other functions? The same journal_queue_set_max_len. >> This is my attempt to make sure only wal_writer's journal has a queue. >> I explicitly set queue_max_... parameters only for wal_writer's journal. >> And then there's an assert that journal_queue_set_...() is only called with >> the current journal. > Or the assertion could be done in wal_set_queue_*() functions. To keep the > journal API consistent. Actually, struct journal has a ton queue_* members now, so I'm following your older advice and extracting everything related to queues into struct journal_queue. > > I just realized, journal can be easily unit-tested. It does not depend on > anything except small/ and core/ libs. Although seems like a lot of work so > maybe not now. Probably later, for something more complex and harder to test > via functional tests. However if you would write tests now, it would be > greatly appreciated. > >>>> +{ >>>> +    assert(j == current_journal); >>>> +    j->queue_max_size = size;c >>>> +    if (journal_queue_has_waiters() && !journal_queue_is_full()) >>>> +        journal_queue_wakeup(false); >>>> +} >>>> @@ -159,6 +264,12 @@ journal_write(struct journal_entry *entry) >>>>   static inline int >>>>   journal_write_async(struct journal_entry *entry) >>>>   { >>>> +    /* >>>> +     * It's the job of the caller to check whether the queue is full prior >>>> +     * to submitting the request. >>> 8. Maybe add an assert though. >> I wanted to, but it's impossible. >> The queue may be full when all the waiters are forcefully waken up by a >> synchronous commit. And it's hard to tell whether it was a "force" wakeup >> or not. So let's just hope noone misuses this API. > Yeah, I see now. > >> Or, even better, I can remove is_ready field from queue entries and add a new field >> to the journal: queue_is_ready or something. And addition to queue_is_awake. >> Then every entry will check queue_is_ready instead of entry.is_ready and >> it'll be possible to add an assert here: !journal_queue_is_full || journal_queue_is_ready >> Looks like this'll also allow us to extract queue_wakeup_(next)_force, like you suggested >> in paragraph 1. >> What do you think ? > Sounds good, worth doing. I introduced queue_is_ready and removed entry.is_ready. The code looks cleaner now and together with your suggestion regarding journal_queue_wakeup_first(), now it doesn't have parameters at all. It does have a check for queue_is_ready internally, but there's no point in separating _force and normal versions. This would simply move the check outside the function call. > > See 2 comments below. > >>>> +     */ >>>> +    journal_queue_on_append(entry); >>>> + >>>>       return current_journal->write_async(current_journal, entry); >>>>   }> diff --git a/src/box/applier.cc b/src/box/applier.cc >> index 553db76fc..7c2452d2b 100644 >> --- a/src/box/applier.cc >> +++ b/src/box/applier.cc >> @@ -967,6 +967,15 @@ applier_apply_tx(struct applier *applier, struct stailq *rows) >> goto success; >> } >> >> + /* >> + * Do not spam WAL with excess write requests, let it process what's >> + * piled up first. >> + * This is done before opening the transaction to avoid problems with >> + * yielding inside it. >> + */ >> + if (journal_queue_is_full()) >> + journal_wait_queue(); > 1. I just noticed you do the waiting before starting the > transaction. In case of Vinyl the transaction can yield. So > by the time you get to commit, the queue could be full. > > Don't know what to do with this. We can't wait before > txn_commit_async() because it would kill the memtx transactions. > > Maybe we could not to care now. Because overpopulation never > will exceed number of appliers, which is tiny. > > But when async transactions will go to the public API, we > will face this issue anyway. I assume we will need to extract > txn_prepare to the "public" part of txn.h and use it separately > from writing to the journal. So in our code it would look like > this: > > sync: txn_begin() ... txn_commit() > async: txn_begin() ... txn_prepare() journal_wait() txn_persist() > > or something similar. But don't know for sure. > > Summary: leave it as is if don't want to tear commit_async() > and commit() up into parts now. Let's leave it as is for now then. > >> + >> /** >> * Explicitly begin the transaction so that we can >> * control fiber->gc life cycle and, in case of apply >> diff --git a/src/box/journal.h b/src/box/journal.h >> index 5d8d5a726..d295dfa4b 100644 >> --- a/src/box/journal.h >> +++ b/src/box/journal.h >> @@ -159,6 +264,12 @@ journal_write(struct journal_entry *entry) >> static inline int >> journal_write_async(struct journal_entry *entry) >> { >> + /* >> + * It's the job of the caller to check whether the queue is full prior >> + * to submitting the request. >> + */ >> + journal_queue_on_append(entry); >> + >> return current_journal->write_async(current_journal, entry); > 2. What if write_async() is called by some applier when the queue is > not full, but also not empty? It seems it will bypass the existing > waiters and lead to the transaction order change. No? Yes, you're correct, and thanks for noticing this. This is fixed simply: diff --git a/src/box/applier.cc b/src/box/applier.cc index 7c2452d2b..27ddd0f29 100644 --- a/src/box/applier.cc +++ b/src/box/applier.cc @@ -973,7 +973,7 @@ applier_apply_tx(struct applier *applier, struct stailq *rows)          * This is done before opening the transaction to avoid problems with          * yielding inside it.          */ -       if (journal_queue_is_full()) +       if (journal_queue_is_full() || journal_queue_has_waiters())                 journal_wait_queue();         /** Having this fix applied, nothing else could go wrong here AFAICS. > > I start thinking that we need to queue the journal_entry objects right > in the journal object. So if their queue is not empty, > journal_write_async() adds the entry to the queue and does not call > write_async(). Why? > > Also would be cool to add a test how the applier can reorder WAL writes > in the current patch. -- Serge Petrenko