[Tarantool-patches] [PATCH v27 2/3] qsync: order access to the limbo terms

Serge Petrenko sergepetrenko at tarantool.org
Wed Jan 12 17:01:00 MSK 2022



11.01.2022 23:39, Cyrill Gorcunov пишет:
> On Mon, Jan 10, 2022 at 05:28:43PM +0300, Serge Petrenko wrote:
>>     Hi! Thanks for the patch!
>>      
>>     box_issue_promote() and box_issue_demote() need fine-grained locking
>>     anyway.
>>     Otherwise it’s possible that promote() is already issued, but not yet
>>     written to WAL, and some
>>     outdated request is applied by applier at that exact moment.
> True. And in previous series Vlad has asked to not move in code which is
> not covered by tests. So I think this is a task for the next part. Currently
> we cover only the race between appliers.

Let's ask Vlad, then.

I feel like we should fix this now, not waiting for a full fine-grained 
locking
patch.

First of all, this is a known bug (and fine-grained locking was meant to
cover everything we don't know of, just in case).

Besides, simply locking issue_promote/issue_demote should be
much easier than implementing the fine-grained locking patch.


>
>>      
>>     You should take the lock before the WAL write, and release it only after
>>     txn_limbo_apply.
>>      
>>     No need to guard every limbo function there is, but we have to guard
>>     everything that
>>     writes PROMOTE/DEMOTE.
> ...
>>       @@ -216,7 +225,7 @@ txn_limbo_last_entry(struct txn_limbo *limbo)
>>         * @a replica_id.
>>         */
>>        static inline uint64_t
>>       -txn_limbo_replica_term(const struct txn_limbo *limbo, uint32_t
>>       replica_id)
>>       +txn_limbo_replica_term(struct txn_limbo *limbo, uint32_t replica_id)
>>        {
>>
>>      
>>     You’ve forgot to lock the latch here, I guess.
> I did it on a purpose. As you remember we've faced many problems when tried
> to implement fine-grained locking inside limbo code. So I dropped this idea
> eventually and I think we could start with explicit locks to cover the applier
> race and then walk via small steps trying to cover the rest.

Ok, then return `const ` to the function declaration, please.

>
>>       +/**
>>       + * Initiate execution of a synchronous replication request.
>>       + */
>>       +static inline void
>>       +txn_limbo_begin(struct txn_limbo *limbo)
>>       +{
>>       + limbo->promote_latch_cnt++;
>>       + latch_lock(&limbo->promote_latch);
>>
>>      
>>     I suppose you should decrease the latch_cnt right after acquiring the
>>     lock.
>>      
>>     Otherwise you count the sole «limbo user» together with «limbo waiters».
> Yes, this will represent accumulated value. To be honest I never saw such
> approach in any other code (ie increment/lock/decrement) but I think this
> is fine for fibres, will do.

It just looks strange to me that `synchro.queue.waiters` will be 
non-zero when
someone simply uses the limbo.

They are `waiters`, not `users` or something else.

>
> 	Cyrill

-- 
Serge Petrenko



More information about the Tarantool-patches mailing list