[Tarantool-patches] [PATCH v27 2/3] qsync: order access to the limbo terms
Serge Petrenko
sergepetrenko at tarantool.org
Wed Jan 12 17:01:00 MSK 2022
11.01.2022 23:39, Cyrill Gorcunov пишет:
> On Mon, Jan 10, 2022 at 05:28:43PM +0300, Serge Petrenko wrote:
>> Hi! Thanks for the patch!
>>
>> box_issue_promote() and box_issue_demote() need fine-grained locking
>> anyway.
>> Otherwise it’s possible that promote() is already issued, but not yet
>> written to WAL, and some
>> outdated request is applied by applier at that exact moment.
> True. And in previous series Vlad has asked to not move in code which is
> not covered by tests. So I think this is a task for the next part. Currently
> we cover only the race between appliers.
Let's ask Vlad, then.
I feel like we should fix this now, not waiting for a full fine-grained
locking
patch.
First of all, this is a known bug (and fine-grained locking was meant to
cover everything we don't know of, just in case).
Besides, simply locking issue_promote/issue_demote should be
much easier than implementing the fine-grained locking patch.
>
>>
>> You should take the lock before the WAL write, and release it only after
>> txn_limbo_apply.
>>
>> No need to guard every limbo function there is, but we have to guard
>> everything that
>> writes PROMOTE/DEMOTE.
> ...
>> @@ -216,7 +225,7 @@ txn_limbo_last_entry(struct txn_limbo *limbo)
>> * @a replica_id.
>> */
>> static inline uint64_t
>> -txn_limbo_replica_term(const struct txn_limbo *limbo, uint32_t
>> replica_id)
>> +txn_limbo_replica_term(struct txn_limbo *limbo, uint32_t replica_id)
>> {
>>
>>
>> You’ve forgot to lock the latch here, I guess.
> I did it on a purpose. As you remember we've faced many problems when tried
> to implement fine-grained locking inside limbo code. So I dropped this idea
> eventually and I think we could start with explicit locks to cover the applier
> race and then walk via small steps trying to cover the rest.
Ok, then return `const ` to the function declaration, please.
>
>> +/**
>> + * Initiate execution of a synchronous replication request.
>> + */
>> +static inline void
>> +txn_limbo_begin(struct txn_limbo *limbo)
>> +{
>> + limbo->promote_latch_cnt++;
>> + latch_lock(&limbo->promote_latch);
>>
>>
>> I suppose you should decrease the latch_cnt right after acquiring the
>> lock.
>>
>> Otherwise you count the sole «limbo user» together with «limbo waiters».
> Yes, this will represent accumulated value. To be honest I never saw such
> approach in any other code (ie increment/lock/decrement) but I think this
> is fine for fibres, will do.
It just looks strange to me that `synchro.queue.waiters` will be
non-zero when
someone simply uses the limbo.
They are `waiters`, not `users` or something else.
>
> Cyrill
--
Serge Petrenko
More information about the Tarantool-patches
mailing list