[Tarantool-patches] [PATCH v23 3/3] test: add gh-6036-qsync-order test

Vladislav Shpilevoy v.shpilevoy at tarantool.org
Fri Oct 22 01:06:07 MSK 2021


>>> Actually you do need to count writes here.
>>> The wait_cond for ERRINJ_WAL_WRITE_COUNT == write_cnt + 3
>>> is needed to make sure you receive (and thus try to process)
>>> insert {3} **before** the replica is re-enabled.
>>>
>>> Otherwise we can't be sure that the test is correct. You may simply
>>> perform a select before insert{3} has reached the replica.
>> You know, I spent a few hours trying to pass the test waiting for
>> ERRINJ_WAL_WRITE_COUNT == write_cnt + 3 and finally realized that
>> it seems that is what happens: the replica1 is not longer a leader
>> and when this record reach our replica3 node we NOPify it then
>> we run
>>
>> apply_row
>>    if (request.type == IPROTO_NOP)
>>      return process_nop()
>>
>> thus this record even not reaching the journal at all and that is
>> why waiting for write_cnt + 3 lasts forever. If only I didn't miss
>> something obvious.
> 
> Unfortunately, this is not the case. A NOP entry still reaches WAL.
> That's why we need NOP entries: they reside in WAL but do nothing.
> That's for vclock bump sake. Otherwise we could skip such entries
> completely, without nopifying them.
> 
> So, even if the entry is nopified, it would enter WAL sooner or later.
> 
> I just realised what the problem is: the entry is waiting on a limbo latch
> inside the NOPify procedure. That's why it never reaches the journal
> (until we re-enable replica3, at least).
> 
> I don't know how to wait for this entry's arrival then.
> The current test version looks OK to me.
> 
> Vlad, do you have any ideas here?

I think it might worth adding an errinj for the number of blocked
fibers waiting on the limbo latch. Could even expose that to box.info.qsync,
seems like useful info. Would help to measure contention.


More information about the Tarantool-patches mailing list