[Tarantool-patches] [PATCH v10 3/4] limbo: filter incoming synchro requests

Thu Aug 12 19:59:31 MSK 2021

On 10.08.2021 17:36, Cyrill Gorcunov wrote:
> On Tue, Aug 10, 2021 at 03:31:04PM +0300, Vladislav Shpilevoy wrote:
>>>
>>>   | I could add some operations here but not sure if it worth it.
>>>
>>> Letme state it clear then - I could add this assert() if you insist
>>> but I think we aready spread too many assertions all over the code,
>>> and if it is possible I would be glad not to add new ones. After all
>>> either we should add this assert() to each filter chain or not add
>>> at all, otherwise there will be kind of code imbalance.
>>
>> What is wrong with the assertions that you don't like adding them?
>> You add panics quite often, and they cost some perf. But asserts
>> just help to catch bugs and cost nothing in Release build.
> 
> I personally think that either some particular condition is critical
> so that you can't continue execution if it failed and because of this
> it must be tested even in release builds. And here panic() is needed.
> Or it is not critical and we don't need assert(). In particular for filtering
> case if we ocasionally called it where should not then it might trigger a
> false positive error breaking the replication but not corrupting data,
> and in such case it is ok and no assertion is needed. In reverse case,
> say enabling filtering in wrong place would cause data corruption then
> we need a panic not assert. So I don't see much point in assert calls
> at all. Surely I can add it if you prefer. Simply don't like.
> 
> You know, we've been talking with Serge today about enabling filtering
> all the time because this looks pretty fishy that I do turn it on/off.
> So I'm working on removing this code and the question with assert will
> disappear on its own.

Assertions help to catch tons of rubbish during tests. Like they
do quite often. Just grep by 'assert' in our github tickets. So please,
add them in all the non-trivial places. It is easier to drop trivial
ones on a review than try to spot places lacking the asserts.

>>>>>>> +static int
>>>>>>> +filter_confirm_rollback(struct txn_limbo *limbo,
>>>>>>> +			const struct synchro_request *req)
>>>>>>> +{
>>>>>>> +	/*
>>>>>>> +	 * When limbo is empty we have nothing to
>>>>>>> +	 * confirm/commit and if this request comes
>>>>>>> +	 * in it means the split brain has happened.
>>>>>>> +	 */
>>>>>>> +	if (!txn_limbo_is_empty(limbo))
>>>>>>> +		return 0;
>>>>>>
>>>>>> 9. What if rollback is for LSN > limbo's last LSN? It
>>>>>> also means nothing to do. The same for confirm LSN < limbo's
>>>>>> first LSN.
>>>>>
>>>>> static void
>>>>> txn_limbo_read_rollback(struct txn_limbo *limbo, int64_t lsn)
>>>>> {
>>>>> -->	assert(limbo->owner_id != REPLICA_ID_NIL || txn_limbo_is_empty(limbo));
>>>>>
>>>>> txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
>>>>> {
>>>>> -->	assert(limbo->owner_id != REPLICA_ID_NIL || txn_limbo_is_empty(limbo));
>>>>>
>>>>> Currently we're allowed to process empty limbo if only owner is not nil,
>>>>> I think I should add this case here.
>>>>
>>>> My question is not about the owner ID. I asked what if rollback/confirm
>>>> try to cover a range not present in the limbo while it is not empty. If
>>>> it is not empty, it has an owner obviously. But it does not matter.
>>>> What if it has an owner, has transactions, but you got ROLLBACK/CONFIRM
>>>> for data out of the LSN range present in the limbo?
>>>
>>> Since the terms are matching I think such scenarion should be fine, right?
>>> IOW, some old replica has been stopped for some reason and been living out
>>> of quorum for some time thus such requests should be considered as OK to
>>> pass and when filter accepts them the will reach txn_limbo_read_confirm
>>> or txn_limbo_read_rollback where they will be simply ignored as far as I
>>> unrestand. IOW, such requests are valid, no?
>>
>> If a replica is outdated, it should not matter. It will receive the needed
>> data in order anyway. Like if the data was just sent. Hence, it seems
>> irrelevant whether it is outdated. And still looks the same as the thing
>> you are trying to filter out (when the limbo is empty = confirm/rollback
>> do not cover anything too).
> 
> Wait, Vlad, I don't understand. When packet comes in we verify for terms
> matching, if it doesn't match then we drop the request with error. Now
> assume the term is valid and we get confirm/rollback over already processed
> entry. Initially I though it is an error due to split-brain because there
> is no data in limbo which we can compare against. Then I looked into
> txn_limbo_read_confirm and the code silently passes if queue is empty
> so I presumed that I simply need to convert the assert() above into
> the real verification condition. And after your reply I confused again.
> 
> Assume I'm a replica and have no data in limbo, if I obtain some
> confirm/rollback it means the master node did some transactions behind my
> back so I should refuse to proceed and refetch all data again, right?
> 
> Another scenario is that I'm the leader node sent some transactions
> then gathered the quorum and make limbo empty, at some moment the
> replica will send me confirm packet back and I should simply advance
> the vclock and ignore this packet, correct?

You have an answer in your question - why a valid replica would send
you a confirm on its own? Only your own confirms are valid since you
are the leader from now on.

If you are talking about the replica sending you your own confirm -
it can't happen. Your own data is not sent back to you. Sometimes it
can be delivered indirectly, but it is simply filtered out as already
applied in the applier and never reaches spaces, limbo, wal or anything
else.