[Tarantool-patches] [PATCH v10 3/4] limbo: filter incoming synchro requests

Cyrill Gorcunov gorcunov at gmail.com
Tue Aug 10 17:36:02 MSK 2021


On Tue, Aug 10, 2021 at 03:31:04PM +0300, Vladislav Shpilevoy wrote:
> > 
> >   | I could add some operations here but not sure if it worth it.
> > 
> > Letme state it clear then - I could add this assert() if you insist
> > but I think we aready spread too many assertions all over the code,
> > and if it is possible I would be glad not to add new ones. After all
> > either we should add this assert() to each filter chain or not add
> > at all, otherwise there will be kind of code imbalance.
> 
> What is wrong with the assertions that you don't like adding them?
> You add panics quite often, and they cost some perf. But asserts
> just help to catch bugs and cost nothing in Release build.

I personally think that either some particular condition is critical
so that you can't continue execution if it failed and because of this
it must be tested even in release builds. And here panic() is needed.
Or it is not critical and we don't need assert(). In particular for filtering
case if we ocasionally called it where should not then it might trigger a
false positive error breaking the replication but not corrupting data,
and in such case it is ok and no assertion is needed. In reverse case,
say enabling filtering in wrong place would cause data corruption then
we need a panic not assert. So I don't see much point in assert calls
at all. Surely I can add it if you prefer. Simply don't like.

You know, we've been talking with Serge today about enabling filtering
all the time because this looks pretty fishy that I do turn it on/off.
So I'm working on removing this code and the question with assert will
disappear on its own.

> >>>>> +static int
> >>>>> +filter_confirm_rollback(struct txn_limbo *limbo,
> >>>>> +			const struct synchro_request *req)
> >>>>> +{
> >>>>> +	/*
> >>>>> +	 * When limbo is empty we have nothing to
> >>>>> +	 * confirm/commit and if this request comes
> >>>>> +	 * in it means the split brain has happened.
> >>>>> +	 */
> >>>>> +	if (!txn_limbo_is_empty(limbo))
> >>>>> +		return 0;
> >>>>
> >>>> 9. What if rollback is for LSN > limbo's last LSN? It
> >>>> also means nothing to do. The same for confirm LSN < limbo's
> >>>> first LSN.
> >>>
> >>> static void
> >>> txn_limbo_read_rollback(struct txn_limbo *limbo, int64_t lsn)
> >>> {
> >>> -->	assert(limbo->owner_id != REPLICA_ID_NIL || txn_limbo_is_empty(limbo));
> >>>
> >>> txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
> >>> {
> >>> -->	assert(limbo->owner_id != REPLICA_ID_NIL || txn_limbo_is_empty(limbo));
> >>>
> >>> Currently we're allowed to process empty limbo if only owner is not nil,
> >>> I think I should add this case here.
> >>
> >> My question is not about the owner ID. I asked what if rollback/confirm
> >> try to cover a range not present in the limbo while it is not empty. If
> >> it is not empty, it has an owner obviously. But it does not matter.
> >> What if it has an owner, has transactions, but you got ROLLBACK/CONFIRM
> >> for data out of the LSN range present in the limbo?
> > 
> > Since the terms are matching I think such scenarion should be fine, right?
> > IOW, some old replica has been stopped for some reason and been living out
> > of quorum for some time thus such requests should be considered as OK to
> > pass and when filter accepts them the will reach txn_limbo_read_confirm
> > or txn_limbo_read_rollback where they will be simply ignored as far as I
> > unrestand. IOW, such requests are valid, no?
> 
> If a replica is outdated, it should not matter. It will receive the needed
> data in order anyway. Like if the data was just sent. Hence, it seems
> irrelevant whether it is outdated. And still looks the same as the thing
> you are trying to filter out (when the limbo is empty = confirm/rollback
> do not cover anything too).

Wait, Vlad, I don't understand. When packet comes in we verify for terms
matching, if it doesn't match then we drop the request with error. Now
assume the term is valid and we get confirm/rollback over already processed
entry. Initially I though it is an error due to split-brain because there
is no data in limbo which we can compare against. Then I looked into
txn_limbo_read_confirm and the code silently passes if queue is empty
so I presumed that I simply need to convert the assert() above into
the real verification condition. And after your reply I confused again.

Assume I'm a replica and have no data in limbo, if I obtain some
confirm/rollback it means the master node did some transactions behind my
back so I should refuse to proceed and refetch all data again, right?

Another scenario is that I'm the leader node sent some transactions
then gathered the quorum and make limbo empty, at some moment the
replica will send me confirm packet back and I should simply advance
the vclock and ignore this packet, correct?


More information about the Tarantool-patches mailing list