[Tarantool-patches] [PATCH v10 3/4] limbo: filter incoming synchro requests

Sun Aug 8 14:43:14 MSK 2021

Hi! Thanks for working on this!

>>> diff --git a/src/box/applier.cc b/src/box/applier.cc
>>> index 9db286ae2..f64b6fa35 100644
>>> --- a/src/box/applier.cc
>>> +++ b/src/box/applier.cc
>>> @@ -514,6 +515,11 @@ applier_fetch_snapshot(struct applier *applier)
>>>  	struct ev_io *coio = &applier->io;
>>>  	struct xrow_header row;
>>>  
>>> +	txn_limbo_filter_disable(&txn_limbo);
>>> +	auto filter_guard = make_scoped_guard([&]{
>>> +		txn_limbo_filter_enable(&txn_limbo);
>>> +	});
>>
>> 3. Why do you need to enable/disabled the filter here? Shouldn't snapshot
>> contain only valid data? Moreover, AFAIU it can't contain any limbo
>> rows at all. The limbo snapshot is sent separately, but the data flow
>> does not have anything except pure data. The same for the
>> join.
> 
> The idea is that snapshot/recovery has valid data which forms the initial
> limbo state versus which we will be apply filtering.

You didn't answer the question really. Why do you need the filtering
here if all the data is correct anyway? Will it all work if I just
drop this filter disable from here?

>> And how is it related to applier_register below? It does not download
>> any data at all, does it?
> 
> After register stage is complete we catch up with lates not yet downloaded
> data (final join stage) where we still assume that the data received is
> valid and do not verify it.

Register just makes the master give you a unique ID. It does not send
any data like joins do, AFAIR. Does it work if you drop the filter disable
from here?

> Actually this is a good question. I've to recheck this moment because in
> previous series when I ran join/recovery with filtering enabled sometime
> I've an issues where filter didnt pass. Gimme some time, maybe we will
> all this and manage to keep filtering all the time.
> 
>>> +
>>> +/**
>>> + * Common chain for any incoming packet.
>>> + */
>>> +static int
>>> +filter_in(struct txn_limbo *limbo, const struct synchro_request *req)
>>> +{
>>> +	(void)limbo;
>>
>> 6. So you have the filtering enabled dynamically in the limbo, but
>> you do not use the limbo here? Why? Maybe at least add an assertion
>> that the filter is enabled?
> 
> All chains are having same interface it is just happen that for common
> filter I don't need to use limbo. I could add some operations here
> but not sure if it worth it. As far as I see leave unused args is
> pretty fine in our code base.

You didn't answer the second question:

	Maybe at least add an assertion that the filter is enabled?

>>> +/**
>>> + * Filter CONFIRM and ROLLBACK packets.
>>> + */
>>> +static int
>>> +filter_confirm_rollback(struct txn_limbo *limbo,
>>> +			const struct synchro_request *req)
>>> +{
>>> +	/*
>>> +	 * When limbo is empty we have nothing to
>>> +	 * confirm/commit and if this request comes
>>> +	 * in it means the split brain has happened.
>>> +	 */
>>> +	if (!txn_limbo_is_empty(limbo))
>>> +		return 0;
>>
>> 9. What if rollback is for LSN > limbo's last LSN? It
>> also means nothing to do. The same for confirm LSN < limbo's
>> first LSN.
> 
> static void
> txn_limbo_read_rollback(struct txn_limbo *limbo, int64_t lsn)
> {
> -->	assert(limbo->owner_id != REPLICA_ID_NIL || txn_limbo_is_empty(limbo));
> 
> txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
> {
> -->	assert(limbo->owner_id != REPLICA_ID_NIL || txn_limbo_is_empty(limbo));
> 
> Currently we're allowed to process empty limbo if only owner is not nil,
> I think I should add this case here.

My question is not about the owner ID. I asked what if rollback/confirm
try to cover a range not present in the limbo while it is not empty. If
it is not empty, it has an owner obviously. But it does not matter.
What if it has an owner, has transactions, but you got ROLLBACK/CONFIRM
for data out of the LSN range present in the limbo?

>>> +		/*
>>> +		 * Some entries are present in the limbo,
>>> +		 * we need to make sure the @a promote_lsn
>>> +		 * lays inside limbo [first; last] range.
>>> +		 * So that the promote request has some
>>> +		 * queued data to process, otherwise it
>>> +		 * means the request comes from split
>>> +		 * brained node.
>>> +		 */
>>> +		struct txn_limbo_entry *first, *last;
>>> +
>>> +		first = txn_limbo_first_entry(limbo);
>>> +		last = txn_limbo_last_entry(limbo);
>>> +
>>> +		if (first->lsn < promote_lsn ||
>>> +		    last->lsn > promote_lsn) {
>>
>> 11. This seems to be broken. In the comment you said the
>> error is when
>>
>> 	promote < first or promote > last
>>
>> And here in the condition you return an error when
>>
>> 	promote > first or promote < last
>>
>> Why?
> 
> Good catch, typo. Actually I've updated this hunk locally
> but didn't pushed out. We need "first <= promote <= last"

Is it covered with a test?

>>> +static int (*filter_req[FILTER_MAX])
>>> +(struct txn_limbo *limbo, const struct synchro_request *req) = {
>>> +	[FILTER_IN]		= filter_in,
>>> +	[FILTER_CONFIRM]	= filter_confirm_rollback,
>>> +	[FILTER_ROLLBACK]	= filter_confirm_rollback,
>>> +	[FILTER_PROMOTE]	= filter_promote_demote,
>>> +	[FILTER_DEMOTE]		= filter_promote_demote,
>>
>> 12. What is this? Wouldn't it be much much much simpler to just
>> call filter_in() always + make a switch case for the request type +
>> call the needed functions?
>>
>> What is worse, you already have the switch-case anyway, but you
>> also added some strange loop, masks, and virtual functions ... .
>> I don't think I could make it more complex even if I wanted to,
>> sorry. Why so complicated?
> 
> It might be look easier but won't allow to extend filtering in future
> without rewritting too much.

I propose to think about that when we add more packet types. Also, in your
code you will need to extend both switch-case and your masks. While if we
had only the switch-case, you would only need to update the switch-case.
So it means less work. 'Switch-case' vs 'switch-case + virtual functions and
the loop'.

> I'm pretty sure this number of packet types
> is not finished and we will have more. Using bitmap routing you can easily
> hook in any call sequence you need while using explicit if\elses or direct
> calls via case-by-request-type won't allow to make it so. So no, this is
> not complicated at all but rather close to real packet filtering code.

You literally filter 4 packet types. Please, do not overcomplicate it. It
is not kernel, not some network filter for all kinds of protocols. It is
just 4 packet types. Which you **already** walk in switch-case + call the
virtual functions in a loop. While I propose to keep only the switch-case.
Even its extension will look simpler than what you have now. Because you
will also need to patch the switch-case.

Just compare:

	filters = [
		...
	]

	switch {
	case ...:
	case ...:
	}

	while (...) {
		...
	}

vs

	switch {
	case ...:
	case ...:
	}

You have the first version, and will need to update
the masks, the switch-case and still have the loop
with a fullscan.

In the second version you only would have a
switch-case. Doesn't it look simpler?

> Anyway, which form would you prefer?
> 
> txn_limbo_filter_locked() {
> 	int rc = filter_in();
> 	if (rc != 0)
> 		return -1;
> 
> 	swicth (req->type) {
> 	case IPROTO_CONFIRM:
> 	case IPROTO_ROLLBACK:
> 		rc = filter_confirm_rollback();
> 		break;
> 	...
> 	}
> 
> 	return rc;
> }
> 
> Kind of this?

Yes! It is much simpler and still easy to extend. Please,
just try and you will see how much simpler it is.