[Tarantool-patches] [PATCH v9 4/5] limbo: filter incoming synchro requests

Cyrill Gorcunov gorcunov at gmail.com
Tue Aug 3 16:49:59 MSK 2021


On Tue, Aug 03, 2021 at 01:51:44PM +0300, Serge Petrenko wrote:
> 
> > +		if (req->lsn != 0 ||
> > +		    !iproto_type_is_promote_request(req->type)) {
> > +			say_info("RAFT: rejecting %s request from "
> > +				 "instance id %u for term %llu. "
> > +				 "req->replica_id = 0 but lsn %lld.",
> > +				 iproto_type_name(req->type),
> > +				 req->origin_id, (long long)req->term,
> > +				 (long long)req->lsn);
> > +
> > +			diag_set(ClientError, ER_UNSUPPORTED,
> > +				 "Replication",
> > +				 "empty replica_id with nonzero LSN");
> > +			return -1;
> > +		}
> > +	}
> 
> I agree with Vlad. This may be moved to filter_confirm_rollback.

You know, as I replied to Vlad, I think we should do a common
attibutes verification in early stage, because not only due to
network error but for security reasons as well, ie say some fuzzer
in  network could feed us with screwed data, and when we sort packets
via chains to process we could do an early verification first.

But I won't insist, surely I can move it to promote chain.

> > +
> > +	/*
> > +	 * Explicit split brain situation. Promote
> > +	 * comes in with an old LSN which we've already
> > +	 * processed.
> > +	 */
> > +	if (limbo->confirmed_lsn > promote_lsn) {
> > +		/*
> > +		 * If limbo is empty we're migrating
> > +		 * the owner.
> > +		 */
> > +		if (txn_limbo_is_empty(limbo))
> > +			return 0;
> 
> I don't understand this part. Are you sure this check is needed?
> We're always migrating the owner with a promote.

Serge, I think this code hunk is due to my misunderstanding of
owner migration, should remove it I think, thanks!

> > +
> > +	if (txn_limbo_is_empty(limbo)) {
> > +		/*
> > +		 * Transactions are already rolled back
> > +		 * since the limbo is empty.
> > +		 */
> > +		say_info("RAFT: rejecting %s request from "
> > +			 "instance id %u for term %llu. "
> > +			 "confirmed_lsn %lld < promote_lsn %lld "
> > +			 "and empty limbo.",
> > +			 iproto_type_name(req->type),
> > +			 req->origin_id, (long long)req->term,
> > +			 (long long)limbo->confirmed_lsn,
> > +			 (long long)promote_lsn);
> > +
> > +		diag_set(ClientError, ER_UNSUPPORTED,
> > +			 "Replication",
> > +			 "forward promote LSN "
> > +			 "(empty limbo, split brain)");
> > +		return -1;
> 
> I think it'd be better to have a separate error code for this purpose.
> Say, ER_SPLITBRAIN or something.
> Then applier would have more control over what to do when such an error
> is raised. Say, never reconnect. (I doesn't reconnect on ER_UNSUPPORTED,
> I believe, but a distinct error is still better).

Split-bain is one particular case, since more cases why we refuse to proceed
may appear in future maybe ER_REJECT with rejection error (just like netfilter
does?)

> > +	} else {
> > +		/*
> > +		 * Some entries are present in the limbo,
> > +		 * and if first entry's LSN is greater than
> > +		 * requested then old data either commited
> > +		 * or rolled back, so can't continue.
> > +		 */
> > +		struct txn_limbo_entry *first;
> > +
> > +		first = txn_limbo_first_entry(limbo);
> > +		if (first->lsn > promote_lsn) {
> 
> This should only happen when confirmed_lsn > promote_lsn, shouldn't it?
> If yes, than you've already handled it above.

Nope. This is the case where the data is still in limbo and we've not
yet commited it but master already commited/rolledback it already.
If only I'm not misssing something obvious.

> > +
> > +static int (*filter_req[FILTER_MAX])
> > +(struct txn_limbo *limbo, const struct synchro_request *req) = {
> > +	[FILTER_IN]		= filter_in,
> > +	[FILTER_CONFIRM]	= filter_confirm_rollback,
> > +	[FILTER_ROLLBACK]	= filter_confirm_rollback,
> > +	[FILTER_PROMOTE]	= filter_promote,
> > +	[FILTER_DEMOTE]		= filter_demote,
> 
> Demote should be filtered the same way as promote, they are
> basically the same requests with same meaning.
> 
> demote is just a promote for replica id 0, because we couldn't
> do promote replica id 0.

Aha, thanks, will update.

> > -	if (req->replica_id == REPLICA_ID_NIL) {
> > -		/*
> > -		 * The limbo was empty on the instance issuing the request.
> > -		 * This means this instance must empty its limbo as well.
> > -		 */
> > -		assert(lsn == 0 && iproto_type_is_promote_request(req->type));
> > -	} else if (req->replica_id != limbo->owner_id) {
> > +	if (req->replica_id != limbo->owner_id) {
> >   		/*
> >   		 * Ignore CONFIRM/ROLLBACK messages for a foreign master.
> >   		 * These are most likely outdated messages for already confirmed
> 
> We should error when request -> replica_id != limbo->owner_id.
> For every entry type: promote/demote/confirm/rollback.
> 
> req->replica_id != limbo->owner_id means that the remote instance has
> taken some actions in the past (say, confirmed something) of which we
> didn't know until now. This is basically a splitbrain again.

Good point. I think we should put it into filter-in chain then.

	Cyrill


More information about the Tarantool-patches mailing list