[Tarantool-patches] [PATCH v2 2/2] box: fix an assertion failure in box.ctl.promote()

Serge Petrenko sergepetrenko at tarantool.org
Wed May 26 11:25:29 MSK 2021



26.05.2021 09:14, Cyrill Gorcunov пишет:
> On Tue, May 25, 2021 at 01:39:29PM +0300, Serge Petrenko wrote:

Hi! Thanks for the review!

>> box.ctl.promote() used to assume that the last synchronous entry is
>> already written to WAL by the time it's called. This is not the case
>> when promote is executed on the limbo owner. The last synchronous entry
>> might still be en route to WAL.
> Typo "en" -> "in".

No, "en route" means в пути/по пути.

>
>> @@ -1618,14 +1618,29 @@ box_promote(void)
>>   				 txn_limbo.owner_id);
>>   			return -1;
>>   		}
>> +		if (txn_limbo_is_empty(&txn_limbo)) {
>> +			wait_lsn = txn_limbo.confirmed_lsn;
>> +			goto promote;
>> +		}
>>   	}
>>   
>> -	/*
>> -	 * promote() is a no-op on the limbo owner, so all the rows
>> -	 * in the limbo must've come through the applier meaning they already
>> -	 * have an lsn assigned, even if their WAL write hasn't finished yet.
>> -	 */
>> -	wait_lsn = txn_limbo_last_synchro_entry(&txn_limbo)->lsn;
>> +	struct txn_limbo_entry *last_entry;
>> +	last_entry = txn_limbo_last_synchro_entry(&txn_limbo);
>> +	/* Wait for the last entries WAL write. */
>> +	if (last_entry->lsn < 0) {
>> +		if (wal_sync(NULL) < 0)
>> +			return -1;
>> +		if (txn_limbo_is_empty(&txn_limbo)) {
>> +			wait_lsn = txn_limbo.confirmed_lsn;
>> +			goto promote;
>> +		}
>> +		if (last_entry != txn_limbo_last_synchro_entry(&txn_limbo)) {
> This is a bit dangerous. We cache a pointer and then go to fiber_yield,
> which switches context, at this moment the pointer become dangling one
> and we simply can't be sure if it _were_ reused. IOW, Serge are we
> 100% sure that the same pointer with same address but with new data
> won't appear here as last entry in limbo?

I agree this solution is not perfect.

An alternative would be to do the following:
1) Check that the limbo owner hasn't changed
2) Check that the last entry has positive lsn (e.g. it's not a new entry 
which
     wasn't yet written to WAL). And that this lsn is equal to the lsn 
of our entry.

But what if our entry was confirmed and destroyed during wal_sync()? We 
can't compare
other entries lsn with this ones.
>
>> +			diag_set(ClientError, ER_QUORUM_WAIT, quorum,
>> +				 "new synchronous transactions appeared");
>> +			return -1;
>> +		}
>> +	}
>> +	wait_lsn = last_entry->lsn;
>>   	assert(wait_lsn > 0);

-- 
Serge Petrenko



More information about the Tarantool-patches mailing list