[Tarantool-patches] [PATCH v2 2/2] box: fix an assertion failure in box.ctl.promote()

Cyrill Gorcunov gorcunov at gmail.com
Wed May 26 09:14:24 MSK 2021


On Tue, May 25, 2021 at 01:39:29PM +0300, Serge Petrenko wrote:
> box.ctl.promote() used to assume that the last synchronous entry is
> already written to WAL by the time it's called. This is not the case
> when promote is executed on the limbo owner. The last synchronous entry
> might still be en route to WAL.

Typo "en" -> "in".

> @@ -1618,14 +1618,29 @@ box_promote(void)
>  				 txn_limbo.owner_id);
>  			return -1;
>  		}
> +		if (txn_limbo_is_empty(&txn_limbo)) {
> +			wait_lsn = txn_limbo.confirmed_lsn;
> +			goto promote;
> +		}
>  	}
>  
> -	/*
> -	 * promote() is a no-op on the limbo owner, so all the rows
> -	 * in the limbo must've come through the applier meaning they already
> -	 * have an lsn assigned, even if their WAL write hasn't finished yet.
> -	 */
> -	wait_lsn = txn_limbo_last_synchro_entry(&txn_limbo)->lsn;
> +	struct txn_limbo_entry *last_entry;
> +	last_entry = txn_limbo_last_synchro_entry(&txn_limbo);
> +	/* Wait for the last entries WAL write. */
> +	if (last_entry->lsn < 0) {
> +		if (wal_sync(NULL) < 0)
> +			return -1;
> +		if (txn_limbo_is_empty(&txn_limbo)) {
> +			wait_lsn = txn_limbo.confirmed_lsn;
> +			goto promote;
> +		}
> +		if (last_entry != txn_limbo_last_synchro_entry(&txn_limbo)) {

This is a bit dangerous. We cache a pointer and then go to fiber_yield,
which switches context, at this moment the pointer become dangling one
and we simply can't be sure if it _were_ reused. IOW, Serge are we
100% sure that the same pointer with same address but with new data
won't appear here as last entry in limbo?

> +			diag_set(ClientError, ER_QUORUM_WAIT, quorum,
> +				 "new synchronous transactions appeared");
> +			return -1;
> +		}
> +	}
> +	wait_lsn = last_entry->lsn;
>  	assert(wait_lsn > 0);


More information about the Tarantool-patches mailing list