[Tarantool-patches] [PATCH v2 2/2] box: fix an assertion failure in box.ctl.promote()
Serge Petrenko
sergepetrenko at tarantool.org
Thu May 27 13:53:54 MSK 2021
26.05.2021 21:46, Vladislav Shpilevoy пишет:
>>>> @@ -1618,14 +1618,29 @@ box_promote(void)
>>>> txn_limbo.owner_id);
>>>> return -1;
>>>> }
>>>> + if (txn_limbo_is_empty(&txn_limbo)) {
>>>> + wait_lsn = txn_limbo.confirmed_lsn;
>>>> + goto promote;
>>>> + }
>>>> }
>>>> - /*
>>>> - * promote() is a no-op on the limbo owner, so all the rows
>>>> - * in the limbo must've come through the applier meaning they already
>>>> - * have an lsn assigned, even if their WAL write hasn't finished yet.
>>>> - */
>>>> - wait_lsn = txn_limbo_last_synchro_entry(&txn_limbo)->lsn;
>>>> + struct txn_limbo_entry *last_entry;
>>>> + last_entry = txn_limbo_last_synchro_entry(&txn_limbo);
>>>> + /* Wait for the last entries WAL write. */
>>>> + if (last_entry->lsn < 0) {
>>>> + if (wal_sync(NULL) < 0)
>>>> + return -1;
>>>> + if (txn_limbo_is_empty(&txn_limbo)) {
>>>> + wait_lsn = txn_limbo.confirmed_lsn;
>>>> + goto promote;
>>>> + }
>>>> + if (last_entry != txn_limbo_last_synchro_entry(&txn_limbo)) {
>>> This is a bit dangerous. We cache a pointer and then go to fiber_yield,
>>> which switches context, at this moment the pointer become dangling one
>>> and we simply can't be sure if it _were_ reused. IOW, Serge are we
>>> 100% sure that the same pointer with same address but with new data
>>> won't appear here as last entry in limbo?
>> I agree this solution is not perfect.
>>
>> An alternative would be to do the following:
>> 1) Check that the limbo owner hasn't changed
>> 2) Check that the last entry has positive lsn (e.g. it's not a new entry which
>> wasn't yet written to WAL). And that this lsn is equal to the lsn of our entry.
>>
>> But what if our entry was confirmed and destroyed during wal_sync()? We can't compare
>> other entries lsn with this ones.
> As decided in the chat, you can use txn->id. It is unique until
> restart and should help to detect if the last transaction has
> changed.
Yep, thanks for the suggestion!
Here's the diff:
=============================================================
diff --git a/src/box/box.cc b/src/box/box.cc
index 3d9cd0e57..3baae6afe 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1628,13 +1628,19 @@ box_promote(void)
last_entry = txn_limbo_last_synchro_entry(&txn_limbo);
/* Wait for the last entries WAL write. */
if (last_entry->lsn < 0) {
+ int64_t tid = last_entry->txn->id;
if (wal_sync(NULL) < 0)
return -1;
+ if (former_leader_id != txn_limbo.owner_id) {
+ diag_set(ClientError, ER_INTERFERING_PROMOTE,
+ txn_limbo.owner_id);
+ return -1;
+ }
if (txn_limbo_is_empty(&txn_limbo)) {
wait_lsn = txn_limbo.confirmed_lsn;
goto promote;
}
- if (last_entry !=
txn_limbo_last_synchro_entry(&txn_limbo)) {
+ if (tid !=
txn_limbo_last_synchro_entry(&txn_limbo)->txn->id) {
diag_set(ClientError, ER_QUORUM_WAIT, quorum,
"new synchronous transactions appeared");
return -1;
=============================================================
--
Serge Petrenko
More information about the Tarantool-patches
mailing list