Tarantool development patches archive
 help / color / mirror / Atom feed
From: Serge Petrenko via Tarantool-patches <tarantool-patches@dev.tarantool.org>
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>, gorcunov@gmail.com
Cc: tarantool-patches@dev.tarantool.org
Subject: Re: [Tarantool-patches] [PATCH 3/3] box: fix an assertion failure in box.ctl.promote()
Date: Tue, 25 May 2021 13:39:03 +0300	[thread overview]
Message-ID: <0776f6e0-7a78-ad10-f5f8-8a08f84cd665@tarantool.org> (raw)
In-Reply-To: <1b0f0a3d-59ba-38ba-7f7e-f214664c8976@tarantool.org>



23.05.2021 15:18, Vladislav Shpilevoy пишет:
> Hi! Thanks for the patch!
>
> I see some of CI jobs have failed the new test:
> https://github.com/tarantool/tarantool/runs/2620153809

Thanks for the review!
CI issues fixed in v2.

>
> See 4 comments below.
>
>> diff --git a/src/box/box.cc b/src/box/box.cc
>> index c10e0d8bf..1b1e7eec0 100644
>> --- a/src/box/box.cc
>> +++ b/src/box/box.cc
>> @@ -1442,17 +1442,22 @@ box_quorum_on_ack_f(struct trigger *trigger, void *event)
>>   }
>>   
>>   /**
>> - * Wait until at least @a quorum of nodes confirm @a target_lsn from the node
>> - * with id @a lead_id.
>> + * Wait until at least @a quorum of nodes confirm the last available synchronous
>> + * entry from the node with id @a lead_id.
>>    */
>>   static int
>> -box_wait_quorum(uint32_t lead_id, int64_t target_lsn, int quorum,
>> +box_wait_quorum(uint32_t lead_id, struct txn_limbo_entry **entry, int quorum,
>>   		double timeout)
> 1. Maybe try to leave this function not depending on the limbo
> and its entries? It was supposed to wait for replication of just
> LSN, not necessarily a synchronous transaction.

Ok. This made the code simpler IMO, but fine. I see your point.

>
>>   {
>>   	struct box_quorum_trigger t;
>>   	memset(&t, 0, sizeof(t));
>>   	vclock_create(&t.vclock);
>>   
>> +	*entry = txn_limbo_wait_lsn_assigned(&txn_limbo);
>> +	if (*entry == NULL)
>> +		return -1;
>> +	int64_t target_lsn = (*entry)->lsn;
>> +
>>   	/* Take this node into account immediately. */
>>   	int ack_count = vclock_get(box_vclock, lead_id) >= target_lsn;
>>   	replicaset_foreach(replica) {
>> @@ -1622,22 +1627,17 @@ box_promote(void)
>>   		}
>>   	}
>>   
>> -	/*
>> -	 * promote() is a no-op on the limbo owner, so all the rows
>> -	 * in the limbo must've come through the applier meaning they already
>> -	 * have an lsn assigned, even if their WAL write hasn't finished yet.
>> -	 */
>> -	wait_lsn = txn_limbo_last_synchro_entry(&txn_limbo)->lsn;
>> -	assert(wait_lsn > 0);
>> -
>> -	rc = box_wait_quorum(former_leader_id, wait_lsn, quorum,
>> +	struct txn_limbo_entry *last_entry;
>> +	rc = box_wait_quorum(former_leader_id,&last_entry, quorum,
> 2. Missing whitespace after the first argument.

Thanks

>
>>   			     replication_synchro_timeout);
>>   	if (rc == 0) {
>> +		wait_lsn = last_entry->lsn;
>>   		if (quorum < replication_synchro_quorum) {
>>   			diag_set(ClientError, ER_QUORUM_WAIT, quorum,
>>   				 "quorum was increased while waiting");
>>   			rc = -1;
>> -		} else if (wait_lsn < txn_limbo_last_synchro_entry(&txn_limbo)->lsn) {
>> +		} else if (last_entry !=
>> +			   txn_limbo_last_synchro_entry(&txn_limbo)) {
>>   			diag_set(ClientError, ER_QUORUM_WAIT, quorum,
>>   				 "new synchronous transactions appeared");
>>   			rc = -1;
> 3. Could all the 3 commits be replaced with calling wal_sync() in the
> beginning of the promote() if we see the last LSN is unknown? After
> wal_sync() several outcomes are possible:
>
> 	- All was rolled back, and the limbo is empty;
> 	- The last transaction is different after sync - it means
> 	  it was added during promote() which is an error like in
> 	  the code above;
> 	- The transaction in the end of the limbo is the same.
>
> In the last case you work like before - box_wait_quorum() with the
> known LSN. Will it work?

Yes, indeed. Thanks for pointing this out!
I took this approach in v2.

>
>> diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
>> index f287369a2..406f2de89 100644
>> --- a/src/box/txn_limbo.c
>> +++ b/src/box/txn_limbo.c
>> @@ -69,6 +69,48 @@ txn_limbo_last_synchro_entry(struct txn_limbo *limbo)
>>   	return NULL;
>>   }
>>   
>> +static int
>> +txn_limbo_wait_lsn_assigned_f(struct trigger *trig, void *event)
>> +{
>> +	(void)event;
>> +	struct fiber *fiber = trig->data;
>> +	fiber_wakeup(fiber);
>> +	return 0;
>> +}
>> +
>> +struct txn_limbo_entry *
>> +txn_limbo_wait_lsn_assigned(struct txn_limbo *limbo)
>> +{
>> +	assert(!txn_limbo_is_empty(limbo));
>> +	struct txn_limbo_entry *entry = txn_limbo_last_synchro_entry(limbo);
>> +	if (entry->lsn >= 0)
>> +		return entry;
>> +
>> +	struct trigger write_trigger, rollback_trigger;
>> +	trigger_create(&write_trigger, txn_limbo_wait_lsn_assigned_f, fiber(),
>> +		       NULL);
>> +	trigger_create(&rollback_trigger, txn_limbo_wait_lsn_assigned_f,
>> +		       fiber(), NULL);
>> +	txn_on_wal_write(entry->txn, &write_trigger);
>> +	txn_on_rollback(entry->txn, &rollback_trigger);
>> +	do {
>> +		fiber_yield();
>> +		if (fiber_is_cancelled()) {
>> +			diag_set(FiberIsCancelled);
>> +			entry = NULL;
>> +			break;
>> +		}
>> +		if (entry->txn->signature < 0) {
>> +			diag_set(ClientError, ER_SYNC_ROLLBACK);
>> +			entry = NULL;
>> +			break;
>> +		}
>> +	} while (entry->lsn == -1);
>> +	trigger_clear(&write_trigger);
>> +	trigger_clear(&rollback_trigger);
> 4. Why do you need the LSN assigned in the on_wal_write trigger in
> the previous commit? I can't see where do you use it here.

I check for entry->lsn right after the waiter is woken.
It's woken from on_wal_write trigger, and the trigger also
wakes up the fiber which used to assign lsn.

The waiter is woken up first, so I had to assign lsn from
the trigger itself.

This is irrelevant now, though, since I've adopted your proposal
with wal_sync().


-- 
Serge Petrenko


      reply	other threads:[~2021-05-25 10:39 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-20  9:02 [Tarantool-patches] [PATCH 0/3] fix " Serge Petrenko via Tarantool-patches
2021-05-20  9:02 ` [Tarantool-patches] [PATCH 1/3] box: make txn reference the limbo entry Serge Petrenko via Tarantool-patches
2021-05-20  9:02 ` [Tarantool-patches] [PATCH 2/3] txn_limbo: move lsn assignment to journal completion callback Serge Petrenko via Tarantool-patches
2021-05-20  9:02 ` [Tarantool-patches] [PATCH 3/3] box: fix an assertion failure in box.ctl.promote() Serge Petrenko via Tarantool-patches
2021-05-23 12:18   ` Vladislav Shpilevoy via Tarantool-patches
2021-05-25 10:39     ` Serge Petrenko via Tarantool-patches [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0776f6e0-7a78-ad10-f5f8-8a08f84cd665@tarantool.org \
    --to=tarantool-patches@dev.tarantool.org \
    --cc=gorcunov@gmail.com \
    --cc=sergepetrenko@tarantool.org \
    --cc=v.shpilevoy@tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH 3/3] box: fix an assertion failure in box.ctl.promote()' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox