From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id 4FEAB6EC58; Tue, 25 May 2021 13:39:06 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 4FEAB6EC58 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1621939146; bh=459CI10Pp+1AgulTAytcYH4Mn8LUF2Q5m/zfxHuY9kQ=; h=To:Cc:References:Date:In-Reply-To:Subject:List-Id: List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe: From:Reply-To:From; b=ZVG3UMVFQb32tlhcItf/BZgEnjUbIGQSkLsX36gbsEK8RwUdCk8L8ig60yZmWWxZf VpoZrxe376ElWsoXSddTog8XxA1LT675RuZjtQk9QN1YxjkKSbHc+5cP0sAuDXzExc W2OB/ZC8JkKNYrf0wWtE2MaDn8GIlHzm9pDaHTnk= Received: from smtp37.i.mail.ru (smtp37.i.mail.ru [94.100.177.97]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 46C306EC58 for ; Tue, 25 May 2021 13:39:05 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 46C306EC58 Received: by smtp37.i.mail.ru with esmtpa (envelope-from ) id 1llUSm-0004mq-AG; Tue, 25 May 2021 13:39:04 +0300 To: Vladislav Shpilevoy , gorcunov@gmail.com Cc: tarantool-patches@dev.tarantool.org References: <1b0f0a3d-59ba-38ba-7f7e-f214664c8976@tarantool.org> Message-ID: <0776f6e0-7a78-ad10-f5f8-8a08f84cd665@tarantool.org> Date: Tue, 25 May 2021 13:39:03 +0300 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.10.2 MIME-Version: 1.0 In-Reply-To: <1b0f0a3d-59ba-38ba-7f7e-f214664c8976@tarantool.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-GB X-7564579A: 646B95376F6C166E X-77F55803: 4F1203BC0FB41BD91B019B01C53E51AF9055931F25A1C7413A6B5BA1B6C38DA100894C459B0CD1B94B6A2EDE8752A664D2053665BF695DD30BC1D0E047EE27F58F23EFBB696E3500 X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE7F5F08398AF01CA1FEA1F7E6F0F101C67BD4B6F7A4D31EC0BCC500DACC3FED6E28638F802B75D45FF8AA50765F7900637A975286280E7A0BD8638F802B75D45FF36EB9D2243A4F8B5A6FCA7DBDB1FC311F39EFFDF887939037866D6147AF826D8D5CAC3F4E2EF172C28010CB05E7E3F5B117882F4460429724CE54428C33FAD305F5C1EE8F4F765FCF1175FABE1C0F9B6A471835C12D1D9774AD6D5ED66289B52BA9C0B312567BB23117882F446042972877693876707352033AC447995A7AD182CC0D3CB04F14752D2E47CDBA5A96583BA9C0B312567BB231DD303D21008E29813377AFFFEAFD269A417C69337E82CC2E827F84554CEF50127C277FBC8AE2E8BA83251EDC214901ED5E8D9A59859A8B67ECBC18655D52CDF089D37D7C0E48F6C5571747095F342E88FB05168BE4CE3AF X-C1DE0DAB: 0D63561A33F958A5221F354EF9A51EDDD0FFB838FAA46CE25E82A365E0504138D59269BC5F550898D99A6476B3ADF6B47008B74DF8BB9EF7333BD3B22AA88B938A852937E12ACA753C350047980234DB410CA545F18667F91A7EA1CDA0B5A7A0 X-C8649E89: 4E36BF7865823D7055A7F0CF078B5EC49A30900B95165D342B8615F5CFAD9D02F6BEBCCAEDCBAF4D7E37929FB5CA396DB6652B74F09603907F322E6C0404CF201D7E09C32AA3244CA49CFFCFF8A2CCBBEDC64F702C8B1D5430363D8B7DA7DD44FACE5A9C96DEB163 X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2biojV3CWresp27e4GCAbBNK1vw== X-Mailru-Sender: 3B9A0136629DC9125D61937A2360A446F0B7B8E406293BC5C533A3C7A1FDD52632871ABCB566BFD0424AE0EB1F3D1D21E2978F233C3FAE6EE63DB1732555E4A8EE80603BA4A5B0BC112434F685709FCF0DA7A0AF5A3A8387 X-Mras: Ok Subject: Re: [Tarantool-patches] [PATCH 3/3] box: fix an assertion failure in box.ctl.promote() X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Serge Petrenko via Tarantool-patches Reply-To: Serge Petrenko Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" 23.05.2021 15:18, Vladislav Shpilevoy пишет: > Hi! Thanks for the patch! > > I see some of CI jobs have failed the new test: > https://github.com/tarantool/tarantool/runs/2620153809 Thanks for the review! CI issues fixed in v2. > > See 4 comments below. > >> diff --git a/src/box/box.cc b/src/box/box.cc >> index c10e0d8bf..1b1e7eec0 100644 >> --- a/src/box/box.cc >> +++ b/src/box/box.cc >> @@ -1442,17 +1442,22 @@ box_quorum_on_ack_f(struct trigger *trigger, void *event) >> } >> >> /** >> - * Wait until at least @a quorum of nodes confirm @a target_lsn from the node >> - * with id @a lead_id. >> + * Wait until at least @a quorum of nodes confirm the last available synchronous >> + * entry from the node with id @a lead_id. >> */ >> static int >> -box_wait_quorum(uint32_t lead_id, int64_t target_lsn, int quorum, >> +box_wait_quorum(uint32_t lead_id, struct txn_limbo_entry **entry, int quorum, >> double timeout) > 1. Maybe try to leave this function not depending on the limbo > and its entries? It was supposed to wait for replication of just > LSN, not necessarily a synchronous transaction. Ok. This made the code simpler IMO, but fine. I see your point. > >> { >> struct box_quorum_trigger t; >> memset(&t, 0, sizeof(t)); >> vclock_create(&t.vclock); >> >> + *entry = txn_limbo_wait_lsn_assigned(&txn_limbo); >> + if (*entry == NULL) >> + return -1; >> + int64_t target_lsn = (*entry)->lsn; >> + >> /* Take this node into account immediately. */ >> int ack_count = vclock_get(box_vclock, lead_id) >= target_lsn; >> replicaset_foreach(replica) { >> @@ -1622,22 +1627,17 @@ box_promote(void) >> } >> } >> >> - /* >> - * promote() is a no-op on the limbo owner, so all the rows >> - * in the limbo must've come through the applier meaning they already >> - * have an lsn assigned, even if their WAL write hasn't finished yet. >> - */ >> - wait_lsn = txn_limbo_last_synchro_entry(&txn_limbo)->lsn; >> - assert(wait_lsn > 0); >> - >> - rc = box_wait_quorum(former_leader_id, wait_lsn, quorum, >> + struct txn_limbo_entry *last_entry; >> + rc = box_wait_quorum(former_leader_id,&last_entry, quorum, > 2. Missing whitespace after the first argument. Thanks > >> replication_synchro_timeout); >> if (rc == 0) { >> + wait_lsn = last_entry->lsn; >> if (quorum < replication_synchro_quorum) { >> diag_set(ClientError, ER_QUORUM_WAIT, quorum, >> "quorum was increased while waiting"); >> rc = -1; >> - } else if (wait_lsn < txn_limbo_last_synchro_entry(&txn_limbo)->lsn) { >> + } else if (last_entry != >> + txn_limbo_last_synchro_entry(&txn_limbo)) { >> diag_set(ClientError, ER_QUORUM_WAIT, quorum, >> "new synchronous transactions appeared"); >> rc = -1; > 3. Could all the 3 commits be replaced with calling wal_sync() in the > beginning of the promote() if we see the last LSN is unknown? After > wal_sync() several outcomes are possible: > > - All was rolled back, and the limbo is empty; > - The last transaction is different after sync - it means > it was added during promote() which is an error like in > the code above; > - The transaction in the end of the limbo is the same. > > In the last case you work like before - box_wait_quorum() with the > known LSN. Will it work? Yes, indeed. Thanks for pointing this out! I took this approach in v2. > >> diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c >> index f287369a2..406f2de89 100644 >> --- a/src/box/txn_limbo.c >> +++ b/src/box/txn_limbo.c >> @@ -69,6 +69,48 @@ txn_limbo_last_synchro_entry(struct txn_limbo *limbo) >> return NULL; >> } >> >> +static int >> +txn_limbo_wait_lsn_assigned_f(struct trigger *trig, void *event) >> +{ >> + (void)event; >> + struct fiber *fiber = trig->data; >> + fiber_wakeup(fiber); >> + return 0; >> +} >> + >> +struct txn_limbo_entry * >> +txn_limbo_wait_lsn_assigned(struct txn_limbo *limbo) >> +{ >> + assert(!txn_limbo_is_empty(limbo)); >> + struct txn_limbo_entry *entry = txn_limbo_last_synchro_entry(limbo); >> + if (entry->lsn >= 0) >> + return entry; >> + >> + struct trigger write_trigger, rollback_trigger; >> + trigger_create(&write_trigger, txn_limbo_wait_lsn_assigned_f, fiber(), >> + NULL); >> + trigger_create(&rollback_trigger, txn_limbo_wait_lsn_assigned_f, >> + fiber(), NULL); >> + txn_on_wal_write(entry->txn, &write_trigger); >> + txn_on_rollback(entry->txn, &rollback_trigger); >> + do { >> + fiber_yield(); >> + if (fiber_is_cancelled()) { >> + diag_set(FiberIsCancelled); >> + entry = NULL; >> + break; >> + } >> + if (entry->txn->signature < 0) { >> + diag_set(ClientError, ER_SYNC_ROLLBACK); >> + entry = NULL; >> + break; >> + } >> + } while (entry->lsn == -1); >> + trigger_clear(&write_trigger); >> + trigger_clear(&rollback_trigger); > 4. Why do you need the LSN assigned in the on_wal_write trigger in > the previous commit? I can't see where do you use it here. I check for entry->lsn right after the waiter is woken. It's woken from on_wal_write trigger, and the trigger also wakes up the fiber which used to assign lsn. The waiter is woken up first, so I had to assign lsn from the trigger itself. This is irrelevant now, though, since I've adopted your proposal with wal_sync(). -- Serge Petrenko