From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id 781FD6F3C8; Sat, 27 Mar 2021 19:52:29 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 781FD6F3C8 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1616863949; bh=6NicxY4L6JGQCDGuACIpgK1AdqPP6wwvbJPENRh6nNA=; h=To:Cc:References:Date:In-Reply-To:Subject:List-Id: List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe: From:Reply-To:From; b=NxoDMGtqDEfMHIrJO99hWrVw6HkrTglNCbOgwmhIqRyuMenroKr+92VRdQyTXIWgf MVyGpZdGn1MouBrR3nN8tKKKZnAYY5rsekySEmm5WkIOetZxNP3YbAOvJeIbKlUvVO U6sx5kaA072U0POOkbD4C5Hokmzm9D5MPJPCI8L8= Received: from smtp63.i.mail.ru (smtp63.i.mail.ru [217.69.128.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id A0DDE6F3C8 for ; Sat, 27 Mar 2021 19:52:27 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org A0DDE6F3C8 Received: by smtp63.i.mail.ru with esmtpa (envelope-from ) id 1lQCAk-0007ON-Nt; Sat, 27 Mar 2021 19:52:27 +0300 To: Vladislav Shpilevoy , gorcunov@gmail.com Cc: tarantool-patches@dev.tarantool.org References: <12bf66a77b755eaadc09665ede9fbcde0516a7a4.1616588119.git.sergepetrenko@tarantool.org> <2ab0844e-84b4-6701-15ab-652ab6f18075@tarantool.org> Message-ID: Date: Sat, 27 Mar 2021 19:52:26 +0300 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: <2ab0844e-84b4-6701-15ab-652ab6f18075@tarantool.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-GB X-7564579A: EEAE043A70213CC8 X-77F55803: 4F1203BC0FB41BD9ED7173E37F4E32947427BE79D20CABD4ABD7C98AF5DBFD37182A05F5380850408F7F982D4D458109AEA194C4627FA6C9BD3BDDDFD901694A1AE337C3EF6C5E43 X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE783C1FBFE215D363AEA1F7E6F0F101C67BD4B6F7A4D31EC0BCC500DACC3FED6E28638F802B75D45FF8AA50765F790063770398A047C76876C8638F802B75D45FF914D58D5BE9E6BC131B5C99E7648C95C7428A34725AB662DF38A1377BDFAA0BF31EC3447AE28EBD5A471835C12D1D9774AD6D5ED66289B5278DA827A17800CE73AFA331E307B52169FA2833FD35BB23D2EF20D2F80756B5F868A13BD56FB6657A471835C12D1D977725E5C173C3A84C3FF021744A2531FDDCC7F00164DA146DA6F5DAA56C3B73B237318B6A418E8EAB8D32BA5DBAC0009BE9E8FC8737B5C2249534C549EF2D23B8176E601842F6C81A12EF20D2F80756B5F7E9C4E3C761E06A776E601842F6C81A127C277FBC8AE2E8B6993FD04C5A6DFEC3AA81AA40904B5D9DBF02ECDB25306B2201CA6A4E26CD07C3BBE47FD9DD3FB595F5C1EE8F4F765FCA83251EDC214901ED5E8D9A59859A8B645423645B6F85954089D37D7C0E48F6C5571747095F342E88FB05168BE4CE3AF X-C1DE0DAB: 0D63561A33F958A5FB07CD89533DF2B4BA6B49F852FAB0A725C10711E8149CDBD59269BC5F550898D99A6476B3ADF6B47008B74DF8BB9EF7333BD3B22AA88B938A852937E12ACA7502E6951B79FF9A3F410CA545F18667F91A7EA1CDA0B5A7A0 X-C8649E89: 4E36BF7865823D7055A7F0CF078B5EC49A30900B95165D34CAFBC0A7A4BEEE01BCF92B3038019D13D5CCC5989FE599FF996E490778E16F67E9513B341A51D5ED1D7E09C32AA3244C9118994DC8EEB243C359227EF30FD825BBA718C7E6A9E042FACE5A9C96DEB163 X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2biojhfg4BOnpz0ppqJRxdDn3FA== X-Mailru-Sender: 583F1D7ACE8F49BDD2846D59FC20E9F88F4D69964301BF780BC07B1F048837BE75C39BEC453FFB56424AE0EB1F3D1D21E2978F233C3FAE6EE63DB1732555E4A8EE80603BA4A5B0BC112434F685709FCF0DA7A0AF5A3A8387 X-Mras: Ok Subject: Re: [Tarantool-patches] [PATCH v2 1/7] replication: fix a hang on final join retry X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Serge Petrenko via Tarantool-patches Reply-To: Serge Petrenko Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" 26.03.2021 23:44, Vladislav Shpilevoy пишет: > Hi! Thanks for working on this! > >> diff --git a/src/box/applier.cc b/src/box/applier.cc >> index 5a88a013e..326cf18d2 100644 >> --- a/src/box/applier.cc >> +++ b/src/box/applier.cc >> @@ -566,9 +566,16 @@ applier_register(struct applier *applier) >> row.type = IPROTO_REGISTER; >> coio_write_xrow(coio, &row); >> >> - applier_set_state(applier, APPLIER_REGISTER); >> + /* >> + * Register may serve as a retry for final join. Set corresponding >> + * states to unblock anyone who's waiting for final join to start or >> + * end. >> + */ >> + applier_set_state(applier, was_anon ? APPLIER_REGISTER : >> + APPLIER_FINAL_JOIN); >> applier_wait_register(applier, 0); >> - applier_set_state(applier, APPLIER_REGISTERED); >> + applier_set_state(applier, was_anon ? APPLIER_REGISTERED : >> + APPLIER_JOINED); >> applier_set_state(applier, APPLIER_READY); > Hm. I don't understand. Transition from anon to non-anon leads to > re-creation of all appliers. It calls box_sync_replication() and > creates new struct applier objects. How is it possible that during one > life of a reader fiber it manages to see 2 states and is not terminated? You're correct. This isn't possible for an applier to see two states, anon and not anon. The flag is still needed though for the case when a normal replica receives some transient error during final join. In this case applier reconnects and we get to the next applier loop iteration. First it checks whether REPLICASET_UUID is nil. It isn't, because initial join succeeded. Then it checks whether instance_id is 0. It is, because final join failed. Applier now assumes that the replica was anonymous and tries to register. The hang I'm talking about is in `bootstrap_from_master()`. It waits until applier enters APPLIER_JOINED state, which never happened before this patch. So, `was_anon` comes in play only when final join fails and is retried. > > Also could you please provide a test? Maybe it would be easier to see > what is happening then. Ok. I'm not sure this test is needed because this is implicitly tested in gh-5566-final-join-synchro test. A test would be as follows: master:     box.cfg{listen=3301, replication_synchro_quorum=10}     box.space._cluster:alter{is_sync=true}     box.schema.user.grant("guest", "replication") replica:     box.cfg{replication=3301} master: wait until replica receives ER_SYNC_QUORUM_TIMEOUT, and then:     box.cfg{replication_synchro_quorum=1} This test passes on the branch, meaning replica's box.cfg completes successfully, but it would hang indefinitely without this commit. -- Serge Petrenko