From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id 760CD6EC5F; Tue, 20 Apr 2021 20:37:29 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 760CD6EC5F DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1618940249; bh=Ce81tJm8KLnCmVEdxUd/t7FeDBT1jES6GoMXt41Me+M=; h=To:Cc:References:Date:In-Reply-To:Subject:List-Id: List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe: From:Reply-To:From; b=tJIgdacPeHqF61enU6gRkThqgZNIgwifjcaexhErB221Bn92ziAWpfrirXBc2lNEw rHJGyNfeV5iPKnt5UeR3gRddwWJgzz/mrcAU6jUk21pVqfQ2WXk0nQogLNATtfZ5NT n9742ofqomsVCHniwkNqxel6JJYU65/Tj2lgpafE= Received: from smtp36.i.mail.ru (smtp36.i.mail.ru [94.100.177.96]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 5F18D6EC5F for ; Tue, 20 Apr 2021 20:37:28 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 5F18D6EC5F Received: by smtp36.i.mail.ru with esmtpa (envelope-from ) id 1lYuJT-0004oE-Ih; Tue, 20 Apr 2021 20:37:27 +0300 To: Vladislav Shpilevoy , gorcunov@gmail.com Cc: tarantool-patches@dev.tarantool.org References: <858d30b2-f988-9fd0-ee75-3281721e54b1@tarantool.org> Message-ID: <99a07dad-e38a-623e-303a-ecf412582be5@tarantool.org> Date: Tue, 20 Apr 2021 20:37:27 +0300 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.9.1 MIME-Version: 1.0 In-Reply-To: <858d30b2-f988-9fd0-ee75-3281721e54b1@tarantool.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-GB X-7564579A: 646B95376F6C166E X-77F55803: 4F1203BC0FB41BD92FFCB8E6708E7480EBD5CA77A668ECB87DA2124B0A8E6609182A05F538085040D50DE7CC0000C31113AED0D375955A30527A83A42A65461CCBFEEA22597860B2 X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE779AAD18609327F83EA1F7E6F0F101C67BD4B6F7A4D31EC0BCC500DACC3FED6E28638F802B75D45FF8AA50765F7900637F97367C191A19EB28638F802B75D45FF914D58D5BE9E6BC1A93B80C6DEB9DEE97C6FB206A91F05B2FE5CAE4D805219E6682A25D6559137D189428004B1FE94FDD2E47CDBA5A96583C09775C1D3CA48CF27ED053E960B195E117882F4460429724CE54428C33FAD30A8DF7F3B2552694AC26CFBAC0749D213D2E47CDBA5A9658378DA827A17800CE764603B5C71CE8B8F9FA2833FD35BB23DF004C906525384302BEBFE083D3B9BA71A620F70A64A45A98AA50765F79006372E808ACE2090B5E1725E5C173C3A84C3C5EA940A35A165FF2DBA43225CD8A89FB26E97DCB74E625257739F23D657EF2BB5C8C57E37DE458BEDA766A37F9254B7 X-C1DE0DAB: 0D63561A33F958A5F005847EE2F8CB23016615F57AC4A6D91CC83573AD774931D59269BC5F550898D99A6476B3ADF6B47008B74DF8BB9EF7333BD3B22AA88B938A852937E12ACA7502E6951B79FF9A3F410CA545F18667F91A7EA1CDA0B5A7A0 X-C8649E89: 4E36BF7865823D7055A7F0CF078B5EC49A30900B95165D346C409ABC5F9C579B73322E10E7600F42EAB6129B84D9D2F11B66617CF8934D1F6A6CF353D17178961D7E09C32AA3244C07E810ACD6DBE1B4D027BA074600D26EC86C126E7119A0FEFACE5A9C96DEB163 X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2biojlPRl29Bx4WEsM8nkhHcvvA== X-Mailru-Sender: 3B9A0136629DC9125D61937A2360A446EEBF1EC5B0570F9A5B663C200B429762ECF6D7CF2D095669424AE0EB1F3D1D21E2978F233C3FAE6EE63DB1732555E4A8EE80603BA4A5B0BC112434F685709FCF0DA7A0AF5A3A8387 X-Mras: Ok Subject: Re: [Tarantool-patches] [PATCH v4 08/12] election: introduce a new election mode: "manual" X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Serge Petrenko via Tarantool-patches Reply-To: Serge Petrenko Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" 20.04.2021 12:25, Serge Petrenko via Tarantool-patches пишет: > > > 20.04.2021 01:34, Vladislav Shpilevoy пишет: >> Hi! Thanks for working on this! >> >> It seems starting from this commit the election stress test >> hangs on my machine in 100% cases. I didn't have time to >> investigate why yet. > Yes, you're correct. I also see this. It's not 100% cases though. > > On my machine the test doesn't hang at all (at least the first 20 runs) > until commit "txn_limbo: filter rows based on known peer terms" > > Starting with commit "txn_limbo: filter rows based on known peer terms" > one or two of the 20 runs hang and get restarted. > > I need some time to investigate this. Will return once I have some > results. > Ok, seems like the case is closed. So, here's a couple of facts that lead to the test hang: 1) The instance may still write CONFIRM for its own transactions after restart.    It may do so even before receiving a CONFIRM from some remote instance, which    took ownership of the limbo later.    This fact alone would be ok, but:    a) the instance doesn't count its own WAL write as the first ack after restart,       so if quorum is M it waits for M+1 acks from remote instances before writing       confirm    b) the instance writes CONFIRM unconditionally even before getting in sync       with other replicas, which could have already written CONFIRM for its rows.       (this may be fine).    There's an issue related to the cause, but it needs some reformulation:    https://github.com/tarantool/tarantool/issues/5856 2) Any failure in txn_commit_try_async is treated as a WAL write error by mistake,    and the actual reason for rollback is lost. I've opened a ticket for this:    https://github.com/tarantool/tarantool/issues/6027    ER_WAL_IO is unrecoverable and breaks connection between master and replica.    (We might make it recoverable as well? Why not retry WAL write after some time?     It may work out this time). 3) NOPs are added to txn_limbo, when it isn't empty. And here's what happened when the test hung: 1) Some instance used to be the leader and got restarted before    writing CONFIRM for its own transactions 2) Once the instance got restarted, its relays were faster than    its appliers, meaning it first gathered 2 acks for the old    transaction, and wrote CONFIRM right away, and received CONFIRM    from a remote instance later 3) This instance was elected the leader once again. Once this    happened other 2 instances started accepting rows from this    instance 4) The first row remote instances got was this CONFIRM which the    instance wrote after restart 5) The instance was considered outdated, because while it was an    elected leader, it hasn't yet sent PROMOTE to the other    instances (PROMOTE comes right after that notorious CONFIRM) 6) Like any row from an outdated instance, CONFIRM was replaced    with a NOP 7) Other instances try to insert that NOP to their limbos, which    aren't empty, due to the nature of the test (and would get    emptied with PROMOTE). Insertion fails with    ER_UNCOMMITTED_FOREIGN_SYNC_TXNS 8) ER_UNCOMMITTED_FOREIGN_SYNC_TXNS is replaced with ER_WAL_IO by    applier's on_rollback trigger. This is an unrecoverable error,    so both the remote instances' appliers break connection to    the leader. 9) Now there's an infinite loop of elections. This node never    votes for any of the remote nodes, because they are behind it. What I've done to fix this is I've allowed transactions that consist of NOPs solely to pass through limbo without waiting even when it's non-empty. The test's now rock-solid on my machine. 0 failures in 100 runs. (with 1 worker, to be honest, but that's still better than a couple of failures in 20 runs with 1 worker). I've sent the new patch as [PATCH v4 14/12] in reply to this series. Please, take a look. -- Serge Petrenko