From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id 25DB76F3C8; Sat, 27 Mar 2021 21:30:04 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 25DB76F3C8 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1616869804; bh=HBJe33Gyj2KWBYcCu2JTZnmrb1iP3/kKGK3a+CJk904=; h=To:Cc:References:Date:In-Reply-To:Subject:List-Id: List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe: From:Reply-To:From; b=wfmJ/Qx7EO+VbUujaZ/mhZv9d7zUmvQQpa5WhFYNWLvqz42SM3f35YUwhi+RaBf9r IRhysTAJZakxFgANon4thN/CuWNAc7foxbqKN732Y4Tdp7Ig8gsWbLeZweJYjL+C71 dKHCTfQSgIayuhGPlWlCGEhKz4FNg1EvljQT1EpI= Received: from smtp60.i.mail.ru (smtp60.i.mail.ru [217.69.128.40]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 653086F3C8 for ; Sat, 27 Mar 2021 21:30:03 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 653086F3C8 Received: by smtp60.i.mail.ru with esmtpa (envelope-from ) id 1lQDhC-0006Le-CE; Sat, 27 Mar 2021 21:30:02 +0300 To: v.shpilevoy@tarantool.org, gorcunov@gmail.com Cc: tarantool-patches@dev.tarantool.org References: Message-ID: <5c6f77b1-5407-f0c9-e600-fef52862c0b4@tarantool.org> Date: Sat, 27 Mar 2021 21:30:01 +0300 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-GB X-7564579A: EEAE043A70213CC8 X-77F55803: 4F1203BC0FB41BD9ED7173E37F4E32941B7C4A78AC10F96A7797F60C25BD4B06182A05F53808504072FB13BACDA62FDF59C7D397E11941C95E7C20CA03818855EE126017E61A72D5 X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE72F65F4DAC4ECD33BEA1F7E6F0F101C67BD4B6F7A4D31EC0BCC500DACC3FED6E28638F802B75D45FF8AA50765F79006379448E89E2A57838D8638F802B75D45FF914D58D5BE9E6BC131B5C99E7648C95C7428A34725AB662DE860E09D80DBC6A6F412EE0F41DD5488A471835C12D1D9774AD6D5ED66289B5259CC434672EE6371117882F4460429724CE54428C33FAD30A8DF7F3B2552694AC26CFBAC0749D213D2E47CDBA5A9658359CC434672EE6371117882F4460429728AD0CFFFB425014E868A13BD56FB6657D81D268191BDAD3DC09775C1D3CA48CF14879B1EE059DD80BA3038C0950A5D36C8A9BA7A39EFB766EC990983EF5C0329BA3038C0950A5D36D5E8D9A59859A8B645744BFAA53FCC0876E601842F6C81A1F004C906525384307823802FF610243DF43C7A68FF6260569E8FC8737B5C2249B372FE9A2E580EFC725E5C173C3A84C36F314AAC816BA38A35872C767BF85DA2F004C90652538430E4A6367B16DE6309 X-B7AD71C0: AC4F5C86D027EB782CDD5689AFBDA7A2368A440D3B0F6089093C9A16E5BC824AC8B6CDF511875BC4E8F7B195E1C97831283D301CA316870EFCB540272E22000E X-C1DE0DAB: C20DE7B7AB408E4181F030C43753B8186998911F362727C4C7A0BC55FA0FE5FC04B5FBBB5B19AC791FA95AD248361B20ADA8DBBB6AFE93D6B1881A6453793CE9C32612AADDFBE061C61BE10805914D3804EBA3D8E7E5B87ABF8C51168CD8EBDB63AF70AF8205D7DCDC48ACC2A39D04F89CDFB48F4795C241BDAD6C7F3747799A X-C8649E89: 4E36BF7865823D7055A7F0CF078B5EC49A30900B95165D3407FE5477D6A8AF084E62BB8D4952DFBC7A8F5A03CA8D654457A8C37E50DBB280EB9C164DACB0182A1D7E09C32AA3244CD0803F9764D632FB931E3491DEEBE6EFF522A1CF68F4BE05FACE5A9C96DEB163 X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2biojhfg4BOnpz0qGXtYes9LHJQ== X-Mailru-Sender: 583F1D7ACE8F49BDD2846D59FC20E9F8BAA1563FF719672BDCE15B06330907A6646F1402972E3DEC424AE0EB1F3D1D21E2978F233C3FAE6EE63DB1732555E4A8EE80603BA4A5B0BC112434F685709FCF0DA7A0AF5A3A8387 X-Mras: Ok Subject: [Tarantool-patches] [PATCH v2 3.5/7] applier: fix not releasing the latch on apply_synchro_row() fail X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Serge Petrenko via Tarantool-patches Reply-To: Serge Petrenko Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" Once apply_synchro_row() failed, applier_apply_tx() would simply raise an error without unlocking replica latch. This lead to all the appliers hanging indefinitely on trying to lock the latch for this replica. In scope of #5566 ---  src/box/applier.cc                            |   4 +-  test/replication/hang_on_synchro_fail.result  | 130 ++++++++++++++++++  .../replication/hang_on_synchro_fail.test.lua |  57 ++++++++  test/replication/suite.cfg                    |   1 +  test/replication/suite.ini                    |   2 +-  5 files changed, 191 insertions(+), 3 deletions(-)  create mode 100644 test/replication/hang_on_synchro_fail.result  create mode 100644 test/replication/hang_on_synchro_fail.test.lua diff --git a/src/box/applier.cc b/src/box/applier.cc index e6d9673dd..41abe64f9 100644 --- a/src/box/applier.cc +++ b/src/box/applier.cc @@ -1055,8 +1055,8 @@ applier_apply_tx(struct applier *applier, struct stailq *rows)           * each other.           */          assert(first_row == last_row); -        if (apply_synchro_row(first_row) != 0) -            diag_raise(); +        if ((rc = apply_synchro_row(first_row)) != 0) +            goto finish;      } else if ((rc = apply_plain_tx(rows, replication_skip_conflict,                      true)) != 0) {          goto finish; diff --git a/test/replication/hang_on_synchro_fail.result b/test/replication/hang_on_synchro_fail.result new file mode 100644 index 000000000..9f6fac00b --- /dev/null +++ b/test/replication/hang_on_synchro_fail.result @@ -0,0 +1,130 @@ +-- test-run result file version 2 +test_run = require('test_run').new() + | --- + | ... +fiber = require('fiber') + | --- + | ... +-- +-- All appliers could hang after failing to apply a synchronous message: either +-- CONFIRM or ROLLBACK. +-- +box.schema.user.grant('guest', 'replication') + | --- + | ... + +_ = box.schema.space.create('sync', {is_sync=true}) + | --- + | ... +_ = box.space.sync:create_index('pk') + | --- + | ... + +old_synchro_quorum = box.cfg.replication_synchro_quorum + | --- + | ... +box.cfg{replication_synchro_quorum=3} + | --- + | ... +-- A huge timeout so that we can perform some actions on a replica before +-- writing ROLLBACK. +old_synchro_timeout = box.cfg.replication_synchro_timeout + | --- + | ... +box.cfg{replication_synchro_timeout=1000} + | --- + | ... + +test_run:cmd('create server replica with rpl_master=default,\ +              script="replication/replica.lua"') + | --- + | - true + | ... +test_run:cmd('start server replica') + | --- + | - true + | ... + +_ = fiber.new(box.space.sync.insert, box.space.sync, {1}) + | --- + | ... +test_run:wait_lsn('replica', 'default') + | --- + | ... + +test_run:switch('replica') + | --- + | - true + | ... + +box.error.injection.set('ERRINJ_WAL_IO', true) + | --- + | - ok + | ... + +test_run:switch('default') + | --- + | - true + | ... + +box.cfg{replication_synchro_timeout=0.01} + | --- + | ... + +test_run:switch('replica') + | --- + | - true + | ... + +test_run:wait_upstream(1, {status='stopped',\ +                           message_re='Failed to write to disk'}) + | --- + | - true + | ... +box.error.injection.set('ERRINJ_WAL_IO', false) + | --- + | - ok + | ... + +-- Applier is killed due to a failed WAL write, so restart replication to +-- check whether it hangs or not. Actually this single applier would fail an +-- assertion rather than hang, but all the other appliers, if any, would hang. +old_repl = box.cfg.replication + | --- + | ... +box.cfg{replication=""} + | --- + | ... +box.cfg{replication=old_repl} + | --- + | ... + +test_run:wait_upstream(1, {status='follow'}) + | --- + | - true + | ... + +-- Cleanup. +test_run:switch('default') + | --- + | - true + | ... +test_run:cmd('stop server replica') + | --- + | - true + | ... +test_run:cmd('delete server replica') + | --- + | - true + | ... +box.cfg{replication_synchro_quorum=old_synchro_quorum,\ +        replication_synchro_timeout=old_synchro_timeout} + | --- + | ... +box.space.sync:drop() + | --- + | ... +box.schema.user.revoke('guest', 'replication') + | --- + | ... + diff --git a/test/replication/hang_on_synchro_fail.test.lua b/test/replication/hang_on_synchro_fail.test.lua new file mode 100644 index 000000000..6c3b09fab --- /dev/null +++ b/test/replication/hang_on_synchro_fail.test.lua @@ -0,0 +1,57 @@ +test_run = require('test_run').new() +fiber = require('fiber') +-- +-- All appliers could hang after failing to apply a synchronous message: either +-- CONFIRM or ROLLBACK. +-- +box.schema.user.grant('guest', 'replication') + +_ = box.schema.space.create('sync', {is_sync=true}) +_ = box.space.sync:create_index('pk') + +old_synchro_quorum = box.cfg.replication_synchro_quorum +box.cfg{replication_synchro_quorum=3} +-- A huge timeout so that we can perform some actions on a replica before +-- writing ROLLBACK. +old_synchro_timeout = box.cfg.replication_synchro_timeout +box.cfg{replication_synchro_timeout=1000} + +test_run:cmd('create server replica with rpl_master=default,\ +              script="replication/replica.lua"') +test_run:cmd('start server replica') + +_ = fiber.new(box.space.sync.insert, box.space.sync, {1}) +test_run:wait_lsn('replica', 'default') + +test_run:switch('replica') + +box.error.injection.set('ERRINJ_WAL_IO', true) + +test_run:switch('default') + +box.cfg{replication_synchro_timeout=0.01} + +test_run:switch('replica') + +test_run:wait_upstream(1, {status='stopped',\ +                           message_re='Failed to write to disk'}) +box.error.injection.set('ERRINJ_WAL_IO', false) + +-- Applier is killed due to a failed WAL write, so restart replication to +-- check whether it hangs or not. Actually this single applier would fail an +-- assertion rather than hang, but all the other appliers, if any, would hang. +old_repl = box.cfg.replication +box.cfg{replication=""} +box.cfg{replication=old_repl} + +test_run:wait_upstream(1, {status='follow'}) + +-- Cleanup. +test_run:switch('default') +test_run:cmd('stop server replica') +test_run:cmd('delete server replica') +box.cfg{replication_synchro_quorum=old_synchro_quorum,\ +        replication_synchro_timeout=old_synchro_timeout} +box.space.sync:drop() +box.schema.user.revoke('guest', 'replication') + diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg index 7e7004592..c1c329438 100644 --- a/test/replication/suite.cfg +++ b/test/replication/suite.cfg @@ -22,6 +22,7 @@      "status.test.lua": {},      "wal_off.test.lua": {},      "hot_standby.test.lua": {}, +    "hang_on_synchro_fail.test.lua": {},      "rebootstrap.test.lua": {},      "wal_rw_stress.test.lua": {},      "force_recovery.test.lua": {}, diff --git a/test/replication/suite.ini b/test/replication/suite.ini index dcd711a2a..fc161700a 100644 --- a/test/replication/suite.ini +++ b/test/replication/suite.ini @@ -3,7 +3,7 @@ core = tarantool  script =  master.lua  description = tarantool/box, replication  disabled = consistent.test.lua -release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua +release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua  config = suite.cfg  lua_libs = lua/fast_replica.lua lua/rlimit.lua  use_unix_sockets = True -- 2.24.3 (Apple Git-128)