From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id DC8252C3CE for ; Mon, 29 Apr 2019 22:58:18 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id gv5jvvaHSMnn for ; Mon, 29 Apr 2019 22:58:18 -0400 (EDT) Received: from smtpng1.m.smailru.net (smtpng1.m.smailru.net [94.100.181.251]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 2D80E2C2B0 for ; Mon, 29 Apr 2019 22:58:18 -0400 (EDT) Date: Tue, 30 Apr 2019 05:58:04 +0300 From: Alexander Turenko Subject: [tarantool-patches] Re: [PATCH v2] Use SIGKILL to stop server replica Message-ID: <20190430025804.477ggvot3y33jk3c@tkn_work_nb> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-Help: List-Unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-Subscribe: List-Owner: List-post: List-Archive: To: avtikhon Cc: tarantool-patches@freelists.org The change of the test case itself is okay for me, but I have some comments around. See below. WBR, Alexander Turenko. On Mon, Apr 29, 2019 at 10:07:05PM +0300, avtikhon wrote: > Used the signal option set to SIGKILL to stop server replica > routine to be able to stop the replica imediately to imitate Typo: imediately -> immediately. > the replica crash and, then, wake up. > Just 'stop server replica' (SIGTERM) is not sufficient to stop > a tarantool instance when ERRINJ_WAL_DELAY is set, because > "tarantool" thread wait for paused "wal" thread infinitely. > Changed server stop routine to to kill routine to be able Typo: to to -> to. > to use SIGKILL instead of SIGTERM to the replica server. In > this way the server replica will be killed immediately and > *.xlog files will be removed as it has to be. I would clarify that here you solves two problems: the incorrect test case and flaky failures. Also it worth to mention an issue about the flaky failures. > Close #4162 And, I guess, #4034. > Github: https://github.com/tarantool/tarantool/tree/avtikhon/gh-4162-stop-kill > Issue: https://github.com/tarantool/tarantool/issues/4162 The patch contains many changes that are not related to the test case itself. Please, factor them out or, better, just drop (if they don't affect a behaviour of the test). This will make the patch more clear for future readers. And, to be honest, I don't think that there is any real difference between 10 seconds and 60 seconds for our test cases: each of them should be quite short even when a system is under heavy load. Don't sure about a swapping system, though; but unlikely we can do something with timeouts tweaking for this case in general. If these changes are really matter it worth to look at a reason and fix separately. So we'll sure we don't just pollute git blame output. > -test_run:cmd("switch replica") > +-- Imitate the replica crash and, then, wake up. > +-- Just 'stop server replica' (SIGTERM) is not sufficient to stop > +-- a tarantool instance when ERRINJ_WAL_DELAY is set, because > +-- "tarantool" thread wait for paused "wal" thread infinitely. > +test_run:cmd("kill server replica") Please, update according to test-run changes: stop server replica with signal=SIGKILL (right?). > -- Check that garbage collection removed the snapshot once > -- the replica released the corresponding checkpoint. > -wait_gc(1) or box.info.gc() > -wait_xlog(1) or fio.listdir('./master') -- Make sure the replica will not receive data until > +wait_gc(1) > +wait_xlog(1) > -- we test garbage collection. See, it was the comment: > -- Make sure the replica will not receive data until > -- we test garbage collection. Now only the second part remains :)