From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp51.i.mail.ru (smtp51.i.mail.ru [94.100.177.111]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 7ECEE430407 for ; Sun, 16 Aug 2020 23:01:37 +0300 (MSK) From: "Alexander V. Tikhonov" Date: Sun, 16 Aug 2020 23:01:34 +0300 Message-Id: <27c2f93a6602f14b882484b4f0c7ec4b8748c371.1597607968.git.avtikhon@tarantool.org> Subject: [Tarantool-patches] [PATCH v1] test: fix issue on first replica in drop_cluster() List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kirill Yukhin , Alexander Turenko Cc: tarantool-patches@dev.tarantool.org Found flaky failed test replication/box_set_replication_stress.test.lua on drop_cluster() routine, like: --- replication/box_set_replication_stress.result Fri Aug 14 18:28:41 2020 +++ var/004_replication/box_set_replication_stress.result Sat Aug 15 15:19:44 2020 @@ -34,5 +34,3 @@ -- Cleanup. test_run:drop_cluster(SERVERS) - | --- - | ... Found that drop_cluster() routine from test-run repository failed in stop() routine from lib/tarantool_server.py:TarantoolServer class. It failed to stop 1st replica which used in test to switch on/off the replication 1000 times. It happend because stop() routine used SIGTERM by default which couldn't kill the first replica in some situations. It happend when both replca processes were alive and tried to read and write data into their sockets, but sockets of the first replica were already unreachable while second replica were alive. In this situation SIGTERM signal was not enough to stop the first replica and test-run hanged in wait_stop() in lib/tarantool_server.py:TarantoolServer class till test-run stopped the test by its general timeout of 2 minutes. To fix the issue the only possible way was to use SIGKILL instead of SIGTERM to be sure that the process will not wait for sockets closing and would be killed w/o waiting of it. SIGKILL could be used by default in drop_cluster() routine, but seems that this change was not good for detecting the other issues of the other tests. So it was decided to use SIGKILL just in this test as the additional option for "stop server" test-run call. Closes #5244 --- Github: https://github.com/tarantool/tarantool/tree/avtikhon/gh-5244-replication-box-stress-drop-replica Issue: https://github.com/tarantool/tarantool/issues/5244 .../replication/box_set_replication_stress.result | 15 ++++++++++++++- .../box_set_replication_stress.test.lua | 5 ++++- 2 files changed, 18 insertions(+), 2 deletions(-) diff --git a/test/replication/box_set_replication_stress.result b/test/replication/box_set_replication_stress.result index e683c0643..225f33ecb 100644 --- a/test/replication/box_set_replication_stress.result +++ b/test/replication/box_set_replication_stress.result @@ -33,6 +33,19 @@ test_run:cmd("switch default") | ... -- Cleanup. -test_run:drop_cluster(SERVERS) +test_run:cmd('stop server master_quorum1 with signal=SIGKILL') | --- + | - true + | ... +test_run:cmd('delete server master_quorum1') + | --- + | - true + | ... +test_run:cmd('stop server master_quorum2 with signal=SIGKILL') + | --- + | - true + | ... +test_run:cmd('delete server master_quorum2') + | --- + | - true | ... diff --git a/test/replication/box_set_replication_stress.test.lua b/test/replication/box_set_replication_stress.test.lua index 407e91e0f..88652b0b4 100644 --- a/test/replication/box_set_replication_stress.test.lua +++ b/test/replication/box_set_replication_stress.test.lua @@ -14,4 +14,7 @@ end test_run:cmd("switch default") -- Cleanup. -test_run:drop_cluster(SERVERS) +test_run:cmd('stop server master_quorum1 with signal=SIGKILL') +test_run:cmd('delete server master_quorum1') +test_run:cmd('stop server master_quorum2 with signal=SIGKILL') +test_run:cmd('delete server master_quorum2') -- 2.17.1