From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp54.i.mail.ru (smtp54.i.mail.ru [217.69.128.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 476D3430409 for ; Wed, 26 Aug 2020 17:45:40 +0300 (MSK) Date: Wed, 26 Aug 2020 17:45:38 +0300 From: Sergey Bronnikov Message-ID: <20200826144538.GC47610@pony.bronevichok.ru> References: <3c7b2274-8443-2be7-c181-2c7026ab0fec@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <3c7b2274-8443-2be7-c181-2c7026ab0fec@tarantool.org> Subject: Re: [Tarantool-patches] [PATCH 3/3] replication: add test with random leaders promotion and demotion List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Vladislav Shpilevoy Cc: tarantool-patches@dev.tarantool.org Vlad, thanks for review! Patch updated in a branch. To make sure patch doesn't make test flaky I run test 100 times using 10 workers in parallel without problems. ../../test/test-run.py --builddir=/home/s.bronnikov/tarantool/build --vardir=/home/s.bronnikov/tarantool /build/test/var -j 10 $(yes replication/qsync_random_leader.test.lua | head -n 100) On 00:01 Tue 21 Jul , Vladislav Shpilevoy wrote: > Thanks for the patch! > > See 11 comments below. > > > diff --git a/test/replication/qsync.lua b/test/replication/qsync.lua > > new file mode 100644 > > index 000000000..383aa5272 > > --- /dev/null > > +++ b/test/replication/qsync.lua > > @@ -0,0 +1,62 @@ > > +#!/usr/bin/env tarantool > > + > > +-- get instance name from filename (qsync1.lua => qsync1) > > +local INSTANCE_ID = string.match(arg[0], "%d") > > + > > +local SOCKET_DIR = require('fio').cwd() > > + > > +local TIMEOUT = tonumber(arg[1]) > > + > > +local function instance_uri(instance_id) > > + return SOCKET_DIR..'/qsync'..instance_id..'.sock'; > > +end > > + > > +-- start console first > > +require('console').listen(os.getenv('ADMIN')) > > + > > +box.cfg({ > > + listen = instance_uri(INSTANCE_ID); > > + replication_timeout = TIMEOUT; > > 1. Why do you need the custom replication_timeout? It is actually a copy-paste from original cluster initialization script, removed replication_timeout. > > + replication_sync_lag = 0.01; > > 2. Why do you need the lag setting? the same as above > > + replication_connect_quorum = 3; > > + replication = { > > + instance_uri(1); > > + instance_uri(2); > > + instance_uri(3); > > + instance_uri(4); > > + instance_uri(5); > > + instance_uri(6); > > + instance_uri(7); > > + instance_uri(8); > > + instance_uri(9); > > + instance_uri(10); > > + instance_uri(11); > > + instance_uri(12); > > + instance_uri(13); > > + instance_uri(14); > > + instance_uri(15); > > + instance_uri(16); > > + instance_uri(17); > > + instance_uri(18); > > + instance_uri(19); > > + instance_uri(20); > > + instance_uri(21); > > + instance_uri(22); > > + instance_uri(23); > > + instance_uri(24); > > + instance_uri(25); > > + instance_uri(26); > > + instance_uri(27); > > + instance_uri(28); > > + instance_uri(29); > > + instance_uri(30); > > + instance_uri(31); > > 3. Seems like in the test you use only 3 instances, not 32. Also the > quorum is set to 3. in updated test 5 instances are in use, others removed in initialization script > > + }; > > +}) > > + > > +box.once("bootstrap", function() > > + local test_run = require('test_run').new() > > + box.schema.user.grant("guest", 'replication') > > + box.schema.space.create('test', {engine = test_run:get_cfg('engine')}) > > + box.space.test:create_index('primary') > > 4. Where do you use this space? space has been renamed to "sync" and used it in a test > > +end) > > diff --git a/test/replication/qsync_random_leader.result b/test/replication/qsync_random_leader.result > > new file mode 100644 > > index 000000000..cb1b5e232 > > --- /dev/null > > +++ b/test/replication/qsync_random_leader.result > > @@ -0,0 +1,123 @@ > > +-- test-run result file version 2 > > +os = require('os') > > + | --- > > + | ... > > +env = require('test_run') > > + | --- > > + | ... > > +math = require('math') > > + | --- > > + | ... > > +fiber = require('fiber') > > + | --- > > + | ... > > +test_run = env.new() > > + | --- > > + | ... > > +engine = test_run:get_cfg('engine') > > + | --- > > + | ... > > + > > +NUM_INSTANCES = 3 > > + | --- > > + | ... > > +BROKEN_QUORUM = NUM_INSTANCES + 1 > > + | --- > > + | ... > > + > > +SERVERS = {} > > + | --- > > + | ... > > +test_run:cmd("setopt delimiter ';'") > > + | --- > > + | - true > > + | ... > > +for i=1,NUM_INSTANCES do > > + SERVERS[i] = 'qsync' .. i > > +end; > > + | --- > > + | ... > > +test_run:cmd("setopt delimiter ''"); > > 5. Please, lets be consistent and use either \ or the delimiter. Currently > it is irrational - you use \ for big code blocks, and a custom delimiter for > tiny blocks which could even be one line. Personally, I would use \ > everywhere. replaced constructions with delimiters to multiline statements > > + | --- > > + | - true > > + | ... > > +SERVERS -- print instance names > > + | --- > > + | - - qsync1 > > + | - qsync2 > > + | - qsync3 > > + | ... > > + > > +random = function(excluded_num, min, max) \ > > 6. Would be better to align all \ by 80 in this file. Makes easier to add > new longer lines in future without moving all the old \. Done. > > + math.randomseed(os.time()) \ > > + local r = math.random(min, max) \ > > + if (r == excluded_num) then \ > > + return random(excluded_num, min, max) \ > > + end \ > > + return r \ > > +end > > + | --- > > + | ... > > + > > +test_run:create_cluster(SERVERS, "replication", {args="0.1"}) > > + | --- > > + | ... > > +test_run:wait_fullmesh(SERVERS) > > + | --- > > + | ... > > +current_leader_id = 1 > > + | --- > > + | ... > > +test_run:switch(SERVERS[current_leader_id]) > > + | --- > > + | - true > > + | ... > > +box.cfg{replication_synchro_quorum=3, replication_synchro_timeout=0.1} > > 7. The timeout is tiny. It will lead to flakiness sooner or later, 100%. increased to 1 sec > > + | --- > > + | ... > > +_ = box.schema.space.create('sync', {is_sync=true}) > > + | --- > > + | ... > > +_ = box.space.sync:create_index('pk') > > + | --- > > + | ... > > +test_run:switch('default') > > + | --- > > + | - true > > + | ... > > + > > +-- Testcase body. > > +for i=1,10 do \ > > + new_leader_id = random(current_leader_id, 1, #SERVERS) \ > > + test_run:switch(SERVERS[new_leader_id]) \ > > + box.cfg{read_only=false} \ > > + fiber = require('fiber') \ > > + f1 = fiber.create(function() box.space.sync:delete{} end) \ > > 8. Delete without a key will fail. You would notice it if you > would check results of the DML operations. Please, do that via pcall. > replaced delete{} with truncate{} > > + f2 = fiber.create(function() for i=1,10000 do box.space.sync:insert{i} end end) \ > > 9. You have \ exactly to avoid such long lines. splitted for shorter lines > > > + f1.status() \ > > + f2.status() \ > > 10. Output is not printed inside one statement. This whole cycle is > one statement because of \, so these status() calls are useless. removed > > + test_run:switch('default') \ > > + test_run:switch(SERVERS[current_leader_id]) \ > > + box.cfg{read_only=true} \ > > + test_run:switch('default') \ > > + current_leader_id = new_leader_id \ > > + fiber.sleep(0.1) \ > > 11. Why do you need this fiber.sleep()? I don't remember the reason to add it, but test works fine without it. So I removed it in updated patch. > > +end > > + | --- > > + | ... > > + > > +-- Teardown. > > +test_run:switch(SERVERS[current_leader_id]) > > + | --- > > + | - true > > + | ... > > +box.space.sync:drop() > > + | --- > > + | ... > > +test_run:switch('default') > > + | --- > > + | - true > > + | ... > > +test_run:drop_cluster(SERVERS) > > + | --- > > + | ... -- sergeyb@