From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp1.mail.ru (smtp1.mail.ru [94.100.179.111]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id D5EC543040F for ; Tue, 1 Sep 2020 16:03:17 +0300 (MSK) References: <3c7b2274-8443-2be7-c181-2c7026ab0fec@tarantool.org> <20200826144538.GC47610@pony.bronevichok.ru> <1598525382.271317018@f511.i.mail.ru> <20200831100539.GA94343@pony.bronevichok.ru> From: Serge Petrenko Message-ID: <9b889490-07da-9afc-cf2e-16d6700d255c@tarantool.org> Date: Tue, 1 Sep 2020 16:03:16 +0300 MIME-Version: 1.0 In-Reply-To: <20200831100539.GA94343@pony.bronevichok.ru> Content-Type: text/plain; charset="utf-8"; format="flowed" Content-Transfer-Encoding: 8bit Content-Language: en-GB Subject: Re: [Tarantool-patches] [PATCH 3/3] replication: add test with random leaders promotion and demotion List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Sergey Bronnikov Cc: tarantool-patches@dev.tarantool.org, Vladislav Shpilevoy 31.08.2020 13:05, Sergey Bronnikov пишет: > Sergey, thanks for review! > See my comments inline, test updated in a branch. Hi! Thanks for the fixes. See my comments below. > > On 13:49 Thu 27 Aug , Serge Petrenko wrote: >> Hi! Thanks for the fixes! >> I’m pasting parts of the patch below to comment on. >> >> +box.cfg({ >> +    listen = instance_uri(INSTANCE_ID); >> +    replication_connect_quorum = 3; >> +    replication = { >> +        instance_uri(1); >> +        instance_uri(2); >> +        instance_uri(3); >> +        instance_uri(4); >> +        instance_uri(5); >> +    }; >> +}) >> >> >> * You should either omit `replication_connect_quorum` at all, or set it to 5. >> Omitting it will have the same effect. >> I think you meant `replication_synchro_quorum` here, then it makes sense >> to set it to 3. Also   `replication_synchro_timeout` should be set here, >> I’ll mention it  ​​​​​​​ again below. > removed replication_connect_quorum at all + +box.once("bootstrap", function() +    local test_run = require('test_run').new() +    box.cfg{replication_synchro_timeout = 1000, replication_synchro_quorum = 5} +    box.cfg{read_only = false} +    box.schema.user.grant("guest", 'replication') +end) Since you moved space creation out of `box.once` call, you don't need test_run here. >> + >> +NUM_INSTANCES = 5 >> +BROKEN_QUORUM = NUM_INSTANCES + 1 >> + >> >> >> * BROKEN_QUORUM assigned but never used. > removed > >> >> + >> +test_run:create_cluster(SERVERS, "replication", {args="0.1"}) >> +test_run:wait_fullmesh(SERVERS) >> >> >> * You’re passing some argument to qsync1, … qsync5 instances, but you never use  it. > removed argument > >> >> +current_leader_id = 1 >> +test_run:switch(SERVERS[current_leader_id]) >> +box.cfg{replication_synchro_timeout=1} >> >> >> * You should set `replication_synchro_timeout` on every instance, not only >> on qsync1 so   you better move this box.cfg call to the instance file. >> Besides, the timeout should be bigger (much bigger), like Vlad said. >> We typically use 30 seconds for various replication timeouts. >> It’s fairly common when a test is stable on your machine, but is flaky on >> testing machines. > increased timeout up to 1000 and set synchro settings in a qsync.lua file > >> +test_run:switch('default') >> + >> +-- Testcase body. >> +for i=1,10 do                                                                  \ >> +    new_leader_id = random(current_leader_id, 1, #SERVERS)                     \ >> +    test_run:switch(SERVERS[new_leader_id])                                    \ >> +    box.cfg{read_only=false}                                                   \ >> +    f1 = fiber.create(function()                                               \ >> +        pcall(box.space.sync:truncate{})                                       \ >> +    end)                                                                       \ >> >> >> * Why put truncate call in a separate fiber? >> >> Why use truncate at all? You may just replace all your `insert` calls below >> with `replace`, and then truncate won’t be needed. This is up to you >> though. > As far as I remember original idea was to cleanup before next insert. > I rewrote a body of test and keep all inserted values in a space and > after loop check amount of inserted values. > >> >> +    f2 = fiber.create(function()                                               \ >> +        for i=1,10000 do box.space.sync:insert{i} end                          \ >> +    end)                                                                       \ >> >> * So you’re testing a case when a leader has some unconfirmed transactions in >> limbo and then a leader change happens. Then you need to call >> `clear_synchro_queue` on a new leader to wait for confirmation of old txns. Otherwise >> the new leader fails to insert its data, but the test doesn’t show this, because you >> don’t check fiber state or `insert()` return values. > added clear_synchro_queue() call to updated test > >> +    test_run:switch('default')                                                 \ >> +    test_run:switch(SERVERS[current_leader_id])                                \ >> >> >> * The 2 lines above are useless. > removed in updated test: > > -- Testcase body. > for i=1,100 do \ > test_run:eval(SERVERS[current_leader_id], \ > "box.cfg{replication_synchro_quorum=6}") \ > test_run:eval(SERVERS[current_leader_id], \ > string.format("box.space.sync:insert{%d}", i)) \ > new_leader_id = random(current_leader_id, 1, #SERVERS) \ > test_run:eval(SERVERS[new_leader_id], \ > "box.cfg{replication_synchro_quorum=3}") \ > test_run:eval(SERVERS[new_leader_id], "box.ctl.clear_synchro_queue()") \ > replica = random(new_leader_id, 1, #SERVERS) \ > test_run:eval(SERVERS[replica], \ > string.format("box.space.sync:get{%d}", i)) \ It  is not checked whether get returns any data or not. You only make a call but never check the return value. It'd also be nice to check whether the insert on the old leader (the one with quorum =  6) succeeds after the new leader executes `clear_synchro_queue`. > test_run:switch('default') \ > current_leader_id = new_leader_id \ > end > > test_run:switch('qsync1') > box.space.sync:count() -- 100 > >> >> +    box.cfg{read_only=true}                                                    \ >> +    test_run:switch('default')                                                 \ >> +    current_leader_id = new_leader_id                                          \ >> +end >> + >> +-- Teardown. >> -- >> Serge Petrenko -- Serge Petrenko