From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp55.i.mail.ru (smtp55.i.mail.ru [217.69.128.35]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 599D343040C for ; Mon, 31 Aug 2020 13:05:41 +0300 (MSK) Date: Mon, 31 Aug 2020 13:05:39 +0300 From: Sergey Bronnikov Message-ID: <20200831100539.GA94343@pony.bronevichok.ru> References: <3c7b2274-8443-2be7-c181-2c7026ab0fec@tarantool.org> <20200826144538.GC47610@pony.bronevichok.ru> <1598525382.271317018@f511.i.mail.ru> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1598525382.271317018@f511.i.mail.ru> Subject: Re: [Tarantool-patches] [PATCH 3/3] replication: add test with random leaders promotion and demotion List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Serge Petrenko Cc: tarantool-patches@dev.tarantool.org, Vladislav Shpilevoy Sergey, thanks for review! See my comments inline, test updated in a branch. On 13:49 Thu 27 Aug , Serge Petrenko wrote: > > Hi! Thanks for the fixes! > I’m pasting parts of the patch below to comment on. > > +box.cfg({ > +    listen = instance_uri(INSTANCE_ID); > +    replication_connect_quorum = 3; > +    replication = { > +        instance_uri(1); > +        instance_uri(2); > +        instance_uri(3); > +        instance_uri(4); > +        instance_uri(5); > +    }; > +}) >   >   > * You should either omit `replication_connect_quorum` at all, or set it to 5. > Omitting it will have the same effect. > I think you meant `replication_synchro_quorum` here, then it makes sense > to set it to 3. Also   `replication_synchro_timeout` should be set here, > I’ll mention it  ​​​​​​​ again below. removed replication_connect_quorum at all > + > +NUM_INSTANCES = 5 > +BROKEN_QUORUM = NUM_INSTANCES + 1 > + >   >   > * BROKEN_QUORUM assigned but never used. removed >   > + > +test_run:create_cluster(SERVERS, "replication", {args="0.1"}) > +test_run:wait_fullmesh(SERVERS) >   >   > * You’re passing some argument to qsync1, … qsync5 instances, but you never use  it. removed argument >   > +current_leader_id = 1 > +test_run:switch(SERVERS[current_leader_id]) > +box.cfg{replication_synchro_timeout=1} >   >   > * You should set `replication_synchro_timeout` on every instance, not only > on qsync1 so   you better move this box.cfg call to the instance file. > Besides, the timeout should be bigger (much bigger), like Vlad said. > We typically use 30 seconds for various replication timeouts. > It’s fairly common when a test is stable on your machine, but is flaky on > testing machines. increased timeout up to 1000 and set synchro settings in a qsync.lua file > +test_run:switch('default') > + > +-- Testcase body. > +for i=1,10 do                                                                  \ > +    new_leader_id = random(current_leader_id, 1, #SERVERS)                     \ > +    test_run:switch(SERVERS[new_leader_id])                                    \ > +    box.cfg{read_only=false}                                                   \ > +    f1 = fiber.create(function()                                               \ > +        pcall(box.space.sync:truncate{})                                       \ > +    end)                                                                       \ >   >   > * Why put truncate call in a separate fiber? > > Why use truncate at all? You may just replace all your `insert` calls below > with `replace`, and then truncate won’t be needed. This is up to you > though. As far as I remember original idea was to cleanup before next insert. I rewrote a body of test and keep all inserted values in a space and after loop check amount of inserted values. >   > +    f2 = fiber.create(function()                                               \ > +        for i=1,10000 do box.space.sync:insert{i} end                          \ > +    end)                                                                       \ >   > * So you’re testing a case when a leader has some unconfirmed transactions in > limbo and then a leader change happens. Then you need to call > `clear_synchro_queue` on a new leader to wait for confirmation of old txns. Otherwise > the new leader fails to insert its data, but the test doesn’t show this, because you > don’t check fiber state or `insert()` return values. added clear_synchro_queue() call to updated test > +    test_run:switch('default')                                                 \ > +    test_run:switch(SERVERS[current_leader_id])                                \ >   >   > * The 2 lines above are useless. removed in updated test: -- Testcase body. for i=1,100 do \ test_run:eval(SERVERS[current_leader_id], \ "box.cfg{replication_synchro_quorum=6}") \ test_run:eval(SERVERS[current_leader_id], \ string.format("box.space.sync:insert{%d}", i)) \ new_leader_id = random(current_leader_id, 1, #SERVERS) \ test_run:eval(SERVERS[new_leader_id], \ "box.cfg{replication_synchro_quorum=3}") \ test_run:eval(SERVERS[new_leader_id], "box.ctl.clear_synchro_queue()") \ replica = random(new_leader_id, 1, #SERVERS) \ test_run:eval(SERVERS[replica], \ string.format("box.space.sync:get{%d}", i)) \ test_run:switch('default') \ current_leader_id = new_leader_id \ end test_run:switch('qsync1') box.space.sync:count() -- 100 >   > +    box.cfg{read_only=true}                                                    \ > +    test_run:switch('default')                                                 \ > +    current_leader_id = new_leader_id                                          \ > +end > + > +-- Teardown. > -- > Serge Petrenko -- sergeyb@