From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtpng1.m.smailru.net (smtpng1.m.smailru.net [94.100.181.251]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 3A6BC469710 for ; Thu, 26 Nov 2020 01:02:26 +0300 (MSK) References: <9eb76fb56017daca32e3fbc5b5415fc144eb0699.1605629206.git.sergeyb@tarantool.org> <3496e5a3-fc28-880e-d71f-d9746cc56412@tarantool.org> <7164d167-4386-4ea7-ae33-e19ca1f8b326@tarantool.org> From: Vladislav Shpilevoy Message-ID: <966e3ab6-036f-bb70-d35a-ab9a37d6ed28@tarantool.org> Date: Wed, 25 Nov 2020 23:02:23 +0100 MIME-Version: 1.0 In-Reply-To: <7164d167-4386-4ea7-ae33-e19ca1f8b326@tarantool.org> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Subject: Re: [Tarantool-patches] [PATCH 3/4] replication: add test with random leaders promotion and demotion List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Sergey Bronnikov , tarantool-patches@dev.tarantool.org Hi! Thanks for the patch! See 3 comments below. > diff --git a/test/replication/qsync_random_leader.result b/test/replication/qsync_random_leader.result > new file mode 100644 > index 000000000..2d242e802 > --- /dev/null > +++ b/test/replication/qsync_random_leader.result > @@ -0,0 +1,137 @@ > +-- test-run result file version 2 > +os = require('os') > + | --- > + | ... > +env = require('test_run') > + | --- > + | ... > +math = require('math') > + | --- > + | ... > +fiber = require('fiber') > + | --- > + | ... > +test_run = env.new() > + | --- > + | ... > +netbox = require('net.box') > + | --- > + | ... > + > +NUM_INSTANCES = 5 > + | --- > + | ... > +SERVERS = {} > + | --- > + | ... > +for i=1,NUM_INSTANCES do \ 1. Here are some more whitespaces: ' ', ' ', ' ', ' '. Should be enough for this loop and the loops below. > + SERVERS[i] = 'qsync' .. i \ > +end > + | --- > + | ... > +SERVERS -- print instance names > + | --- > + | - - qsync1 > + | - qsync2 > + | - qsync3 > + | - qsync4 > + | - qsync5 > + | ... > + > +math.randomseed(os.time()) > + | --- > + | ... > +function random(excluded_num, total) \ > + local r = math.random(1, total) \ > + if (r == excluded_num) then \ 2. In Lua we never use () for condition bodies. > + return random(excluded_num, total) \ > + end \ > + return r \ > +end > + | --- > + | ... > + > +-- Set 'broken' quorum on current leader. > +-- Write value on current leader. > +-- Pick a random replica in a cluster. > +-- Set 'good' quorum on it and promote to a leader. > +-- Make sure value is there and on an old leader. > + > +-- Testcase setup. > +test_run:create_cluster(SERVERS) > + | --- > + | ... > +test_run:wait_fullmesh(SERVERS) > + | --- > + | ... > +test_run:switch('qsync1') > + | --- > + | - true > + | ... > +_ = box.schema.space.create('sync', {is_sync=true, engine = test_run:get_cfg('engine')}) > + | --- > + | ... > +_ = box.space.sync:create_index('primary') > + | --- > + | ... > +box.schema.user.grant('guest', 'write', 'space', 'sync') > + | --- > + | ... > +test_run:switch('default') > + | --- > + | - true > + | ... > +current_leader_id = 1 > + | --- > + | ... > +test_run:eval(SERVERS[current_leader_id], "box.ctl.clear_synchro_queue()") > + | --- > + | - [] > + | ... > + > +SOCKET_DIR = require('fio').cwd() > + | --- > + | ... > + > +-- Testcase body. > +for i=1,30 do \ > + test_run:eval(SERVERS[current_leader_id], \ > + "box.cfg{replication_synchro_quorum=6, replication_synchro_timeout=1000}") \ > + c = netbox.connect(SOCKET_DIR..'/'..SERVERS[current_leader_id]..'.sock') \ > + fiber.create(function() c.space.sync:insert{i} end) \ > + new_leader_id = random(current_leader_id, #SERVERS) \ > + test_run:eval(SERVERS[new_leader_id], \ > + "box.cfg{replication_synchro_quorum=3, replication_synchro_timeout=0.01}") \ > + test_run:eval(SERVERS[new_leader_id], "box.ctl.clear_synchro_queue()") \ 3. There is a little issue, which makes this code test not what it is supposed to test, I suspect. When you do c.space.sync:insert{i} in a new fiber, you don't wait until the insertion is replicated to a new leader. So it can happen, that when you call box.ctl.clear_synchro_queue() on the new leader, the queue here is empty or contains some old records from the previous iterations. Not the value from the current iteration. If the test is supposed to commit the record of the old leader on the new leader in the same iteration, then it is likely working by luck. I suggest you to turn on the mvcc engine to turn off dirty reads to make sure you don't treat a dirty read as committed read. And rely entirely on LSNs to wait for replication. We have wait_lsn and get_lsn test_run methods for that. In fact, it is always better to rely on LSNs. Because dirty reads may be turned off by default someday, and we will need to rewrite such tests. > + c:close() \ > + replica = random(new_leader_id, #SERVERS) \ > + test_run:wait_cond(function() return test_run:eval(SERVERS[replica], \ > + string.format("box.space.sync:get{%d}", i))[1] ~= nil end) \ > + test_run:wait_cond(function() return test_run:eval(SERVERS[current_leader_id], \ > + string.format("box.space.sync:get{%d}", i))[1] ~= nil end) \ > + new_leader_id = random(current_leader_id, #SERVERS) \ > + current_leader_id = new_leader_id \ > +end > + | --- > + | ... > + > +for i=1,NUM_INSTANCES do \ > + instance = string.format('qsync%d', i) \ > + test_run:wait_cond(function() return test_run:eval(instance, \ > + "box.space.sync:count()")[1] == 30 end) \ > +end