[Tarantool-patches] [PATCH 3/3] replication: add test with random leaders promotion and demotion
Sergey Bronnikov
sergeyb at tarantool.org
Wed Aug 26 17:45:38 MSK 2020
Vlad, thanks for review!
Patch updated in a branch.
To make sure patch doesn't make test flaky I run test 100 times using 10
workers in parallel without problems.
../../test/test-run.py --builddir=/home/s.bronnikov/tarantool/build --vardir=/home/s.bronnikov/tarantool
/build/test/var -j 10 $(yes replication/qsync_random_leader.test.lua | head -n 100)
On 00:01 Tue 21 Jul , Vladislav Shpilevoy wrote:
> Thanks for the patch!
>
> See 11 comments below.
>
> > diff --git a/test/replication/qsync.lua b/test/replication/qsync.lua
> > new file mode 100644
> > index 000000000..383aa5272
> > --- /dev/null
> > +++ b/test/replication/qsync.lua
> > @@ -0,0 +1,62 @@
> > +#!/usr/bin/env tarantool
> > +
> > +-- get instance name from filename (qsync1.lua => qsync1)
> > +local INSTANCE_ID = string.match(arg[0], "%d")
> > +
> > +local SOCKET_DIR = require('fio').cwd()
> > +
> > +local TIMEOUT = tonumber(arg[1])
> > +
> > +local function instance_uri(instance_id)
> > + return SOCKET_DIR..'/qsync'..instance_id..'.sock';
> > +end
> > +
> > +-- start console first
> > +require('console').listen(os.getenv('ADMIN'))
> > +
> > +box.cfg({
> > + listen = instance_uri(INSTANCE_ID);
> > + replication_timeout = TIMEOUT;
>
> 1. Why do you need the custom replication_timeout?
It is actually a copy-paste from original cluster initialization script,
removed replication_timeout.
> > + replication_sync_lag = 0.01;
>
> 2. Why do you need the lag setting?
the same as above
> > + replication_connect_quorum = 3;
> > + replication = {
> > + instance_uri(1);
> > + instance_uri(2);
> > + instance_uri(3);
> > + instance_uri(4);
> > + instance_uri(5);
> > + instance_uri(6);
> > + instance_uri(7);
> > + instance_uri(8);
> > + instance_uri(9);
> > + instance_uri(10);
> > + instance_uri(11);
> > + instance_uri(12);
> > + instance_uri(13);
> > + instance_uri(14);
> > + instance_uri(15);
> > + instance_uri(16);
> > + instance_uri(17);
> > + instance_uri(18);
> > + instance_uri(19);
> > + instance_uri(20);
> > + instance_uri(21);
> > + instance_uri(22);
> > + instance_uri(23);
> > + instance_uri(24);
> > + instance_uri(25);
> > + instance_uri(26);
> > + instance_uri(27);
> > + instance_uri(28);
> > + instance_uri(29);
> > + instance_uri(30);
> > + instance_uri(31);
>
> 3. Seems like in the test you use only 3 instances, not 32. Also the
> quorum is set to 3.
in updated test 5 instances are in use, others removed in initialization script
> > + };
> > +})
> > +
> > +box.once("bootstrap", function()
> > + local test_run = require('test_run').new()
> > + box.schema.user.grant("guest", 'replication')
> > + box.schema.space.create('test', {engine = test_run:get_cfg('engine')})
> > + box.space.test:create_index('primary')
>
> 4. Where do you use this space?
space has been renamed to "sync" and used it in a test
> > +end)
> > diff --git a/test/replication/qsync_random_leader.result b/test/replication/qsync_random_leader.result
> > new file mode 100644
> > index 000000000..cb1b5e232
> > --- /dev/null
> > +++ b/test/replication/qsync_random_leader.result
> > @@ -0,0 +1,123 @@
> > +-- test-run result file version 2
> > +os = require('os')
> > + | ---
> > + | ...
> > +env = require('test_run')
> > + | ---
> > + | ...
> > +math = require('math')
> > + | ---
> > + | ...
> > +fiber = require('fiber')
> > + | ---
> > + | ...
> > +test_run = env.new()
> > + | ---
> > + | ...
> > +engine = test_run:get_cfg('engine')
> > + | ---
> > + | ...
> > +
> > +NUM_INSTANCES = 3
> > + | ---
> > + | ...
> > +BROKEN_QUORUM = NUM_INSTANCES + 1
> > + | ---
> > + | ...
> > +
> > +SERVERS = {}
> > + | ---
> > + | ...
> > +test_run:cmd("setopt delimiter ';'")
> > + | ---
> > + | - true
> > + | ...
> > +for i=1,NUM_INSTANCES do
> > + SERVERS[i] = 'qsync' .. i
> > +end;
> > + | ---
> > + | ...
> > +test_run:cmd("setopt delimiter ''");
>
> 5. Please, lets be consistent and use either \ or the delimiter. Currently
> it is irrational - you use \ for big code blocks, and a custom delimiter for
> tiny blocks which could even be one line. Personally, I would use \
> everywhere.
replaced constructions with delimiters to multiline statements
> > + | ---
> > + | - true
> > + | ...
> > +SERVERS -- print instance names
> > + | ---
> > + | - - qsync1
> > + | - qsync2
> > + | - qsync3
> > + | ...
> > +
> > +random = function(excluded_num, min, max) \
>
> 6. Would be better to align all \ by 80 in this file. Makes easier to add
> new longer lines in future without moving all the old \.
Done.
> > + math.randomseed(os.time()) \
> > + local r = math.random(min, max) \
> > + if (r == excluded_num) then \
> > + return random(excluded_num, min, max) \
> > + end \
> > + return r \
> > +end
> > + | ---
> > + | ...
> > +
> > +test_run:create_cluster(SERVERS, "replication", {args="0.1"})
> > + | ---
> > + | ...
> > +test_run:wait_fullmesh(SERVERS)
> > + | ---
> > + | ...
> > +current_leader_id = 1
> > + | ---
> > + | ...
> > +test_run:switch(SERVERS[current_leader_id])
> > + | ---
> > + | - true
> > + | ...
> > +box.cfg{replication_synchro_quorum=3, replication_synchro_timeout=0.1}
>
> 7. The timeout is tiny. It will lead to flakiness sooner or later, 100%.
increased to 1 sec
> > + | ---
> > + | ...
> > +_ = box.schema.space.create('sync', {is_sync=true})
> > + | ---
> > + | ...
> > +_ = box.space.sync:create_index('pk')
> > + | ---
> > + | ...
> > +test_run:switch('default')
> > + | ---
> > + | - true
> > + | ...
> > +
> > +-- Testcase body.
> > +for i=1,10 do \
> > + new_leader_id = random(current_leader_id, 1, #SERVERS) \
> > + test_run:switch(SERVERS[new_leader_id]) \
> > + box.cfg{read_only=false} \
> > + fiber = require('fiber') \
> > + f1 = fiber.create(function() box.space.sync:delete{} end) \
>
> 8. Delete without a key will fail. You would notice it if you
> would check results of the DML operations. Please, do that via pcall.
>
replaced delete{} with truncate{}
> > + f2 = fiber.create(function() for i=1,10000 do box.space.sync:insert{i} end end) \
>
> 9. You have \ exactly to avoid such long lines.
splitted for shorter lines
>
> > + f1.status() \
> > + f2.status() \
>
> 10. Output is not printed inside one statement. This whole cycle is
> one statement because of \, so these status() calls are useless.
removed
> > + test_run:switch('default') \
> > + test_run:switch(SERVERS[current_leader_id]) \
> > + box.cfg{read_only=true} \
> > + test_run:switch('default') \
> > + current_leader_id = new_leader_id \
> > + fiber.sleep(0.1) \
>
> 11. Why do you need this fiber.sleep()?
I don't remember the reason to add it, but test works fine without it.
So I removed it in updated patch.
> > +end
> > + | ---
> > + | ...
> > +
> > +-- Teardown.
> > +test_run:switch(SERVERS[current_leader_id])
> > + | ---
> > + | - true
> > + | ...
> > +box.space.sync:drop()
> > + | ---
> > + | ...
> > +test_run:switch('default')
> > + | ---
> > + | - true
> > + | ...
> > +test_run:drop_cluster(SERVERS)
> > + | ---
> > + | ...
--
sergeyb@
More information about the Tarantool-patches
mailing list