[Tarantool-patches] [PATCH 3/3] replication: add test with random leaders promotion and demotion

Sergey Bronnikov sergeyb at tarantool.org
Wed Aug 26 17:45:38 MSK 2020


Vlad, thanks for review!
Patch updated in a branch.

To make sure patch doesn't make test flaky I run test 100 times using 10
workers in parallel without problems.
../../test/test-run.py --builddir=/home/s.bronnikov/tarantool/build --vardir=/home/s.bronnikov/tarantool
/build/test/var -j 10 $(yes replication/qsync_random_leader.test.lua | head -n 100)

On 00:01 Tue 21 Jul , Vladislav Shpilevoy wrote:
> Thanks for the patch!
> 
> See 11 comments below.
> 
> > diff --git a/test/replication/qsync.lua b/test/replication/qsync.lua
> > new file mode 100644
> > index 000000000..383aa5272
> > --- /dev/null
> > +++ b/test/replication/qsync.lua
> > @@ -0,0 +1,62 @@
> > +#!/usr/bin/env tarantool
> > +
> > +-- get instance name from filename (qsync1.lua => qsync1)
> > +local INSTANCE_ID = string.match(arg[0], "%d")
> > +
> > +local SOCKET_DIR = require('fio').cwd()
> > +
> > +local TIMEOUT = tonumber(arg[1])
> > +
> > +local function instance_uri(instance_id)
> > +    return SOCKET_DIR..'/qsync'..instance_id..'.sock';
> > +end
> > +
> > +-- start console first
> > +require('console').listen(os.getenv('ADMIN'))
> > +
> > +box.cfg({
> > +    listen = instance_uri(INSTANCE_ID);
> > +    replication_timeout = TIMEOUT;
> 
> 1. Why do you need the custom replication_timeout?

It is actually a copy-paste from original cluster initialization script,
removed replication_timeout.

> > +    replication_sync_lag = 0.01;
> 
> 2. Why do you need the lag setting?

the same as above

> > +    replication_connect_quorum = 3;
> > +    replication = {
> > +        instance_uri(1);
> > +        instance_uri(2);
> > +        instance_uri(3);
> > +        instance_uri(4);
> > +        instance_uri(5);
> > +        instance_uri(6);
> > +        instance_uri(7);
> > +        instance_uri(8);
> > +        instance_uri(9);
> > +        instance_uri(10);
> > +        instance_uri(11);
> > +        instance_uri(12);
> > +        instance_uri(13);
> > +        instance_uri(14);
> > +        instance_uri(15);
> > +        instance_uri(16);
> > +        instance_uri(17);
> > +        instance_uri(18);
> > +        instance_uri(19);
> > +        instance_uri(20);
> > +        instance_uri(21);
> > +        instance_uri(22);
> > +        instance_uri(23);
> > +        instance_uri(24);
> > +        instance_uri(25);
> > +        instance_uri(26);
> > +        instance_uri(27);
> > +        instance_uri(28);
> > +        instance_uri(29);
> > +        instance_uri(30);
> > +        instance_uri(31);
> 
> 3. Seems like in the test you use only 3 instances, not 32. Also the
> quorum is set to 3.

in updated test 5 instances are in use, others removed in initialization script

> > +    };
> > +})
> > +
> > +box.once("bootstrap", function()
> > +    local test_run = require('test_run').new()
> > +    box.schema.user.grant("guest", 'replication')
> > +    box.schema.space.create('test', {engine = test_run:get_cfg('engine')})
> > +    box.space.test:create_index('primary')
> 
> 4. Where do you use this space?

space has been renamed to "sync" and used it in a test

> > +end)
> > diff --git a/test/replication/qsync_random_leader.result b/test/replication/qsync_random_leader.result
> > new file mode 100644
> > index 000000000..cb1b5e232
> > --- /dev/null
> > +++ b/test/replication/qsync_random_leader.result
> > @@ -0,0 +1,123 @@
> > +-- test-run result file version 2
> > +os = require('os')
> > + | ---
> > + | ...
> > +env = require('test_run')
> > + | ---
> > + | ...
> > +math = require('math')
> > + | ---
> > + | ...
> > +fiber = require('fiber')
> > + | ---
> > + | ...
> > +test_run = env.new()
> > + | ---
> > + | ...
> > +engine = test_run:get_cfg('engine')
> > + | ---
> > + | ...
> > +
> > +NUM_INSTANCES = 3
> > + | ---
> > + | ...
> > +BROKEN_QUORUM = NUM_INSTANCES + 1
> > + | ---
> > + | ...
> > +
> > +SERVERS = {}
> > + | ---
> > + | ...
> > +test_run:cmd("setopt delimiter ';'")
> > + | ---
> > + | - true
> > + | ...
> > +for i=1,NUM_INSTANCES do
> > +    SERVERS[i] = 'qsync' .. i
> > +end;
> > + | ---
> > + | ...
> > +test_run:cmd("setopt delimiter ''");
> 
> 5. Please, lets be consistent and use either \ or the delimiter. Currently
> it is irrational - you use \ for big code blocks, and a custom delimiter for
> tiny blocks which could even be one line. Personally, I would use \
> everywhere.

replaced constructions with delimiters to multiline statements

> > + | ---
> > + | - true
> > + | ...
> > +SERVERS -- print instance names
> > + | ---
> > + | - - qsync1
> > + |   - qsync2
> > + |   - qsync3
> > + | ...
> > +
> > +random = function(excluded_num, min, max)       \
> 
> 6. Would be better to align all \ by 80 in this file. Makes easier to add
> new longer lines in future without moving all the old \.

Done.

> > +    math.randomseed(os.time())                  \
> > +    local r = math.random(min, max)             \
> > +    if (r == excluded_num) then                 \
> > +        return random(excluded_num, min, max)   \
> > +    end                                         \
> > +    return r                                    \
> > +end
> > + | ---
> > + | ...
> > +
> > +test_run:create_cluster(SERVERS, "replication", {args="0.1"})
> > + | ---
> > + | ...
> > +test_run:wait_fullmesh(SERVERS)
> > + | ---
> > + | ...
> > +current_leader_id = 1
> > + | ---
> > + | ...
> > +test_run:switch(SERVERS[current_leader_id])
> > + | ---
> > + | - true
> > + | ...
> > +box.cfg{replication_synchro_quorum=3, replication_synchro_timeout=0.1}
> 
> 7. The timeout is tiny. It will lead to flakiness sooner or later, 100%.

increased to 1 sec

> > + | ---
> > + | ...
> > +_ = box.schema.space.create('sync', {is_sync=true})
> > + | ---
> > + | ...
> > +_ = box.space.sync:create_index('pk')
> > + | ---
> > + | ...
> > +test_run:switch('default')
> > + | ---
> > + | - true
> > + | ...
> > +
> > +-- Testcase body.
> > +for i=1,10 do                                                 \
> > +    new_leader_id = random(current_leader_id, 1, #SERVERS)    \
> > +    test_run:switch(SERVERS[new_leader_id])                   \
> > +    box.cfg{read_only=false}                                  \
> > +    fiber = require('fiber')                                  \
> > +    f1 = fiber.create(function() box.space.sync:delete{} end) \
> 
> 8. Delete without a key will fail. You would notice it if you
> would check results of the DML operations. Please, do that via pcall.
> 

replaced delete{} with truncate{}

> > +    f2 = fiber.create(function() for i=1,10000 do box.space.sync:insert{i} end end) \
> 
> 9. You have \ exactly to avoid such long lines.

splitted for shorter lines

> 
> > +    f1.status()                                               \
> > +    f2.status()                                               \
> 
> 10. Output is not printed inside one statement. This whole cycle is
> one statement because of \, so these status() calls are useless.

removed

> > +    test_run:switch('default')                                \
> > +    test_run:switch(SERVERS[current_leader_id])               \
> > +    box.cfg{read_only=true}                                   \
> > +    test_run:switch('default')                                \
> > +    current_leader_id = new_leader_id                         \
> > +    fiber.sleep(0.1)                                          \
> 
> 11. Why do you need this fiber.sleep()?

I don't remember the reason to add it, but test works fine without it.
So I removed it in updated patch.

> > +end
> > + | ---
> > + | ...
> > +
> > +-- Teardown.
> > +test_run:switch(SERVERS[current_leader_id])
> > + | ---
> > + | - true
> > + | ...
> > +box.space.sync:drop()
> > + | ---
> > + | ...
> > +test_run:switch('default')
> > + | ---
> > + | - true
> > + | ...
> > +test_run:drop_cluster(SERVERS)
> > + | ---
> > + | ...

-- 
sergeyb@


More information about the Tarantool-patches mailing list