[Tarantool-patches] [PATCH 3/3] replication: add test with random leaders promotion and demotion

Vladislav Shpilevoy v.shpilevoy at tarantool.org
Tue Jul 21 01:01:05 MSK 2020


Thanks for the patch!

See 11 comments below.

> diff --git a/test/replication/qsync.lua b/test/replication/qsync.lua
> new file mode 100644
> index 000000000..383aa5272
> --- /dev/null
> +++ b/test/replication/qsync.lua
> @@ -0,0 +1,62 @@
> +#!/usr/bin/env tarantool
> +
> +-- get instance name from filename (qsync1.lua => qsync1)
> +local INSTANCE_ID = string.match(arg[0], "%d")
> +
> +local SOCKET_DIR = require('fio').cwd()
> +
> +local TIMEOUT = tonumber(arg[1])
> +
> +local function instance_uri(instance_id)
> +    return SOCKET_DIR..'/qsync'..instance_id..'.sock';
> +end
> +
> +-- start console first
> +require('console').listen(os.getenv('ADMIN'))
> +
> +box.cfg({
> +    listen = instance_uri(INSTANCE_ID);
> +    replication_timeout = TIMEOUT;

1. Why do you need the custom replication_timeout?

> +    replication_sync_lag = 0.01;

2. Why do you need the lag setting?

> +    replication_connect_quorum = 3;
> +    replication = {
> +        instance_uri(1);
> +        instance_uri(2);
> +        instance_uri(3);
> +        instance_uri(4);
> +        instance_uri(5);
> +        instance_uri(6);
> +        instance_uri(7);
> +        instance_uri(8);
> +        instance_uri(9);
> +        instance_uri(10);
> +        instance_uri(11);
> +        instance_uri(12);
> +        instance_uri(13);
> +        instance_uri(14);
> +        instance_uri(15);
> +        instance_uri(16);
> +        instance_uri(17);
> +        instance_uri(18);
> +        instance_uri(19);
> +        instance_uri(20);
> +        instance_uri(21);
> +        instance_uri(22);
> +        instance_uri(23);
> +        instance_uri(24);
> +        instance_uri(25);
> +        instance_uri(26);
> +        instance_uri(27);
> +        instance_uri(28);
> +        instance_uri(29);
> +        instance_uri(30);
> +        instance_uri(31);

3. Seems like in the test you use only 3 instances, not 32. Also the
quorum is set to 3.

> +    };
> +})
> +
> +box.once("bootstrap", function()
> +    local test_run = require('test_run').new()
> +    box.schema.user.grant("guest", 'replication')
> +    box.schema.space.create('test', {engine = test_run:get_cfg('engine')})
> +    box.space.test:create_index('primary')

4. Where do you use this space?

> +end)
> diff --git a/test/replication/qsync_random_leader.result b/test/replication/qsync_random_leader.result
> new file mode 100644
> index 000000000..cb1b5e232
> --- /dev/null
> +++ b/test/replication/qsync_random_leader.result
> @@ -0,0 +1,123 @@
> +-- test-run result file version 2
> +os = require('os')
> + | ---
> + | ...
> +env = require('test_run')
> + | ---
> + | ...
> +math = require('math')
> + | ---
> + | ...
> +fiber = require('fiber')
> + | ---
> + | ...
> +test_run = env.new()
> + | ---
> + | ...
> +engine = test_run:get_cfg('engine')
> + | ---
> + | ...
> +
> +NUM_INSTANCES = 3
> + | ---
> + | ...
> +BROKEN_QUORUM = NUM_INSTANCES + 1
> + | ---
> + | ...
> +
> +SERVERS = {}
> + | ---
> + | ...
> +test_run:cmd("setopt delimiter ';'")
> + | ---
> + | - true
> + | ...
> +for i=1,NUM_INSTANCES do
> +    SERVERS[i] = 'qsync' .. i
> +end;
> + | ---
> + | ...
> +test_run:cmd("setopt delimiter ''");

5. Please, lets be consistent and use either \ or the delimiter. Currently
it is irrational - you use \ for big code blocks, and a custom delimiter for
tiny blocks which could even be one line. Personally, I would use \
everywhere.

> + | ---
> + | - true
> + | ...
> +SERVERS -- print instance names
> + | ---
> + | - - qsync1
> + |   - qsync2
> + |   - qsync3
> + | ...
> +
> +random = function(excluded_num, min, max)       \

6. Would be better to align all \ by 80 in this file. Makes easier to add
new longer lines in future without moving all the old \.

> +    math.randomseed(os.time())                  \
> +    local r = math.random(min, max)             \
> +    if (r == excluded_num) then                 \
> +        return random(excluded_num, min, max)   \
> +    end                                         \
> +    return r                                    \
> +end
> + | ---
> + | ...
> +
> +test_run:create_cluster(SERVERS, "replication", {args="0.1"})
> + | ---
> + | ...
> +test_run:wait_fullmesh(SERVERS)
> + | ---
> + | ...
> +current_leader_id = 1
> + | ---
> + | ...
> +test_run:switch(SERVERS[current_leader_id])
> + | ---
> + | - true
> + | ...
> +box.cfg{replication_synchro_quorum=3, replication_synchro_timeout=0.1}

7. The timeout is tiny. It will lead to flakiness sooner or later, 100%.

> + | ---
> + | ...
> +_ = box.schema.space.create('sync', {is_sync=true})
> + | ---
> + | ...
> +_ = box.space.sync:create_index('pk')
> + | ---
> + | ...
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +
> +-- Testcase body.
> +for i=1,10 do                                                 \
> +    new_leader_id = random(current_leader_id, 1, #SERVERS)    \
> +    test_run:switch(SERVERS[new_leader_id])                   \
> +    box.cfg{read_only=false}                                  \
> +    fiber = require('fiber')                                  \
> +    f1 = fiber.create(function() box.space.sync:delete{} end) \

8. Delete without a key will fail. You would notice it if you
would check results of the DML operations. Please, do that via pcall.

> +    f2 = fiber.create(function() for i=1,10000 do box.space.sync:insert{i} end end) \

9. You have \ exactly to avoid such long lines.

> +    f1.status()                                               \
> +    f2.status()                                               \

10. Output is not printed inside one statement. This whole cycle is
one statement because of \, so these status() calls are useless.

> +    test_run:switch('default')                                \
> +    test_run:switch(SERVERS[current_leader_id])               \
> +    box.cfg{read_only=true}                                   \
> +    test_run:switch('default')                                \
> +    current_leader_id = new_leader_id                         \
> +    fiber.sleep(0.1)                                          \

11. Why do you need this fiber.sleep()?

> +end
> + | ---
> + | ...
> +
> +-- Teardown.
> +test_run:switch(SERVERS[current_leader_id])
> + | ---
> + | - true
> + | ...
> +box.space.sync:drop()
> + | ---
> + | ...
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +test_run:drop_cluster(SERVERS)
> + | ---
> + | ...


More information about the Tarantool-patches mailing list