[Tarantool-patches] [PATCH 3/3] replication: add test with random leaders promotion and demotion
Vladislav Shpilevoy
v.shpilevoy at tarantool.org
Tue Jul 21 01:01:05 MSK 2020
Thanks for the patch!
See 11 comments below.
> diff --git a/test/replication/qsync.lua b/test/replication/qsync.lua
> new file mode 100644
> index 000000000..383aa5272
> --- /dev/null
> +++ b/test/replication/qsync.lua
> @@ -0,0 +1,62 @@
> +#!/usr/bin/env tarantool
> +
> +-- get instance name from filename (qsync1.lua => qsync1)
> +local INSTANCE_ID = string.match(arg[0], "%d")
> +
> +local SOCKET_DIR = require('fio').cwd()
> +
> +local TIMEOUT = tonumber(arg[1])
> +
> +local function instance_uri(instance_id)
> + return SOCKET_DIR..'/qsync'..instance_id..'.sock';
> +end
> +
> +-- start console first
> +require('console').listen(os.getenv('ADMIN'))
> +
> +box.cfg({
> + listen = instance_uri(INSTANCE_ID);
> + replication_timeout = TIMEOUT;
1. Why do you need the custom replication_timeout?
> + replication_sync_lag = 0.01;
2. Why do you need the lag setting?
> + replication_connect_quorum = 3;
> + replication = {
> + instance_uri(1);
> + instance_uri(2);
> + instance_uri(3);
> + instance_uri(4);
> + instance_uri(5);
> + instance_uri(6);
> + instance_uri(7);
> + instance_uri(8);
> + instance_uri(9);
> + instance_uri(10);
> + instance_uri(11);
> + instance_uri(12);
> + instance_uri(13);
> + instance_uri(14);
> + instance_uri(15);
> + instance_uri(16);
> + instance_uri(17);
> + instance_uri(18);
> + instance_uri(19);
> + instance_uri(20);
> + instance_uri(21);
> + instance_uri(22);
> + instance_uri(23);
> + instance_uri(24);
> + instance_uri(25);
> + instance_uri(26);
> + instance_uri(27);
> + instance_uri(28);
> + instance_uri(29);
> + instance_uri(30);
> + instance_uri(31);
3. Seems like in the test you use only 3 instances, not 32. Also the
quorum is set to 3.
> + };
> +})
> +
> +box.once("bootstrap", function()
> + local test_run = require('test_run').new()
> + box.schema.user.grant("guest", 'replication')
> + box.schema.space.create('test', {engine = test_run:get_cfg('engine')})
> + box.space.test:create_index('primary')
4. Where do you use this space?
> +end)
> diff --git a/test/replication/qsync_random_leader.result b/test/replication/qsync_random_leader.result
> new file mode 100644
> index 000000000..cb1b5e232
> --- /dev/null
> +++ b/test/replication/qsync_random_leader.result
> @@ -0,0 +1,123 @@
> +-- test-run result file version 2
> +os = require('os')
> + | ---
> + | ...
> +env = require('test_run')
> + | ---
> + | ...
> +math = require('math')
> + | ---
> + | ...
> +fiber = require('fiber')
> + | ---
> + | ...
> +test_run = env.new()
> + | ---
> + | ...
> +engine = test_run:get_cfg('engine')
> + | ---
> + | ...
> +
> +NUM_INSTANCES = 3
> + | ---
> + | ...
> +BROKEN_QUORUM = NUM_INSTANCES + 1
> + | ---
> + | ...
> +
> +SERVERS = {}
> + | ---
> + | ...
> +test_run:cmd("setopt delimiter ';'")
> + | ---
> + | - true
> + | ...
> +for i=1,NUM_INSTANCES do
> + SERVERS[i] = 'qsync' .. i
> +end;
> + | ---
> + | ...
> +test_run:cmd("setopt delimiter ''");
5. Please, lets be consistent and use either \ or the delimiter. Currently
it is irrational - you use \ for big code blocks, and a custom delimiter for
tiny blocks which could even be one line. Personally, I would use \
everywhere.
> + | ---
> + | - true
> + | ...
> +SERVERS -- print instance names
> + | ---
> + | - - qsync1
> + | - qsync2
> + | - qsync3
> + | ...
> +
> +random = function(excluded_num, min, max) \
6. Would be better to align all \ by 80 in this file. Makes easier to add
new longer lines in future without moving all the old \.
> + math.randomseed(os.time()) \
> + local r = math.random(min, max) \
> + if (r == excluded_num) then \
> + return random(excluded_num, min, max) \
> + end \
> + return r \
> +end
> + | ---
> + | ...
> +
> +test_run:create_cluster(SERVERS, "replication", {args="0.1"})
> + | ---
> + | ...
> +test_run:wait_fullmesh(SERVERS)
> + | ---
> + | ...
> +current_leader_id = 1
> + | ---
> + | ...
> +test_run:switch(SERVERS[current_leader_id])
> + | ---
> + | - true
> + | ...
> +box.cfg{replication_synchro_quorum=3, replication_synchro_timeout=0.1}
7. The timeout is tiny. It will lead to flakiness sooner or later, 100%.
> + | ---
> + | ...
> +_ = box.schema.space.create('sync', {is_sync=true})
> + | ---
> + | ...
> +_ = box.space.sync:create_index('pk')
> + | ---
> + | ...
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +
> +-- Testcase body.
> +for i=1,10 do \
> + new_leader_id = random(current_leader_id, 1, #SERVERS) \
> + test_run:switch(SERVERS[new_leader_id]) \
> + box.cfg{read_only=false} \
> + fiber = require('fiber') \
> + f1 = fiber.create(function() box.space.sync:delete{} end) \
8. Delete without a key will fail. You would notice it if you
would check results of the DML operations. Please, do that via pcall.
> + f2 = fiber.create(function() for i=1,10000 do box.space.sync:insert{i} end end) \
9. You have \ exactly to avoid such long lines.
> + f1.status() \
> + f2.status() \
10. Output is not printed inside one statement. This whole cycle is
one statement because of \, so these status() calls are useless.
> + test_run:switch('default') \
> + test_run:switch(SERVERS[current_leader_id]) \
> + box.cfg{read_only=true} \
> + test_run:switch('default') \
> + current_leader_id = new_leader_id \
> + fiber.sleep(0.1) \
11. Why do you need this fiber.sleep()?
> +end
> + | ---
> + | ...
> +
> +-- Teardown.
> +test_run:switch(SERVERS[current_leader_id])
> + | ---
> + | - true
> + | ...
> +box.space.sync:drop()
> + | ---
> + | ...
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +test_run:drop_cluster(SERVERS)
> + | ---
> + | ...
More information about the Tarantool-patches
mailing list