[Tarantool-patches] [PATCH 3/4] replication: add test with random leaders promotion and demotion

Vladislav Shpilevoy v.shpilevoy at tarantool.org
Thu Nov 26 01:02:23 MSK 2020


Hi! Thanks for the patch!

See 3 comments below.

> diff --git a/test/replication/qsync_random_leader.result b/test/replication/qsync_random_leader.result
> new file mode 100644
> index 000000000..2d242e802
> --- /dev/null
> +++ b/test/replication/qsync_random_leader.result
> @@ -0,0 +1,137 @@
> +-- test-run result file version 2
> +os = require('os')
> + | ---
> + | ...
> +env = require('test_run')
> + | ---
> + | ...
> +math = require('math')
> + | ---
> + | ...
> +fiber = require('fiber')
> + | ---
> + | ...
> +test_run = env.new()
> + | ---
> + | ...
> +netbox = require('net.box')
> + | ---
> + | ...
> +
> +NUM_INSTANCES = 5
> + | ---
> + | ...
> +SERVERS = {}
> + | ---
> + | ...
> +for i=1,NUM_INSTANCES do                                                       \

1. Here are some more whitespaces: '                              ',
'               ', '               ', '               '. Should be
enough for this loop and the loops below.

> +    SERVERS[i] = 'qsync' .. i                                                  \
> +end
> + | ---
> + | ...
> +SERVERS -- print instance names
> + | ---
> + | - - qsync1
> + |   - qsync2
> + |   - qsync3
> + |   - qsync4
> + |   - qsync5
> + | ...
> +
> +math.randomseed(os.time())
> + | ---
> + | ...
> +function random(excluded_num, total)                                           \
> +    local r = math.random(1, total)                                            \
> +    if (r == excluded_num) then                                                \

2. In Lua we never use () for condition bodies.

> +        return random(excluded_num, total)                                     \
> +    end                                                                        \
> +    return r                                                                   \
> +end
> + | ---
> + | ...
> +
> +-- Set 'broken' quorum on current leader.
> +-- Write value on current leader.
> +-- Pick a random replica in a cluster.
> +-- Set 'good' quorum on it and promote to a leader.
> +-- Make sure value is there and on an old leader.
> +
> +-- Testcase setup.
> +test_run:create_cluster(SERVERS)
> + | ---
> + | ...
> +test_run:wait_fullmesh(SERVERS)
> + | ---
> + | ...
> +test_run:switch('qsync1')
> + | ---
> + | - true
> + | ...
> +_ = box.schema.space.create('sync', {is_sync=true, engine = test_run:get_cfg('engine')})
> + | ---
> + | ...
> +_ = box.space.sync:create_index('primary')
> + | ---
> + | ...
> +box.schema.user.grant('guest', 'write', 'space', 'sync')
> + | ---
> + | ...
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +current_leader_id = 1
> + | ---
> + | ...
> +test_run:eval(SERVERS[current_leader_id], "box.ctl.clear_synchro_queue()")
> + | ---
> + | - []
> + | ...
> +
> +SOCKET_DIR = require('fio').cwd()
> + | ---
> + | ...
> +
> +-- Testcase body.
> +for i=1,30 do                                                                  \
> +    test_run:eval(SERVERS[current_leader_id],                                  \
> +        "box.cfg{replication_synchro_quorum=6, replication_synchro_timeout=1000}") \
> +    c = netbox.connect(SOCKET_DIR..'/'..SERVERS[current_leader_id]..'.sock')   \
> +    fiber.create(function() c.space.sync:insert{i} end)                        \
> +    new_leader_id = random(current_leader_id, #SERVERS)                        \
> +    test_run:eval(SERVERS[new_leader_id],                                      \
> +        "box.cfg{replication_synchro_quorum=3, replication_synchro_timeout=0.01}") \
> +    test_run:eval(SERVERS[new_leader_id], "box.ctl.clear_synchro_queue()")     \

3. There is a little issue, which makes this code test not what it is
supposed to test, I suspect. When you do c.space.sync:insert{i} in a new
fiber, you don't wait until the insertion is replicated to a new leader.

So it can happen, that when you call box.ctl.clear_synchro_queue() on the
new leader, the queue here is empty or contains some old records from
the previous iterations. Not the value from the current iteration.

If the test is supposed to commit the record of the old leader on the
new leader in the same iteration, then it is likely working by luck.

I suggest you to turn on the mvcc engine to turn off dirty reads to make
sure you don't treat a dirty read as committed read. And rely entirely on
LSNs to wait for replication. We have wait_lsn and get_lsn test_run methods
for that. In fact, it is always better to rely on LSNs. Because dirty reads
may be turned off by default someday, and we will need to rewrite such
tests.

> +    c:close()                                                                  \
> +    replica = random(new_leader_id, #SERVERS)                                  \
> +    test_run:wait_cond(function() return test_run:eval(SERVERS[replica],       \
> +                       string.format("box.space.sync:get{%d}", i))[1] ~= nil end)  \
> +    test_run:wait_cond(function() return test_run:eval(SERVERS[current_leader_id], \
> +                       string.format("box.space.sync:get{%d}", i))[1] ~= nil end)  \
> +    new_leader_id = random(current_leader_id, #SERVERS)                        \
> +    current_leader_id = new_leader_id                                          \
> +end
> + | ---
> + | ...
> +
> +for i=1,NUM_INSTANCES do                                                       \
> +    instance = string.format('qsync%d', i)                                     \
> +    test_run:wait_cond(function() return test_run:eval(instance,               \
> +                       "box.space.sync:count()")[1] == 30 end)                 \
> +end


More information about the Tarantool-patches mailing list