[Tarantool-patches] [PATCH v5] applier: filter incoming synchro packets via transaction initiator
Vladislav Shpilevoy
v.shpilevoy at tarantool.org
Tue Jun 8 21:52:46 MSK 2021
Hi! Good job on the patch!
See 7 comments below.
On 08.06.2021 20:08, Cyrill Gorcunov wrote:
> Currently we use synchro packets filtration based on their contents,
> in particular by their xrow->replica_id value. Still there was a
> question if we can optimize this moment and rather filter our all
1. our -> out.
> packets coming from non-leader replica.
> > diff --git a/test/replication/gh-6035-applier-filter.result b/test/replication/gh-6035-applier-filter.result
> new file mode 100644
> index 000000000..cf032c332
> --- /dev/null
> +++ b/test/replication/gh-6035-applier-filter.result
> @@ -0,0 +1,147 @@
> +-- test-run result file version 2
> +--
> +-- gh-6035: verify synchronous rows filtration in applier,
> +-- we need to be sure that filtering synchronous rows is
> +-- done via transaction initiator not sender (iow via
> +-- xrow->replica_id).
> +--
> +
> +local SOCKET_DIR = require('fio').cwd()
> + | ---
> + | ...
> +local function unix_socket(name) \
> + return SOCKET_DIR .. "/" .. name .. '.sock' \
> +end
2. Both this function and SOCKET_DIR are not needed here. Because you
declared them as 'local', and they go out of scope on the next
line, and are deleted. In test-run's non-tap tests each line is a
visibility scope. Like in the console.
I deleted them from the test's code and the test passed.
It works below because unix_socket() is already declared as global in
your scripts calling box.cfg.
Because it is global, you get luacheck warnings:
Checking test/replication/gh6035node.lua 6 warnings
test/replication/gh6035node.lua:4:10: (W111) setting non-standard global variable unix_socket
test/replication/gh6035node.lua:12:18: (W113) accessing undefined variable unix_socket
test/replication/gh6035node.lua:16:18: (W113) accessing undefined variable unix_socket
test/replication/gh6035node.lua:18:13: (W113) accessing undefined variable unix_socket
test/replication/gh6035node.lua:19:13: (W113) accessing undefined variable unix_socket
test/replication/gh6035node.lua:27:13: (W113) accessing undefined variable unix_socket
Maybe you could make the test not change box.cfg.replication?
You could build the master <-> replica1 -> replica2 topology
from the beginning.
You start them as read-write, no elections. Replica1 would register
on master. Replica2 would register on replcia1.
Then you make master 'manual', 'replica1' voter, and 'replica2'
voter. Then you promote the master and do the main part of the test.
No box.cfg.replication was altered.
> + | ---
> + | ...
> +
> +env = require('test_run')
> + | ---
> + | ...
> +test_run = env.new()
3. You don't need the env. You can do
test_run = require('test_run').new()
> + | ---
> + | ...
> +engine = test_run:get_cfg('engine')
4. You don't use the engine. Please, drop it, and make the test
run only once, without 2 engines.
> + | ---
> + | ...
> +
> +test_run:cmd('create server gh6035master with script="replication/gh6035master.lua"')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server gh6035master with args="gh6035master"')
5. Filename is already passed as arg[0] into the script. Look at
autobootstrap.lua files to see how they extract the instance name
from the file name.
> + | ---
> + | - true
> + | ...
> +
> +test_run:cmd('create server gh6035replica1 with script="replication/gh6035replica1.lua"')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server gh6035replica1 with args="gh6035replica1"')
> + | ---
> + | - true
> + | ...
> +
> +test_run:cmd('create server gh6035replica2 with script="replication/gh6035replica2.lua"')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server gh6035replica2 with args="gh6035replica2"')
> + | ---
> + | - true
> + | ...
> +
> +--
> +-- The gh6035master node connected to gh6035replica1 and gh6035replica2,
> +-- where each of gh6035replicaX connected to the gh6035master only.
> +--
> +-- Lets reroute gh6035replica2 to gh6035replica1 so gh6035replica1 gonna
> +-- be sending data to gh6035replica2 as a transit hop from gh6035master.
> +test_run:cmd('switch gh6035replica2')
6. There is a shortcut 'test_run:switch()'.
I also propose to drop gh6035 prefix from the instance names (but keep
it in the script file names). It is hard to read otherwise.
Another proposal: change gh6035 to gh-6035 in the file names. Otherwise
they are too far from the test itself in the file explorer.
> + | ---
> + | - true
> + | ...
> +box.cfg{replication = {unix_socket("gh6035replica1")}}
> + | ---
> + | ...
> +
> +--
> +-- Make the gh6035master to be RAFT leader.
> +test_run:cmd('switch gh6035master')
> + | ---
> + | - true
> + | ...
> +box.cfg{ \
> + replication = { \
> + unix_socket("gh6035master"), \
> + unix_socket("gh6035replica1"), \
> + }, \
> + replication_synchro_quorum = 2, \
> + election_mode = 'manual', \
> +}
> + | ---
> + | ...
> +box.ctl.promote()
> + | ---
> + | ...
> +_ = box.schema.space.create("sync", {is_sync = true})
> + | ---
> + | ...
> +_ = box.space.sync:create_index("pk")
> + | ---
> + | ...
> +box.space.sync:insert{1}
> + | ---
> + | - [1]
> + | ...
> +
> +-- The first hop is gh6035replica1.
> +test_run:cmd('switch gh6035replica1')
> + | ---
> + | - true
> + | ...
> +box.space.sync:select{}
> + | ---
> + | - - [1]
> + | ...
> +
> +-- And the second hop is gh6035replica2 where
> +-- gh6035replica1 replicated the data to us.
> +test_run:cmd('switch gh6035replica2')
> + | ---
> + | - true
> + | ...
> +box.space.sync:select{}
> + | ---
> + | - - [1]
7. You can't be sure it is replicated to replica2 yet. If
replica1's disk is slow, it might need more time to write
it to WAL. You need to use test_run:wait_lsn() to be sure
that replica2 got the transaction. Wait_lsn() must make
replica2 wait for all data from the master node, not from
replica2. Because replica2 didn't write anything own.
More information about the Tarantool-patches
mailing list