From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp38.i.mail.ru (smtp38.i.mail.ru [94.100.177.98]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id C3F69445320 for ; Tue, 21 Jul 2020 01:01:07 +0300 (MSK) References: From: Vladislav Shpilevoy Message-ID: <3c7b2274-8443-2be7-c181-2c7026ab0fec@tarantool.org> Date: Tue, 21 Jul 2020 00:01:05 +0200 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Subject: Re: [Tarantool-patches] [PATCH 3/3] replication: add test with random leaders promotion and demotion List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: sergeyb@tarantool.org, tarantool-patches@dev.tarantool.org, sergepetrenko@tarantool.org, gorcunov@gmail.com, lvasiliev@tarantool.org Thanks for the patch! See 11 comments below. > diff --git a/test/replication/qsync.lua b/test/replication/qsync.lua > new file mode 100644 > index 000000000..383aa5272 > --- /dev/null > +++ b/test/replication/qsync.lua > @@ -0,0 +1,62 @@ > +#!/usr/bin/env tarantool > + > +-- get instance name from filename (qsync1.lua => qsync1) > +local INSTANCE_ID = string.match(arg[0], "%d") > + > +local SOCKET_DIR = require('fio').cwd() > + > +local TIMEOUT = tonumber(arg[1]) > + > +local function instance_uri(instance_id) > + return SOCKET_DIR..'/qsync'..instance_id..'.sock'; > +end > + > +-- start console first > +require('console').listen(os.getenv('ADMIN')) > + > +box.cfg({ > + listen = instance_uri(INSTANCE_ID); > + replication_timeout = TIMEOUT; 1. Why do you need the custom replication_timeout? > + replication_sync_lag = 0.01; 2. Why do you need the lag setting? > + replication_connect_quorum = 3; > + replication = { > + instance_uri(1); > + instance_uri(2); > + instance_uri(3); > + instance_uri(4); > + instance_uri(5); > + instance_uri(6); > + instance_uri(7); > + instance_uri(8); > + instance_uri(9); > + instance_uri(10); > + instance_uri(11); > + instance_uri(12); > + instance_uri(13); > + instance_uri(14); > + instance_uri(15); > + instance_uri(16); > + instance_uri(17); > + instance_uri(18); > + instance_uri(19); > + instance_uri(20); > + instance_uri(21); > + instance_uri(22); > + instance_uri(23); > + instance_uri(24); > + instance_uri(25); > + instance_uri(26); > + instance_uri(27); > + instance_uri(28); > + instance_uri(29); > + instance_uri(30); > + instance_uri(31); 3. Seems like in the test you use only 3 instances, not 32. Also the quorum is set to 3. > + }; > +}) > + > +box.once("bootstrap", function() > + local test_run = require('test_run').new() > + box.schema.user.grant("guest", 'replication') > + box.schema.space.create('test', {engine = test_run:get_cfg('engine')}) > + box.space.test:create_index('primary') 4. Where do you use this space? > +end) > diff --git a/test/replication/qsync_random_leader.result b/test/replication/qsync_random_leader.result > new file mode 100644 > index 000000000..cb1b5e232 > --- /dev/null > +++ b/test/replication/qsync_random_leader.result > @@ -0,0 +1,123 @@ > +-- test-run result file version 2 > +os = require('os') > + | --- > + | ... > +env = require('test_run') > + | --- > + | ... > +math = require('math') > + | --- > + | ... > +fiber = require('fiber') > + | --- > + | ... > +test_run = env.new() > + | --- > + | ... > +engine = test_run:get_cfg('engine') > + | --- > + | ... > + > +NUM_INSTANCES = 3 > + | --- > + | ... > +BROKEN_QUORUM = NUM_INSTANCES + 1 > + | --- > + | ... > + > +SERVERS = {} > + | --- > + | ... > +test_run:cmd("setopt delimiter ';'") > + | --- > + | - true > + | ... > +for i=1,NUM_INSTANCES do > + SERVERS[i] = 'qsync' .. i > +end; > + | --- > + | ... > +test_run:cmd("setopt delimiter ''"); 5. Please, lets be consistent and use either \ or the delimiter. Currently it is irrational - you use \ for big code blocks, and a custom delimiter for tiny blocks which could even be one line. Personally, I would use \ everywhere. > + | --- > + | - true > + | ... > +SERVERS -- print instance names > + | --- > + | - - qsync1 > + | - qsync2 > + | - qsync3 > + | ... > + > +random = function(excluded_num, min, max) \ 6. Would be better to align all \ by 80 in this file. Makes easier to add new longer lines in future without moving all the old \. > + math.randomseed(os.time()) \ > + local r = math.random(min, max) \ > + if (r == excluded_num) then \ > + return random(excluded_num, min, max) \ > + end \ > + return r \ > +end > + | --- > + | ... > + > +test_run:create_cluster(SERVERS, "replication", {args="0.1"}) > + | --- > + | ... > +test_run:wait_fullmesh(SERVERS) > + | --- > + | ... > +current_leader_id = 1 > + | --- > + | ... > +test_run:switch(SERVERS[current_leader_id]) > + | --- > + | - true > + | ... > +box.cfg{replication_synchro_quorum=3, replication_synchro_timeout=0.1} 7. The timeout is tiny. It will lead to flakiness sooner or later, 100%. > + | --- > + | ... > +_ = box.schema.space.create('sync', {is_sync=true}) > + | --- > + | ... > +_ = box.space.sync:create_index('pk') > + | --- > + | ... > +test_run:switch('default') > + | --- > + | - true > + | ... > + > +-- Testcase body. > +for i=1,10 do \ > + new_leader_id = random(current_leader_id, 1, #SERVERS) \ > + test_run:switch(SERVERS[new_leader_id]) \ > + box.cfg{read_only=false} \ > + fiber = require('fiber') \ > + f1 = fiber.create(function() box.space.sync:delete{} end) \ 8. Delete without a key will fail. You would notice it if you would check results of the DML operations. Please, do that via pcall. > + f2 = fiber.create(function() for i=1,10000 do box.space.sync:insert{i} end end) \ 9. You have \ exactly to avoid such long lines. > + f1.status() \ > + f2.status() \ 10. Output is not printed inside one statement. This whole cycle is one statement because of \, so these status() calls are useless. > + test_run:switch('default') \ > + test_run:switch(SERVERS[current_leader_id]) \ > + box.cfg{read_only=true} \ > + test_run:switch('default') \ > + current_leader_id = new_leader_id \ > + fiber.sleep(0.1) \ 11. Why do you need this fiber.sleep()? > +end > + | --- > + | ... > + > +-- Teardown. > +test_run:switch(SERVERS[current_leader_id]) > + | --- > + | - true > + | ... > +box.space.sync:drop() > + | --- > + | ... > +test_run:switch('default') > + | --- > + | - true > + | ... > +test_run:drop_cluster(SERVERS) > + | --- > + | ...