From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id 7F9BC6F3C8; Sat, 27 Mar 2021 14:13:19 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 7F9BC6F3C8 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1616843599; bh=CuofZPL7u7Mfb7Tmp1WdjqoK2G0GVfF3tdKd1skE+FU=; h=To:Date:Subject:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:Cc:From; b=tkT16zMYhH7YFeDnlDI/tZo5zl70UZZAP55aq2ibJzxYeiRxMZVaCUcfP2ZeSWQ1G tPcd8e9wmQ5eqoi554A0F1kc2WuqKBUcz+i2rXM5IzcYcM/WyZ0Iy/DP5lSGyZPnCj NkEh49FAMvRCK2OExqxC6IzthP9Hv/m9KF2UeFFo= Received: from mail-lf1-f51.google.com (mail-lf1-f51.google.com [209.85.167.51]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 2F2A26F3C8 for ; Sat, 27 Mar 2021 14:13:17 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 2F2A26F3C8 Received: by mail-lf1-f51.google.com with SMTP id i26so11503360lfl.1 for ; Sat, 27 Mar 2021 04:13:17 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=x2bsHpNxR5KFDsIYlzFOCMlZ6HtZsBBYUljQwCV7gGc=; b=l+R+98W0/55bxLG0tY+npV/34G+SLSn28L/hKwKMSbQkqZMtht/oWJMuJJeC1ZuCt1 pBqLU/wGN4SyFjqj9CYPI5u5p0UV6KQ2FLFM6YnSkdpe23TMK8HcWBm1Z+jE9x6QiVgo Q/4K8bbo96CmlDD3gQhNZ97ET1wrxZ4TbGUdpgkd40kq9Ot5szjuNO8Xwb3u8GbAxlqu D8S02ob8ioGyEVjAnFuQPD3vGuIG1BBTtPNka5b2QfofETKEswlSsdQiBsEmAYt8LKkS salHRzujhevwz0noe/SX8iBcFskYLXWnb4HnmpDMWquRCk3UoQAf3TZF/qptCXEO5ISB 9VUw== X-Gm-Message-State: AOAM53222LKOHaxsPCLD5X9qykUhceokuPgy4AlHJ5Sc3F0GiWhL6yi7 vSbLCwrsYHLi83jvtbwztlJWmY9ZSDijZA== X-Google-Smtp-Source: ABdhPJyqe6OLMWgk6mkYIJHi7I/2tExJwVcP58+tp7ZRdeJOAcsjQ9OnkYyReTmLlieyjadUpk5O5A== X-Received: by 2002:a05:6512:3d1c:: with SMTP id d28mr10553646lfv.41.1616843593974; Sat, 27 Mar 2021 04:13:13 -0700 (PDT) Received: from grain.localdomain ([5.18.171.94]) by smtp.gmail.com with ESMTPSA id y22sm1512422ljg.32.2021.03.27.04.13.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 27 Mar 2021 04:13:13 -0700 (PDT) Received: by grain.localdomain (Postfix, from userid 1000) id 2303C56014E; Sat, 27 Mar 2021 14:13:12 +0300 (MSK) To: tml Date: Sat, 27 Mar 2021 14:13:07 +0300 Message-Id: <20210327111310.37504-1-gorcunov@gmail.com> X-Mailer: git-send-email 2.30.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: [Tarantool-patches] [PATCH v6 0/3] gc/xlog: delay xlog cleanup until relays are subscribed X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Cyrill Gorcunov via Tarantool-patches Reply-To: Cyrill Gorcunov Cc: Vladislav Shpilevoy Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" Take a look please. v2: - rebase code to the fresh master branch - keep wal_cleanup_delay option name - pass wal_cleanup_delay as an option to gc_init, so it won't be dependent on cfg engine - add comment about gc_delay_unref in plain bootstrap mode - allow to setup wal_cleanup_delay dynamically - update comment in gc_wait_cleanup and call it conditionally - declare wal_cleanup_delay as a double - rename gc.cleanup_is_paused to gc.is_paused and update output - do not show ref counter in box.info.gc() output - update documentation - move gc_delay_unref inside relay_subscribe call which runs in tx context (instead of relay's context) - update tests: - add a comment why we need a temp space on replica node - use explicit insert/snapshot operations - shrkink the number of insert/snapshot to speedup testing - use "restart" instead of stop/start pair - use wait_log helper instead of own function - add is_paused test v3: - fix changelog - rework box_check_wal_cleanup_delay, the replication_anon setting is considered only in box_set_wal_cleanup_delay, ie when config is checked and parsed, moreover the order of setup is set to be behind "replication_anon" option processing - delay cycle now considers deadline instead of per cycle calculation - use `double` type for timestamp - test update - verify `.is_paused` value - minimize number of inserts - no need to use temporary space, regular space works as well - add comments on why we should restart the master node v4: - drop argument from gc_init(), since we're configuring delay value from load_cfg.lua script there is no need to read the delay early, simply start gc paused and unpause it on demand - move unpause message to main wait cycle - test update: - verify tests and fix replication/replica_rejoin since it waits for xlogs to be cleaned up too early - use 10 seconds for XlogGapError instead of 0.1 second, this is a common deadline value v5: - define limits for `wal_cleanup_delay`: it should be either 0, or in range [0.001; TIMEOUT_INFINITY]. This is done to not consider fp epsilon as a meaningul value - fix comment about why anon replica is not using delay - rework cleanup delay'ed cycle - test update: - update vinyl/replica_rejoin -- we need to disable cleanup delay explicitly - update replication/replica_rejoin for same reason - drop unneded test_run:switch() calls - add a testcase where timeout is decreased and cleanup fiber is kicked to run even with stuck replica v6: - test update: - replica_rejoin.lua simplified to drop not needed data - update main test to check if _cluster sleanup triggers the fiber to run issue https://github.com/tarantool/tarantool/issues/5806 branch gorcunov/gh-5806-xlog-gc-6 Cyrill Gorcunov (3): gc/xlog: delay xlog cleanup until relays are subscribed test: add a test for wal_cleanup_delay option test: box-tap/gc -- add test for is_paused field .../unreleased/add-wal_cleanup_delay.md | 5 + src/box/box.cc | 41 ++ src/box/box.h | 1 + src/box/gc.c | 95 ++- src/box/gc.h | 36 ++ src/box/lua/cfg.cc | 9 + src/box/lua/info.c | 4 + src/box/lua/load_cfg.lua | 5 + src/box/relay.cc | 1 + src/box/replication.cc | 2 + test/app-tap/init_script.result | 1 + test/box-tap/gc.test.lua | 3 +- test/box/admin.result | 2 + test/box/cfg.result | 4 + test/replication/gh-5806-master.lua | 8 + test/replication/gh-5806-xlog-cleanup.result | 558 ++++++++++++++++++ .../replication/gh-5806-xlog-cleanup.test.lua | 234 ++++++++ test/replication/replica_rejoin.lua | 11 + test/replication/replica_rejoin.result | 26 +- test/replication/replica_rejoin.test.lua | 19 +- test/vinyl/replica_rejoin.lua | 5 +- test/vinyl/replica_rejoin.result | 13 + test/vinyl/replica_rejoin.test.lua | 8 + 23 files changed, 1074 insertions(+), 17 deletions(-) create mode 100644 changelogs/unreleased/add-wal_cleanup_delay.md create mode 100644 test/replication/gh-5806-master.lua create mode 100644 test/replication/gh-5806-xlog-cleanup.result create mode 100644 test/replication/gh-5806-xlog-cleanup.test.lua create mode 100644 test/replication/replica_rejoin.lua base-commit: 234472522a924ecf62e27c27e1e29b8803a677cc -- Here is a summary diff for v5 diff --git a/test/replication/gh-5806-xlog-cleanup.result b/test/replication/gh-5806-xlog-cleanup.result index 523d400a7..da09daf17 100644 --- a/test/replication/gh-5806-xlog-cleanup.result +++ b/test/replication/gh-5806-xlog-cleanup.result @@ -29,7 +29,7 @@ test_run:cmd('create server master with script="replication/gh-5806-master.lua"' | --- | - true | ... -test_run:cmd('start server master with wait=True, wait_load=True') +test_run:cmd('start server master') | --- | - true | ... @@ -68,7 +68,7 @@ test_run:cmd('create server replica with rpl_master=master,\ | --- | - true | ... -test_run:cmd('start server replica with wait=True, wait_load=True') +test_run:cmd('start server replica') | --- | - true | ... @@ -132,7 +132,7 @@ box.snapshot() -- space and run snapshot which removes old xlog required -- by replica to subscribe leading to XlogGapError which -- we need to test. -test_run:cmd('restart server master with wait_load=True') +test_run:cmd('restart server master') | box.space.test:insert({2}) | --- @@ -207,7 +207,7 @@ test_run:cmd('create server master with script="replication/gh-5806-master.lua"' | --- | - true | ... -test_run:cmd('start server master with args="3600", wait=True, wait_load=True') +test_run:cmd('start server master with args="3600"') | --- | - true | ... @@ -243,7 +243,7 @@ test_run:cmd('create server replica with rpl_master=master,\ | --- | - true | ... -test_run:cmd('start server replica with wait=True, wait_load=True') +test_run:cmd('start server replica') | --- | - true | ... @@ -288,7 +288,7 @@ box.snapshot() | - ok | ... -test_run:cmd('restart server master with args="3600", wait=True, wait_load=True') +test_run:cmd('restart server master with args="3600"') | box.space.test:insert({2}) | --- @@ -303,7 +303,7 @@ assert(box.info.gc().is_paused == true) | - true | ... -test_run:cmd('start server replica with wait=True, wait_load=True') +test_run:cmd('start server replica') | --- | - true | ... @@ -354,7 +354,7 @@ test_run:cmd('create server master with script="replication/gh-5806-master.lua"' | --- | - true | ... -test_run:cmd('start server master with args="3600", wait=True, wait_load=True') +test_run:cmd('start server master with args="3600"') | --- | - true | ... @@ -376,7 +376,7 @@ test_run:cmd('create server replica with rpl_master=master,\ | --- | - true | ... -test_run:cmd('start server replica with wait=True, wait_load=True') +test_run:cmd('start server replica') | --- | - true | ... @@ -398,7 +398,7 @@ test_run:cmd('delete server replica') | - true | ... -test_run:cmd('restart server master with args="3600", wait=True, wait_load=True') +test_run:cmd('restart server master with args="3600"') | assert(box.info.gc().is_paused == true) | --- @@ -433,3 +433,126 @@ test_run:cmd('delete server master') | --- | - true | ... + +-- +-- Case 4: Fill _cluster with replica but then delete +-- the replica so that master's cleanup leave in "paused" +-- state, and finally cleanup the _cluster to kick cleanup. +-- +test_run:cmd('create server master with script="replication/gh-5806-master.lua"') + | --- + | - true + | ... +test_run:cmd('start server master') + | --- + | - true + | ... + +test_run:switch('master') + | --- + | - true + | ... +box.schema.user.grant('guest', 'replication') + | --- + | ... + +test_run:switch('default') + | --- + | - true + | ... +test_run:cmd('create server replica with rpl_master=master,\ + script="replication/replica.lua"') + | --- + | - true + | ... +test_run:cmd('start server replica') + | --- + | - true + | ... + +test_run:switch('default') + | --- + | - true + | ... +master_uuid = test_run:eval('master', 'return box.info.uuid')[1] + | --- + | ... +replica_uuid = test_run:eval('replica', 'return box.info.uuid')[1] + | --- + | ... +master_custer = test_run:eval('master', 'return box.space._cluster:select()')[1] + | --- + | ... +assert(master_custer[1][2] == master_uuid) + | --- + | - true + | ... +assert(master_custer[2][2] == replica_uuid) + | --- + | - true + | ... + +test_run:cmd('stop server replica') + | --- + | - true + | ... +test_run:cmd('cleanup server replica') + | --- + | - true + | ... +test_run:cmd('delete server replica') + | --- + | - true + | ... + +test_run:switch('master') + | --- + | - true + | ... +test_run:cmd('restart server master with args="3600"') + | +assert(box.info.gc().is_paused == true) + | --- + | - true + | ... + +-- +-- Drop the replica from _cluster and make sure +-- cleanup fiber is not paused anymore. +test_run:switch('default') + | --- + | - true + | ... +deleted_uuid = test_run:eval('master', 'return box.space._cluster:delete(2)')[1][2] + | --- + | ... +assert(replica_uuid == deleted_uuid) + | --- + | - true + | ... + +test_run:switch('master') + | --- + | - true + | ... +test_run:wait_cond(function() return box.info.gc().is_paused == false end) + | --- + | - true + | ... + +test_run:switch('default') + | --- + | - true + | ... +test_run:cmd('stop server master') + | --- + | - true + | ... +test_run:cmd('cleanup server master') + | --- + | - true + | ... +test_run:cmd('delete server master') + | --- + | - true + | ... diff --git a/test/replication/gh-5806-xlog-cleanup.test.lua b/test/replication/gh-5806-xlog-cleanup.test.lua index f16be758a..b65563e7f 100644 --- a/test/replication/gh-5806-xlog-cleanup.test.lua +++ b/test/replication/gh-5806-xlog-cleanup.test.lua @@ -19,7 +19,7 @@ engine = test_run:get_cfg('engine') -- test_run:cmd('create server master with script="replication/gh-5806-master.lua"') -test_run:cmd('start server master with wait=True, wait_load=True') +test_run:cmd('start server master') test_run:switch('master') box.schema.user.grant('guest', 'replication') @@ -36,7 +36,7 @@ _ = s:create_index('pk') test_run:switch('default') test_run:cmd('create server replica with rpl_master=master,\ script="replication/replica.lua"') -test_run:cmd('start server replica with wait=True, wait_load=True') +test_run:cmd('start server replica') -- -- On replica we create an own space which allows us to @@ -70,7 +70,7 @@ box.snapshot() -- space and run snapshot which removes old xlog required -- by replica to subscribe leading to XlogGapError which -- we need to test. -test_run:cmd('restart server master with wait_load=True') +test_run:cmd('restart server master') box.space.test:insert({2}) box.snapshot() assert(box.info.gc().is_paused == false) @@ -105,7 +105,7 @@ test_run:cmd('delete server replica') -- test_run:cmd('create server master with script="replication/gh-5806-master.lua"') -test_run:cmd('start server master with args="3600", wait=True, wait_load=True') +test_run:cmd('start server master with args="3600"') test_run:switch('master') box.schema.user.grant('guest', 'replication') @@ -119,7 +119,7 @@ _ = s:create_index('pk') test_run:switch('default') test_run:cmd('create server replica with rpl_master=master,\ script="replication/replica.lua"') -test_run:cmd('start server replica with wait=True, wait_load=True') +test_run:cmd('start server replica') test_run:switch('replica') box.cfg{checkpoint_count = 1} @@ -134,12 +134,12 @@ test_run:cmd('stop server replica') box.space.test:insert({1}) box.snapshot() -test_run:cmd('restart server master with args="3600", wait=True, wait_load=True') +test_run:cmd('restart server master with args="3600"') box.space.test:insert({2}) box.snapshot() assert(box.info.gc().is_paused == true) -test_run:cmd('start server replica with wait=True, wait_load=True') +test_run:cmd('start server replica') -- -- Make sure no error happened. @@ -160,7 +160,7 @@ test_run:cmd('delete server replica') -- cleanup fiber work again. -- test_run:cmd('create server master with script="replication/gh-5806-master.lua"') -test_run:cmd('start server master with args="3600", wait=True, wait_load=True') +test_run:cmd('start server master with args="3600"') test_run:switch('master') box.schema.user.grant('guest', 'replication') @@ -168,14 +168,14 @@ box.schema.user.grant('guest', 'replication') test_run:switch('default') test_run:cmd('create server replica with rpl_master=master,\ script="replication/replica.lua"') -test_run:cmd('start server replica with wait=True, wait_load=True') +test_run:cmd('start server replica') test_run:switch('master') test_run:cmd('stop server replica') test_run:cmd('cleanup server replica') test_run:cmd('delete server replica') -test_run:cmd('restart server master with args="3600", wait=True, wait_load=True') +test_run:cmd('restart server master with args="3600"') assert(box.info.gc().is_paused == true) test_run:switch('master') @@ -186,3 +186,49 @@ test_run:switch('default') test_run:cmd('stop server master') test_run:cmd('cleanup server master') test_run:cmd('delete server master') + +-- +-- Case 4: Fill _cluster with replica but then delete +-- the replica so that master's cleanup leave in "paused" +-- state, and finally cleanup the _cluster to kick cleanup. +-- +test_run:cmd('create server master with script="replication/gh-5806-master.lua"') +test_run:cmd('start server master') + +test_run:switch('master') +box.schema.user.grant('guest', 'replication') + +test_run:switch('default') +test_run:cmd('create server replica with rpl_master=master,\ + script="replication/replica.lua"') +test_run:cmd('start server replica') + +test_run:switch('default') +master_uuid = test_run:eval('master', 'return box.info.uuid')[1] +replica_uuid = test_run:eval('replica', 'return box.info.uuid')[1] +master_custer = test_run:eval('master', 'return box.space._cluster:select()')[1] +assert(master_custer[1][2] == master_uuid) +assert(master_custer[2][2] == replica_uuid) + +test_run:cmd('stop server replica') +test_run:cmd('cleanup server replica') +test_run:cmd('delete server replica') + +test_run:switch('master') +test_run:cmd('restart server master with args="3600"') +assert(box.info.gc().is_paused == true) + +-- +-- Drop the replica from _cluster and make sure +-- cleanup fiber is not paused anymore. +test_run:switch('default') +deleted_uuid = test_run:eval('master', 'return box.space._cluster:delete(2)')[1][2] +assert(replica_uuid == deleted_uuid) + +test_run:switch('master') +test_run:wait_cond(function() return box.info.gc().is_paused == false end) + +test_run:switch('default') +test_run:cmd('stop server master') +test_run:cmd('cleanup server master') +test_run:cmd('delete server master') diff --git a/test/replication/replica_rejoin.lua b/test/replication/replica_rejoin.lua index 76f6e5b75..9c743c52b 100644 --- a/test/replication/replica_rejoin.lua +++ b/test/replication/replica_rejoin.lua @@ -1,22 +1,11 @@ #!/usr/bin/env tarantool -local repl_include_self = arg[1] and arg[1] == 'true' or false -local repl_list - -if repl_include_self then - repl_list = {os.getenv("MASTER"), os.getenv("LISTEN")} -else - repl_list = os.getenv("MASTER") -end - -- Start the console first to allow test-run to attach even before -- box.cfg is finished. require('console').listen(os.getenv('ADMIN')) box.cfg({ listen = os.getenv("LISTEN"), - replication = repl_list, - memtx_memory = 107374182, - replication_timeout = 0.1, + replication = {os.getenv("MASTER"), os.getenv("LISTEN")}, wal_cleanup_delay = 0, }) diff --git a/test/replication/replica_rejoin.result b/test/replication/replica_rejoin.result index 074cc3e67..843333a19 100644 --- a/test/replication/replica_rejoin.result +++ b/test/replication/replica_rejoin.result @@ -47,7 +47,7 @@ test_run:cmd("create server replica with rpl_master=default, script='replication --- - true ... -test_run:cmd("start server replica with args='true'") +test_run:cmd("start server replica") --- - true ... @@ -124,7 +124,7 @@ box.cfg{checkpoint_count = checkpoint_count} ... -- Restart the replica. Since xlogs have been removed, -- it is supposed to rejoin without changing id. -test_run:cmd("start server replica with args='true'") +test_run:cmd("start server replica") --- - true ... @@ -229,7 +229,7 @@ test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*. box.cfg{checkpoint_count = checkpoint_count} --- ... -test_run:cmd("start server replica with args='true', wait=False") +test_run:cmd("start server replica with wait=False") --- - true ... @@ -271,7 +271,7 @@ test_run:cleanup_cluster() box.space.test:truncate() --- ... -test_run:cmd("start server replica with args='true'") +test_run:cmd("start server replica") --- - true ... diff --git a/test/replication/replica_rejoin.test.lua b/test/replication/replica_rejoin.test.lua index 223316d86..c3ba9bf3f 100644 --- a/test/replication/replica_rejoin.test.lua +++ b/test/replication/replica_rejoin.test.lua @@ -24,7 +24,7 @@ _ = box.space.test:insert{3} -- Join a replica, then stop it. test_run:cmd("create server replica with rpl_master=default, script='replication/replica_rejoin.lua'") -test_run:cmd("start server replica with args='true'") +test_run:cmd("start server replica") test_run:cmd("switch replica") box.info.replication[1].upstream.status == 'follow' or log.error(box.info) box.space.test:select() @@ -53,7 +53,7 @@ box.cfg{checkpoint_count = checkpoint_count} -- Restart the replica. Since xlogs have been removed, -- it is supposed to rejoin without changing id. -test_run:cmd("start server replica with args='true'") +test_run:cmd("start server replica") box.info.replication[2].downstream.vclock ~= nil or log.error(box.info) test_run:cmd("switch replica") box.info.replication[1].upstream.status == 'follow' or log.error(box.info) @@ -88,7 +88,7 @@ for i = 1, 3 do box.space.test:insert{i * 100} end fio = require('fio') test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog') box.cfg{checkpoint_count = checkpoint_count} -test_run:cmd("start server replica with args='true', wait=False") +test_run:cmd("start server replica with wait=False") test_run:cmd("switch replica") test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'}) box.space.test:select() @@ -104,7 +104,7 @@ test_run:cmd("stop server replica") test_run:cmd("cleanup server replica") test_run:cleanup_cluster() box.space.test:truncate() -test_run:cmd("start server replica with args='true'") +test_run:cmd("start server replica") -- Subscribe the master to the replica. replica_listen = test_run:cmd("eval replica 'return box.cfg.listen'") replica_listen ~= nil