From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id C93896EC56; Sat, 20 Mar 2021 16:15:57 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org C93896EC56 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1616246157; bh=35SepIvlNoB2RX/sl0DRPkiWkpP8O9Bg3MhwQnY3Lr8=; h=To:Date:In-Reply-To:References:Subject:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=KpBTleAAfGmVY706pAIL3IjotiWFH4SB2+zVlPH1db1sG0WBHcKSNEly2YDIznxow J/OTH4I14S3+79SwbI4iFI74zvtMPOcnGUl1+19GXzdhWLW5v0RDY+RTpGgkhokwjY 6gjEPktWNr4UfQCXKqL7/aVAZrMyMVqJRA8SwJ2o= Received: from mail-lj1-f179.google.com (mail-lj1-f179.google.com [209.85.208.179]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id B9D9F6EC56 for ; Sat, 20 Mar 2021 16:15:38 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org B9D9F6EC56 Received: by mail-lj1-f179.google.com with SMTP id 184so15297297ljf.9 for ; Sat, 20 Mar 2021 06:15:38 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=3P7pH3EzvR4/DeHZLrtZIvoWrOTQ1xgZ7HPSqXIgl/s=; b=JuspYft039zN47ctenallmFJf5YlY8F4NPyOlefu15d5EtKoDQgTxy2Ufz69F7FG2w bN7foXG2LXPJJXkkyEMqF7HU2FCpqBglmIKIW00zAyLbhHGlvlWnAmy1w5MfNxkcHjhw 5ZlVIzMKlweXBnVO2V2KV444EAuil1a8H0/1MJBjG0QTN7Cj2rX7tiRU87SMTT80FW7R v38XNmg3iUh8K9GzZeIzUrS9AcJCdgxjLAR8skaHrt6/pSCsJUJgvo9ssXbDAVzXXAos pQP/owSqDhs+SkVH2Hc8c6m/6wYaFjmrqxjlXYE13oxtdhW8EGPHlJWdffeBfLlhdOer k03g== X-Gm-Message-State: AOAM533Ql2vDoHra0bFwkIzh93HBFVJNz1NMqTeu0IzNo/LpJk96gBym pb2yDeFcqGIvYzN+86Kpu+0= X-Google-Smtp-Source: ABdhPJz4+pfTYCPF9I4Nncgz6v1sbgz7jOa/YnbRXjgLqTF05iPxeHneDnj4gDrA3jUbSBBshLXKrg== X-Received: by 2002:a2e:87d5:: with SMTP id v21mr3725193ljj.195.1616246138303; Sat, 20 Mar 2021 06:15:38 -0700 (PDT) Received: from grain.localdomain ([5.18.171.94]) by smtp.gmail.com with ESMTPSA id h17sm935932lfc.289.2021.03.20.06.15.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 20 Mar 2021 06:15:37 -0700 (PDT) Received: by grain.localdomain (Postfix, from userid 1000) id BBB99560066; Sat, 20 Mar 2021 16:15:23 +0300 (MSK) To: tml Date: Sat, 20 Mar 2021 16:15:19 +0300 Message-Id: <20210320131521.1249747-2-gorcunov@gmail.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210320131521.1249747-1-gorcunov@gmail.com> References: <20210320131521.1249747-1-gorcunov@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: [Tarantool-patches] [PATCH v2 1/3] gc/xlog: delay xlog cleanup until relays are subscribed X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Cyrill Gorcunov via Tarantool-patches Reply-To: Cyrill Gorcunov Cc: Vladislav Shpilevoy , Mons Anderson Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" In case if replica managed to be far behind the master node (so there are a number of xlog files present after the last master's snapshot) then once master node get restarted it may clean up the xlogs needed by the replica to subscribe in a fast way and instead the replica will have to rejoin reading a number of data back. Lets try to address this by delaying xlog files cleanup until replicas are got subscribed and relays are up and running. For this sake we start with cleanup fiber spinning in nop cycle ("paused" mode) and use a delay counter to wait until relays decrement them. This implies that if `_cluster` system space is not empty upon restart and the registered replica somehow vanished completely and won't ever come back, then the node administrator has to drop this replica from `_cluster` manually. Note that this delayed cleanup start doesn't prevent WAL engine from removing old files if there is no space left on a storage device. The WAL will simply drop old data without a question. We need to take into account that some administrators might not need this functionality at all, for this sake we introduce "wal_cleanup_delay" configuration option which allows to enable or disable the delay. Closes #5806 Signed-off-by: Cyrill Gorcunov @TarantoolBot document Title: Add wal_cleanup_delay configuration parameter The `wal_cleanup_delay` option defines a delay in seconds before write ahead log files (`*.xlog`) are getting started to prune upon a node restart. This option is ignored in case if a node is running as an anonymous replica (`replication_anon = true`). Similarly if replication is unused or there is no plans to use replication at all then this option should not be considered. An initial problem to solve is the case where a node is operating so fast that its replicas do not manage to reach the node state and in case if the node is restarted at this moment (for various reasons, for example due to power outage) then `*.xlog` files might be pruned during restart. In result replicas will not find these files on the main node and have to reread all data back which is a very expensive procedure. Since replicas are tracked via `_cluster` system space this we use its content to count subscribed replicas and when all of them are up and running the cleanup procedure is automatically enabled even if `wal_cleanup_delay` is not expired. The `wal_cleanup_delay` should be set to: - `-1` to wait infinitely until all existing replicas are subscribed; - `0` to disable the cleanup delay; - `>0` to wait for specified number of seconds. By default it is set to `14400` seconds (ie `4` hours). In case if registered replica is lost forever and timeout is set to infinity then a preferred way to enable cleanup procedure is not setting up a small timeout value but rather to delete this replica from `_cluster` space manually. Note that the option does *not* prevent WAL engine from removing old `*.xlog` files if there is no space left on a storage device, WAL engine can remove them in a force way. Current state of `*.xlog` garbage collector can be found in `box.info.gc()` output. For example ``` Lua tarantool> box.info.gc() --- ... cleanup_is_paused: false ``` The `cleanup_is_paused` shows if cleanup fiber is paused or not. --- src/box/box.cc | 46 +++++++++++++++- src/box/box.h | 1 + src/box/gc.c | 97 +++++++++++++++++++++++++++++++-- src/box/gc.h | 38 ++++++++++++- src/box/lua/cfg.cc | 9 +++ src/box/lua/info.c | 4 ++ src/box/lua/load_cfg.lua | 4 ++ src/box/relay.cc | 1 + src/box/replication.cc | 2 + test/app-tap/init_script.result | 1 + test/box/admin.result | 2 + test/box/cfg.result | 4 ++ test/unit/snap_quorum_delay.cc | 2 +- 13 files changed, 203 insertions(+), 8 deletions(-) diff --git a/src/box/box.cc b/src/box/box.cc index cc59564e1..b9fd7af00 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -771,6 +771,29 @@ box_check_wal_queue_max_size(void) return size; } +static double +box_check_wal_cleanup_delay(void) +{ + double value = cfg_geti("wal_cleanup_delay"); + if (value < 0) { + diag_set(ClientError, ER_CFG, "wal_cleanup_delay", + "the value must be >= 0"); + return -1; + } + + /* + * Anonymous replicas do not require + * delay the cleanup procedure since they + * are read only. + */ + if (cfg_geti("replication_anon") != 0) { + if (value != 0) + value = 0; + } + + return value; +} + static void box_check_readahead(int readahead) { @@ -918,6 +941,8 @@ box_check_config(void) box_check_wal_mode(cfg_gets("wal_mode")); if (box_check_wal_queue_max_size() < 0) diag_raise(); + if (box_check_wal_cleanup_delay() < 0) + diag_raise(); if (box_check_memory_quota("memtx_memory") < 0) diag_raise(); box_check_memtx_min_tuple_size(cfg_geti64("memtx_min_tuple_size")); @@ -1465,6 +1490,16 @@ box_set_wal_queue_max_size(void) return 0; } +int +box_set_wal_cleanup_delay(void) +{ + double delay = box_check_wal_cleanup_delay(); + if (delay < 0) + return -1; + gc_set_wal_cleanup_delay(delay); + return 0; +} + void box_set_vinyl_memory(void) { @@ -3000,7 +3035,7 @@ box_cfg_xc(void) rmean_box = rmean_new(iproto_type_strs, IPROTO_TYPE_STAT_MAX); rmean_error = rmean_new(rmean_error_strings, RMEAN_ERROR_LAST); - gc_init(); + gc_init(box_check_wal_cleanup_delay()); engine_init(); schema_init(); replication_init(); @@ -3076,6 +3111,15 @@ box_cfg_xc(void) } fiber_gc(); + /* + * Exclude self from GC delay because we care + * about remote replicas only, still for ref/unref + * balance we do reference self node initially and + * downgrade it to zero when there is no replication + * set at all. + */ + gc_delay_unref(); + bootstrap_journal_guard.is_active = false; assert(current_journal != &bootstrap_journal); diff --git a/src/box/box.h b/src/box/box.h index 65215b087..e2321b9b0 100644 --- a/src/box/box.h +++ b/src/box/box.h @@ -243,6 +243,7 @@ void box_set_checkpoint_count(void); void box_set_checkpoint_interval(void); void box_set_checkpoint_wal_threshold(void); int box_set_wal_queue_max_size(void); +int box_set_wal_cleanup_delay(void); void box_set_memtx_memory(void); void box_set_memtx_max_tuple_size(void); void box_set_vinyl_memory(void); diff --git a/src/box/gc.c b/src/box/gc.c index 9af4ef958..5418fd31d 100644 --- a/src/box/gc.c +++ b/src/box/gc.c @@ -46,6 +46,7 @@ #include #include +#include "cfg.h" #include "diag.h" #include "errcode.h" #include "fiber.h" @@ -102,11 +103,18 @@ gc_checkpoint_delete(struct gc_checkpoint *checkpoint) } void -gc_init(void) +gc_init(double wal_cleanup_delay) { /* Don't delete any files until recovery is complete. */ gc.min_checkpoint_count = INT_MAX; + gc.wal_cleanup_delay = wal_cleanup_delay; + gc.is_paused = wal_cleanup_delay > 0; + gc.delay_ref = 0; + + if (gc.is_paused) + say_info("wal/engine cleanup is paused"); + vclock_create(&gc.vclock); rlist_create(&gc.checkpoints); gc_tree_new(&gc.consumers); @@ -238,6 +246,47 @@ static int gc_cleanup_fiber_f(va_list ap) { (void)ap; + + /* + * Stage 1 (optional): in case if we're booting + * up with cleanup disabled lets do wait in a + * separate cycle to minimize branching on stage 2. + */ + if (gc.is_paused) { + ev_tstamp timeout = gc.wal_cleanup_delay; + while (!fiber_is_cancelled()) { + ev_tstamp clock_start = fiber_clock(); + if (fiber_yield_timeout(timeout)) { + say_info("wal/engine cleanup is resumed " + "due to timeout expiration"); + gc.is_paused = false; + gc.delay_ref = 0; + break; + } + + /* + * If a last reference is dropped + * we can exit out early. + */ + if (!gc.is_paused) + break; + + ev_tstamp elapsed = fiber_clock() - clock_start; + timeout = gc.wal_cleanup_delay - elapsed; + if (timeout <= 0) { + say_info("wal/engine cleanup is resumed " + "due to reconfigured timeout " + "expiration"); + gc.is_paused = false; + gc.delay_ref = 0; + break; + } + } + } + + /* + * Stage 2: a regular cleanup cycle. + */ while (!fiber_is_cancelled()) { int64_t delta = gc.cleanup_scheduled - gc.cleanup_completed; if (delta == 0) { @@ -253,6 +302,43 @@ gc_cleanup_fiber_f(va_list ap) return 0; } +void +gc_set_wal_cleanup_delay(double wal_cleanup_delay) +{ + gc.wal_cleanup_delay = wal_cleanup_delay; + /* + * This routine may be called at arbitrary + * moment thus we must be sure the cleanup + * fiber is paused to not wake up it when + * it is already in a regular cleanup stage. + */ + if (gc.is_paused) + fiber_wakeup(gc.cleanup_fiber); +} + +void +gc_delay_ref(void) +{ + if (gc.is_paused) { + assert(gc.delay_ref >= 0); + gc.delay_ref++; + } +} + +void +gc_delay_unref(void) +{ + if (gc.is_paused) { + assert(gc.delay_ref > 0); + gc.delay_ref--; + if (gc.delay_ref == 0) { + say_info("wal/engine cleanup is resumed"); + gc.is_paused = false; + fiber_wakeup(gc.cleanup_fiber); + } + } +} + /** * Trigger asynchronous garbage collection. */ @@ -462,11 +548,12 @@ gc_checkpoint(void) * Wait for background garbage collection that might * have been triggered by this checkpoint to complete. * Strictly speaking, it isn't necessary, but it - * simplifies testing as it guarantees that by the - * time box.snapshot() returns, all outdated checkpoint - * files have been removed. + * simplifies testing. Same time if GC is paused and + * waiting for old XLOGs to be read by replicas the + * cleanup won't happen immediately after the checkpoint. */ - gc_wait_cleanup(); + if (!gc.is_paused) + gc_wait_cleanup(); return 0; } diff --git a/src/box/gc.h b/src/box/gc.h index 2a568c5f9..aa8f2070a 100644 --- a/src/box/gc.h +++ b/src/box/gc.h @@ -147,6 +147,24 @@ struct gc_state { * taken at that moment of time. */ int64_t cleanup_completed, cleanup_scheduled; + /** + * A counter to wait until all replicas are managed to + * subscribe so that we can enable cleanup fiber to + * remove old XLOGs. Otherwise some replicas might be + * far behind the master node and after the master + * node been restarted they will have to reread all + * data back due to XlogGapError, ie too early deleted + * XLOGs. + */ + int64_t delay_ref; + /** + * Delay timeout in seconds. + */ + double wal_cleanup_delay; + /** + * When set the cleanup fiber is paused. + */ + bool is_paused; /** * Set if there's a fiber making a checkpoint right now. */ @@ -198,7 +216,7 @@ gc_last_checkpoint(void) * Initialize the garbage collection state. */ void -gc_init(void); +gc_init(double wal_cleanup_delay); /** * Destroy the garbage collection state. @@ -206,6 +224,24 @@ gc_init(void); void gc_free(void); +/** + * Set a new delay value. + */ +void +gc_set_wal_cleanup_delay(double wal_cleanup_delay); + +/** + * Increment a reference to delay counter. + */ +void +gc_delay_ref(void); + +/** + * Decrement a reference from the delay counter. + */ +void +gc_delay_unref(void); + /** * Advance the garbage collector vclock to the given position. * Deactivate WAL consumers that need older data. diff --git a/src/box/lua/cfg.cc b/src/box/lua/cfg.cc index b37a93ed8..1142e2726 100644 --- a/src/box/lua/cfg.cc +++ b/src/box/lua/cfg.cc @@ -172,6 +172,14 @@ lbox_cfg_set_wal_queue_max_size(struct lua_State *L) return 0; } +static int +lbox_cfg_set_wal_cleanup_delay(struct lua_State *L) +{ + if (box_set_wal_cleanup_delay() < 0) + luaT_error(L); + return 0; +} + static int lbox_cfg_set_read_only(struct lua_State *L) { @@ -408,6 +416,7 @@ box_lua_cfg_init(struct lua_State *L) {"cfg_set_checkpoint_interval", lbox_cfg_set_checkpoint_interval}, {"cfg_set_checkpoint_wal_threshold", lbox_cfg_set_checkpoint_wal_threshold}, {"cfg_set_wal_queue_max_size", lbox_cfg_set_wal_queue_max_size}, + {"cfg_set_wal_cleanup_delay", lbox_cfg_set_wal_cleanup_delay}, {"cfg_set_read_only", lbox_cfg_set_read_only}, {"cfg_set_memtx_memory", lbox_cfg_set_memtx_memory}, {"cfg_set_memtx_max_tuple_size", lbox_cfg_set_memtx_max_tuple_size}, diff --git a/src/box/lua/info.c b/src/box/lua/info.c index c4c9fa0a0..fb39799a1 100644 --- a/src/box/lua/info.c +++ b/src/box/lua/info.c @@ -445,6 +445,10 @@ lbox_info_gc_call(struct lua_State *L) lua_pushboolean(L, gc.checkpoint_is_in_progress); lua_settable(L, -3); + lua_pushstring(L, "is_paused"); + lua_pushboolean(L, gc.is_paused); + lua_settable(L, -3); + lua_pushstring(L, "checkpoints"); lua_newtable(L); diff --git a/src/box/lua/load_cfg.lua b/src/box/lua/load_cfg.lua index 9967f992e..5ed8c12f0 100644 --- a/src/box/lua/load_cfg.lua +++ b/src/box/lua/load_cfg.lua @@ -73,6 +73,7 @@ local default_cfg = { wal_max_size = 256 * 1024 * 1024, wal_dir_rescan_delay= 2, wal_queue_max_size = 16 * 1024 * 1024, + wal_cleanup_delay = 4 * 3600, force_recovery = false, replication = nil, instance_uuid = nil, @@ -155,6 +156,7 @@ local template_cfg = { wal_mode = 'string', wal_max_size = 'number', wal_dir_rescan_delay= 'number', + wal_cleanup_delay = 'number', force_recovery = 'boolean', replication = 'string, number, table', instance_uuid = 'string', @@ -289,6 +291,7 @@ local dynamic_cfg = { feedback_interval = ifdef_feedback_set_params, -- do nothing, affects new replicas, which query this value on start wal_dir_rescan_delay = function() end, + wal_cleanup_delay = private.cfg_set_wal_cleanup_delay, custom_proc_title = function() require('title').update(box.cfg.custom_proc_title) end, @@ -381,6 +384,7 @@ local dynamic_cfg_skip_at_load = { replication_skip_conflict = true, replication_anon = true, wal_dir_rescan_delay = true, + wal_cleanup_delay = true, custom_proc_title = true, force_recovery = true, instance_uuid = true, diff --git a/src/box/relay.cc b/src/box/relay.cc index 41f949e8e..6edee86bf 100644 --- a/src/box/relay.cc +++ b/src/box/relay.cc @@ -824,6 +824,7 @@ relay_subscribe(struct replica *replica, int fd, uint64_t sync, tt_uuid_str(&replica->uuid)); if (replica->gc == NULL) diag_raise(); + gc_delay_unref(); } relay_start(relay, fd, sync, relay_send_row); diff --git a/src/box/replication.cc b/src/box/replication.cc index 1fa8843e7..aefb812b3 100644 --- a/src/box/replication.cc +++ b/src/box/replication.cc @@ -250,6 +250,7 @@ replica_set_id(struct replica *replica, uint32_t replica_id) tt_uuid_str(&replica->uuid)); } replicaset.replica_by_id[replica_id] = replica; + gc_delay_ref(); ++replicaset.registered_count; say_info("assigned id %d to replica %s", replica->id, tt_uuid_str(&replica->uuid)); @@ -273,6 +274,7 @@ replica_clear_id(struct replica *replica) replicaset.replica_by_id[replica->id] = NULL; assert(replicaset.registered_count > 0); --replicaset.registered_count; + gc_delay_unref(); if (replica->id == instance_id) { /* See replica_check_id(). */ assert(replicaset.is_joining); diff --git a/test/app-tap/init_script.result b/test/app-tap/init_script.result index 74cdb3efb..1fdd9a227 100644 --- a/test/app-tap/init_script.result +++ b/test/app-tap/init_script.result @@ -53,6 +53,7 @@ vinyl_run_count_per_level:2 vinyl_run_size_ratio:3.5 vinyl_timeout:60 vinyl_write_threads:4 +wal_cleanup_delay:14400 wal_dir:. wal_dir_rescan_delay:2 wal_max_size:268435456 diff --git a/test/box/admin.result b/test/box/admin.result index b619ae6cb..f8e8808e3 100644 --- a/test/box/admin.result +++ b/test/box/admin.result @@ -127,6 +127,8 @@ cfg_filter(box.cfg) - 60 - - vinyl_write_threads - 4 + - - wal_cleanup_delay + - 14400 - - wal_dir - - - wal_dir_rescan_delay diff --git a/test/box/cfg.result b/test/box/cfg.result index 2c3d5a981..693c1b521 100644 --- a/test/box/cfg.result +++ b/test/box/cfg.result @@ -115,6 +115,8 @@ cfg_filter(box.cfg) | - 60 | - - vinyl_write_threads | - 4 + | - - wal_cleanup_delay + | - 14400 | - - wal_dir | - | - - wal_dir_rescan_delay @@ -234,6 +236,8 @@ cfg_filter(box.cfg) | - 60 | - - vinyl_write_threads | - 4 + | - - wal_cleanup_delay + | - 14400 | - - wal_dir | - | - - wal_dir_rescan_delay diff --git a/test/unit/snap_quorum_delay.cc b/test/unit/snap_quorum_delay.cc index 803bbbea8..c4edbcbcb 100644 --- a/test/unit/snap_quorum_delay.cc +++ b/test/unit/snap_quorum_delay.cc @@ -236,7 +236,7 @@ main(void) { memory_init(); fiber_init(fiber_c_invoke); - gc_init(); + gc_init(0); txn_limbo_init(); instance_id = 1; -- 2.30.2