From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id 62B366EC56; Fri, 19 Mar 2021 02:04:40 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 62B366EC56 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1616108680; bh=r+7O3ogpDxvixR4r9Ecpi08bMf9QNmV730HYr99ku8A=; h=To:References:Date:In-Reply-To:Subject:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=wVckalCa8flBxXcLskf3S6CUBjFmoT6C+dhjoP5Fige5RKpZue/JM9pAFnjtIwhja UaGNs5dtu0H68yNzbKtpQ0UQCT40KQno+oSXnMx1M0/Lu6oRdLVxo3/7iXPNjrAOca OzM8mL5Ar7dXkV0CXNPhJcDEcBwhL1BTclHf5BQ8= Received: from smtpng3.m.smailru.net (smtpng3.m.smailru.net [94.100.177.149]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 40CA06EC5A for ; Fri, 19 Mar 2021 02:04:12 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 40CA06EC5A Received: by smtpng3.m.smailru.net with esmtpa (envelope-from ) id 1lN1gY-0007Da-I3; Fri, 19 Mar 2021 02:04:11 +0300 To: Cyrill Gorcunov , tml References: <20210318184138.1077807-1-gorcunov@gmail.com> <20210318184138.1077807-2-gorcunov@gmail.com> Message-ID: Date: Fri, 19 Mar 2021 00:04:09 +0100 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: <20210318184138.1077807-2-gorcunov@gmail.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-7564579A: B8F34718100C35BD X-77F55803: 4F1203BC0FB41BD96485A7A9FC1318935A89AF0CDB68A1C65FFEA4E2B72DF156182A05F538085040D889F610D0703BB43A8147A6E864CC1B9C703BBB94699AD9EED517266748298B X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE70EEC24FFE855BCBBC2099A533E45F2D0395957E7521B51C2CFCAF695D4D8E9FCEA1F7E6F0F101C6778DA827A17800CE7AD2F2D6F6013FF7F8F08D7030A58E5ADC58D69EE07B14084C6CDE5D1141D2B1C892A2D0106A2C06D04741E782C0C495684D5885DC6DD9FD69FA2833FD35BB23D9E625A9149C048EE33AC447995A7AD182CC0D3CB04F14752D2E47CDBA5A96583BD4B6F7A4D31EC0BC014FD901B82EE079FA2833FD35BB23D27C277FBC8AE2E8B3A703B70628EAD7BA471835C12D1D977C4224003CC836476EC64975D915A344093EC92FD9297F6718AA50765F7900637C3090DF2C0002BD1A7F4EDE966BC389F395957E7521B51C24C7702A67D5C33162DBA43225CD8A89F249797B4B1AC1449CE5475246E174218B5C8C57E37DE458B4C7702A67D5C3316FA3894348FB808DBE5FF2EAB69DB47CC574AF45C6390F7469DAA53EE0834AAEE X-B7AD71C0: AC4F5C86D027EB782CDD5689AFBDA7A2BBE337FB72E923155C0AF1600DCBC20BD52826C51A2876AB090E6798FDCAD955 X-C1DE0DAB: 0D63561A33F958A5B2F3A0A5CF9547792E2C610AABDD6A0164BA7C3C7A1E38C2D59269BC5F550898D99A6476B3ADF6B47008B74DF8BB9EF7333BD3B22AA88B938A852937E12ACA75F04B387B5D7535DE410CA545F18667F91A7EA1CDA0B5A7A0 X-C8649E89: 4E36BF7865823D7055A7F0CF078B5EC49A30900B95165D346840168BCAD8054EB3BA3BA3B585895E1D400EB0ABC04655E2F31CC390D2B0834B462DA5DC8558721D7E09C32AA3244CFF1068609D401B53E45142D4912270386C248321276684228D5DD81C2BAB7D1D X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2biojyKyJYJ15DtLJKIQovXaovw== X-Mailru-Sender: 689FA8AB762F73936BC43F508A063822754C1ED750FB0D37EFE8020C86A49EE93841015FED1DE5223CC9A89AB576DD93FB559BB5D741EB963CF37A108A312F5C27E8A8C3839CE0E267EA787935ED9F1B X-Mras: Ok Subject: Re: [Tarantool-patches] [PATCH 1/2] gc/xlog: delay xlog cleanup until relays are subscribed X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Vladislav Shpilevoy via Tarantool-patches Reply-To: Vladislav Shpilevoy Cc: Mons Anderson Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" Hi! Thanks for the patch! Generally looks fine except details. AFAIU, we go for the timeout option for the only reason that there might be non-fullmesh typologies, when a replica is in _cluster, but does not have relays to other nodes. If we would know the topology, we could tell which nodes will relay to who, and would be able to detect when a replica needs to keep the logs, and when don't. Not even the entire topology. Just the info from the adjacent nodes. From our own box.cfg.replication, and from box.cfg.replication of these nodes. I still don't understand what was wrong with implementing some kind of topology discovery for this 2-level topology tree. For instance when applier on a replica is connected, the remote node sends us a flag whether it is going to connect back. If the flag is false, we don't keep the logs for that instance. See 13 comments below. > @TarantoolBot document > Title: Add wal_cleanup_delay configuration parameter 1. There are 2 issues with the name. Firstly, 'cleanup' conflicts with the existing 'gc' name, which is already exposed in box.info.gc. It is not called box.info.cleanup, so I would propose to use 'gc' here too. Another issue I described in the first patchset's discussion. It is not really a delay. Because it is simply ignored when the replication feels like it. It must either have 'replication' prefix to designate it is not just a fixated timeout to keep the WALs and it depends on the replication, or at least it must mention "max". To designate it is not the exact strict timeout for keeping the logs. > The `wal_cleanup_delay` option defines a delay in seconds > before write ahead log files (`*.xlog`) are getting started > to prune upon a node restart. > > This option is isnored in case if a node is running as 2. isnored -> ignored. > an anonymous replica (`replication_anon = true`). Similarly > if replication is unused or there is no plans to use > replication at all then this option should not be considered. > > An initial problem to solve is the case where a node is operating > so fast that its replicas do not manage to reach the node state > and in case if the node is restarted at this moment (for various > reasons, for example due to power outage) then unfetched `*.xlog` > files might be pruned during restart. In result replicas will not > find these files on the main node and have to reread all data back > which is a very expensive procedure. > > Since replicas are tracked via `_cluster` system space this we use > its content to count subscribed replicas and when all of them are > up and running the cleanup procedure is automatically enabled even > if `wal_cleanup_delay` is not expired. > > The `wal_cleanup_delay` should be set to: > > - `-1` to wait infinitely until all existing replicas are subscribed; > - `0` to disable the cleanup delay; > - `>0` to wait for specified number of seconds. > > By default it is set to `14400` seconds (ie `4` hours). > > In case if registered replica is lost forever and timeout is set to > infinity then a preferred way to enable cleanup procedure is not setting > up a small timeout value but rather to delete this replica from `_cluster` > space manually. > > Note that the option does *not* prevent WAL engine from removing > old `*.xlog` files if there is no space left on a storage device, > WAL engine can remove them in a force way. > > Current state of `*.xlog` garbage collector can be found in > `box.info.gc()` output. For example > > ``` Lua > tarantool> box.info.gc() > --- > ... > cleanup_is_paused: false 3. 'gc.cleanup' is a tautology. If you want to expose the flag, it better be simply 'is_paused' IMO. > delay_ref: 0 4. Can't the user get the same by looking at _cluster size - box.info.gc().consumers count ? > ``` > > The `cleanup_is_paused` shows if cleanup fiber is paused or not, > and `delay_ref` represents number of replicas to be subscribed > if the cleanup fiber is in `pause` mode. > --- > src/box/box.cc | 11 +++++ > src/box/gc.c | 75 +++++++++++++++++++++++++++++++-- > src/box/gc.h | 30 +++++++++++++ > src/box/lua/info.c | 8 ++++ > src/box/lua/load_cfg.lua | 3 ++ > src/box/relay.cc | 20 +++++++++ > src/box/replication.cc | 2 + > test/app-tap/init_script.result | 1 + > test/box/admin.result | 2 + > test/box/cfg.result | 4 ++ > 10 files changed, 153 insertions(+), 3 deletions(-) > > diff --git a/src/box/box.cc b/src/box/box.cc > index b4a1a5e07..ee017c2dc 100644 > --- a/src/box/box.cc > +++ b/src/box/box.cc > @@ -754,6 +754,15 @@ box_check_wal_mode(const char *mode_name) > return (enum wal_mode) mode; > } > > +static void > +box_check_wal_cleanup_delay(int tmo) > +{ > + if (tmo < 0 && tmo != -1) { > + tnt_raise(ClientError, ER_CFG, "wal_cleanup_delay", > + "the value must be either >= 0 or -1"); 3. As you could notice, we try not to use C++ exceptions in the new code. Lets use diag_set(). For example, like the new option wal_max_queue_size does. 4. Can you move cfg_geti() into this function? I see no reason why do you need to pass it as a parameter. The function is used in a single place. > + } > +} > @@ -3045,6 +3055,7 @@ box_cfg_xc(void) > bootstrap(&instance_uuid, &replicaset_uuid, > &is_bootstrap_leader); > } > + gc_delay_unref(); 5. Why? > fiber_gc(); > > bootstrap_journal_guard.is_active = false; > diff --git a/src/box/gc.c b/src/box/gc.c > index 9af4ef958..595b98bdf 100644 > --- a/src/box/gc.c > +++ b/src/box/gc.c > @@ -107,6 +108,25 @@ gc_init(void) > /* Don't delete any files until recovery is complete. */ > gc.min_checkpoint_count = INT_MAX; > > + gc.wal_cleanup_delay = cfg_geti("wal_cleanup_delay"); 6. Please, pass the options as arguments of gc_init(). Having the global options disseminated over the sources does not simplify understanding of what subsystems on which options depend. That would be consistent with not using cfg_get* fuctions almost anywhere except a few big 'main' files, and one sql file, which is also wrong, but is a mess in many other ways as well. > + gc.delay_ref = 0; > + > + if (cfg_geti("replication_anon") != 0) { 7. GC does not really need to know if the instance is anon. This could be both controlled by wal_cleanup_delay cfg and the caller of gc_init(): box_cfg_xc(). Otherwise you made GC depend on replication details in the code, which is not great. We must keep the dependencies as low as possible. In box_cfg_xc you can get both wal_cleanup_delay and replication_anon and check if the anon is set, then you call gc_init(0). If it is false, you call gc_init(wal_cleanup_delay). GC only knows about some delay obtained from its owner. > + /* > + * Anonymous replicas are read only > + * so no need to keep XLOGs. > + */ > + gc.cleanup_is_paused = false; > + } else if (gc.wal_cleanup_delay == 0) { > + /* > + * The delay is disabled explicitly. > + */ > + gc.cleanup_is_paused = false; > + } else { > + say_info("gc: wal/engine cleanup is paused"); > + gc.cleanup_is_paused = true; > + } > + > vclock_create(&gc.vclock); > rlist_create(&gc.checkpoints); > gc_tree_new(&gc.consumers); > @@ -238,6 +258,30 @@ static int > gc_cleanup_fiber_f(va_list ap) > { > (void)ap; > + > + /* > + * Stage 1 (optional): in case if we're booting > + * up with cleanup disabled lets do wait in a > + * separate cycle to minimize branching on stage 2. > + */ > + if (gc.cleanup_is_paused) { > + ev_tstamp tmo = gc.wal_cleanup_delay == -1 ? > + TIMEOUT_INFINITY : gc.wal_cleanup_delay; 8. Please, call it 'timeout'. 'tmo' looks super weird, and is never used anywhere except this patch. 9. What if I see the timout was too big, and I cal box.cfg{wal_cleanup_delay = 0}? It seems I can only reduce it on restart. Can you support a dynamic update? It should be easy. You need to store gc deadline instead of the delay. So when you change the timeout, you can see if with the new timeout the deadline is lready reached, and then can start GC earlier. If I increase the timeout, GC could be paused again, or simply ignored, or banned. But the decrease is more important, as I think. > + while (!fiber_is_cancelled()) { > + if (fiber_yield_timeout(tmo)) { > + say_info("gc: wal/engine cleanup is resumed " > + "due to timeout expiration"); 10. Other gc logs don't have 'gc' prefix. I would propose to follow it in the new logs too. > + gc.cleanup_is_paused = false; > + break; > + } > + if (!gc.cleanup_is_paused) > + break; > + } > + } > + > + /* > + * Stage 2: a regular cleanup cycle. > + */ > while (!fiber_is_cancelled()) { > int64_t delta = gc.cleanup_scheduled - gc.cleanup_completed; > if (delta == 0) { > @@ -253,6 +297,29 @@ gc_cleanup_fiber_f(va_list ap) > return 0; > } > > +void > +gc_delay_ref(void) > +{ > + if (gc.cleanup_is_paused) { > + assert(gc.delay_ref >= 0); > + gc.delay_ref++; > + } > +} > + > +void > +gc_delay_unref(void) > +{ > + if (gc.cleanup_is_paused) { > + assert(gc.delay_ref > 0); > + gc.delay_ref--; 11. If it is not paused and GC started earlier due to a timeout, the user will see 'delay_ref' in box.info even after all the replicas are connected. > + if (gc.delay_ref == 0) { > + say_info("gc: wal/engine cleanup is resumed"); > + gc.cleanup_is_paused = false; > + fiber_wakeup(gc.cleanup_fiber); > + } > + } > +} > + > /** > * Trigger asynchronous garbage collection. > */ > @@ -278,9 +345,11 @@ gc_schedule_cleanup(void) > static void > gc_wait_cleanup(void) > { > - int64_t scheduled = gc.cleanup_scheduled; > - while (gc.cleanup_completed < scheduled) > - fiber_cond_wait(&gc.cleanup_cond); > + if (!gc.cleanup_is_paused) { > + int64_t scheduled = gc.cleanup_scheduled; > + while (gc.cleanup_completed < scheduled) > + fiber_cond_wait(&gc.cleanup_cond); > + } 12. The function is called 'wait_cleanup', but it does not really wait in case GC is paused. It looks wrong. > diff --git a/src/box/relay.cc b/src/box/relay.cc > index 41f949e8e..5d09690ad 100644 > --- a/src/box/relay.cc > +++ b/src/box/relay.cc > @@ -668,6 +668,13 @@ relay_send_is_raft_enabled(struct relay *relay, > } > } > > +static void > +relay_gc_delay_unref(struct cmsg *msg) > +{ > + (void)msg; > + gc_delay_unref(); 13. You don't need a separate callback, and don't need to call it from the relay thread. Relay already works with GC - it calls gc_consumer_register() before starting the thread. You can do the unref in the same place. Starting from the consumer registration the logs are going to be kept anyway. Looks like it would be simpler if it works. > +} > + > /** > * A libev callback invoked when a relay client socket is ready > * for read. This currently only happens when the client closes > @@ -721,6 +728,19 @@ relay_subscribe_f(va_list ap) > */ > relay_send_heartbeat(relay); > > + /* > + * Now we can resume wal/engine gc as relay > + * is up and running. > + */ > + if (!relay->replica->anon) { > + static const struct cmsg_hop gc_delay_route[] = { > + {relay_gc_delay_unref, NULL} > + }; > + struct cmsg gc_delay_msg; > + cmsg_init(&gc_delay_msg, gc_delay_route); > + cpipe_push(&relay->tx_pipe, &gc_delay_msg); > + } > + > /* > * Run the event loop until the connection is broken > * or an error occurs.