From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id 7CD1E6EC56; Fri, 19 Mar 2021 14:03:57 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 7CD1E6EC56 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1616151837; bh=5ZrJ6mGNA5Le4Qiw6NrYgxMIsEAX2cV8URTI2GkKNKA=; h=Date:To:References:In-Reply-To:Subject:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=A5zZ3gw+gFgupt8iD79AlpE3DnVOA/tP0OzPUMlYMwhkBKqdStKQ3UlJg8leCPN2b vUHP4fsg8UocKeQ0xzE279aJiuwaIwURBdMWlQhmIN1i6Uia7DuLtaM+kSTIyEAVTI AeO9z6lQOGIm9vcul17JjDPhphiw8aWyMjiFEnew= Received: from mail-lj1-f173.google.com (mail-lj1-f173.google.com [209.85.208.173]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 7827D6EC56 for ; Fri, 19 Mar 2021 14:03:55 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 7827D6EC56 Received: by mail-lj1-f173.google.com with SMTP id z25so11411955lja.3 for ; Fri, 19 Mar 2021 04:03:55 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=yboR0/Jz5QkeKutsh4bT3S2PKOeJA1w1VpRzAeJ14C4=; b=R0dvHn+mpeNUdroD2FuF42RxfB1Jq2KMCGO4eo6gQstTVZmMyRWLczZZGT9prNy9BM f6Gq8UC46Bt6/s94qFMc67N7NM0CqmmLqj6dthzWgXZacvfmVMdLR6ZpplLMxeihipLB bpNfxF2OVVBBLVTWYNCVgGCKq7Qfdsc7l/V3qdvZiYLcGx6XhwCqBtXXYtSiRV66ra9z ocowynmb2uOvJNV/jicnhiGe2i3eG/MnOT+DZnKgjJB1DjkfO682b1hZaW8ShJVVaS1U v/tbWbJ8j4gb70hnU3nHj8E+gHbILGin6keX92NtsW9e/QvGOVht3QHg77vvZFRPOcHl ynDA== X-Gm-Message-State: AOAM531mtlZeAkcH0Xg0GVOSbETY+7sk6xmKYOKj9ECZFdLqYuO9ZqM8 3atasyA/9TtJprn6ekK+Jak= X-Google-Smtp-Source: ABdhPJzTvHsqIcSfSdYrOhHXWbOCQa9kdjYtZaN4SLtyJe6MMWn1UqPm1iZXJXqjV+6Hq/xnp/e6dA== X-Received: by 2002:a2e:a40c:: with SMTP id p12mr574341ljn.358.1616151834874; Fri, 19 Mar 2021 04:03:54 -0700 (PDT) Received: from grain.localdomain ([5.18.171.94]) by smtp.gmail.com with ESMTPSA id w25sm582299lfe.298.2021.03.19.04.03.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 19 Mar 2021 04:03:53 -0700 (PDT) Received: by grain.localdomain (Postfix, from userid 1000) id F275B5601CD; Fri, 19 Mar 2021 14:03:52 +0300 (MSK) Date: Fri, 19 Mar 2021 14:03:52 +0300 To: Vladislav Shpilevoy Message-ID: References: <20210318184138.1077807-1-gorcunov@gmail.com> <20210318184138.1077807-2-gorcunov@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/2.0.5 (2021-01-21) Subject: Re: [Tarantool-patches] [PATCH 1/2] gc/xlog: delay xlog cleanup until relays are subscribed X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Cyrill Gorcunov via Tarantool-patches Reply-To: Cyrill Gorcunov Cc: Mons Anderson , tml Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" On Fri, Mar 19, 2021 at 12:04:09AM +0100, Vladislav Shpilevoy wrote: > > AFAIU, we go for the timeout option for the only reason that > there might be non-fullmesh typologies, when a replica is in > _cluster, but does not have relays to other nodes. Exactly. > If we would know the topology, we could tell which nodes will > relay to who, and would be able to detect when a replica needs > to keep the logs, and when don't. Not even the entire topology. > Just the info from the adjacent nodes. From our own box.cfg.replication, > and from box.cfg.replication of these nodes. > > I still don't understand what was wrong with implementing some kind > of topology discovery for this 2-level topology tree. For instance > when applier on a replica is connected, the remote node sends us a > flag whether it is going to connect back. If the flag is false, we don't > keep the logs for that instance. There is nothing wrong and I think we should do it. Could you please elaborate the details? You mean to extend applier protocol data, so that it would send not only vclock but also the flag if it going to be setting up a relay? > See 13 comments below. > > > @TarantoolBot document > > Title: Add wal_cleanup_delay configuration parameter > > 1. There are 2 issues with the name. Firstly, 'cleanup' conflicts > with the existing 'gc' name, which is already exposed in box.info.gc. > It is not called box.info.cleanup, so I would propose to use 'gc' > here too. > > Another issue I described in the first patchset's discussion. It is > not really a delay. Because it is simply ignored when the replication > feels like it. It must either have 'replication' prefix to designate > it is not just a fixated timeout to keep the WALs and it depends on > the replication, or at least it must mention "max". To designate it > is not the exact strict timeout for keeping the logs. Vlad, personally I don't mind to name this option whatever you like, just gimme a name and I'll use it. > > The `wal_cleanup_delay` option defines a delay in seconds > > before write ahead log files (`*.xlog`) are getting started > > to prune upon a node restart. > > > > This option is isnored in case if a node is running as > > 2. isnored -> ignored. Ouch, I happen to miss this while been running aspell. Thanks! > > > > Current state of `*.xlog` garbage collector can be found in > > `box.info.gc()` output. For example > > > > ``` Lua > > tarantool> box.info.gc() > > --- > > ... > > cleanup_is_paused: false > > 3. 'gc.cleanup' is a tautology. If you want to expose the > flag, it better be simply 'is_paused' IMO. OK > > delay_ref: 0 > > 4. Can't the user get the same by looking at > > _cluster size - box.info.gc().consumers count > > ? I would prefer to keep this as a separate output simply because I need an internal variable for gc code itself and reading it is a way more convenient than fetching _cluster size and then consumers and do some math after. Moreover I need a way to find out _internal_ counter value for debug purpose. > > +static void > > +box_check_wal_cleanup_delay(int tmo) > > +{ > > + if (tmo < 0 && tmo != -1) { > > + tnt_raise(ClientError, ER_CFG, "wal_cleanup_delay", > > + "the value must be either >= 0 or -1"); > > 3. As you could notice, we try not to use C++ exceptions in the new > code. Lets use diag_set(). For example, like the new option > wal_max_queue_size does. OK > > 4. Can you move cfg_geti() into this function? I see no reason why > do you need to pass it as a parameter. The function is used in > a single place. Sure > > + } > > +} > > @@ -3045,6 +3055,7 @@ box_cfg_xc(void) > > bootstrap(&instance_uuid, &replicaset_uuid, > > &is_bootstrap_leader); > > } > > + gc_delay_unref(); > > 5. Why? For case where you don't have replicas at all thus you need to unref yourself from counting and the counter shrinks to zero enabling gc. > > > fiber_gc(); > > > > bootstrap_journal_guard.is_active = false; > > diff --git a/src/box/gc.c b/src/box/gc.c > > index 9af4ef958..595b98bdf 100644 > > --- a/src/box/gc.c > > +++ b/src/box/gc.c > > @@ -107,6 +108,25 @@ gc_init(void) > > /* Don't delete any files until recovery is complete. */ > > gc.min_checkpoint_count = INT_MAX; > > > > + gc.wal_cleanup_delay = cfg_geti("wal_cleanup_delay"); > > 6. Please, pass the options as arguments of gc_init(). Having the > global options disseminated over the sources does not simplify > understanding of what subsystems on which options depend. > > That would be consistent with not using cfg_get* fuctions almost > anywhere except a few big 'main' files, and one sql file, which > is also wrong, but is a mess in many other ways as well. OK > > > + gc.delay_ref = 0; > > + > > + if (cfg_geti("replication_anon") != 0) { > > 7. GC does not really need to know if the instance is anon. This > could be both controlled by wal_cleanup_delay cfg and the caller > of gc_init(): box_cfg_xc(). Otherwise you made GC depend on > replication details in the code, which is not great. We must keep > the dependencies as low as possible. > > In box_cfg_xc you can get both wal_cleanup_delay and replication_anon > and check if the anon is set, then you call gc_init(0). If it is > false, you call gc_init(wal_cleanup_delay). GC only knows about some > delay obtained from its owner. OK > > + > > + /* > > + * Stage 1 (optional): in case if we're booting > > + * up with cleanup disabled lets do wait in a > > + * separate cycle to minimize branching on stage 2. > > + */ > > + if (gc.cleanup_is_paused) { > > + ev_tstamp tmo = gc.wal_cleanup_delay == -1 ? > > + TIMEOUT_INFINITY : gc.wal_cleanup_delay; > > 8. Please, call it 'timeout'. 'tmo' looks super weird, and is never > used anywhere except this patch. TMO is well known abbreviation in network stack but sure will rename. > > 9. What if I see the timout was too big, and I cal > box.cfg{wal_cleanup_delay = 0}? It seems I can only reduce it on > restart. Can you support a dynamic update? It should be easy. You > need to store gc deadline instead of the delay. So when you change > the timeout, you can see if with the new timeout the deadline is > lready reached, and then can start GC earlier. Thanks, will make it dynamic. > If I increase the timeout, GC could be paused again, or simply > ignored, or banned. But the decrease is more important, as I think. > > > + while (!fiber_is_cancelled()) { > > + if (fiber_yield_timeout(tmo)) { > > + say_info("gc: wal/engine cleanup is resumed " > > + "due to timeout expiration"); > > 10. Other gc logs don't have 'gc' prefix. I would propose to follow > it in the new logs too. This makes easier for grepping the logs but sure will drop. > > +void > > +gc_delay_unref(void) > > +{ > > + if (gc.cleanup_is_paused) { > > + assert(gc.delay_ref > 0); > > + gc.delay_ref--; > > 11. If it is not paused and GC started earlier due to a > timeout, the user will see 'delay_ref' in box.info even > after all the replicas are connected. Yes, and this gonna be a sign that we're exited due to timeout. Such output should not be treated as an error but if you prefer I can zap the counter for this case. > > + if (!gc.cleanup_is_paused) { > > + int64_t scheduled = gc.cleanup_scheduled; > > + while (gc.cleanup_completed < scheduled) > > + fiber_cond_wait(&gc.cleanup_cond); > > + } > > 12. The function is called 'wait_cleanup', but it does not really > wait in case GC is paused. It looks wrong. Hard to say. paused state is rather internal state and when someone is waiting for cleanup to complete but cleanup is turned off then I consider it pretty natural to exit immediately. And I moved this "if" into the function itself simply because we may use this helper in future and we should not stuck forever in "waiting" if cleanup is disabled. I can move this "if" to the caller side though if you prefer. > > > > +static void > > +relay_gc_delay_unref(struct cmsg *msg) > > +{ > > + (void)msg; > > + gc_delay_unref(); > > 13. You don't need a separate callback, and don't need to call it > from the relay thread. Relay already works with GC - it calls > gc_consumer_register() before starting the thread. You can do the > unref in the same place. Starting from the consumer registration > the logs are going to be kept anyway. > > Looks like it would be simpler if it works. Actually initially i did it this way, right before creating relay thread. But I think this is not good and that's why: when relay is starting it may fail in a number of places (the thread itself is not created; thread is created but then faiber creation failed with eception) and I think we should decrement the reference when we only pretty sure that there won't be new errors inside relay cycle. What would happen when say relay fiber is triggering an error? iproto will write an error and as far as I understand the replica will try to reconnect. Thus we should keep the logs until relay is subscribed for sure. Cyrill