[Tarantool-patches] [PATCH 1/2] gc/xlog: delay xlog cleanup until relays are subscribed
Vladislav Shpilevoy
v.shpilevoy at tarantool.org
Sat Mar 20 01:17:22 MSK 2021
>> If we would know the topology, we could tell which nodes will
>> relay to who, and would be able to detect when a replica needs
>> to keep the logs, and when don't. Not even the entire topology.
>> Just the info from the adjacent nodes. From our own box.cfg.replication,
>> and from box.cfg.replication of these nodes.
>>
>> I still don't understand what was wrong with implementing some kind
>> of topology discovery for this 2-level topology tree. For instance
>> when applier on a replica is connected, the remote node sends us a
>> flag whether it is going to connect back. If the flag is false, we don't
>> keep the logs for that instance.
>
> There is nothing wrong and I think we should do it. Could you please
> elaborate the details? You mean to extend applier protocol data, so
> that it would send not only vclock but also the flag if it going to
> be setting up a relay?
Nevermind, it won't work for certain topologies. The only way is to
have it in _cluster, and make _cluster synchronous space. But this is
a big separate task, so probably timeout is fine for now.
>> See 13 comments below.
>>
>>> @TarantoolBot document
>>> Title: Add wal_cleanup_delay configuration parameter
>>
>> 1. There are 2 issues with the name. Firstly, 'cleanup' conflicts
>> with the existing 'gc' name, which is already exposed in box.info.gc.
>> It is not called box.info.cleanup, so I would propose to use 'gc'
>> here too.
>>
>> Another issue I described in the first patchset's discussion. It is
>> not really a delay. Because it is simply ignored when the replication
>> feels like it. It must either have 'replication' prefix to designate
>> it is not just a fixated timeout to keep the WALs and it depends on
>> the replication, or at least it must mention "max". To designate it
>> is not the exact strict timeout for keeping the logs.
>
> Vlad, personally I don't mind to name this option whatever you like,
> just gimme a name and I'll use it.
I still don't like the name, but I am outvoted. We won't even add
'max' to it. The current name stays, unfortunately.
>>> + }
>>> +}
>>> @@ -3045,6 +3055,7 @@ box_cfg_xc(void)
>>> bootstrap(&instance_uuid, &replicaset_uuid,
>>> &is_bootstrap_leader);
>>> }
>>> + gc_delay_unref();
>>
>> 5. Why?
>
> For case where you don't have replicas at all thus you need
> to unref yourself from counting and the counter shrinks to
> zero enabling gc.
Lets add a comment about this.
>>> +void
>>> +gc_delay_unref(void)
>>> +{
>>> + if (gc.cleanup_is_paused) {
>>> + assert(gc.delay_ref > 0);
>>> + gc.delay_ref--;
>>
>> 11. If it is not paused and GC started earlier due to a
>> timeout, the user will see 'delay_ref' in box.info even
>> after all the replicas are connected.
>
> Yes, and this gonna be a sign that we're exited due to
> timeout. Such output should not be treated as an error
> but if you prefer I can zap the counter for this case.
I would prefer not to have delay_ref in the public
monitoring at all, but you should ask Mons about it now.
>>> + if (!gc.cleanup_is_paused) {
>>> + int64_t scheduled = gc.cleanup_scheduled;
>>> + while (gc.cleanup_completed < scheduled)
>>> + fiber_cond_wait(&gc.cleanup_cond);
>>> + }
>>
>> 12. The function is called 'wait_cleanup', but it does not really
>> wait in case GC is paused. It looks wrong.
>
> Hard to say. paused state is rather internal state and
> when someone is waiting for cleanup to complete but cleanup
> is turned off then I consider it pretty natural to exit
> immediately. And I moved this "if" into the function itself
> simply because we may use this helper in future and we should
> not stuck forever in "waiting" if cleanup is disabled.
>
> I can move this "if" to the caller side though if you prefer.
Probably this would be better. Because the function is called
'wait cleanup', not 'try wait or return immediately'.
The comment in the place where the function is called says
it guarantees that by the time box.snapshot()
returns, all outdated checkpoint files have been removed
It is misleading now.
>>> +static void
>>> +relay_gc_delay_unref(struct cmsg *msg)
>>> +{
>>> + (void)msg;
>>> + gc_delay_unref();
>>
>> 13. You don't need a separate callback, and don't need to call it
>> from the relay thread. Relay already works with GC - it calls
>> gc_consumer_register() before starting the thread. You can do the
>> unref in the same place. Starting from the consumer registration
>> the logs are going to be kept anyway.
>>
>> Looks like it would be simpler if it works.
>
> Actually initially i did it this way, right before creating
> relay thread. But I think this is not good and that's why:
> when relay is starting it may fail in a number of places
> (the thread itself is not created; thread is created but
> then faiber creation failed with eception) and I think we
> should decrement the reference when we only pretty sure that
> there won't be new errors inside relay cycle.
Why? New errors won't break anything. They also can happen after
you did unref in your patch.
> What would happen when say relay fiber is triggering an
> error? iproto will write an error and as far as I understand
> the replica will try to reconnect. Thus we should keep the
> logs until relay is subscribed for sure.
GC consumer keeps the logs while the replica tries to reconnect.
But what I don't understand now - how does it work if the struct
replica objects create the consumer objects right in replica_set_id()?
Look at relay_subscribe(). If the replica is not anon, then
'replica->id != REPLICA_ID_NIL' is true (because there is an assertion).
It means replica_set_id() was already called. And this means replica->gc
is already not NULL. Therefore the check "replica->gc == NULL && !replica->anon"
is never true. Am I missing something?
Doesn't it mean all the _cluster nodes have a GC consumer on the
current node, and nothing is deleted until all the nodes connect?
Regardless of your patch. I feel like I miss something important.
Because before the patch we have xlog gap errors somehow. Which means
the consumers are dropped somewhere. Can you investigate?
More information about the Tarantool-patches
mailing list