[Tarantool-patches] [PATCH] Trigger on vclock change

Mon Nov 18 12:31:18 MSK 2019

* Georgy Kirichenko <georgy at tarantool.org> [19/11/16 23:37]:

> > What is wrong with GC and how exactly do you want to "fix" it?
> We have discussed some points with you verbally (about 3-4 month
> ago).  The main point is: the way of information processing is
> weird:

> 1. WAL has the full information about the wal directory (xlogs
>    and their boundaries)

This is not strictly necessary. It saves us one xdir_scan() in
xdir_collect_garbage(), this is perhaps the main historical
reason it's there.

We even have to make an effort to maintain this state in WAL:
- we call xdir_scan() in wal_enable()
- we call xdir_add_vclock() whenever we open/close the next xlog.

The second reason it was done in WAL was to not block tx
thread, but later we had latency spikes in WAL thread as well, so
we added XDIR_GC_ASYNC to fix these, and this second reason is a
non-reason any more.

Finally, the third reason WAL does it is wal_fallocate() function,
which removes files if we're out of space. Instead of going back
to GC subsystem and asking it to remove a file, the implementation
went the short route and removes the file directly in WAL
susbystem and notifies GC as a matter of fact.

As you can see, all these reasons are accidental. Technically any
subsystem (WAL, GC) can remove these files if we add xdir_scan()
to xdir_collect_garbage().

GC subsystem is responsible for all the old files, so it should be
dealing with them.

The fix is to add xdir_scan() to xdir_collect_garbage(), and
change wal_fallocate() to send a message to GC asking it to remove
some data, rather than kick the chair out of GC butt by calling
xdir_collect_garbage(XDIR_GC_REMOVE_ONE). One issue with fixing it
this way, is what would you do in wal_fallocate() after you send
the message? You will have to have wal_fallocate_soft(), which
sends the message asynchronously, to not stall WAL, and
wal_fallocate_hard(), which would stall WAL until there is
response from TX about extra space. A lot more work.

Even though WAL contains some of GC state, it's neither an owner
of it nor a consumer: it is only a producer of GC state, and
it updates GC state by sending notifications about the files that
it creates and closes. The consumers are engines, checkpoints,
backups, relays.

BTW, I don't think in-memory replication is a new consumer of GC state
- it doesn't act like a standard consumer:

 * a usual consumer may need multiple xlog files, because it can
   be at a position way behind the current xlog; in-memory
   replication is almost always pointing to the current xlog, 
   there may be rare cases when it depends on the previous xlogs
   when xlog size is small or there was a recent rotation.

 * in case of standard consumers, each consumer is at its own
   position, while for in-memory replication, all relays are more
   or less on the same position - at least it doesn't make any
   logical sense to advance each relay's position independently

I remember having suggested that, and I don't remember why using a
single consumer for all in-memory relays did not work out for you. 
The idea is that whenever a relay switches to the memory mode it
unsubscribes from GC, and whenever it is back to file mode, it is
subscribes to GC again. In order to avoid any races, in-memory-WAL
as a consumer keeps a reference to a few WALs.

The alternative is to move GC subsystem entirely to WAL. This
could perhaps also work and even be cleaner than centralizing GC
in TX. Either way I don't see it as a blocker for in-memory WAL -
I think in-memory WAL can work with GC being either in WAL or in
TX, it's just the messages that threads exchange become a bit more
complicated. 

> 2. WAL process the wal directory cleanup

As I wrote above, there are two reasons for this, both historical:
- we wanted to avoid TX stalls
- we have wal_fallocate(), a feature which was implemented
  "lazily" so it just removes the files under GCs feet and
  notifies GC after the fact.

GC, logically, controls the WAL dir, and WAL is only a producer of
WAL files.

> 3. We filter out all this information while relaying (as a relay
>    has only a stream of rows)

GC subscription is not interested in the stream of rows.
It is interested in a stream files. A file is represented in GC as a
vclock, and a row is identified by a vclock, but it doesn't mean
it's the same thing. 

This is why I really dislike your idea of calling gc_advance on
row events. 

> 4. We try to restore some of this information using on_close_log
>    recovery trigger.

No, it's not "restore" the information. It's pass the right event
about the consumer - the file event - to the GC.

> 5. We send recovered boundaries to TX and tx tread reconstruct
>    the relay order loosing really relay vclocks (as they mapped
>    to the local xlog history)

I don't quite get what you mean here? Could you elaborate?
I think there is no "reconstruction". There are two types of
events: the events updating replicaset_vclock, are needed for
replication monitoring, and they happen often. The action upon 
this event is very cheap - you simply
vclock_advance(replicaset_vclock).

The second type of event is when relay or backup or engine stops
using an xlog file. It is also represented by a vclock but it is
not as cheap to process as the first kind, because gc_advance() is
not cheap, it's rbtree search.

You keep trying to merge the two streams into a single stream, I
keep asking to keep the two streams separate. There is of course
the standard pluses and minuses of using a centralized "event bus"
for all these events - with a single bus, as you suggest, things
become simpler for the producer, but the consumers have to do more
work to filter out the unnecessary events. 

> 6. TX sends the oldest vclock back to wal

> 7. There is some issues with making a consumer inactive. For
>    instance if we deactivated a consumer could survive, for
>    instance if deleted xlog was already send by an applier but
>    not reported yet (I do not even know how it could be fixed in
>    the current design).

I don't want to argue whether it's weird or not, it's subjective. 
I agree GC state is distributed now, and it's better if it is
centralized.

This could be achieved by either moving WAL xdir state to tx, 
and making sure tx is controlling it, or by moving entire GC 
to WAL. Moving GC state to WAL seems a cleaner approach, but I'm
fine either way.

> Maybe it is working, but I afraid, this was done without any thinking about 
> the future development (I mean the synchronous replication). Let me explain 
> why.
> 1. WAL should receive all relay states as soon as possible.

Agree, but it is a different stream of events - it's sync
replication events. File events are routed to GC subsystem, sync
replication events are routed to RAFT subsystem in WAL. 

> 2. The set of relay vclocks is enough to perform garbage
>    collection (as we could form a vclock with is the lower bound
>    of the set)

This is thanks to the fact that each file is unequivocally defined
by its vclock boundaries, which is accidental.

> So I wish the garbage collection would be implemented using direct relay to 
> wal reporting. In this circumstances I was in need to implement a structure (I 
> named it as matrix clock - mclcok) which able to contain relay vclocks and 
> evaluate a vclock which is lower bound of n-members of the mclcock.
> The mclock could be used to get the n-majority vclock as wel as the lowest 
> boundary of all vclock alive.
> The mclock is already implemented as well as new gc design (wal knows about 
> all relay vclock and the first vclock locked by TX - checkpoint or join read 
> view).

The idea of vclock matrix is totally fine & dandy for bsync. Using it for
GC seems like a huge overkill.

As to the fact that you have more patches on branches, I think
it's better to finish in-memory-replication first - it's a huge
performance boost for replicated set-ups, and reduces the latency,
too.

-- 
Konstantin Osipov, Moscow, Russia
https://scylladb.com