[Tarantool-patches] [PATCH] Trigger on vclock change

Wed Jun 3 13:12:38 MSK 2020

Hi!

Thank you for bringing this up!

It was discussed back in 2019
https://lists.tarantool.org/pipermail/tarantool-patches/2019-November/012402.html
and before Kostja and Georgy got personal in their discussion,
they [kind of] agreed that the single-leader sync replication will
implement the required functionality.

As you know, we have first version of synchro planned this quarter
and I believe this tracker will be resolved as part of it - so no need
to implement it for the second time.

Please, link it to the qsync tracker #4842 so we update/close it as
soon as we're done.

Regards,
Sergos

On 02 июн 15:22, Maria Khaydich wrote:
> 
> Using trigger on vclock change to determine the state would be cpu consuming, so I’m currently remaking previous patch so that we could yield from fiber and wait for a specific lsn from a specific replica. A possible use-case: committing a transaction and waiting for it to apply on all replicas. The way I am going to implement it is pretty much how Kostja suggested:  «..wait_lsn() could add the server_id, lsn that is  being waited for to a sorted list, and whenever we update  replicaset vclock for this lsn we also look at top of the list, if  it is not empty, and if the current lsn is greater than the top,  we could pop the value from the list and send a notification to  the waiter».
>  
> Anyway, there are still some questions to discuss. 1. Do we need wait_lsn_any() method mentioned here  https://github.com/tarantool/tarantool/issues/3808  ? I don’t see how this one can be useful. 2. What should be done in case of fail (reaching the timeout)? Simply returning an error seems like the best choice to me, so that user can later decide to do as he pleases with this information. 
> Another issue is that during the last discussion in the mailing list it was mentioned that we wouldn’t need this feature altogether if we had synchronous replication. Any thoughts on this matter?
> > 
> >>
> >>Понедельник, 18 ноября 2019, 12:31 +03:00 от Konstantin Osipov <kostja.osipov at gmail.com>:
> >> 
> >>* Georgy Kirichenko < georgy at tarantool.org > [19/11/16 23:37]:
> >>
> >>> > What is wrong with GC and how exactly do you want to "fix" it?
> >>> We have discussed some points with you verbally (about 3-4 month
> >>> ago). The main point is: the way of information processing is
> >>> weird:
> >>
> >>> 1. WAL has the full information about the wal directory (xlogs
> >>> and their boundaries)
> >>
> >>This is not strictly necessary. It saves us one xdir_scan() in
> >>xdir_collect_garbage(), this is perhaps the main historical
> >>reason it's there.
> >>
> >>We even have to make an effort to maintain this state in WAL:
> >>- we call xdir_scan() in wal_enable()
> >>- we call xdir_add_vclock() whenever we open/close the next xlog.
> >>
> >>The second reason it was done in WAL was to not block tx
> >>thread, but later we had latency spikes in WAL thread as well, so
> >>we added XDIR_GC_ASYNC to fix these, and this second reason is a
> >>non-reason any more.
> >>
> >>Finally, the third reason WAL does it is wal_fallocate() function,
> >>which removes files if we're out of space. Instead of going back
> >>to GC subsystem and asking it to remove a file, the implementation
> >>went the short route and removes the file directly in WAL
> >>susbystem and notifies GC as a matter of fact.
> >>
> >>As you can see, all these reasons are accidental. Technically any
> >>subsystem (WAL, GC) can remove these files if we add xdir_scan()
> >>to xdir_collect_garbage().
> >>
> >>GC subsystem is responsible for all the old files, so it should be
> >>dealing with them.
> >>
> >>The fix is to add xdir_scan() to xdir_collect_garbage(), and
> >>change wal_fallocate() to send a message to GC asking it to remove
> >>some data, rather than kick the chair out of GC butt by calling
> >>xdir_collect_garbage(XDIR_GC_REMOVE_ONE). One issue with fixing it
> >>this way, is what would you do in wal_fallocate() after you send
> >>the message? You will have to have wal_fallocate_soft(), which
> >>sends the message asynchronously, to not stall WAL, and
> >>wal_fallocate_hard(), which would stall WAL until there is
> >>response from TX about extra space. A lot more work.
> >>
> >>Even though WAL contains some of GC state, it's neither an owner
> >>of it nor a consumer: it is only a producer of GC state, and
> >>it updates GC state by sending notifications about the files that
> >>it creates and closes. The consumers are engines, checkpoints,
> >>backups, relays.
> >>
> >>BTW, I don't think in-memory replication is a new consumer of GC state
> >>- it doesn't act like a standard consumer:
> >> 
> >> * a usual consumer may need multiple xlog files, because it can
> >>   be at a position way behind the current xlog; in-memory
> >>   replication is almost always pointing to the current xlog,
> >>   there may be rare cases when it depends on the previous xlogs
> >>   when xlog size is small or there was a recent rotation.
> >>
> >> * in case of standard consumers, each consumer is at its own
> >>   position, while for in-memory replication, all relays are more
> >>   or less on the same position - at least it doesn't make any
> >>   logical sense to advance each relay's position independently
> >>
> >>I remember having suggested that, and I don't remember why using a
> >>single consumer for all in-memory relays did not work out for you.
> >>The idea is that whenever a relay switches to the memory mode it
> >>unsubscribes from GC, and whenever it is back to file mode, it is
> >>subscribes to GC again. In order to avoid any races, in-memory-WAL
> >>as a consumer keeps a reference to a few WALs.
> >>
> >>The alternative is to move GC subsystem entirely to WAL. This
> >>could perhaps also work and even be cleaner than centralizing GC
> >>in TX. Either way I don't see it as a blocker for in-memory WAL -
> >>I think in-memory WAL can work with GC being either in WAL or in
> >>TX, it's just the messages that threads exchange become a bit more
> >>complicated.
> >>
> >>> 2. WAL process the wal directory cleanup
> >>
> >>As I wrote above, there are two reasons for this, both historical:
> >>- we wanted to avoid TX stalls
> >>- we have wal_fallocate(), a feature which was implemented
> >>  "lazily" so it just removes the files under GCs feet and
> >>  notifies GC after the fact.
> >>
> >>GC, logically, controls the WAL dir, and WAL is only a producer of
> >>WAL files.
> >>  
> >>> 3. We filter out all this information while relaying (as a relay
> >>> has only a stream of rows)
> >>
> >>GC subscription is not interested in the stream of rows.
> >>It is interested in a stream files. A file is represented in GC as a
> >>vclock, and a row is identified by a vclock, but it doesn't mean
> >>it's the same thing.
> >>
> >>This is why I really dislike your idea of calling gc_advance on
> >>row events.
> >>
> >>> 4. We try to restore some of this information using on_close_log
> >>> recovery trigger.
> >>
> >>No, it's not "restore" the information. It's pass the right event
> >>about the consumer - the file event - to the GC.
> >>
> >>> 5. We send recovered boundaries to TX and tx tread reconstruct
> >>> the relay order loosing really relay vclocks (as they mapped
> >>> to the local xlog history)
> >>
> >>I don't quite get what you mean here? Could you elaborate?
> >>I think there is no "reconstruction". There are two types of
> >>events: the events updating replicaset_vclock, are needed for
> >>replication monitoring, and they happen often. The action upon
> >>this event is very cheap - you simply
> >>vclock_advance(replicaset_vclock).
> >>
> >>The second type of event is when relay or backup or engine stops
> >>using an xlog file. It is also represented by a vclock but it is
> >>not as cheap to process as the first kind, because gc_advance() is
> >>not cheap, it's rbtree search.
> >>
> >>You keep trying to merge the two streams into a single stream, I
> >>keep asking to keep the two streams separate. There is of course
> >>the standard pluses and minuses of using a centralized "event bus"
> >>for all these events - with a single bus, as you suggest, things
> >>become simpler for the producer, but the consumers have to do more
> >>work to filter out the unnecessary events.
> >>
> >>> 6. TX sends the oldest vclock back to wal
> >>
> >>
> >>
> >>> 7. There is some issues with making a consumer inactive. For
> >>> instance if we deactivated a consumer could survive, for
> >>> instance if deleted xlog was already send by an applier but
> >>> not reported yet (I do not even know how it could be fixed in
> >>> the current design).
> >>
> >>I don't want to argue whether it's weird or not, it's subjective.
> >>I agree GC state is distributed now, and it's better if it is
> >>centralized.
> >>
> >>This could be achieved by either moving WAL xdir state to tx,
> >>and making sure tx is controlling it, or by moving entire GC
> >>to WAL. Moving GC state to WAL seems a cleaner approach, but I'm
> >>fine either way.
> >>
> >>> Maybe it is working, but I afraid, this was done without any thinking about
> >>> the future development (I mean the synchronous replication). Let me explain
> >>> why.
> >>> 1. WAL should receive all relay states as soon as possible.
> >>
> >>Agree, but it is a different stream of events - it's sync
> >>replication events. File events are routed to GC subsystem, sync
> >>replication events are routed to RAFT subsystem in WAL.
> >>
> >>> 2. The set of relay vclocks is enough to perform garbage
> >>> collection (as we could form a vclock with is the lower bound
> >>> of the set)
> >>
> >>This is thanks to the fact that each file is unequivocally defined
> >>by its vclock boundaries, which is accidental.
> >>
> >>> So I wish the garbage collection would be implemented using direct relay to
> >>> wal reporting. In this circumstances I was in need to implement a structure (I
> >>> named it as matrix clock - mclcok) which able to contain relay vclocks and
> >>> evaluate a vclock which is lower bound of n-members of the mclcock.
> >>> The mclock could be used to get the n-majority vclock as wel as the lowest
> >>> boundary of all vclock alive.
> >>> The mclock is already implemented as well as new gc design (wal knows about
> >>> all relay vclock and the first vclock locked by TX - checkpoint or join read
> >>> view).
> >>
> >>The idea of vclock matrix is totally fine & dandy for bsync. Using it for
> >>GC seems like a huge overkill.
> >>
> >>As to the fact that you have more patches on branches, I think
> >>it's better to finish in-memory-replication first - it's a huge
> >>performance boost for replicated set-ups, and reduces the latency,
> >>too.
> >>
> >>--
> >>Konstantin Osipov, Moscow, Russia
> >>https://scylladb.com 
> > 
> > 
> >--
> >Maria Khaydich
> >