From: Konstantin Osipov <kostja.osipov@gmail.com> To: Georgy Kirichenko <georgy@tarantool.org> Cc: tarantool-patches@dev.tarantool.org Subject: Re: [Tarantool-patches] [PATCH] Trigger on vclock change Date: Mon, 18 Nov 2019 12:31:18 +0300 [thread overview] Message-ID: <20191118093118.GF2498@atlas> (raw) In-Reply-To: <13345393.mysJK4fUO5@home.lan> * Georgy Kirichenko <georgy@tarantool.org> [19/11/16 23:37]: > > What is wrong with GC and how exactly do you want to "fix" it? > We have discussed some points with you verbally (about 3-4 month > ago). The main point is: the way of information processing is > weird: > 1. WAL has the full information about the wal directory (xlogs > and their boundaries) This is not strictly necessary. It saves us one xdir_scan() in xdir_collect_garbage(), this is perhaps the main historical reason it's there. We even have to make an effort to maintain this state in WAL: - we call xdir_scan() in wal_enable() - we call xdir_add_vclock() whenever we open/close the next xlog. The second reason it was done in WAL was to not block tx thread, but later we had latency spikes in WAL thread as well, so we added XDIR_GC_ASYNC to fix these, and this second reason is a non-reason any more. Finally, the third reason WAL does it is wal_fallocate() function, which removes files if we're out of space. Instead of going back to GC subsystem and asking it to remove a file, the implementation went the short route and removes the file directly in WAL susbystem and notifies GC as a matter of fact. As you can see, all these reasons are accidental. Technically any subsystem (WAL, GC) can remove these files if we add xdir_scan() to xdir_collect_garbage(). GC subsystem is responsible for all the old files, so it should be dealing with them. The fix is to add xdir_scan() to xdir_collect_garbage(), and change wal_fallocate() to send a message to GC asking it to remove some data, rather than kick the chair out of GC butt by calling xdir_collect_garbage(XDIR_GC_REMOVE_ONE). One issue with fixing it this way, is what would you do in wal_fallocate() after you send the message? You will have to have wal_fallocate_soft(), which sends the message asynchronously, to not stall WAL, and wal_fallocate_hard(), which would stall WAL until there is response from TX about extra space. A lot more work. Even though WAL contains some of GC state, it's neither an owner of it nor a consumer: it is only a producer of GC state, and it updates GC state by sending notifications about the files that it creates and closes. The consumers are engines, checkpoints, backups, relays. BTW, I don't think in-memory replication is a new consumer of GC state - it doesn't act like a standard consumer: * a usual consumer may need multiple xlog files, because it can be at a position way behind the current xlog; in-memory replication is almost always pointing to the current xlog, there may be rare cases when it depends on the previous xlogs when xlog size is small or there was a recent rotation. * in case of standard consumers, each consumer is at its own position, while for in-memory replication, all relays are more or less on the same position - at least it doesn't make any logical sense to advance each relay's position independently I remember having suggested that, and I don't remember why using a single consumer for all in-memory relays did not work out for you. The idea is that whenever a relay switches to the memory mode it unsubscribes from GC, and whenever it is back to file mode, it is subscribes to GC again. In order to avoid any races, in-memory-WAL as a consumer keeps a reference to a few WALs. The alternative is to move GC subsystem entirely to WAL. This could perhaps also work and even be cleaner than centralizing GC in TX. Either way I don't see it as a blocker for in-memory WAL - I think in-memory WAL can work with GC being either in WAL or in TX, it's just the messages that threads exchange become a bit more complicated. > 2. WAL process the wal directory cleanup As I wrote above, there are two reasons for this, both historical: - we wanted to avoid TX stalls - we have wal_fallocate(), a feature which was implemented "lazily" so it just removes the files under GCs feet and notifies GC after the fact. GC, logically, controls the WAL dir, and WAL is only a producer of WAL files. > 3. We filter out all this information while relaying (as a relay > has only a stream of rows) GC subscription is not interested in the stream of rows. It is interested in a stream files. A file is represented in GC as a vclock, and a row is identified by a vclock, but it doesn't mean it's the same thing. This is why I really dislike your idea of calling gc_advance on row events. > 4. We try to restore some of this information using on_close_log > recovery trigger. No, it's not "restore" the information. It's pass the right event about the consumer - the file event - to the GC. > 5. We send recovered boundaries to TX and tx tread reconstruct > the relay order loosing really relay vclocks (as they mapped > to the local xlog history) I don't quite get what you mean here? Could you elaborate? I think there is no "reconstruction". There are two types of events: the events updating replicaset_vclock, are needed for replication monitoring, and they happen often. The action upon this event is very cheap - you simply vclock_advance(replicaset_vclock). The second type of event is when relay or backup or engine stops using an xlog file. It is also represented by a vclock but it is not as cheap to process as the first kind, because gc_advance() is not cheap, it's rbtree search. You keep trying to merge the two streams into a single stream, I keep asking to keep the two streams separate. There is of course the standard pluses and minuses of using a centralized "event bus" for all these events - with a single bus, as you suggest, things become simpler for the producer, but the consumers have to do more work to filter out the unnecessary events. > 6. TX sends the oldest vclock back to wal > 7. There is some issues with making a consumer inactive. For > instance if we deactivated a consumer could survive, for > instance if deleted xlog was already send by an applier but > not reported yet (I do not even know how it could be fixed in > the current design). I don't want to argue whether it's weird or not, it's subjective. I agree GC state is distributed now, and it's better if it is centralized. This could be achieved by either moving WAL xdir state to tx, and making sure tx is controlling it, or by moving entire GC to WAL. Moving GC state to WAL seems a cleaner approach, but I'm fine either way. > Maybe it is working, but I afraid, this was done without any thinking about > the future development (I mean the synchronous replication). Let me explain > why. > 1. WAL should receive all relay states as soon as possible. Agree, but it is a different stream of events - it's sync replication events. File events are routed to GC subsystem, sync replication events are routed to RAFT subsystem in WAL. > 2. The set of relay vclocks is enough to perform garbage > collection (as we could form a vclock with is the lower bound > of the set) This is thanks to the fact that each file is unequivocally defined by its vclock boundaries, which is accidental. > So I wish the garbage collection would be implemented using direct relay to > wal reporting. In this circumstances I was in need to implement a structure (I > named it as matrix clock - mclcok) which able to contain relay vclocks and > evaluate a vclock which is lower bound of n-members of the mclcock. > The mclock could be used to get the n-majority vclock as wel as the lowest > boundary of all vclock alive. > The mclock is already implemented as well as new gc design (wal knows about > all relay vclock and the first vclock locked by TX - checkpoint or join read > view). The idea of vclock matrix is totally fine & dandy for bsync. Using it for GC seems like a huge overkill. As to the fact that you have more patches on branches, I think it's better to finish in-memory-replication first - it's a huge performance boost for replicated set-ups, and reduces the latency, too. -- Konstantin Osipov, Moscow, Russia https://scylladb.com
next prev parent reply other threads:[~2019-11-18 9:31 UTC|newest] Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-11-14 12:57 Maria 2019-11-14 13:44 ` Konstantin Osipov 2019-11-14 14:06 ` Georgy Kirichenko 2019-11-14 15:26 ` Konstantin Osipov 2019-11-14 17:13 ` Georgy Kirichenko 2019-11-14 17:33 ` Konstantin Osipov 2019-11-14 19:16 ` Georgy Kirichenko 2019-11-14 19:48 ` Konstantin Osipov 2019-11-14 20:01 ` Georgy Kirichenko 2019-11-15 1:57 ` Konstantin Osipov 2019-11-15 6:02 ` Georgy Kirichenko 2019-11-15 13:57 ` Konstantin Osipov 2019-11-15 19:57 ` Georgy Kirichenko 2019-11-16 10:37 ` Konstantin Osipov 2019-11-16 20:43 ` Georgy Kirichenko 2019-11-16 11:56 ` Konstantin Osipov 2019-11-16 20:34 ` Georgy Kirichenko 2019-11-18 9:31 ` Konstantin Osipov [this message] 2020-06-02 12:22 ` Maria Khaydich 2020-06-03 10:12 ` Sergey Ostanevich 2020-06-03 12:08 ` Alexander Turenko
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20191118093118.GF2498@atlas \ --to=kostja.osipov@gmail.com \ --cc=georgy@tarantool.org \ --cc=tarantool-patches@dev.tarantool.org \ --subject='Re: [Tarantool-patches] [PATCH] Trigger on vclock change' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox