From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Vladimir Davydov Subject: [PATCH] rfc: vylog ups and downs Date: Tue, 6 Aug 2019 11:04:34 +0300 Message-Id: <27436b40260197a4e3b4c27f0c061a24ea78a2ab.1565078606.git.vdavydov.dev@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit To: tarantool-patches@freelists.org List-ID: As per request by Kostja, commit an RFC document with a brief history of the vinyl metadata log infrastructure, issues it was intended to solve, problems we are facing now, and possible ways to solve them. --- https://github.com/tarantool/tarantool/tree/dv/vylog-rfc doc/rfc/vylog-ups-and-downs.md | 279 +++++++++++++++++++++++++++++++++ 1 file changed, 279 insertions(+) create mode 100644 doc/rfc/vylog-ups-and-downs.md diff --git a/doc/rfc/vylog-ups-and-downs.md b/doc/rfc/vylog-ups-and-downs.md new file mode 100644 index 00000000..c8b680d5 --- /dev/null +++ b/doc/rfc/vylog-ups-and-downs.md @@ -0,0 +1,279 @@ +# VyLog: ups and downs + +* **Status**: In progress +* **Start date**: 08-06-2019 +* **Authors**: Vladimir Davydov @locker vdavydov.dev@gmail.com +* **Issues**: N/A + +# Summary + +VyLog, also known as Vinyl metadata log, is the infrastructure that is needed +to match Vinyl data files (\*.run) to Tarantool indexes so that they can be +safely recovered after restart. This document says a few words about how we +have come to the conclusion that we need such an entity, briefly describes +VyLog implementation, mentions some problems we are facing now because of it, +and proposes a few possible ways to deal with them. + +# Origins + +Vinyl data files are stored under the directory specified by box.cfg.vinyl\_dir +configuration option. Within this directory there are sub-directories, one for +each index, named space\_id/index\_id (e.g. 512/0), that store \*.run and +\*.index files: \*.run files contain rows sorted by the index key while +\*.index files contain information necessary to access the rows quickly (page +index, bloom filters). Both kinds of files have the same format as xlog and +snap files. The file names have format %020d.run and %020d.index where the +number (%d) grows incrementally and is the same for related \*.run and \*.index +files. + +Initially, we recovered data files simply by scanning the directory and adding +all found files to the index, just like RocksDB and Cassandra do. However, over +time the complexity of such a solution had grown to insurmountable levels, +because in contrast to the aforementioned database engines, even in its crib +Vinyl had much more complex structure as it splits files into key ranges. With +the appearance of the shared memory level and the concept of slices it turned +out to be practically impossible to recover an index looking solely at its file +directory. There were other problems, too. For instance, how to detect a +situation when we failed to remove a data file due to a system error? The file +is not needed, but it's still in the directory and hence will be read by the +recovery procedure. + +To address the recovery issue, it was initially suggested to store vinyl +metadata information in a system space, but an attempt to implement that failed +for a number of reasons described further in the test. Eventually, we had come +to an agreement that the best we could do at the time was to store vinyl +metadata in its own \*.vylog file. + +# Internals + +## File format and basic idea + +\*.vylog files have the same format as \*.xlog files. Similarly to \*.xlog +files, they store operations done on Vinyl indexes and files, such as "create +index", "drop index", "create key range", "prepare run file", "commit run +file". Each operation has parameters that identify the objects it deals with. +For instance for an operations that adds a key range to an LSM tree, we need to +specify which key range and which LSM tree to use. Objects are identified by +unique integer identifiers. + +On recovery we first read the last \*.vylog file and build a vy\_recovery +structure which stores information about all LSM trees, their key ranges, and +run files. The structure is used during recovery to build vy\_lsm objects: +whenever we encounter an index creation record in \*.snap or \*.xlog file, we +look up the index in the vy\_recovery structure to restore the corresponding +vy\_log object. + +## Relationship with checkpoints + +When a checkpoint is created, the current vylog file is closed, its content is +read,"compressed", meaning that operations annihilating each other, such as +"create run" and "drop run" are discarded, and written to a new vylog file, +which is then used for writing until a new checkpoint is created. So \*.vylog +files are named after checkpoints (\*.snap) files. In contrast to \*.xlog +files, we don't have separate \*.vysnap files - both checkpoint data and +records written after the checkpoint are stored in the same \*.vylog file for +simplicity. We have a special "snapshot" record written to the \*.vylog right +after a checkpoint is created so that we can replay a \*.vylog file until a +checkpoint if necessary. + +The close association with checkpoints stems from the necessity to send the +consistent view of the last checkpoint to a replica on initial join. It is also +used for backups - we only allow to backup files corresponding to a particular +checkpoint. + +## Association with indexes + +As it was mentioned earlier, when we encounter an xlog record creating an index +on recovery, we look up the corresponding LSM tree object in the vy\_recovery +structure loaded from the most recent \*.vylog file. Initially, we used +space\_id/index\_id pair to identify indexes, but the problem is an index can +be dropped and then recreated with the same space\_id/index\_id. That is, we +needed a way to differentiate different incarnations of the same index object. +So we started assigning a unique identifier to each index object and store it +in a persistent manner in the index options (in \_index system space). It was +decided not to introduce a new counter, but use existing box.info.signature for +that. Everything worked out fine until we realized there was an inherent race +condition in that decision. + +The problem is two fibers may create Vinyl indexes concurrently. Since the +index creation operation yields, it may occur that both of them use the same +value of box.info.signature for their indexes, thus breaking recovery. One way +to fix that was ditch box.info.signature and start using a dedicated counter +which would be incremented before any index creation operation (like +auto-increment field). However, the decision to stick to xlog signatures +prevailed so instead we decided to use the signature associated with the xlog +operation by the WAL. Since they are guaranteed to be unique (no two xlog rows +are allowed to have the same signature), no race is possible. + +## DDL operations + +Since VyLog stores information about all Vinyl indexes out there, we have to +log index creation and drop events in it, too. The difficulty associated with +this is that those events are generated from commit triggers, after the +corresponding operation have been written to WAL, when we are not allowed to +fail or yield. So we submit those events to VyLog buffer without waiting for +them to complete (fire-and-forget). If a write fails, VyLog will retry to flush +those records along with the next write. If it fails to flush those records to +disk before restart, we will have to detect this on recovery and restore those +records on the fly, which greatly complicates the recovery procedure, see all +this juggling with lsn in vy\_lsm\_recover. + +# What went wrong + +All VyLog issues can be divided into two sub-categories: + + 1. As explained above, \*.vylog files are rotated along with checkpoints. + Besides, we use vclock signatures to uniquely identify LSM trees in VyLog. + This close relationship with checkpoints and xlogs forces us to add hacks + to the engine-independent code from time to time. The last hack was added + along with transactional DDL when we had to patch alter.cc to pass correct + signatures to Vinyl. + + 2. Another issue with VyLog is its complexity, which makes it difficult to + understand let alone modify its code. The two most notable examples are: + "fire-and-forget" requests issued from commit triggers to log index + creation and drop operations; matching different incarnations of the same + index to vy\_lsm objects on recovery using vclock signatures. + +Arguably, the first sub-category lists more serious issues than the second one. + +# What we can do about it + +This section presents a few ways of alleviating the issues outlined in the +previous section. + +## Incorporate VyLog into xlog + +The idea is to store information about Vinyl data files in memtx, similarly to +how we store meta information in \_space, \_index, and other spaces. The +biggest challenge here is to preserve backward compatibility as the old and the +new approaches would have to co-exist for some time, at least as long as the +recovery procedure concerns. Putting that aside, we can write down the pros and +cons of this approach. + +Pros: + + - The idea of storing all the meta information in system spaces instead of + "re-inventing the wheel" in the shape of VyLog does indeed look tempting and + elegant. + + - This approach would ease introspection as one could simply look through + these spaces to figure out what's going on with Vinyl internals. + + - It would also allow us to simplify Vinyl internals (i.e. deal with the + second sub-category of issues presented in the previous section). For + example, vylog-vs-xlog sync would come for free naturally and we wouldn't + need to juggle with vclock signatures to achieve that. + + - Apparently, there wouldn't be any \*.vylog files should we take this path. + The fewer kinds of files we have, the better, obviously. + +Cons: + + - We would have to rethink the concept of \*.snap files. Currently, they are + equivalent to memtx checkpoints, i.e. they store a consistent view of the + database at the moment a checkpoint was taken. If Vinyl metadata were stored + in memtx, we would have to log creation of run files that store checkpoint + data after we start writing a \*.snap file. This means that to get to a + checkpoint, we would have to not only read the corresponding \*.snap file, + but also apply a chunk of the next \*.xlog file. This was a show stopper + when we considered this approach in the first place. + + - We would probably have to scan the xlog file twice on recovery, because we + need to recover Vinyl metadata before we start replaying DML records on + Vinyl spaces. Rationale: we want to skip records that have already been + dumped to disk. If we don't scan the xlog file twice, we can either apply + all records, which would slow down the recovery procedure, or check the + filesystem to figure out if a run we are about to create on recovery was + dumped to disk before restart. Checking the filesystem neglects the whole + idea of logging so we wind up either with slow recovery or scanning xlogs + twice. Both ways are ugly. + + - It's unclear what to do with rebootstrap. It's a procedure that allows a + replica to initiate bootstrap from a remote master in case it detects that + the master has removed the xlogs it is subscribed to. During rebootstrap we + don't write xlogs, but we may write Vinyl files and hence we may log + something to VyLog. We discard the data from VyLog if reboostrap fails. If + we switched completely to xlog, we'd have to augment the xlog with ability + to discard certain records. Besides, we'd probably have to skip all DML + records during rebootstrap and only write DDL and Vinyl-related data to + xlogs. Otherwise, we'd somewhat slow-down rebootstrap procedure. + +Overall, I don't like this approach. Even if we close our eyes at the +complexity of its implementation and the necessity to support the old approach +for an unknown period of time, there's a much more serious problem inherent to +it. While this approach does simplify the Vinyl metadata logging in general, it +ties Vinyl metadata with generic concepts, such as checkpointing and signatures +even more aggressively than the current VyLog implementation. This means that +should we change anything in either of the sub-systems, we'd have to think how +it'll reflect upon the other one. + +## Relax VyLog dependency of checkpoints and xlogs + +As the worst problems come from the fact that VyLog is too tightly connected +with checkpoints and xlogs, we could try to get rid of this dependency +altogether. + +The first thing we need to do in this direction is to get rid of the vclock +signature propagation to VyLog where it is then used to differentiate various +incarnations of the same index. As mentioned above, we could instead store a +unique Vinyl id right in index options (in \_index space). The id would be +generated by applying an auto-increment operation on a system space (\_schema, +similarly to space\_id) and hence would be known before an index is even +created. As a result, we wouldn't have to pass the signature of the index +creation operation to VyLog on commit. + +The only problem here is transactional DDL: index build operation must always +go first in a transaction, as it may yield while in order to build a new index +we would need to get a unique id from another system space (\_schema). It isn't +quite clear how to solve this problem. Maybe, we could reuse the same approach +that is already used by sequences, i.e. skip generation of a row for such +auto-increment operations, or perhaps we could simply generate UUIDs to uniquely +identify indexes in engines. + +The second step on the way of making VyLog an independent entity is untying it +log rotation from checkpointing. To achieve that, we have to reimplement +initial join and backups in such a way that they don't require the consistent +view of the last checkpoint (currently both operations can only deal with +checkpoints). + +In case of initial join, we could open an iterator over a space instead of +using the last snapshot. In fact, that's how it was initially implemented. +True, it would result in accumulation of garbage on the master while initial +join is in progress. On the other hand, garbage is accumulated anyway, because +we have to pin the \*.run files corresponding to the last checkpoint. + +In case of backup we would have to make box.backup() work on the current +database state, i.e. consistently copy the current Vinyl state, the last +\*.snap file and all following \*.xlog files. This seems doable and reasonable +as the limitation that states that one may only create a backup of a checkpoint +seems to be artificial. After all, the option to store a few last checkpoints +(\*.snap files) was dictated by the absence of backups. I am inclined to think +that we should only keep the last \*.snap file and the actual state of a Vinyl +database (i.e. no \*.run files should be pinned by previous checkpoints). + +## Make VyLog per index + +On our way of untying VyLog from checkpointing and xlogs we could go even +further and maintain an own VyLog in each index directory. This would simplify +the VyLog implementation as we wouldn't have to deal with DDL operations: upon +creation of an index we would simply create a new directory without logging +anything. On recovery we would open the VyLog file stored in the directory and +load the index. Just like with the previous approach we would store the link to +the index directory and/or VyLog file right in index options (\_index system +space). + +This way VyLog files would only store information about \*.run files and their +relationship with key ranges. We wouldn't have to log anything from commit +triggers and hence we wouldn't have to implement "fire-and-forget" like +operations. This would also make the whole system more robust, because it would +eliminate the single point of failure - if a VyLog file of an index was +corrupted, we could still recover other indexes. + +The downside of this approach is that we would have to scan the Vinyl directory +to find and delete orphan indexes. An index can become an orphan in case the +instance crashes while performing an index creation operation. In this case we +need to delete files written by the incomplete operation on recovery. +Currently, we find such orphan indexes by scanning the global VyLog file. If +there was no global VyLog file, the only option would be scanning the directory +where indexes are stored. -- 2.20.1