From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Wed, 3 Jul 2019 22:21:53 +0300 From: Konstantin Osipov Subject: Re: [PATCH 3/6] Don't take schema lock for checkpointing Message-ID: <20190703192153.GF17318@atlas> References: <2b12ec00f416bbb2ab214c5a0ddfc373375ed038.1561922496.git.vdavydov.dev@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <2b12ec00f416bbb2ab214c5a0ddfc373375ed038.1561922496.git.vdavydov.dev@gmail.com> To: Vladimir Davydov Cc: tarantool-patches@freelists.org List-ID: * Vladimir Davydov [19/07/01 10:04]: > Memtx checkpointing proceeds as follows: first we open iterators over > primary indexes of all spaces and save them to a list, then we start > a thread that uses the iterators to dump space contents to a snap file. > To avoid accessing a freed tuple, we put the small allocator to the > delayed free mode. However, this doesn't prevent an index from being > dropped so we also take the schema lock to lock out any DDL operation > that can potentially destroy a space or an index. Note, vinyl doesn't > need this lock, because it implements index reference counting under > the hood. > > Actually, we don't really need to take a lock - instead we can simply > postpone index destruction until checkpointing is complete, similarly > to how we postpone destruction of individual tuples. We even have all > the infrastructure for this - it's delayed garbage collection. So this > patch tweaks it a bit to delay the actual index destruction to be done > after checkpointing is complete. > > This is a step forward towards removal of the schema lock, which stands > in the way of transactional DDL. Looks like you do it because I said once I hate reference counting. First, I don't mind having reference counting for memtx index objects now that we've approached transactional ddl frontier. But even reference counting would be a bit cumbersome for this. Please take a look at how bps does it - it links all pages into a fifo-like list, and a checkpoint simply sets a savepoint on that list. Committing a checkpoint releases the savepoint and garbage collects all objects that have been freed after the savepoint. SQL will need multiple concurrent snapshots - so it will need multiple versions. So please consider turning the algorithm you've just used a general-purpose one - so that any database object could add itself for delayed destruction, the delayed destruction would take place immediately if there are no savepoints, or immediately after the savepoint up until the next savepoint. Then we can move all subsystems to this. If you have a better general-purpose object garbage collection idea, please share/implement it too. -- Konstantin Osipov, Moscow, Russia