From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 8C7012E562 for ; Sat, 25 May 2019 02:12:00 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id u51GZgvsXwaz for ; Sat, 25 May 2019 02:12:00 -0400 (EDT) Received: from smtp63.i.mail.ru (smtp63.i.mail.ru [217.69.128.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 3ACAB2E55E for ; Sat, 25 May 2019 02:12:00 -0400 (EDT) Received: by smtp63.i.mail.ru with esmtpa (envelope-from ) id 1hUPuP-0001KI-Mk for tarantool-patches@freelists.org; Sat, 25 May 2019 09:11:58 +0300 Date: Sat, 25 May 2019 09:11:57 +0300 From: Konstantin Osipov Subject: [tarantool-patches] Re: [PATCH 1/3] vinyl: fix secondary index divergence on update Message-ID: <20190525061157.GB14501@atlas> References: <8e4175c3f3b857097ccfd264608b046b71635e91.1558733443.git.vdavydov.dev@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <8e4175c3f3b857097ccfd264608b046b71635e91.1558733443.git.vdavydov.dev@gmail.com> Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-Help: List-Unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-Subscribe: List-Owner: List-post: List-Archive: To: tarantool-patches@freelists.org * Vladimir Davydov [19/05/25 06:41]: Vladimir, could you clarify your comments a bit? > If an UPDATE request doesn't touch key parts of a secondary index, we > don't need to write it to the index memory level or dump it to disk, as We don't have a separate memtable for secondary keys. Better say "we don't need to re-index it in the in-memory secondary index". > this would only increase IO load. Historically, we use column mask set > by the UPDATE operation to skip secondary indexes that are not affected > by the operation on commit. However, there's a problem here: the column > mask isn't precise - it may have a bit set even if the corresponding > column doesn't really get updated, e.g. consider {'+', 2, 0}. The column does get updated, but the update doesn't change its value. Now I am making the ends of it. > Not taking > this into account may result in appearance of phantom tuples on disk as > the write iterator assumes that statements that have no effect aren't > written to secondary indexes (this is needed to apply INSERT+DELETE > "annihilation" optimization). We fixed that by clearing column mask bits > in vy_tx_set in case we detect that the key isn't changed, for more > details see #3607 and commit e72867cb9169 ("vinyl: fix appearance of > phantom tuple in secondary index after update"). It was rather an ugly > hack, but it worked. > > However, it turned out that apart from looking hackish this code has > a nasty bug that may lead to tuples missing from secondary indexes. > Consider the following example: > > s = box.schema.space.create('test', {engine = 'vinyl'}) > s:create_index('pk') > s:create_index('sk', {parts = {2, 'unsigned'}}) > s:insert{1, 1, 1} > > box.begin() > s:update(1, {{'=', 2, 2}}) > s:update(1, {{'=', 3, 2}}) > box.commit() > > The first update operation writes DELETE{1,1} and REPLACE{2,1} to the > secondary index write set. The second update replaces REPLACE{2,1} with > DELETE{2,1} and then with REPLACE{2,1}. When replacing DELETE{2,1} with > REPLACE{2,1} in the write set, we assume that the update doesn't modify > secondary index key parts and clear the column mask so as not to commit > a pointless request, see vy_tx_set. As a result, we skip the first > update too and get key {2,1} missing from the secondary index. > > Actually, it was a dumb idea to use column mask to skip statements in > the first place, as there's a much easier way to filter out statements > that have no effect for secondary indexes. The thing is every DELETE > statement inserted into a secondary index write set acts as a "single > DELETE", i.e. there's exactly one older statement it is supposed to > purge. This is, because in contrast to the primary index we don't write > DELETE statements blindly - we always look up the tuple overwritten in > the primary index first. This means that REPLACE+DELETE for the same key > is basically a no-op and can be safely skip. Moreover, DELETE+REPLACE > can be treated as no-op, too, because secondary indexes don't store full > tuples hence all REPLACE statements for the same key are equivalent. > By marking such pair of statements as no-op in vy_tx_set, we guarantee > that no-op statements don't make it to secondary index memory or disk > levels. Better say "mark both statements", not a pair, since they are not present in the tx write list as a pair. Could you also please explain why you decided to introduce a new flag, and not use is_overwritten? -- Konstantin Osipov, Moscow, Russia