From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp15.mail.ru (smtp15.mail.ru [94.100.176.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 6E2ED469719 for ; Sun, 16 Feb 2020 18:41:47 +0300 (MSK) References: <701f259d-b9fd-c688-9602-7281e0dc2f0d@tarantool.org> <4571f818-2211-5713-fc4a-b6539ba68253@tarantool.org> <20200215163440.GA19240@atlas> From: Vladislav Shpilevoy Message-ID: <2b8dee18-aaab-265e-6d38-4dd3a5d67653@tarantool.org> Date: Sun, 16 Feb 2020 16:41:45 +0100 MIME-Version: 1.0 In-Reply-To: <20200215163440.GA19240@atlas> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Subject: Re: [Tarantool-patches] [PATCH 0/4] Safe truncation and deletion List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Konstantin Osipov , Ilya Kosarev , tarantool-patches@dev.tarantool.org We discussed the ticket and the patches verbally. On 15/02/2020 17:34, Konstantin Osipov wrote: > * Vladislav Shpilevoy [20/02/15 19:24]: >> Here is what Kostja said, but somewhy without CCing the >> mailing list, and my answers to it inlined: >> >>> This is the first approximation of what I have proposed at best. >>> First of all, the mechanism should not be truncation specific. It should be usable by all subsystems that require emergency memory. >>> >>> Second, there is no reason to patch small. All the patch needs to do is something along these lines: >>> >>> >>> 1) in memtx_init, reserve the emergency slab. >>> >>> 2) in memtx_tuple_alloc, refuse with error if there is no emergency slab >>> >>> 3) in memtx_tuple_free, try to reserve emergency slab (replenish it) if it is empty >> >> All of this adds code and conditions to a hot path. The way with quota >> enabling/disabling works only in a rare case, when delete() fails due >> to OOM, or when truncate is called, which is not often. > > First of all, the way with quota simply doesn't work. You expect > the quota to shrink back, but it never does. So you simply run out > of all existing memory. This is solvable. Only normal slabs are never returned. 'Huge' slabs are. When quota is disabled, we can allocate all new data in 'huge' slabs. Then quota should shrink back when these slabs are freed. > If it worked, the fix would to reduce the amount of available > quota by 1 slab at start, and change quota before truncate, > and then return the quota back in its place. This does not really work as simple as you wanted. If truncate would not free anything, you won't be able to return the quota back. Because it is occupied by a tuple in _truncate. > My main point: under no circumstances tarantool should go beyond > the amount of memory set by quota. If we need to reserve > memory/quota for emergency, it's fine, let's do it at start, > but going beyond is not acceptable. Ok, we also discussed with you that we could allocate one special slab in the beginning, and use it only for special cases like _truncate tuples and :delete() reservations. But this also does not fix the issue completely, only makes it harder to stumble into it. Besides, this adds notable complexity to memtx tuple allocation and freeing. > Second, yes, it does add a branch to the hot path. Same as in > memtx_index_extent_reserve(). This will have no impact on performance profile - > feel free to check. I think, however, given the 10% performance > regression in tuple_format, your time optimizing performance will > be better spent elsewhere. Having one condition on a hot path is not a justification for adding more conditions and complexity. >>> 4) at start of truncate, or wherever we need emergency memory, release emergency slab. Simply return it to arena. >> >> Once you returned it, all the other operations will be able to take >> and fill it. Such as insertions. In the first truncate or delete >> didn't free anything, or freed not enough to fit a new truncate >> tuple here, there is no more a reserved slab for a next delete/truncate. >> So your proposal does not seem to change anything. > > Ehm, I did not fully explain the proposal. It also assumes there is > "emergency mode" flag set at start of truncate and cleared at end, > and if there is no reserve slab and there is an emergence flag > set, nothing can allocate memory. Yeah, the problem is the same as with reserving quota instead of a slab. If truncate/delete didn't free anything, the 'emergency' slab is already taken, and can't be freed. Here is a summary of what we discussed. Lets don't change anything about truncate(). Or make it work using deletes + periodically try to execute a real truncate again. Or drop _truncate system space. Or use 'huge' slabs idea for _truncate tuples. Talking of delete, it looks like the problem is related to a leak/bug somewhere in btree. (Probably this is it: https://github.com/tarantool/tarantool/issues/4780). The thing is that :delete() should never fail in memtx. It is guaranteed by memtx_index_extent_reserve() in memtx_space_replace_all_keys(). Indeed, consider an example. Assume, that we insert tuples into a tree. It is getting filled, and calls memtx_index_extent_reserve() with 16 blocks each time. When there is not enough memory for 16 blocks, it is assumed, that there is enough for 8 blocks. So memtx_index_extent_reserve() for REPLACE stops working, but memtx_index_extent_reserve() for DELETE works, because needs only 8 blocks. Moreover, even if DELETE didn't free anything, the tree is not changed, and these 8 blocks are still available for a next DELETE attempt. This was the initial idea with that reserve() call. A problem may be with a leak, or with the assumption, that not enough for 16 blocks != enough for 8 blocks. And it is possible to reach a situation, when both REPLACE and DELETE can't reserve anything. I think that firstly we need a stable reproducer. Because the tests on the branch seems to fail only at insertion. On deletion they fail artificially via errinj. Perhaps. Because I tried to delete that errinj, and nothing changed. It means, that they don't really depend on why memtx_index_extent_reserve() can fail. And this is bad. Once we have a reproducer, without errinj, we can see at the real reason why does not it work. Maybe it would be enough just to increase RESERVE_EXTENTS_BEFORE_REPLACE. So as after any insertion/update there would be still reserved at least RESERVE_EXTENTS_BEFORE_DELETE. Maybe we will find a leak in bps. A fair reproducer will show.