[Tarantool-patches] [PATCH 0/4] Safe truncation and deletion
Vladislav Shpilevoy
v.shpilevoy at tarantool.org
Sun Feb 16 18:41:45 MSK 2020
We discussed the ticket and the patches verbally.
On 15/02/2020 17:34, Konstantin Osipov wrote:
> * Vladislav Shpilevoy <v.shpilevoy at tarantool.org> [20/02/15 19:24]:
>> Here is what Kostja said, but somewhy without CCing the
>> mailing list, and my answers to it inlined:
>>
>>> This is the first approximation of what I have proposed at best.
>>> First of all, the mechanism should not be truncation specific. It should be usable by all subsystems that require emergency memory.
>>>
>>> Second, there is no reason to patch small. All the patch needs to do is something along these lines:
>>>
>>>
>>> 1) in memtx_init, reserve the emergency slab.
>>>
>>> 2) in memtx_tuple_alloc, refuse with error if there is no emergency slab
>>>
>>> 3) in memtx_tuple_free, try to reserve emergency slab (replenish it) if it is empty
>>
>> All of this adds code and conditions to a hot path. The way with quota
>> enabling/disabling works only in a rare case, when delete() fails due
>> to OOM, or when truncate is called, which is not often.
>
> First of all, the way with quota simply doesn't work. You expect
> the quota to shrink back, but it never does. So you simply run out
> of all existing memory.
This is solvable. Only normal slabs are never returned. 'Huge' slabs
are. When quota is disabled, we can allocate all new data in 'huge'
slabs. Then quota should shrink back when these slabs are freed.
> If it worked, the fix would to reduce the amount of available
> quota by 1 slab at start, and change quota before truncate,
> and then return the quota back in its place.
This does not really work as simple as you wanted. If truncate would
not free anything, you won't be able to return the quota back. Because
it is occupied by a tuple in _truncate.
> My main point: under no circumstances tarantool should go beyond
> the amount of memory set by quota. If we need to reserve
> memory/quota for emergency, it's fine, let's do it at start,
> but going beyond is not acceptable.
Ok, we also discussed with you that we could allocate one special
slab in the beginning, and use it only for special cases like
_truncate tuples and :delete() reservations. But this also does
not fix the issue completely, only makes it harder to stumble into
it. Besides, this adds notable complexity to memtx tuple allocation
and freeing.
> Second, yes, it does add a branch to the hot path. Same as in
> memtx_index_extent_reserve(). This will have no impact on performance profile -
> feel free to check. I think, however, given the 10% performance
> regression in tuple_format, your time optimizing performance will
> be better spent elsewhere.
Having one condition on a hot path is not a justification for
adding more conditions and complexity.
>>> 4) at start of truncate, or wherever we need emergency memory, release emergency slab. Simply return it to arena.
>>
>> Once you returned it, all the other operations will be able to take
>> and fill it. Such as insertions. In the first truncate or delete
>> didn't free anything, or freed not enough to fit a new truncate
>> tuple here, there is no more a reserved slab for a next delete/truncate.
>> So your proposal does not seem to change anything.
>
> Ehm, I did not fully explain the proposal. It also assumes there is
> "emergency mode" flag set at start of truncate and cleared at end,
> and if there is no reserve slab and there is an emergence flag
> set, nothing can allocate memory.
Yeah, the problem is the same as with reserving quota instead of a
slab. If truncate/delete didn't free anything, the 'emergency'
slab is already taken, and can't be freed.
Here is a summary of what we discussed.
Lets don't change anything about truncate(). Or make it work using
deletes + periodically try to execute a real truncate again. Or
drop _truncate system space. Or use 'huge' slabs idea for _truncate
tuples.
Talking of delete, it looks like the problem is related to a
leak/bug somewhere in btree. (Probably this is it:
https://github.com/tarantool/tarantool/issues/4780).
The thing is that :delete() should never fail in memtx. It is
guaranteed by memtx_index_extent_reserve() in
memtx_space_replace_all_keys(). Indeed, consider an example. Assume,
that we insert tuples into a tree. It is getting filled, and calls
memtx_index_extent_reserve() with 16 blocks each time. When there is
not enough memory for 16 blocks, it is assumed, that there is enough
for 8 blocks.
So memtx_index_extent_reserve() for REPLACE stops working, but
memtx_index_extent_reserve() for DELETE works, because needs only 8
blocks. Moreover, even if DELETE didn't free anything, the tree is
not changed, and these 8 blocks are still available for a next DELETE
attempt.
This was the initial idea with that reserve() call. A problem may be
with a leak, or with the assumption, that not enough for 16 blocks !=
enough for 8 blocks. And it is possible to reach a situation, when both
REPLACE and DELETE can't reserve anything.
I think that firstly we need a stable reproducer. Because the tests
on the branch seems to fail only at insertion. On deletion they fail
artificially via errinj. Perhaps. Because I tried to delete that
errinj, and nothing changed. It means, that they don't really depend
on why memtx_index_extent_reserve() can fail. And this is bad.
Once we have a reproducer, without errinj, we can see at the real
reason why does not it work. Maybe it would be enough just to increase
RESERVE_EXTENTS_BEFORE_REPLACE. So as after any insertion/update there
would be still reserved at least RESERVE_EXTENTS_BEFORE_DELETE.
Maybe we will find a leak in bps. A fair reproducer will show.
More information about the Tarantool-patches
mailing list