From: Oleg Babin via Tarantool-patches <tarantool-patches@dev.tarantool.org> To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>, tarantool-patches@dev.tarantool.org, yaroslav.dynnikov@tarantool.org Subject: Re: [Tarantool-patches] [PATCH 9/9] util: introduce binary heap data structure Date: Wed, 10 Feb 2021 12:01:11 +0300 [thread overview] Message-ID: <805dc520-0146-6fb0-29ec-f621bc8d2a8b@tarantool.org> (raw) In-Reply-To: <259e9595aefe7a28af13eb6dd336ea8f145c2112.1612914070.git.v.shpilevoy@tarantool.org> Thanks for your patch. Shouldn't it be added to storage "MODULE_INTERNALS" ? LGTM. One comment below. On 10/02/2021 02:46, Vladislav Shpilevoy wrote: > Lua does not have a built-in standard library for binary heaps > (also called priority queues). There is an implementation in > Tarantool core in libsalad, but it is in C. > > Heap is a perfect storage for the soon coming feature map-reduce. > In the map-reduce algorithm it will be necessary to be able to > lock an entire storage against any bucket moves for time <= > specified timeout. Number of map-reduce requests can be big, and > they can have different timeouts. > > So there is a pile of timeouts from different requests. It is > necessary to be able to quickly add new ones, be able to delete > random ones, and remove expired ones. > > One way would be a sorted array of the deadlines. Unfortunately, > it is super slow. O(N + log(N)) to add a new element (find place > for log(N) and move all next elements for N), O(N) to delete a > random one (move all next elements one cell left/right). > > Another way would be a sorted tree. But trees like RB or a dumb > binary tree require extra steps to keep them balanced and to have > access to the smallest element ASAP. > > The best way is the binary heap. It is perfectly balanced by > design meaning that all operations there have complexity at most > O(log(N)). It is possible to find the closest deadline for > constant time as it is the heap's top. > > This patch implements it. The heap is intrusive. It means it > stores index of each element right inside of the element as a > field 'index'. Having an index along with each element allows to > delete it from the heap for O(log(N)) without necessity to look > its place up first. > > Part of #147 > --- > test/unit-tap/heap.test.lua | 310 ++++++++++++++++++++++++++++++++++++ > test/unit-tap/suite.ini | 4 + > vshard/heap.lua | 226 ++++++++++++++++++++++++++ > 3 files changed, 540 insertions(+) > create mode 100755 test/unit-tap/heap.test.lua > create mode 100644 test/unit-tap/suite.ini > create mode 100644 vshard/heap.lua > > diff --git a/test/unit-tap/heap.test.lua b/test/unit-tap/heap.test.lua > new file mode 100755 > index 0000000..8c3819f > --- /dev/null > +++ b/test/unit-tap/heap.test.lua > @@ -0,0 +1,310 @@ > +#!/usr/bin/env tarantool > + > +local tap = require('tap') > +local test = tap.test("cfg") > +local heap = require('vshard.heap') > + Maybe it's better to use single brackets everywhere: test("cfg") -> test('cfg'). Or does such difference have some sense? > +-- > +-- Max number of heap to test. Number of iterations in the test > +-- grows as a factorial of this value. At 10 the test becomes > +-- too long already. > +-- > +local heap_size = 8 > + > +-- > +-- Type of the object stored in the intrusive heap. > +-- > +local function min_heap_cmp(l, r) > + return l.value < r.value > +end > + > +local function max_heap_cmp(l, r) > + return l.value > r.value > +end > + > +local function new_object(value) > + return {value = value} > +end > + > +local function heap_check_indexes(heap) > + local count = heap:count() > + local data = heap.data > + for i = 1, count do > + assert(data[i].index == i) > + end > +end > + > +local function reverse(values, i1, i2) > + while i1 < i2 do > + values[i1], values[i2] = values[i2], values[i1] > + i1 = i1 + 1 > + i2 = i2 - 1 > + end > +end > + > +-- > +-- Implementation of std::next_permutation() from C++. > +-- > +local function next_permutation(values) > + local count = #values > + if count <= 1 then > + return false > + end > + local i = count > + while true do > + local j = i > + i = i - 1 > + if values[i] < values[j] then > + local k = count > + while values[i] >= values[k] do > + k = k - 1 > + end > + values[i], values[k] = values[k], values[i] > + reverse(values, j, count) > + return true > + end > + if i == 1 then > + reverse(values, 1, count) > + return false > + end > + end > +end > + > +local function range(count) > + local res = {} > + for i = 1, count do > + res[i] = i > + end > + return res > +end > + > +-- > +-- Min heap fill and empty. > +-- > +local function test_min_heap_basic(test) > + test:plan(1) > + > + local h = heap.new(min_heap_cmp) > + assert(not h:pop()) > + assert(h:count() == 0) > + local values = {} > + for i = 1, heap_size do > + values[i] = new_object(i) > + end > + for counti = 1, heap_size do > + local indexes = range(counti) > + repeat > + for i = 1, counti do > + h:push(values[indexes[i]]) > + end > + heap_check_indexes(h) > + assert(h:count() == counti) > + for i = 1, counti do > + assert(h:top() == values[i]) > + assert(h:pop() == values[i]) > + heap_check_indexes(h) > + end > + assert(not h:pop()) > + assert(h:count() == 0) > + until not next_permutation(indexes) > + end > + > + test:ok(true, "no asserts") > +end > + > +-- > +-- Max heap fill and empty. > +-- > +local function test_max_heap_basic(test) > + test:plan(1) > + > + local h = heap.new(max_heap_cmp) > + assert(not h:pop()) > + assert(h:count() == 0) > + local values = {} > + for i = 1, heap_size do > + values[i] = new_object(heap_size - i + 1) > + end > + for counti = 1, heap_size do > + local indexes = range(counti) > + repeat > + for i = 1, counti do > + h:push(values[indexes[i]]) > + end > + heap_check_indexes(h) > + assert(h:count() == counti) > + for i = 1, counti do > + assert(h:top() == values[i]) > + assert(h:pop() == values[i]) > + heap_check_indexes(h) > + end > + assert(not h:pop()) > + assert(h:count() == 0) > + until not next_permutation(indexes) > + end > + > + test:ok(true, "no asserts") > +end > + > +-- > +-- Min heap update top element. > +-- > +local function test_min_heap_update_top(test) > + test:plan(1) > + > + local h = heap.new(min_heap_cmp) > + for counti = 1, heap_size do > + local indexes = range(counti) > + repeat > + local values = {} > + for i = 1, counti do > + values[i] = new_object(0) > + h:push(values[i]) > + end > + heap_check_indexes(h) > + for i = 1, counti do > + h:top().value = indexes[i] > + h:update_top() > + end > + heap_check_indexes(h) > + assert(h:count() == counti) > + for i = 1, counti do > + assert(h:top().value == i) > + assert(h:pop().value == i) > + heap_check_indexes(h) > + end > + assert(not h:pop()) > + assert(h:count() == 0) > + until not next_permutation(indexes) > + end > + > + test:ok(true, "no asserts") > +end > + > +-- > +-- Min heap update all elements in all possible positions. > +-- > +local function test_min_heap_update(test) > + test:plan(1) > + > + local h = heap.new(min_heap_cmp) > + for counti = 1, heap_size do > + for srci = 1, counti do > + local endv = srci * 10 + 5 > + for newv = 5, endv, 5 do > + local values = {} > + for i = 1, counti do > + values[i] = new_object(i * 10) > + h:push(values[i]) > + end > + heap_check_indexes(h) > + local obj = values[srci] > + obj.value = newv > + h:update(obj) > + assert(obj.index >= 1) > + assert(obj.index <= counti) > + local prev = -1 > + for i = 1, counti do > + obj = h:pop() > + assert(obj.index == -1) > + assert(obj.value >= prev) > + assert(obj.value >= 1) > + prev = obj.value > + obj.value = -1 > + heap_check_indexes(h) > + end > + assert(not h:pop()) > + assert(h:count() == 0) > + end > + end > + end > + > + test:ok(true, "no asserts") > +end > + > +-- > +-- Max heap delete all elements from all possible positions. > +-- > +local function test_max_heap_delete(test) > + test:plan(1) > + > + local h = heap.new(max_heap_cmp) > + local inf = heap_size + 1 > + for counti = 1, heap_size do > + for srci = 1, counti do > + local values = {} > + for i = 1, counti do > + values[i] = new_object(i) > + h:push(values[i]) > + end > + heap_check_indexes(h) > + local obj = values[srci] > + obj.value = inf > + h:remove(obj) > + assert(obj.index == -1) > + local prev = inf > + for i = 2, counti do > + obj = h:pop() > + assert(obj.index == -1) > + assert(obj.value < prev) > + assert(obj.value >= 1) > + prev = obj.value > + obj.value = -1 > + heap_check_indexes(h) > + end > + assert(not h:pop()) > + assert(h:count() == 0) > + end > + end > + > + test:ok(true, "no asserts") > +end > + > +local function test_min_heap_remove_top(test) > + test:plan(1) > + > + local h = heap.new(min_heap_cmp) > + for i = 1, heap_size do > + h:push(new_object(i)) > + end > + for i = 1, heap_size do > + assert(h:top().value == i) > + h:remove_top() > + end > + assert(h:count() == 0) > + > + test:ok(true, "no asserts") > +end > + > +local function test_max_heap_remove_try(test) > + test:plan(1) > + > + local h = heap.new(max_heap_cmp) > + local obj = new_object(1) > + assert(obj.index == nil) > + h:remove_try(obj) > + assert(h:count() == 0) > + > + h:push(obj) > + h:push(new_object(2)) > + assert(obj.index == 2) > + h:remove(obj) > + assert(obj.index == -1) > + h:remove_try(obj) > + assert(obj.index == -1) > + assert(h:count() == 1) > + > + test:ok(true, "no asserts") > +end > + > +test:plan(7) > + > +test:test('min_heap_basic', test_min_heap_basic) > +test:test('max_heap_basic', test_max_heap_basic) > +test:test('min_heap_update_top', test_min_heap_update_top) > +test:test('min heap update', test_min_heap_update) > +test:test('max heap delete', test_max_heap_delete) > +test:test('min heap remove top', test_min_heap_remove_top) > +test:test('max heap remove try', test_max_heap_remove_try) > + > +os.exit(test:check() and 0 or 1) > diff --git a/test/unit-tap/suite.ini b/test/unit-tap/suite.ini > new file mode 100644 > index 0000000..f365b69 > --- /dev/null > +++ b/test/unit-tap/suite.ini > @@ -0,0 +1,4 @@ > +[default] > +core = app > +description = Unit tests TAP > +is_parallel = True > diff --git a/vshard/heap.lua b/vshard/heap.lua > new file mode 100644 > index 0000000..78c600a > --- /dev/null > +++ b/vshard/heap.lua > @@ -0,0 +1,226 @@ > +local math_floor = math.floor > + > +-- > +-- Implementation of a typical algorithm of the binary heap. > +-- The heap is intrusive - it stores index of each element inside of it. It > +-- allows to update and delete elements in any place in the heap, not only top > +-- elements. > +-- > + > +local function heap_parent_index(index) > + return math_floor(index / 2) > +end > + > +local function heap_left_child_index(index) > + return index * 2 > +end > + > +-- > +-- Generate a new heap. > +-- > +-- The implementation is targeted on as few index accesses as possible. > +-- Everything what could be is stored as upvalue variables instead of as indexes > +-- in a table. What couldn't be an upvalue and is used in a function more than > +-- once is saved on the stack. > +-- > +local function heap_new(is_left_above) > + -- Having it as an upvalue allows not to do 'self.data' lookup in each > + -- function. > + local data = {} > + -- Saves #data calculation. In Lua it is not just reading a number. > + local count = 0 > + > + local function heap_update_index_up(idx) > + if idx == 1 then > + return false > + end > + > + local orig_idx = idx > + local value = data[idx] > + local pidx = heap_parent_index(idx) > + local parent = data[pidx] > + while is_left_above(value, parent) do > + data[idx] = parent > + parent.index = idx > + idx = pidx > + if idx == 1 then > + break > + end > + pidx = heap_parent_index(idx) > + parent = data[pidx] > + end > + > + if idx == orig_idx then > + return false > + end > + data[idx] = value > + value.index = idx > + return true > + end > + > + local function heap_update_index_down(idx) > + local left_idx = heap_left_child_index(idx) > + if left_idx > count then > + return false > + end > + > + local orig_idx = idx > + local left > + local right > + local right_idx = left_idx + 1 > + local top > + local top_idx > + local value = data[idx] > + repeat > + right_idx = left_idx + 1 > + if right_idx > count then > + top = data[left_idx] > + if is_left_above(value, top) then > + break > + end > + top_idx = left_idx > + else > + left = data[left_idx] > + right = data[right_idx] > + if is_left_above(left, right) then > + if is_left_above(value, left) then > + break > + end > + top_idx = left_idx > + top = left > + else > + if is_left_above(value, right) then > + break > + end > + top_idx = right_idx > + top = right > + end > + end > + > + data[idx] = top > + top.index = idx > + idx = top_idx > + left_idx = heap_left_child_index(idx) > + until left_idx > count > + > + if idx == orig_idx then > + return false > + end > + data[idx] = value > + value.index = idx > + return true > + end > + > + local function heap_update_index(idx) > + if not heap_update_index_up(idx) then > + heap_update_index_down(idx) > + end > + end > + > + local function heap_push(self, value) > + count = count + 1 > + data[count] = value > + value.index = count > + heap_update_index_up(count) > + end > + > + local function heap_update_top(self) > + heap_update_index_down(1) > + end > + > + local function heap_update(self, value) > + heap_update_index(value.index) > + end > + > + local function heap_remove_top(self) > + if count == 0 then > + return > + end > + data[1].index = -1 > + if count == 1 then > + data[1] = nil > + count = 0 > + return > + end > + local value = data[count] > + data[count] = nil > + data[1] = value > + value.index = 1 > + count = count - 1 > + heap_update_index_down(1) > + end > + > + local function heap_remove(self, value) > + local idx = value.index > + value.index = -1 > + if idx == count then > + data[count] = nil > + count = count - 1 > + return > + end > + value = data[count] > + data[idx] = value > + data[count] = nil > + value.index = idx > + count = count - 1 > + heap_update_index(idx) > + end > + > + local function heap_remove_try(self, value) > + local idx = value.index > + if idx and idx > 0 then > + heap_remove(self, value) > + end > + end > + > + local function heap_pop(self) > + if count == 0 then > + return > + end > + -- Some duplication from remove_top, but allows to save a few > + -- condition checks, index accesses, and a function call. > + local res = data[1] > + res.index = -1 > + if count == 1 then > + data[1] = nil > + count = 0 > + return res > + end > + local value = data[count] > + data[count] = nil > + data[1] = value > + value.index = 1 > + count = count - 1 > + heap_update_index_down(1) > + return res > + end > + > + local function heap_top(self) > + return data[1] > + end > + > + local function heap_count(self) > + return count > + end > + > + return setmetatable({ > + -- Expose the data. For testing. > + data = data, > + }, { > + __index = { > + push = heap_push, > + update_top = heap_update_top, > + remove_top = heap_remove_top, > + pop = heap_pop, > + update = heap_update, > + remove = heap_remove, > + remove_try = heap_remove_try, > + top = heap_top, > + count = heap_count, > + } > + }) > +end > + > +return { > + new = heap_new, > +}
next prev parent reply other threads:[~2021-02-10 9:01 UTC|newest] Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-02-09 23:46 [Tarantool-patches] [PATCH 0/9] VShard Map-Reduce, part 1, preparations Vladislav Shpilevoy via Tarantool-patches 2021-02-09 23:46 ` [Tarantool-patches] [PATCH 1/9] rlist: move rlist to a new module Vladislav Shpilevoy via Tarantool-patches 2021-02-10 8:57 ` Oleg Babin via Tarantool-patches 2021-02-11 6:50 ` Oleg Babin via Tarantool-patches 2021-02-12 0:09 ` Vladislav Shpilevoy via Tarantool-patches 2021-02-09 23:46 ` [Tarantool-patches] [PATCH 2/9] Use fiber.clock() instead of .time() everywhere Vladislav Shpilevoy via Tarantool-patches 2021-02-10 8:57 ` Oleg Babin via Tarantool-patches 2021-02-10 22:33 ` Vladislav Shpilevoy via Tarantool-patches 2021-02-09 23:46 ` [Tarantool-patches] [PATCH 3/9] test: introduce a helper to wait for bucket GC Vladislav Shpilevoy via Tarantool-patches 2021-02-10 8:57 ` Oleg Babin via Tarantool-patches 2021-02-10 22:33 ` Vladislav Shpilevoy via Tarantool-patches 2021-02-11 6:50 ` Oleg Babin via Tarantool-patches 2021-02-09 23:46 ` [Tarantool-patches] [PATCH 4/9] storage: bucket_recv() should check rs lock Vladislav Shpilevoy via Tarantool-patches 2021-02-10 8:59 ` Oleg Babin via Tarantool-patches 2021-02-09 23:46 ` [Tarantool-patches] [PATCH 5/9] util: introduce yielding table functions Vladislav Shpilevoy via Tarantool-patches 2021-02-10 8:59 ` Oleg Babin via Tarantool-patches 2021-02-10 22:34 ` Vladislav Shpilevoy via Tarantool-patches 2021-02-11 6:50 ` Oleg Babin via Tarantool-patches 2021-02-09 23:46 ` [Tarantool-patches] [PATCH 6/9] cfg: introduce 'deprecated option' feature Vladislav Shpilevoy via Tarantool-patches 2021-02-10 8:59 ` Oleg Babin via Tarantool-patches 2021-02-10 22:34 ` Vladislav Shpilevoy via Tarantool-patches 2021-02-11 6:50 ` Oleg Babin via Tarantool-patches 2021-02-09 23:46 ` [Tarantool-patches] [PATCH 7/9] gc: introduce reactive garbage collector Vladislav Shpilevoy via Tarantool-patches 2021-02-10 9:00 ` Oleg Babin via Tarantool-patches 2021-02-10 22:35 ` Vladislav Shpilevoy via Tarantool-patches 2021-02-11 6:50 ` Oleg Babin via Tarantool-patches 2021-02-09 23:46 ` [Tarantool-patches] [PATCH 8/9] recovery: introduce reactive recovery Vladislav Shpilevoy via Tarantool-patches 2021-02-10 9:00 ` Oleg Babin via Tarantool-patches 2021-02-09 23:46 ` [Tarantool-patches] [PATCH 9/9] util: introduce binary heap data structure Vladislav Shpilevoy via Tarantool-patches 2021-02-10 9:01 ` Oleg Babin via Tarantool-patches [this message] 2021-02-10 22:36 ` Vladislav Shpilevoy via Tarantool-patches 2021-02-11 6:51 ` Oleg Babin via Tarantool-patches 2021-02-12 0:09 ` Vladislav Shpilevoy via Tarantool-patches 2021-03-05 22:03 ` Vladislav Shpilevoy via Tarantool-patches 2021-02-09 23:51 ` [Tarantool-patches] [PATCH 0/9] VShard Map-Reduce, part 1, preparations Vladislav Shpilevoy via Tarantool-patches 2021-02-12 11:02 ` Oleg Babin via Tarantool-patches
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=805dc520-0146-6fb0-29ec-f621bc8d2a8b@tarantool.org \ --to=tarantool-patches@dev.tarantool.org \ --cc=olegrok@tarantool.org \ --cc=v.shpilevoy@tarantool.org \ --cc=yaroslav.dynnikov@tarantool.org \ --subject='Re: [Tarantool-patches] [PATCH 9/9] util: introduce binary heap data structure' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox