Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH vshard 00/11] VShard Map-Reduce, part 2: Ref, Sched, Map
@ 2021-02-23  0:15 Vladislav Shpilevoy via Tarantool-patches
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 01/11] error: introduce vshard.error.timeout() Vladislav Shpilevoy via Tarantool-patches
                   ` (12 more replies)
  0 siblings, 13 replies; 47+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-02-23  0:15 UTC (permalink / raw)
  To: tarantool-patches, olegrok, yaroslav.dynnikov

The patchset is a second part of the series introducing consistent Map-Reduce
API for vshard.

Core of the patchset - 3 last patches: new module 'storage.ref', new module
'storage.sched', and router's new function `map_callrw()`.

The other patches are preparatory. They mostly introduce and unit-test helpers
used in the last patches.

For details see the commit messages.

Branch: http://github.com/tarantool/vshard/tree/gerold103/gh-147-map-reduce-part2
Issue: https://github.com/tarantool/vshard/issues/147

Vladislav Shpilevoy (11):
  error: introduce vshard.error.timeout()
  storage: add helper for local functions invocation
  storage: cache bucket count
  registry: module for circular deps resolution
  util: introduce safe fiber_cond_wait()
  util: introduce fiber_is_self_canceled()
  storage: introduce bucket_generation_wait()
  storage: introduce bucket_are_all_rw()
  ref: introduce vshard.storage.ref module
  sched: introduce vshard.storage.sched module
  router: introduce map_callrw()

 test/reload_evolution/storage.result   |  66 +++
 test/reload_evolution/storage.test.lua |  28 ++
 test/router/map-reduce.result          | 636 +++++++++++++++++++++++++
 test/router/map-reduce.test.lua        | 258 ++++++++++
 test/router/router.result              |   9 +-
 test/router/sync.result                |  10 +-
 test/router/sync.test.lua              |   3 +-
 test/storage/ref.result                | 406 ++++++++++++++++
 test/storage/ref.test.lua              | 169 +++++++
 test/storage/scheduler.result          | 410 ++++++++++++++++
 test/storage/scheduler.test.lua        | 178 +++++++
 test/storage/storage.result            | 168 +++++++
 test/storage/storage.test.lua          |  68 +++
 test/unit-tap/ref.test.lua             | 205 ++++++++
 test/unit-tap/scheduler.test.lua       | 555 +++++++++++++++++++++
 test/unit/config.result                |  59 +++
 test/unit/config.test.lua              |  23 +
 test/unit/error.result                 |  22 +
 test/unit/error.test.lua               |   9 +
 test/unit/util.result                  | 110 +++++
 test/unit/util.test.lua                |  45 ++
 test/upgrade/upgrade.result            |   5 +-
 vshard/CMakeLists.txt                  |   3 +-
 vshard/cfg.lua                         |   8 +
 vshard/consts.lua                      |   6 +
 vshard/error.lua                       |  29 ++
 vshard/registry.lua                    |  67 +++
 vshard/replicaset.lua                  |  37 +-
 vshard/router/init.lua                 | 186 +++++++-
 vshard/storage/CMakeLists.txt          |   2 +-
 vshard/storage/init.lua                | 208 +++++++-
 vshard/storage/ref.lua                 | 397 +++++++++++++++
 vshard/storage/sched.lua               | 231 +++++++++
 vshard/util.lua                        |  43 ++
 34 files changed, 4627 insertions(+), 32 deletions(-)
 create mode 100644 test/router/map-reduce.result
 create mode 100644 test/router/map-reduce.test.lua
 create mode 100644 test/storage/ref.result
 create mode 100644 test/storage/ref.test.lua
 create mode 100644 test/storage/scheduler.result
 create mode 100644 test/storage/scheduler.test.lua
 create mode 100755 test/unit-tap/ref.test.lua
 create mode 100755 test/unit-tap/scheduler.test.lua
 create mode 100644 vshard/registry.lua
 create mode 100644 vshard/storage/ref.lua
 create mode 100644 vshard/storage/sched.lua

-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Tarantool-patches] [PATCH vshard 01/11] error: introduce vshard.error.timeout()
  2021-02-23  0:15 [Tarantool-patches] [PATCH vshard 00/11] VShard Map-Reduce, part 2: Ref, Sched, Map Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-23  0:15 ` Vladislav Shpilevoy via Tarantool-patches
  2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 10/11] sched: introduce vshard.storage.sched module Vladislav Shpilevoy via Tarantool-patches
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 47+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-02-23  0:15 UTC (permalink / raw)
  To: tarantool-patches, olegrok, yaroslav.dynnikov

The function returns a box.error.TIMEOUT error converted to the
format used by vshard.

Probably it wouldn't be needed if only Tarantool >= 1.10 was
supported - then error.make(box.error.new(box.error.TIMEOUT))
wouldn't be so bad. But 1.9 is supposed to work as well, and to
create a timeout error on <= 1.9 it is necessary to make a pcall()
which is long and ugly.

vshard.error.timeout() provides a version-agnostic way of
returning timeout errors.

The patch is motivated by timeout error being actively used in the
future patches about map-reduce.

Needed for #147
---
 test/router/sync.result   | 10 +++++++---
 test/router/sync.test.lua |  3 ++-
 test/unit/error.result    | 22 ++++++++++++++++++++++
 test/unit/error.test.lua  |  9 +++++++++
 vshard/error.lua          | 10 ++++++++++
 vshard/replicaset.lua     |  3 +--
 vshard/router/init.lua    |  6 ++----
 vshard/storage/init.lua   |  6 ++----
 8 files changed, 55 insertions(+), 14 deletions(-)

diff --git a/test/router/sync.result b/test/router/sync.result
index 6f0821d..040d611 100644
--- a/test/router/sync.result
+++ b/test/router/sync.result
@@ -45,10 +45,14 @@ vshard.router.bootstrap()
 ---
 - true
 ...
-vshard.router.sync(-1)
+res, err = vshard.router.sync(-1)
 ---
-- null
-- Timeout exceeded
+...
+util.portable_error(err)
+---
+- type: ClientError
+  code: 78
+  message: Timeout exceeded
 ...
 res, err = vshard.router.sync(0)
 ---
diff --git a/test/router/sync.test.lua b/test/router/sync.test.lua
index 3150343..cb36b0e 100644
--- a/test/router/sync.test.lua
+++ b/test/router/sync.test.lua
@@ -15,7 +15,8 @@ util = require('util')
 
 vshard.router.bootstrap()
 
-vshard.router.sync(-1)
+res, err = vshard.router.sync(-1)
+util.portable_error(err)
 res, err = vshard.router.sync(0)
 util.portable_error(err)
 
diff --git a/test/unit/error.result b/test/unit/error.result
index 8552d91..738cfeb 100644
--- a/test/unit/error.result
+++ b/test/unit/error.result
@@ -97,3 +97,25 @@ util.portable_error(err)
   code: 32
   message: '[string "function raise_lua_err() assert(false) end "]:1: assertion failed!'
 ...
+--
+-- lerror.timeout() - portable alternative to box.error.new(box.error.TIMEOUT).
+--
+err = lerror.timeout()
+---
+...
+type(err)
+---
+- table
+...
+assert(err.code == box.error.TIMEOUT)
+---
+- true
+...
+err.type
+---
+- ClientError
+...
+err.message
+---
+- Timeout exceeded
+...
diff --git a/test/unit/error.test.lua b/test/unit/error.test.lua
index 859414e..0a51d33 100644
--- a/test/unit/error.test.lua
+++ b/test/unit/error.test.lua
@@ -36,3 +36,12 @@ function raise_lua_err() assert(false) end
 ok, err = pcall(raise_lua_err)
 err = lerror.make(err)
 util.portable_error(err)
+
+--
+-- lerror.timeout() - portable alternative to box.error.new(box.error.TIMEOUT).
+--
+err = lerror.timeout()
+type(err)
+assert(err.code == box.error.TIMEOUT)
+err.type
+err.message
diff --git a/vshard/error.lua b/vshard/error.lua
index 65da763..a6f46a9 100644
--- a/vshard/error.lua
+++ b/vshard/error.lua
@@ -212,10 +212,20 @@ local function make_alert(code, ...)
     return setmetatable(r, { __serialize = 'seq' })
 end
 
+--
+-- Create a timeout error object. Box.error.new() can't be used because is
+-- present only since 1.10.
+--
+local function make_timeout()
+    local _, err = pcall(box.error, box.error.TIMEOUT)
+    return make_error(err)
+end
+
 return {
     code = error_code,
     box = box_error,
     vshard = vshard_error,
     make = make_error,
     alert = make_alert,
+    timeout = make_timeout,
 }
diff --git a/vshard/replicaset.lua b/vshard/replicaset.lua
index 9c792b3..7437e3b 100644
--- a/vshard/replicaset.lua
+++ b/vshard/replicaset.lua
@@ -401,8 +401,7 @@ local function replicaset_template_multicallro(prefer_replica, balance)
         local timeout = opts.timeout or consts.CALL_TIMEOUT_MAX
         local net_status, storage_status, retval, err, replica
         if timeout <= 0 then
-            net_status, err = pcall(box.error, box.error.TIMEOUT)
-            return nil, lerror.make(err)
+            return nil, lerror.timeout()
         end
         local end_time = fiber_clock() + timeout
         while not net_status and timeout > 0 do
diff --git a/vshard/router/init.lua b/vshard/router/init.lua
index eeb7515..97bcb0a 100644
--- a/vshard/router/init.lua
+++ b/vshard/router/init.lua
@@ -628,8 +628,7 @@ local function router_call_impl(router, bucket_id, mode, prefer_replica,
     if err then
         return nil, err
     else
-        local _, boxerror = pcall(box.error, box.error.TIMEOUT)
-        return nil, lerror.box(boxerror)
+        return nil, lerror.timeout()
     end
 end
 
@@ -1235,8 +1234,7 @@ local function router_sync(router, timeout)
     local opts = {timeout = timeout}
     for rs_uuid, replicaset in pairs(router.replicasets) do
         if timeout < 0 then
-            local ok, err = pcall(box.error, box.error.TIMEOUT)
-            return nil, err
+            return nil, lerror.timeout()
         end
         local status, err = replicaset:callrw('vshard.storage.sync', arg, opts)
         if not status then
diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
index a3e7008..e0ce31d 100644
--- a/vshard/storage/init.lua
+++ b/vshard/storage/init.lua
@@ -756,8 +756,7 @@ local function sync(timeout)
         lfiber.sleep(0.001)
     until fiber_clock() > tstart + timeout
     log.warn("Timed out during synchronizing replicaset")
-    local ok, err = pcall(box.error, box.error.TIMEOUT)
-    return nil, lerror.make(err)
+    return nil, lerror.timeout()
 end
 
 --------------------------------------------------------------------------------
@@ -1344,8 +1343,7 @@ local function bucket_send_xc(bucket_id, destination, opts, exception_guard)
     while ref.rw ~= 0 do
         timeout = deadline - fiber_clock()
         if not M.bucket_rw_lock_is_ready_cond:wait(timeout) then
-            status, err = pcall(box.error, box.error.TIMEOUT)
-            return nil, lerror.make(err)
+            return nil, lerror.timeout()
         end
         lfiber.testcancel()
     end
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Tarantool-patches] [PATCH vshard 10/11] sched: introduce vshard.storage.sched module
  2021-02-23  0:15 [Tarantool-patches] [PATCH vshard 00/11] VShard Map-Reduce, part 2: Ref, Sched, Map Vladislav Shpilevoy via Tarantool-patches
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 01/11] error: introduce vshard.error.timeout() Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-23  0:15 ` Vladislav Shpilevoy via Tarantool-patches
  2021-02-24 10:28   ` Oleg Babin via Tarantool-patches
  2021-03-04 21:02   ` Oleg Babin via Tarantool-patches
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 11/11] router: introduce map_callrw() Vladislav Shpilevoy via Tarantool-patches
                   ` (10 subsequent siblings)
  12 siblings, 2 replies; 47+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-02-23  0:15 UTC (permalink / raw)
  To: tarantool-patches, olegrok, yaroslav.dynnikov

'vshard.storage.sched' module ensures that two incompatible
operations share storage time fairly - storage refs and bucket
moves.

Storage refs are going to be used by map-reduce API to preserve
data consistency while map requests are in progress on all
storages.

It means storage refs will be used as commonly as bucket refs,
and should not block the rebalancer. However it is hard not to
block the rebalancer forever if there are always refs on the
storage.

With bucket refs it was easy - one bucket temporary block is not a
big deal. So rebalancer always has higher prio than bucket refs,
and it still does not block requests for the other buckets +
read-only requests on the subject bucket.

With storage refs having rebalancer with a higher prio would make
map-reduce requests die in the entire cluster for the whole time
of rebalancing, which can be as long as hours or even days. It
wouldn't be acceptable.

The new module vshard.storage.sched shares time between moves and
storeage refs fairly. They both get time to execute with
proportions configures by user. The proportions depend on how
big is a bucket, how long the map-reduce requests are expected to
be. The longer is a request, the less quota it should be given,
typically.

The patch introduces new storage options to configure the
scheduling.

Part of #147

@TarantoolBot document
Title: vshard.storage.cfg new options - sched_ref_quota and sched_move_quota

There are new options for `vshard.storage.cfg`: `sched_ref_quota`
and `sched_move_quota`. The options control how much time should
be given to storage refs and bucket moves - two incompatible but
important operations.

Storage refs are used by router's map-reduce API. Each map-reduce
call creates storage refs on all storages to prevent data
migration on them for the map execution.

Bucket moves are used by the rebalancer. Obviously, they are
incompatible with the storage refs.

If vshard would prefer one operation to another always, it would
lead to starvation of one of them. For example, if storage refs
would be prefered, rebalancing could just never work if there are
always refs under constant map-reduce load. If bucket moves would
be prefered, storage refs (and therefore map-reduce) would stop
for the entire rebalancing time which can be quite long (hours,
days).

To control how much time to give to which operation the new
options serve.

`sched_ref_quota` tells how many storage refs (therefore
map-reduce requests) can be executed on the storage in a row if
there are pending bucket moves, before they are blocked to let the
moves work. Default value is 300.

`sched_move_quota` controls the same, but vice-versa: how many
bucket moves can be done in a row if there are pending refs.
Default value is 1.

Map-reduce requests are expected to be much shorter than bucket
moves, so storage refs by default have a higher quota.

This is how it works on an example. Assume map-reduces start.
They execute one after another, 150 requests in a row. Now the
rebalancer wakes up and wants to move some buckets. He stands into
a queue and waits for the storage refs to be gone.

But the ref quota is not reached yet, so the storage still can
execute +150 map-reduces even with the queued bucket moves until
new refs are blocked, and the moves start.
---
 test/reload_evolution/storage.result |   2 +-
 test/storage/ref.result              |  19 +-
 test/storage/ref.test.lua            |   9 +-
 test/storage/scheduler.result        | 410 ++++++++++++++++++++
 test/storage/scheduler.test.lua      | 178 +++++++++
 test/unit-tap/ref.test.lua           |   7 +-
 test/unit-tap/scheduler.test.lua     | 555 +++++++++++++++++++++++++++
 test/unit/config.result              |  59 +++
 test/unit/config.test.lua            |  23 ++
 vshard/cfg.lua                       |   8 +
 vshard/consts.lua                    |   5 +
 vshard/storage/CMakeLists.txt        |   2 +-
 vshard/storage/init.lua              |  54 ++-
 vshard/storage/ref.lua               |  30 +-
 vshard/storage/sched.lua             | 231 +++++++++++
 15 files changed, 1567 insertions(+), 25 deletions(-)
 create mode 100644 test/storage/scheduler.result
 create mode 100644 test/storage/scheduler.test.lua
 create mode 100755 test/unit-tap/scheduler.test.lua
 create mode 100644 vshard/storage/sched.lua

diff --git a/test/reload_evolution/storage.result b/test/reload_evolution/storage.result
index c4a0cdd..77010a2 100644
--- a/test/reload_evolution/storage.result
+++ b/test/reload_evolution/storage.result
@@ -258,7 +258,7 @@ ok, err = vshard.storage.bucket_send(bucket_id_to_move, util.replicasets[2],
 ...
 assert(not ok and err.message)
 ---
-- Storage is referenced
+- Timeout exceeded
 ...
 lref.del(0, 0)
 ---
diff --git a/test/storage/ref.result b/test/storage/ref.result
index d5f4166..59f07f4 100644
--- a/test/storage/ref.result
+++ b/test/storage/ref.result
@@ -84,18 +84,22 @@ big_timeout = 1000000
 small_timeout = 0.001
  | ---
  | ...
+
+timeout = 0.01
+ | ---
+ | ...
 lref.add(rid, sid, big_timeout)
  | ---
  | - true
  | ...
 -- Send fails.
 ok, err = vshard.storage.bucket_send(1, util.replicasets[2],                    \
-                                     {timeout = big_timeout})
+                                     {timeout = timeout})
  | ---
  | ...
 assert(not ok and err.message)
  | ---
- | - Storage is referenced
+ | - Timeout exceeded
  | ...
 lref.use(rid, sid)
  | ---
@@ -103,12 +107,12 @@ lref.use(rid, sid)
  | ...
 -- Still fails - use only makes ref undead until it is deleted explicitly.
 ok, err = vshard.storage.bucket_send(1, util.replicasets[2],                    \
-                                     {timeout = big_timeout})
+                                     {timeout = timeout})
  | ---
  | ...
 assert(not ok and err.message)
  | ---
- | - Storage is referenced
+ | - Timeout exceeded
  | ...
 
 _ = test_run:switch('storage_2_a')
@@ -118,13 +122,16 @@ _ = test_run:switch('storage_2_a')
 big_timeout = 1000000
  | ---
  | ...
+timeout = 0.01
+ | ---
+ | ...
 ok, err = vshard.storage.bucket_send(1501, util.replicasets[1],                 \
-                                     {timeout = big_timeout})
+                                     {timeout = timeout})
  | ---
  | ...
 assert(not ok and err.message)
  | ---
- | - Storage is referenced
+ | - Timeout exceeded
  | ...
 
 --
diff --git a/test/storage/ref.test.lua b/test/storage/ref.test.lua
index b34a294..24303e2 100644
--- a/test/storage/ref.test.lua
+++ b/test/storage/ref.test.lua
@@ -35,22 +35,25 @@ sid = 0
 rid = 0
 big_timeout = 1000000
 small_timeout = 0.001
+
+timeout = 0.01
 lref.add(rid, sid, big_timeout)
 -- Send fails.
 ok, err = vshard.storage.bucket_send(1, util.replicasets[2],                    \
-                                     {timeout = big_timeout})
+                                     {timeout = timeout})
 assert(not ok and err.message)
 lref.use(rid, sid)
 -- Still fails - use only makes ref undead until it is deleted explicitly.
 ok, err = vshard.storage.bucket_send(1, util.replicasets[2],                    \
-                                     {timeout = big_timeout})
+                                     {timeout = timeout})
 assert(not ok and err.message)
 
 _ = test_run:switch('storage_2_a')
 -- Receive (from another replicaset) also fails.
 big_timeout = 1000000
+timeout = 0.01
 ok, err = vshard.storage.bucket_send(1501, util.replicasets[1],                 \
-                                     {timeout = big_timeout})
+                                     {timeout = timeout})
 assert(not ok and err.message)
 
 --
diff --git a/test/storage/scheduler.result b/test/storage/scheduler.result
new file mode 100644
index 0000000..0f53e42
--- /dev/null
+++ b/test/storage/scheduler.result
@@ -0,0 +1,410 @@
+-- test-run result file version 2
+test_run = require('test_run').new()
+ | ---
+ | ...
+REPLICASET_1 = { 'storage_1_a', 'storage_1_b' }
+ | ---
+ | ...
+REPLICASET_2 = { 'storage_2_a', 'storage_2_b' }
+ | ---
+ | ...
+
+test_run:create_cluster(REPLICASET_1, 'storage')
+ | ---
+ | ...
+test_run:create_cluster(REPLICASET_2, 'storage')
+ | ---
+ | ...
+util = require('util')
+ | ---
+ | ...
+util.wait_master(test_run, REPLICASET_1, 'storage_1_a')
+ | ---
+ | ...
+util.wait_master(test_run, REPLICASET_2, 'storage_2_a')
+ | ---
+ | ...
+util.map_evals(test_run, {REPLICASET_1, REPLICASET_2}, 'bootstrap_storage()')
+ | ---
+ | ...
+util.push_rs_filters(test_run)
+ | ---
+ | ...
+
+--
+-- gh-147: scheduler helps to share time fairly between incompatible but
+-- necessary operations - storage refs and bucket moves. Refs are used for the
+-- consistent map-reduce feature when the whole cluster can be scanned without
+-- being afraid that some data may slip through requests on behalf of the
+-- rebalancer.
+--
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+
+vshard.storage.rebalancer_disable()
+ | ---
+ | ...
+vshard.storage.bucket_force_create(1, 1500)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+vshard.storage.rebalancer_disable()
+ | ---
+ | ...
+vshard.storage.bucket_force_create(1501, 1500)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+--
+-- Bucket_send() uses the scheduler.
+--
+lsched = require('vshard.storage.sched')
+ | ---
+ | ...
+assert(lsched.move_strike == 0)
+ | ---
+ | - true
+ | ...
+assert(lsched.move_count == 0)
+ | ---
+ | - true
+ | ...
+big_timeout = 1000000
+ | ---
+ | ...
+big_timeout_opts = {timeout = big_timeout}
+ | ---
+ | ...
+vshard.storage.bucket_send(1, util.replicasets[2], big_timeout_opts)
+ | ---
+ | - true
+ | ...
+assert(lsched.move_strike == 1)
+ | ---
+ | - true
+ | ...
+assert(lsched.move_count == 0)
+ | ---
+ | - true
+ | ...
+wait_bucket_is_collected(1)
+ | ---
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+lsched = require('vshard.storage.sched')
+ | ---
+ | ...
+--
+-- Bucket_recv() uses the scheduler.
+--
+assert(lsched.move_strike == 1)
+ | ---
+ | - true
+ | ...
+assert(lsched.move_count == 0)
+ | ---
+ | - true
+ | ...
+
+--
+-- When move is in progress, it is properly accounted.
+--
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = true
+ | ---
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+big_timeout = 1000000
+ | ---
+ | ...
+big_timeout_opts = {timeout = big_timeout}
+ | ---
+ | ...
+ok, err = nil
+ | ---
+ | ...
+assert(lsched.move_strike == 1)
+ | ---
+ | - true
+ | ...
+_ = fiber.create(function()                                                     \
+    ok, err = vshard.storage.bucket_send(1, util.replicasets[1],                \
+                                         big_timeout_opts)                      \
+end)
+ | ---
+ | ...
+-- Strike increase does not mean the move finished. It means it was successfully
+-- scheduled.
+assert(lsched.move_strike == 2)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+test_run:wait_cond(function() return lsched.move_strike == 2 end)
+ | ---
+ | - true
+ | ...
+
+--
+-- Ref is not allowed during move.
+--
+small_timeout = 0.000001
+ | ---
+ | ...
+lref = require('vshard.storage.ref')
+ | ---
+ | ...
+ok, err = lref.add(0, 0, small_timeout)
+ | ---
+ | ...
+assert(not ok)
+ | ---
+ | - true
+ | ...
+err.message
+ | ---
+ | - Timeout exceeded
+ | ...
+-- Put it to wait until move is done.
+ok, err = nil
+ | ---
+ | ...
+_ = fiber.create(function() ok, err = lref.add(0, 0, big_timeout) end)
+ | ---
+ | ...
+vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = false
+ | ---
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+test_run:wait_cond(function() return ok or err end)
+ | ---
+ | - true
+ | ...
+ok, err
+ | ---
+ | - true
+ | - null
+ | ...
+assert(lsched.move_count == 0)
+ | ---
+ | - true
+ | ...
+wait_bucket_is_collected(1)
+ | ---
+ | ...
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+test_run:wait_cond(function() return ok or err end)
+ | ---
+ | - true
+ | ...
+ok, err
+ | ---
+ | - true
+ | - null
+ | ...
+assert(lsched.move_count == 0)
+ | ---
+ | - true
+ | ...
+assert(lsched.ref_count == 1)
+ | ---
+ | - true
+ | ...
+lref.del(0, 0)
+ | ---
+ | - true
+ | ...
+assert(lsched.ref_count == 0)
+ | ---
+ | - true
+ | ...
+
+--
+-- Refs can't block sends infinitely. The scheduler must be fair and share time
+-- between ref/move.
+--
+do_refs = true
+ | ---
+ | ...
+ref_worker_count = 10
+ | ---
+ | ...
+function ref_worker()                                                           \
+    while do_refs do                                                            \
+        lref.add(0, 0, big_timeout)                                             \
+        fiber.sleep(small_timeout)                                              \
+        lref.del(0, 0)                                                          \
+    end                                                                         \
+    ref_worker_count = ref_worker_count - 1                                     \
+end
+ | ---
+ | ...
+-- Simulate many fibers doing something with a ref being kept.
+for i = 1, ref_worker_count do fiber.create(ref_worker) end
+ | ---
+ | ...
+assert(lref.count > 0)
+ | ---
+ | - true
+ | ...
+assert(lsched.ref_count > 0)
+ | ---
+ | - true
+ | ...
+-- Ensure it passes with default opts (when move is in great unfairness). It is
+-- important. Because moves are expected to be much longer than refs, and must
+-- not happen too often with ref load in progress. But still should eventually
+-- be processed.
+bucket_count = 100
+ | ---
+ | ...
+bucket_id = 1
+ | ---
+ | ...
+bucket_worker_count = 5
+ | ---
+ | ...
+function bucket_worker()                                                        \
+    while bucket_id <= bucket_count do                                          \
+        local id = bucket_id                                                    \
+        bucket_id = bucket_id + 1                                               \
+        assert(vshard.storage.bucket_send(id, util.replicasets[2]))             \
+    end                                                                         \
+    bucket_worker_count = bucket_worker_count - 1                               \
+end
+ | ---
+ | ...
+-- Simulate many rebalancer fibers like when max_sending is increased.
+for i = 1, bucket_worker_count do fiber.create(bucket_worker) end
+ | ---
+ | ...
+test_run:wait_cond(function() return bucket_worker_count == 0 end)
+ | ---
+ | - true
+ | ...
+
+do_refs = false
+ | ---
+ | ...
+test_run:wait_cond(function() return ref_worker_count == 0 end)
+ | ---
+ | - true
+ | ...
+assert(lref.count == 0)
+ | ---
+ | - true
+ | ...
+assert(lsched.ref_count == 0)
+ | ---
+ | - true
+ | ...
+
+for i = 1, bucket_count do wait_bucket_is_collected(i) end
+ | ---
+ | ...
+
+--
+-- Refs can't block recvs infinitely.
+--
+do_refs = true
+ | ---
+ | ...
+for i = 1, ref_worker_count do fiber.create(ref_worker) end
+ | ---
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+bucket_count = 100
+ | ---
+ | ...
+bucket_id = 1
+ | ---
+ | ...
+bucket_worker_count = 5
+ | ---
+ | ...
+function bucket_worker()                                                        \
+    while bucket_id <= bucket_count do                                          \
+        local id = bucket_id                                                    \
+        bucket_id = bucket_id + 1                                               \
+        assert(vshard.storage.bucket_send(id, util.replicasets[1]))             \
+    end                                                                         \
+    bucket_worker_count = bucket_worker_count - 1                               \
+end
+ | ---
+ | ...
+for i = 1, bucket_worker_count do fiber.create(bucket_worker) end
+ | ---
+ | ...
+test_run:wait_cond(function() return bucket_worker_count == 0 end)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+do_refs = false
+ | ---
+ | ...
+test_run:wait_cond(function() return ref_worker_count == 0 end)
+ | ---
+ | - true
+ | ...
+assert(lref.count == 0)
+ | ---
+ | - true
+ | ...
+assert(lsched.ref_count == 0)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+for i = 1, bucket_count do wait_bucket_is_collected(i) end
+ | ---
+ | ...
+
+_ = test_run:switch("default")
+ | ---
+ | ...
+test_run:drop_cluster(REPLICASET_2)
+ | ---
+ | ...
+test_run:drop_cluster(REPLICASET_1)
+ | ---
+ | ...
+_ = test_run:cmd('clear filter')
+ | ---
+ | ...
diff --git a/test/storage/scheduler.test.lua b/test/storage/scheduler.test.lua
new file mode 100644
index 0000000..8628f0e
--- /dev/null
+++ b/test/storage/scheduler.test.lua
@@ -0,0 +1,178 @@
+test_run = require('test_run').new()
+REPLICASET_1 = { 'storage_1_a', 'storage_1_b' }
+REPLICASET_2 = { 'storage_2_a', 'storage_2_b' }
+
+test_run:create_cluster(REPLICASET_1, 'storage')
+test_run:create_cluster(REPLICASET_2, 'storage')
+util = require('util')
+util.wait_master(test_run, REPLICASET_1, 'storage_1_a')
+util.wait_master(test_run, REPLICASET_2, 'storage_2_a')
+util.map_evals(test_run, {REPLICASET_1, REPLICASET_2}, 'bootstrap_storage()')
+util.push_rs_filters(test_run)
+
+--
+-- gh-147: scheduler helps to share time fairly between incompatible but
+-- necessary operations - storage refs and bucket moves. Refs are used for the
+-- consistent map-reduce feature when the whole cluster can be scanned without
+-- being afraid that some data may slip through requests on behalf of the
+-- rebalancer.
+--
+
+_ = test_run:switch('storage_1_a')
+
+vshard.storage.rebalancer_disable()
+vshard.storage.bucket_force_create(1, 1500)
+
+_ = test_run:switch('storage_2_a')
+vshard.storage.rebalancer_disable()
+vshard.storage.bucket_force_create(1501, 1500)
+
+_ = test_run:switch('storage_1_a')
+--
+-- Bucket_send() uses the scheduler.
+--
+lsched = require('vshard.storage.sched')
+assert(lsched.move_strike == 0)
+assert(lsched.move_count == 0)
+big_timeout = 1000000
+big_timeout_opts = {timeout = big_timeout}
+vshard.storage.bucket_send(1, util.replicasets[2], big_timeout_opts)
+assert(lsched.move_strike == 1)
+assert(lsched.move_count == 0)
+wait_bucket_is_collected(1)
+
+_ = test_run:switch('storage_2_a')
+lsched = require('vshard.storage.sched')
+--
+-- Bucket_recv() uses the scheduler.
+--
+assert(lsched.move_strike == 1)
+assert(lsched.move_count == 0)
+
+--
+-- When move is in progress, it is properly accounted.
+--
+_ = test_run:switch('storage_1_a')
+vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = true
+
+_ = test_run:switch('storage_2_a')
+big_timeout = 1000000
+big_timeout_opts = {timeout = big_timeout}
+ok, err = nil
+assert(lsched.move_strike == 1)
+_ = fiber.create(function()                                                     \
+    ok, err = vshard.storage.bucket_send(1, util.replicasets[1],                \
+                                         big_timeout_opts)                      \
+end)
+-- Strike increase does not mean the move finished. It means it was successfully
+-- scheduled.
+assert(lsched.move_strike == 2)
+
+_ = test_run:switch('storage_1_a')
+test_run:wait_cond(function() return lsched.move_strike == 2 end)
+
+--
+-- Ref is not allowed during move.
+--
+small_timeout = 0.000001
+lref = require('vshard.storage.ref')
+ok, err = lref.add(0, 0, small_timeout)
+assert(not ok)
+err.message
+-- Put it to wait until move is done.
+ok, err = nil
+_ = fiber.create(function() ok, err = lref.add(0, 0, big_timeout) end)
+vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = false
+
+_ = test_run:switch('storage_2_a')
+test_run:wait_cond(function() return ok or err end)
+ok, err
+assert(lsched.move_count == 0)
+wait_bucket_is_collected(1)
+
+_ = test_run:switch('storage_1_a')
+test_run:wait_cond(function() return ok or err end)
+ok, err
+assert(lsched.move_count == 0)
+assert(lsched.ref_count == 1)
+lref.del(0, 0)
+assert(lsched.ref_count == 0)
+
+--
+-- Refs can't block sends infinitely. The scheduler must be fair and share time
+-- between ref/move.
+--
+do_refs = true
+ref_worker_count = 10
+function ref_worker()                                                           \
+    while do_refs do                                                            \
+        lref.add(0, 0, big_timeout)                                             \
+        fiber.sleep(small_timeout)                                              \
+        lref.del(0, 0)                                                          \
+    end                                                                         \
+    ref_worker_count = ref_worker_count - 1                                     \
+end
+-- Simulate many fibers doing something with a ref being kept.
+for i = 1, ref_worker_count do fiber.create(ref_worker) end
+assert(lref.count > 0)
+assert(lsched.ref_count > 0)
+-- Ensure it passes with default opts (when move is in great unfairness). It is
+-- important. Because moves are expected to be much longer than refs, and must
+-- not happen too often with ref load in progress. But still should eventually
+-- be processed.
+bucket_count = 100
+bucket_id = 1
+bucket_worker_count = 5
+function bucket_worker()                                                        \
+    while bucket_id <= bucket_count do                                          \
+        local id = bucket_id                                                    \
+        bucket_id = bucket_id + 1                                               \
+        assert(vshard.storage.bucket_send(id, util.replicasets[2]))             \
+    end                                                                         \
+    bucket_worker_count = bucket_worker_count - 1                               \
+end
+-- Simulate many rebalancer fibers like when max_sending is increased.
+for i = 1, bucket_worker_count do fiber.create(bucket_worker) end
+test_run:wait_cond(function() return bucket_worker_count == 0 end)
+
+do_refs = false
+test_run:wait_cond(function() return ref_worker_count == 0 end)
+assert(lref.count == 0)
+assert(lsched.ref_count == 0)
+
+for i = 1, bucket_count do wait_bucket_is_collected(i) end
+
+--
+-- Refs can't block recvs infinitely.
+--
+do_refs = true
+for i = 1, ref_worker_count do fiber.create(ref_worker) end
+
+_ = test_run:switch('storage_2_a')
+bucket_count = 100
+bucket_id = 1
+bucket_worker_count = 5
+function bucket_worker()                                                        \
+    while bucket_id <= bucket_count do                                          \
+        local id = bucket_id                                                    \
+        bucket_id = bucket_id + 1                                               \
+        assert(vshard.storage.bucket_send(id, util.replicasets[1]))             \
+    end                                                                         \
+    bucket_worker_count = bucket_worker_count - 1                               \
+end
+for i = 1, bucket_worker_count do fiber.create(bucket_worker) end
+test_run:wait_cond(function() return bucket_worker_count == 0 end)
+
+_ = test_run:switch('storage_1_a')
+do_refs = false
+test_run:wait_cond(function() return ref_worker_count == 0 end)
+assert(lref.count == 0)
+assert(lsched.ref_count == 0)
+
+_ = test_run:switch('storage_2_a')
+for i = 1, bucket_count do wait_bucket_is_collected(i) end
+
+_ = test_run:switch("default")
+test_run:drop_cluster(REPLICASET_2)
+test_run:drop_cluster(REPLICASET_1)
+_ = test_run:cmd('clear filter')
diff --git a/test/unit-tap/ref.test.lua b/test/unit-tap/ref.test.lua
index d987a63..ba95eee 100755
--- a/test/unit-tap/ref.test.lua
+++ b/test/unit-tap/ref.test.lua
@@ -5,6 +5,7 @@ local test = tap.test('cfg')
 local fiber = require('fiber')
 local lregistry = require('vshard.registry')
 local lref = require('vshard.storage.ref')
+require('vshard.storage.sched')
 
 local big_timeout = 1000000
 local small_timeout = 0.000001
@@ -19,9 +20,11 @@ local sid3 = 2
 --
 
 --
--- Refs used storage API to get bucket space state and wait on its changes. But
--- not important for these unit tests.
+-- Refs use storage API to get bucket space state and wait on its changes. And
+-- scheduler API to sync with bucket moves. But not important for these unit
+-- tests.
 --
+
 local function bucket_are_all_rw()
     return true
 end
diff --git a/test/unit-tap/scheduler.test.lua b/test/unit-tap/scheduler.test.lua
new file mode 100755
index 0000000..0af4f5e
--- /dev/null
+++ b/test/unit-tap/scheduler.test.lua
@@ -0,0 +1,555 @@
+#!/usr/bin/env tarantool
+
+local fiber = require('fiber')
+local tap = require('tap')
+local test = tap.test('cfg')
+local lregistry = require('vshard.registry')
+local lref = require('vshard.storage.ref')
+local lsched = require('vshard.storage.sched')
+
+local big_timeout = 1000000
+local small_timeout = 0.000001
+
+--
+-- gh-147: scheduler helps to share time fairly between incompatible but
+-- necessary operations - storage refs and bucket moves. Refs are used for the
+-- consistent map-reduce feature when the whole cluster can be scanned without
+-- being afraid that some data may slip through requests on behalf of the
+-- rebalancer.
+--
+
+box.cfg{
+    log = 'log.txt'
+}
+-- io.write = function(...) require('log').info(...) end
+
+--
+-- Storage registry is used by the ref module. The ref module is used in the
+-- tests in order to ensure the scheduler performs ref garbage collection.
+--
+local function bucket_are_all_rw()
+    return true
+end
+
+lregistry.storage = {
+    bucket_are_all_rw = bucket_are_all_rw,
+}
+
+local function fiber_csw()
+    return fiber.info()[fiber.self():id()].csw
+end
+
+local function fiber_set_joinable()
+    fiber.self():set_joinable(true)
+end
+
+local function test_basic(test)
+    test:plan(32)
+
+    local ref_strike = lsched.ref_strike
+    --
+    -- Simplest possible test - start and end a ref.
+    --
+    test:is(lsched.ref_start(big_timeout), big_timeout, 'start ref')
+    test:is(lsched.ref_count, 1, '1 ref')
+    test:is(lsched.ref_strike, ref_strike + 1, '+1 ref in a row')
+    lsched.ref_end(1)
+    test:is(lsched.ref_count, 0, '0 refs after end')
+    test:is(lsched.ref_strike, ref_strike + 1, 'strike is kept')
+
+    lsched.ref_start(big_timeout)
+    lsched.ref_end(1)
+    test:is(lsched.ref_strike, ref_strike + 2, 'strike grows')
+    test:is(lsched.ref_count, 0, 'count does not')
+
+    --
+    -- Move ends ref strike.
+    --
+    test:is(lsched.move_start(big_timeout), big_timeout, 'start move')
+    test:is(lsched.move_count, 1, '1 move')
+    test:is(lsched.move_strike, 1, '+1 move strike')
+    test:is(lsched.ref_strike, 0, 'ref strike is interrupted')
+
+    --
+    -- Ref times out if there is a move in progress.
+    --
+    local ok, err = lsched.ref_start(small_timeout)
+    test:ok(not ok and err, 'ref fails')
+    test:is(lsched.move_count, 1, 'still 1 move')
+    test:is(lsched.move_strike, 1, 'still 1 move strike')
+    test:is(lsched.ref_count, 0, 'could not add ref')
+    test:is(lsched.ref_queue, 0, 'empty ref queue')
+
+    --
+    -- Ref succeeds when move ends.
+    --
+    local f = fiber.create(function()
+        fiber_set_joinable()
+        return lsched.ref_start(big_timeout)
+    end)
+    fiber.sleep(small_timeout)
+    lsched.move_end(1)
+    local new_timeout
+    ok, new_timeout = f:join()
+    test:ok(ok and new_timeout < big_timeout, 'correct timeout')
+    test:is(lsched.move_count, 0, 'no moves')
+    test:is(lsched.move_strike, 0, 'move strike ends')
+    test:is(lsched.ref_count, 1, '+1 ref')
+    test:is(lsched.ref_strike, 1, '+1 ref strike')
+
+    --
+    -- Move succeeds when ref ends.
+    --
+    f = fiber.create(function()
+        fiber_set_joinable()
+        return lsched.move_start(big_timeout)
+    end)
+    fiber.sleep(small_timeout)
+    lsched.ref_end(1)
+    ok, new_timeout = f:join()
+    test:ok(ok and new_timeout < big_timeout, 'correct timeout')
+    test:is(lsched.ref_count, 0, 'no refs')
+    test:is(lsched.ref_strike, 0, 'ref strike ends')
+    test:is(lsched.move_count, 1, '+1 move')
+    test:is(lsched.move_strike, 1, '+1 move strike')
+    lsched.move_end(1)
+
+    --
+    -- Move times out when there is a ref.
+    --
+    test:is(lsched.ref_start(big_timeout), big_timeout, '+ ref')
+    ok, err = lsched.move_start(small_timeout)
+    test:ok(not ok and err, 'move fails')
+    test:is(lsched.ref_count, 1, 'still 1 ref')
+    test:is(lsched.ref_strike, 1, 'still 1 ref strike')
+    test:is(lsched.move_count, 0, 'could not add move')
+    test:is(lsched.move_queue, 0, 'empty move queue')
+    lsched.ref_end(1)
+end
+
+local function test_negative_timeout(test)
+    test:plan(12)
+
+    --
+    -- Move works even with negative timeout if no refs.
+    --
+    test:is(lsched.move_start(-1), -1, 'timeout does not matter if no refs')
+    test:is(lsched.move_count, 1, '+1 move')
+
+    --
+    -- Ref fails immediately if timeout negative and has moves.
+    --
+    local csw = fiber_csw()
+    local ok, err = lsched.ref_start(-1)
+    test:ok(not ok and err, 'ref fails')
+    test:is(csw, fiber_csw(), 'no yields')
+    test:is(lsched.ref_count, 0, 'no refs')
+    test:is(lsched.ref_queue, 0, 'no ref queue')
+
+    --
+    -- Ref works even with negative timeout if no moves.
+    --
+    lsched.move_end(1)
+    test:is(lsched.ref_start(-1), -1, 'timeout does not matter if no moves')
+    test:is(lsched.ref_count, 1, '+1 ref')
+
+    --
+    -- Move fails immediately if timeout is negative and has refs.
+    --
+    csw = fiber_csw()
+    ok, err = lsched.move_start(-1)
+    test:ok(not ok and err, 'move fails')
+    test:is(csw, fiber_csw(), 'no yields')
+    test:is(lsched.move_count, 0, 'no moves')
+    test:is(lsched.move_queue, 0, 'no move queue')
+    lsched.ref_end(1)
+end
+
+local function test_move_gc_ref(test)
+    test:plan(10)
+
+    --
+    -- Move deletes expired refs if it may help to start the move.
+    --
+    for sid = 1, 10 do
+        for rid = 1, 5 do
+            lref.add(rid, sid, small_timeout)
+        end
+    end
+    test:is(lsched.ref_count, 50, 'refs are in progress')
+    local ok, err = lsched.move_start(-1)
+    test:ok(not ok and err, 'move without timeout failed')
+
+    fiber.sleep(small_timeout)
+    test:is(lsched.move_start(-1), -1, 'succeeds even with negative timeout')
+    test:is(lsched.ref_count, 0, 'all refs are expired and deleted')
+    test:is(lref.count, 0, 'ref module knows about it')
+    test:is(lsched.move_count, 1, 'move is started')
+    lsched.move_end(1)
+
+    --
+    -- May need more than 1 GC step.
+    --
+    for rid = 1, 5 do
+        lref.add(0, rid, small_timeout)
+    end
+    for rid = 1, 5 do
+        lref.add(1, rid, small_timeout * 100)
+    end
+    local new_timeout = lsched.move_start(big_timeout)
+    test:ok(new_timeout < big_timeout, 'succeeds by doing 2 gc steps')
+    test:is(lsched.ref_count, 0, 'all refs are expired and deleted')
+    test:is(lref.count, 0, 'ref module knows about it')
+    test:is(lsched.move_count, 1, 'move is started')
+    lsched.move_end(1)
+end
+
+local function test_ref_strike(test)
+    test:plan(10)
+
+    local quota = lsched.ref_quota
+    --
+    -- Strike should stop new refs if they exceed the quota and there is a
+    -- pending move.
+    --
+    -- End ref strike if there was one.
+    lsched.move_start(small_timeout)
+    lsched.move_end(1)
+    -- Ref strike starts.
+    assert(lsched.ref_start(small_timeout))
+
+    local f = fiber.create(function()
+        fiber_set_joinable()
+        return lsched.move_start(big_timeout)
+    end)
+    test:is(lsched.move_queue, 1, 'move is queued')
+    --
+    -- New refs should work only until quota is reached, because there is a
+    -- pending move.
+    --
+    for i = 1, quota - 1 do
+        assert(lsched.ref_start(small_timeout))
+    end
+    local ok, err = lsched.ref_start(small_timeout)
+    test:ok(not ok and err, 'too long strike with move queue not empty')
+    test:is(lsched.ref_strike, quota, 'max strike is reached')
+    -- Even if number of current refs decreases, new still are not accepted.
+    -- Because there was too many in a row while a new move was waiting.
+    lsched.ref_end(1)
+    ok, err = lsched.ref_start(small_timeout)
+    test:ok(not ok and err, 'still too long strike after one unref')
+    test:is(lsched.ref_strike, quota, 'strike is unchanged')
+
+    lsched.ref_end(quota - 1)
+    local new_timeout
+    ok, new_timeout = f:join()
+    test:ok(ok and new_timeout < big_timeout, 'move succeeded')
+    test:is(lsched.move_count, 1, '+1 move')
+    test:is(lsched.move_strike, 1, '+1 move strike')
+    test:is(lsched.ref_count, 0, 'no refs')
+    test:is(lsched.ref_strike, 0, 'no ref strike')
+    lsched.move_end(1)
+end
+
+local function test_move_strike(test)
+    test:plan(10)
+
+    local quota = lsched.move_quota
+    --
+    -- Strike should stop new moves if they exceed the quota and there is a
+    -- pending ref.
+    --
+    -- End move strike if there was one.
+    lsched.ref_start(small_timeout)
+    lsched.ref_end(1)
+    -- Move strike starts.
+    assert(lsched.move_start(small_timeout))
+
+    local f = fiber.create(function()
+        fiber_set_joinable()
+        return lsched.ref_start(big_timeout)
+    end)
+    test:is(lsched.ref_queue, 1, 'ref is queued')
+    --
+    -- New moves should work only until quota is reached, because there is a
+    -- pending ref.
+    --
+    for i = 1, quota - 1 do
+        assert(lsched.move_start(small_timeout))
+    end
+    local ok, err = lsched.move_start(small_timeout)
+    test:ok(not ok and err, 'too long strike with ref queue not empty')
+    test:is(lsched.move_strike, quota, 'max strike is reached')
+    -- Even if number of current moves decreases, new still are not accepted.
+    -- Because there was too many in a row while a new ref was waiting.
+    lsched.move_end(1)
+    ok, err = lsched.move_start(small_timeout)
+    test:ok(not ok and err, 'still too long strike after one move end')
+    test:is(lsched.move_strike, quota, 'strike is unchanged')
+
+    lsched.move_end(quota - 1)
+    local new_timeout
+    ok, new_timeout = f:join()
+    test:ok(ok and new_timeout < big_timeout, 'ref succeeded')
+    test:is(lsched.ref_count, 1, '+1 ref')
+    test:is(lsched.ref_strike, 1, '+1 ref strike')
+    test:is(lsched.move_count, 0, 'no moves')
+    test:is(lsched.move_strike, 0, 'no move strike')
+    lsched.ref_end(1)
+end
+
+local function test_ref_increase_quota(test)
+    test:plan(4)
+
+    local quota = lsched.ref_quota
+    --
+    -- Ref quota increase allows to do more refs even if there are pending
+    -- moves.
+    --
+    -- End ref strike if there was one.
+    lsched.move_start(big_timeout)
+    lsched.move_end(1)
+    -- Fill the quota.
+    for _ = 1, quota do
+        assert(lsched.ref_start(big_timeout))
+    end
+    -- Start move to block new refs by quota.
+    local f = fiber.create(function()
+        fiber_set_joinable()
+        return lsched.move_start(big_timeout)
+    end)
+    test:ok(not lsched.ref_start(small_timeout), 'can not add ref - full quota')
+
+    lsched.cfg({sched_ref_quota = quota + 1})
+    test:ok(lsched.ref_start(small_timeout), 'now can add - quota is extended')
+
+    -- Decrease quota - should not accept new refs again.
+    lsched.cfg{sched_ref_quota = quota}
+    test:ok(not lsched.ref_start(small_timeout), 'full quota again')
+
+    lsched.ref_end(quota + 1)
+    local ok, new_timeout = f:join()
+    test:ok(ok and new_timeout < big_timeout, 'move started')
+    lsched.move_end(1)
+end
+
+local function test_move_increase_quota(test)
+    test:plan(4)
+
+    local quota = lsched.move_quota
+    --
+    -- Move quota increase allows to do more moves even if there are pending
+    -- refs.
+    --
+    -- End move strike if there was one.
+    lsched.ref_start(big_timeout)
+    lsched.ref_end(1)
+    -- Fill the quota.
+    for _ = 1, quota do
+        assert(lsched.move_start(big_timeout))
+    end
+    -- Start ref to block new moves by quota.
+    local f = fiber.create(function()
+        fiber_set_joinable()
+        return lsched.ref_start(big_timeout)
+    end)
+    test:ok(not lsched.move_start(small_timeout), 'can not add move - full quota')
+
+    lsched.cfg({sched_move_quota = quota + 1})
+    test:ok(lsched.move_start(small_timeout), 'now can add - quota is extended')
+
+    -- Decrease quota - should not accept new moves again.
+    lsched.cfg{sched_move_quota = quota}
+    test:ok(not lsched.move_start(small_timeout), 'full quota again')
+
+    lsched.move_end(quota + 1)
+    local ok, new_timeout = f:join()
+    test:ok(ok and new_timeout < big_timeout, 'ref started')
+    lsched.ref_end(1)
+end
+
+local function test_ref_decrease_quota(test)
+    test:plan(4)
+
+    local old_quota = lsched.ref_quota
+    --
+    -- Quota decrease should not affect any existing operations or break
+    -- anything.
+    --
+    lsched.cfg({sched_ref_quota = 10})
+    for _ = 1, 5 do
+        assert(lsched.ref_start(big_timeout))
+    end
+    test:is(lsched.ref_count, 5, 'started refs below quota')
+
+    local f = fiber.create(function()
+        fiber_set_joinable()
+        return lsched.move_start(big_timeout)
+    end)
+    test:ok(lsched.ref_start(big_timeout), 'another ref after move queued')
+
+    lsched.cfg({sched_ref_quota = 2})
+    test:ok(not lsched.ref_start(small_timeout), 'quota decreased - can not '..
+            'start ref')
+
+    lsched.ref_end(6)
+    local ok, new_timeout = f:join()
+    test:ok(ok and new_timeout, 'move is started')
+    lsched.move_end(1)
+
+    lsched.cfg({sched_ref_quota = old_quota})
+end
+
+local function test_move_decrease_quota(test)
+    test:plan(4)
+
+    local old_quota = lsched.move_quota
+    --
+    -- Quota decrease should not affect any existing operations or break
+    -- anything.
+    --
+    lsched.cfg({sched_move_quota = 10})
+    for _ = 1, 5 do
+        assert(lsched.move_start(big_timeout))
+    end
+    test:is(lsched.move_count, 5, 'started moves below quota')
+
+    local f = fiber.create(function()
+        fiber_set_joinable()
+        return lsched.ref_start(big_timeout)
+    end)
+    test:ok(lsched.move_start(big_timeout), 'another move after ref queued')
+
+    lsched.cfg({sched_move_quota = 2})
+    test:ok(not lsched.move_start(small_timeout), 'quota decreased - can not '..
+            'start move')
+
+    lsched.move_end(6)
+    local ok, new_timeout = f:join()
+    test:ok(ok and new_timeout, 'ref is started')
+    lsched.ref_end(1)
+
+    lsched.cfg({sched_move_quota = old_quota})
+end
+
+local function test_ref_zero_quota(test)
+    test:plan(6)
+
+    local old_quota = lsched.ref_quota
+    --
+    -- Zero quota is a valid value. Moreover, it is special. It means the
+    -- 0-quoted operation should always be paused in favor of the other
+    -- operation.
+    --
+    lsched.cfg({sched_ref_quota = 0})
+    test:ok(lsched.ref_start(big_timeout), 'started ref with 0 quota')
+
+    local f = fiber.create(function()
+        fiber_set_joinable()
+        return lsched.move_start(big_timeout)
+    end)
+    test:ok(not lsched.ref_start(small_timeout), 'can not add more refs if '..
+            'move is queued - quota 0')
+
+    lsched.ref_end(1)
+    local ok, new_timeout = f:join()
+    test:ok(ok and new_timeout, 'move is started')
+
+    -- Ensure ref never starts if there are always moves, when quota is 0.
+    f = fiber.create(function()
+        fiber_set_joinable()
+        return lsched.ref_start(big_timeout)
+    end)
+    local move_count = lsched.move_quota + 3
+    -- Start from 2 to account the already existing move.
+    for _ = 2, move_count do
+        -- Start one new move.
+        assert(lsched.move_start(big_timeout))
+        -- Start second new move.
+        assert(lsched.move_start(big_timeout))
+        -- End first move.
+        lsched.move_end(1)
+        -- In result the moves are always interleaving - no time for refs at
+        -- all.
+    end
+    test:is(lsched.move_count, move_count, 'moves exceed quota')
+    test:ok(lsched.move_strike > move_count, 'strike is not interrupted')
+
+    lsched.move_end(move_count)
+    ok, new_timeout = f:join()
+    test:ok(ok and new_timeout, 'ref finally started')
+    lsched.ref_end(1)
+
+    lsched.cfg({sched_ref_quota = old_quota})
+end
+
+local function test_move_zero_quota(test)
+    test:plan(6)
+
+    local old_quota = lsched.move_quota
+    --
+    -- Zero quota is a valid value. Moreover, it is special. It means the
+    -- 0-quoted operation should always be paused in favor of the other
+    -- operation.
+    --
+    lsched.cfg({sched_move_quota = 0})
+    test:ok(lsched.move_start(big_timeout), 'started move with 0 quota')
+
+    local f = fiber.create(function()
+        fiber_set_joinable()
+        return lsched.ref_start(big_timeout)
+    end)
+    test:ok(not lsched.move_start(small_timeout), 'can not add more moves if '..
+            'ref is queued - quota 0')
+
+    lsched.move_end(1)
+    local ok, new_timeout = f:join()
+    test:ok(ok and new_timeout, 'ref is started')
+
+    -- Ensure move never starts if there are always refs, when quota is 0.
+    f = fiber.create(function()
+        fiber_set_joinable()
+        return lsched.move_start(big_timeout)
+    end)
+    local ref_count = lsched.ref_quota + 3
+    -- Start from 2 to account the already existing ref.
+    for _ = 2, ref_count do
+        -- Start one new ref.
+        assert(lsched.ref_start(big_timeout))
+        -- Start second new ref.
+        assert(lsched.ref_start(big_timeout))
+        -- End first ref.
+        lsched.ref_end(1)
+        -- In result the refs are always interleaving - no time for moves at
+        -- all.
+    end
+    test:is(lsched.ref_count, ref_count, 'refs exceed quota')
+    test:ok(lsched.ref_strike > ref_count, 'strike is not interrupted')
+
+    lsched.ref_end(ref_count)
+    ok, new_timeout = f:join()
+    test:ok(ok and new_timeout, 'move finally started')
+    lsched.move_end(1)
+
+    lsched.cfg({sched_move_quota = old_quota})
+end
+
+test:plan(11)
+
+-- Change default values. Move is 1 by default, which would reduce the number of
+-- possible tests. Ref is decreased to speed the tests up.
+lsched.cfg({sched_ref_quota = 10, sched_move_quota = 5})
+
+test:test('basic', test_basic)
+test:test('negative timeout', test_negative_timeout)
+test:test('ref gc', test_move_gc_ref)
+test:test('ref strike', test_ref_strike)
+test:test('move strike', test_move_strike)
+test:test('ref add quota', test_ref_increase_quota)
+test:test('move add quota', test_move_increase_quota)
+test:test('ref decrease quota', test_ref_decrease_quota)
+test:test('move decrease quota', test_move_decrease_quota)
+test:test('ref zero quota', test_ref_zero_quota)
+test:test('move zero quota', test_move_zero_quota)
+
+os.exit(test:check() and 0 or 1)
diff --git a/test/unit/config.result b/test/unit/config.result
index e0b2482..9df3bf1 100644
--- a/test/unit/config.result
+++ b/test/unit/config.result
@@ -597,3 +597,62 @@ cfg.collect_bucket_garbage_interval = 100
 _ = lcfg.check(cfg)
 ---
 ...
+--
+-- gh-147: router map-reduce. It adds scheduler options on the storage.
+--
+cfg.sched_ref_quota = 100
+---
+...
+_ = lcfg.check(cfg)
+---
+...
+cfg.sched_ref_quota = 1
+---
+...
+_ = lcfg.check(cfg)
+---
+...
+cfg.sched_ref_quota = 0
+---
+...
+_ = lcfg.check(cfg)
+---
+...
+cfg.sched_ref_quota = -1
+---
+...
+util.check_error(lcfg.check, cfg)
+---
+- Scheduler storage ref quota must be non-negative number
+...
+cfg.sched_ref_quota = nil
+---
+...
+cfg.sched_move_quota = 100
+---
+...
+_ = lcfg.check(cfg)
+---
+...
+cfg.sched_move_quota = 1
+---
+...
+_ = lcfg.check(cfg)
+---
+...
+cfg.sched_move_quota = 0
+---
+...
+_ = lcfg.check(cfg)
+---
+...
+cfg.sched_move_quota = -1
+---
+...
+util.check_error(lcfg.check, cfg)
+---
+- Scheduler bucket move quota must be non-negative number
+...
+cfg.sched_move_quota = nil
+---
+...
diff --git a/test/unit/config.test.lua b/test/unit/config.test.lua
index a1c9f07..473e460 100644
--- a/test/unit/config.test.lua
+++ b/test/unit/config.test.lua
@@ -241,3 +241,26 @@ cfg.rebalancer_max_sending = nil
 --
 cfg.collect_bucket_garbage_interval = 100
 _ = lcfg.check(cfg)
+
+--
+-- gh-147: router map-reduce. It adds scheduler options on the storage.
+--
+cfg.sched_ref_quota = 100
+_ = lcfg.check(cfg)
+cfg.sched_ref_quota = 1
+_ = lcfg.check(cfg)
+cfg.sched_ref_quota = 0
+_ = lcfg.check(cfg)
+cfg.sched_ref_quota = -1
+util.check_error(lcfg.check, cfg)
+cfg.sched_ref_quota = nil
+
+cfg.sched_move_quota = 100
+_ = lcfg.check(cfg)
+cfg.sched_move_quota = 1
+_ = lcfg.check(cfg)
+cfg.sched_move_quota = 0
+_ = lcfg.check(cfg)
+cfg.sched_move_quota = -1
+util.check_error(lcfg.check, cfg)
+cfg.sched_move_quota = nil
diff --git a/vshard/cfg.lua b/vshard/cfg.lua
index 63d5414..30f8794 100644
--- a/vshard/cfg.lua
+++ b/vshard/cfg.lua
@@ -274,6 +274,14 @@ local cfg_template = {
         type = 'string', name = 'Discovery mode: on, off, once',
         is_optional = true, default = 'on', check = check_discovery_mode
     },
+    sched_ref_quota = {
+        name = 'Scheduler storage ref quota', type = 'non-negative number',
+        is_optional = true, default = consts.DEFAULT_SCHED_REF_QUOTA
+    },
+    sched_move_quota = {
+        name = 'Scheduler bucket move quota', type = 'non-negative number',
+        is_optional = true, default = consts.DEFAULT_SCHED_MOVE_QUOTA
+    },
 }
 
 --
diff --git a/vshard/consts.lua b/vshard/consts.lua
index 0ffe0e2..47a893b 100644
--- a/vshard/consts.lua
+++ b/vshard/consts.lua
@@ -41,6 +41,11 @@ return {
     GC_BACKOFF_INTERVAL = 5,
     RECOVERY_BACKOFF_INTERVAL = 5,
     COLLECT_LUA_GARBAGE_INTERVAL = 100;
+    DEFAULT_BUCKET_SEND_TIMEOUT = 10,
+    DEFAULT_BUCKET_RECV_TIMEOUT = 10,
+
+    DEFAULT_SCHED_REF_QUOTA = 300,
+    DEFAULT_SCHED_MOVE_QUOTA = 1,
 
     DISCOVERY_IDLE_INTERVAL = 10,
     DISCOVERY_WORK_INTERVAL = 1,
diff --git a/vshard/storage/CMakeLists.txt b/vshard/storage/CMakeLists.txt
index 7c1e97d..396664a 100644
--- a/vshard/storage/CMakeLists.txt
+++ b/vshard/storage/CMakeLists.txt
@@ -1,2 +1,2 @@
-install(FILES init.lua reload_evolution.lua ref.lua
+install(FILES init.lua reload_evolution.lua ref.lua sched.lua
         DESTINATION ${TARANTOOL_INSTALL_LUADIR}/vshard/storage)
diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
index 2957f48..31f668f 100644
--- a/vshard/storage/init.lua
+++ b/vshard/storage/init.lua
@@ -17,7 +17,7 @@ if rawget(_G, MODULE_INTERNALS) then
         'vshard.replicaset', 'vshard.util',
         'vshard.storage.reload_evolution',
         'vshard.lua_gc', 'vshard.rlist', 'vshard.registry',
-        'vshard.heap', 'vshard.storage.ref',
+        'vshard.heap', 'vshard.storage.ref', 'vshard.storage.sched',
     }
     for _, module in pairs(vshard_modules) do
         package.loaded[module] = nil
@@ -32,6 +32,7 @@ local util = require('vshard.util')
 local lua_gc = require('vshard.lua_gc')
 local lregistry = require('vshard.registry')
 local lref = require('vshard.storage.ref')
+local lsched = require('vshard.storage.sched')
 local reload_evolution = require('vshard.storage.reload_evolution')
 local fiber_cond_wait = util.fiber_cond_wait
 local bucket_ref_new
@@ -1142,16 +1143,33 @@ local function bucket_recv_xc(bucket_id, from, data, opts)
             return nil, lerror.vshard(lerror.code.WRONG_BUCKET, bucket_id, msg,
                                       from)
         end
-        if lref.count > 0 then
-            return nil, lerror.vshard(lerror.code.STORAGE_IS_REFERENCED)
-        end
         if is_this_replicaset_locked() then
             return nil, lerror.vshard(lerror.code.REPLICASET_IS_LOCKED)
         end
         if not bucket_receiving_quota_add(-1) then
             return nil, lerror.vshard(lerror.code.TOO_MANY_RECEIVING)
         end
-        _bucket:insert({bucket_id, recvg, from})
+        local timeout = opts and opts.timeout or
+                        consts.DEFAULT_BUCKET_SEND_TIMEOUT
+        local ok, err = lsched.move_start(timeout)
+        if not ok then
+            return nil, err
+        end
+        assert(lref.count == 0)
+        -- Move schedule is done only for the time of _bucket update.
+        -- The reason is that one bucket_send() calls bucket_recv() on the
+        -- remote storage multiple times. If the latter would schedule new moves
+        -- on each call, it could happen that the scheduler would block it in
+        -- favor of refs right in the middle of bucket_send().
+        -- It would lead to a deadlock, because refs won't be able to start -
+        -- the bucket won't be writable.
+        -- This way still provides fair scheduling, but does not have the
+        -- described issue.
+        ok, err = pcall(_bucket.insert, _bucket, {bucket_id, recvg, from})
+        lsched.move_end(1)
+        if not ok then
+            return nil, lerror.make(err)
+        end
     elseif b.status ~= recvg then
         local msg = string.format("bucket state is changed: was receiving, "..
                                   "became %s", b.status)
@@ -1434,7 +1452,7 @@ local function bucket_send_xc(bucket_id, destination, opts, exception_guard)
     ref.rw_lock = true
     exception_guard.ref = ref
     exception_guard.drop_rw_lock = true
-    local timeout = opts and opts.timeout or 10
+    local timeout = opts and opts.timeout or consts.DEFAULT_BUCKET_SEND_TIMEOUT
     local deadline = fiber_clock() + timeout
     while ref.rw ~= 0 do
         timeout = deadline - fiber_clock()
@@ -1446,9 +1464,6 @@ local function bucket_send_xc(bucket_id, destination, opts, exception_guard)
 
     local _bucket = box.space._bucket
     local bucket = _bucket:get({bucket_id})
-    if lref.count > 0 then
-        return nil, lerror.vshard(lerror.code.STORAGE_IS_REFERENCED)
-    end
     if is_this_replicaset_locked() then
         return nil, lerror.vshard(lerror.code.REPLICASET_IS_LOCKED)
     end
@@ -1468,7 +1483,25 @@ local function bucket_send_xc(bucket_id, destination, opts, exception_guard)
     local idx = M.shard_index
     local bucket_generation = M.bucket_generation
     local sendg = consts.BUCKET.SENDING
-    _bucket:replace({bucket_id, sendg, destination})
+
+    local ok, err = lsched.move_start(timeout)
+    if not ok then
+        return nil, err
+    end
+    assert(lref.count == 0)
+    -- Move is scheduled only for the time of _bucket update because:
+    --
+    -- * it is consistent with bucket_recv() (see its comments);
+    --
+    -- * gives the same effect as if move was in the scheduler for the whole
+    --   bucket_send() time, because refs won't be able to start anyway - the
+    --   bucket is not writable.
+    ok, err = pcall(_bucket.replace, _bucket, {bucket_id, sendg, destination})
+    lsched.move_end(1)
+    if not ok then
+        return nil, lerror.make(err)
+    end
+
     -- From this moment the bucket is SENDING. Such a status is
     -- even stronger than the lock.
     ref.rw_lock = false
@@ -2542,6 +2575,7 @@ local function storage_cfg(cfg, this_replica_uuid, is_reload)
         M.bucket_on_replace = bucket_generation_increment
     end
 
+    lsched.cfg(vshard_cfg)
     lreplicaset.rebind_replicasets(new_replicasets, M.replicasets)
     lreplicaset.outdate_replicasets(M.replicasets)
     M.replicasets = new_replicasets
diff --git a/vshard/storage/ref.lua b/vshard/storage/ref.lua
index 7589cb9..2daad6b 100644
--- a/vshard/storage/ref.lua
+++ b/vshard/storage/ref.lua
@@ -33,6 +33,7 @@ local lregistry = require('vshard.registry')
 local fiber_clock = lfiber.clock
 local fiber_yield = lfiber.yield
 local DEADLINE_INFINITY = lconsts.DEADLINE_INFINITY
+local TIMEOUT_INFINITY = lconsts.TIMEOUT_INFINITY
 local LUA_CHUNK_SIZE = lconsts.LUA_CHUNK_SIZE
 
 --
@@ -88,6 +89,7 @@ local function ref_session_new(sid)
     -- Cache global session storages as upvalues to save on M indexing.
     local global_heap = M.session_heap
     local global_map = M.session_map
+    local sched = lregistry.storage_sched
 
     local function ref_session_discount(self, del_count)
         local new_count = M.count - del_count
@@ -97,6 +99,8 @@ local function ref_session_new(sid)
         new_count = count - del_count
         assert(new_count >= 0)
         count = new_count
+
+        sched.ref_end(del_count)
     end
 
     local function ref_session_update_deadline(self)
@@ -310,10 +314,17 @@ local function ref_add(rid, sid, timeout)
     local deadline = now + timeout
     local ok, err, session
     local storage = lregistry.storage
+    local sched = lregistry.storage_sched
+
+    timeout, err = sched.ref_start(timeout)
+    if not timeout then
+        return nil, err
+    end
+
     while not storage.bucket_are_all_rw() do
         ok, err = storage.bucket_generation_wait(timeout)
         if not ok then
-            return nil, err
+            goto fail_sched
         end
         now = fiber_clock()
         timeout = deadline - now
@@ -322,7 +333,13 @@ local function ref_add(rid, sid, timeout)
     if not session then
         session = ref_session_new(sid)
     end
-    return session:add(rid, deadline, now)
+    ok, err = session:add(rid, deadline, now)
+    if ok then
+        return true
+    end
+::fail_sched::
+    sched.ref_end(1)
+    return nil, err
 end
 
 local function ref_use(rid, sid)
@@ -341,6 +358,14 @@ local function ref_del(rid, sid)
     return session:del(rid)
 end
 
+local function ref_next_deadline()
+    local session = M.session_heap:top()
+    if not session then
+        return fiber_clock() + TIMEOUT_INFINITY
+    end
+    return session.deadline
+end
+
 local function ref_kill_session(sid)
     local session = M.session_map[sid]
     if session then
@@ -366,6 +391,7 @@ M.add = ref_add
 M.use = ref_use
 M.cfg = ref_cfg
 M.kill = ref_kill_session
+M.next_deadline = ref_next_deadline
 lregistry.storage_ref = M
 
 return M
diff --git a/vshard/storage/sched.lua b/vshard/storage/sched.lua
new file mode 100644
index 0000000..0ac71f4
--- /dev/null
+++ b/vshard/storage/sched.lua
@@ -0,0 +1,231 @@
+--
+-- Scheduler module ensures fair time sharing between incompatible operations:
+-- storage refs and bucket moves.
+-- Storage ref is supposed to prevent all bucket moves and provide safe
+-- environment for all kinds of possible requests on entire dataset of all
+-- spaces stored on the instance.
+-- Bucket move, on the contrary, wants to make a part of the dataset not usable
+-- temporary.
+-- Without a scheduler it would be possible to always keep at least one ref on
+-- the storage and block bucket moves forever. Or vice versa - during
+-- rebalancing block all incoming refs for the entire time of data migration,
+-- essentially making map-reduce not usable since it heavily depends on refs.
+--
+-- The schedule divides storage time between refs and moves so both of them can
+-- execute without blocking each other. Division proportions depend on the
+-- configuration settings.
+--
+-- Idea of non-blockage is based on quotas and strikes. Move and ref both have
+-- quotas. When one op executes more than quota requests in a row (makes a
+-- strike) while the other op has queued requests, the first op stops accepting
+-- new requests until the other op executes.
+--
+
+local MODULE_INTERNALS = '__module_vshard_storage_sched'
+-- Update when change behaviour of anything in the file, to be able to reload.
+local MODULE_VERSION = 1
+
+local lfiber = require('fiber')
+local lerror = require('vshard.error')
+local lconsts = require('vshard.consts')
+local lregistry = require('vshard.registry')
+local lutil = require('vshard.util')
+local fiber_clock = lfiber.clock
+local fiber_cond_wait = lutil.fiber_cond_wait
+local fiber_is_self_canceled = lutil.fiber_is_self_canceled
+
+local M = rawget(_G, MODULE_INTERNALS)
+if not M then
+    M = {
+        ---------------- Common module attributes ----------------
+        module_version = MODULE_VERSION,
+        -- Scheduler condition is signaled every time anything significant
+        -- happens - count of an operation type drops to 0, or quota increased,
+        -- etc.
+        cond = lfiber.cond(),
+
+        -------------------------- Refs --------------------------
+        -- Number of ref requests waiting for start.
+        ref_queue = 0,
+        -- Number of ref requests being executed. It is the same as ref's module
+        -- counter, but is duplicated here for the sake of isolation and
+        -- symmetry with moves.
+        ref_count = 0,
+        -- Number of ref requests executed in a row. When becomes bigger than
+        -- quota, any next queued move blocks new refs.
+        ref_strike = 0,
+        ref_quota = lconsts.DEFAULT_SCHED_REF_QUOTA,
+
+        ------------------------- Moves --------------------------
+        -- Number of move requests waiting for start.
+        move_queue = 0,
+        -- Number of move requests being executed.
+        move_count = 0,
+        -- Number of move requests executed in a row. When becomes bigger than
+        -- quota, any next queued ref blocks new moves.
+        move_strike = 0,
+        move_quota = lconsts.DEFAULT_SCHED_MOVE_QUOTA,
+    }
+else
+    return M
+end
+
+local function sched_wait_anything(timeout)
+    return fiber_cond_wait(M.cond, timeout)
+end
+
+--
+-- Return the remaining timeout in case there was a yield. This helps to save
+-- current clock get in the caller code if there were no yields.
+--
+local function sched_ref_start(timeout)
+    local deadline = fiber_clock() + timeout
+    local ok, err
+    -- Fast-path. Moves are extremely rare. No need to inc-dec the ref queue
+    -- then nor try to start some loops.
+    if M.move_count == 0 and M.move_queue == 0 then
+        goto success
+    end
+
+    M.ref_queue = M.ref_queue + 1
+
+::retry::
+    if M.move_count > 0 then
+        goto wait_and_retry
+    end
+    -- Even if move count is zero, must ensure the time usage is fair. Does not
+    -- matter in case the moves have no quota at all. That allows to ignore them
+    -- infinitely until all refs end voluntarily.
+    if M.move_queue > 0 and M.ref_strike >= M.ref_quota and
+       M.move_quota > 0 then
+        goto wait_and_retry
+    end
+
+    M.ref_queue = M.ref_queue - 1
+
+::success::
+    M.ref_count = M.ref_count + 1
+    M.ref_strike = M.ref_strike + 1
+    M.move_strike = 0
+    do return timeout end
+
+::wait_and_retry::
+    ok, err = sched_wait_anything(timeout)
+    if not ok then
+        M.ref_queue = M.ref_queue - 1
+        return nil, err
+    end
+    timeout = deadline - fiber_clock()
+    goto retry
+end
+
+local function sched_ref_end(count)
+    count = M.ref_count - count
+    M.ref_count = count
+    if count == 0 and M.move_queue > 0 then
+        M.cond:broadcast()
+    end
+end
+
+--
+-- Return the remaining timeout in case there was a yield. This helps to save
+-- current clock get in the caller code if there were no yields.
+--
+local function sched_move_start(timeout)
+    local deadline = fiber_clock() + timeout
+    local ok, err, ref_deadline
+    local lref = lregistry.storage_ref
+    -- Fast-path. Refs are not extremely rare *when used*. But they are not
+    -- expected to be used in a lot of installations. So most of the times the
+    -- moves should work right away.
+    if M.ref_count == 0 and M.ref_queue == 0 then
+        goto success
+    end
+
+    M.move_queue = M.move_queue + 1
+
+::retry::
+    if M.ref_count > 0 then
+        ref_deadline = lref.next_deadline()
+        if ref_deadline < deadline then
+            timeout = ref_deadline - fiber_clock()
+        end
+        ok, err = sched_wait_anything(timeout)
+        timeout = deadline - fiber_clock()
+        if ok then
+            goto retry
+        end
+        if fiber_is_self_canceled() then
+            goto fail
+        end
+        -- Even if the timeout has expired already (or was 0 from the
+        -- beginning), it is still possible the move can be started if all the
+        -- present refs are expired too and can be collected.
+        lref.gc()
+        -- GC could yield - need to refetch the clock again.
+        timeout = deadline - fiber_clock()
+        if M.ref_count > 0 then
+            if timeout < 0 then
+                goto fail
+            end
+            goto retry
+        end
+    end
+
+    if M.ref_queue > 0 and M.move_strike >= M.move_quota and
+       M.ref_quota > 0 then
+        ok, err = sched_wait_anything(timeout)
+        if not ok then
+            goto fail
+        end
+        timeout = deadline - fiber_clock()
+        goto retry
+    end
+
+    M.move_queue = M.move_queue - 1
+
+::success::
+    M.move_count = M.move_count + 1
+    M.move_strike = M.move_strike + 1
+    M.ref_strike = 0
+    do return timeout end
+
+::fail::
+    M.move_queue = M.move_queue - 1
+    return nil, err
+end
+
+local function sched_move_end(count)
+    count = M.move_count - count
+    M.move_count = count
+    if count == 0 and M.ref_queue > 0 then
+        M.cond:broadcast()
+    end
+end
+
+local function sched_cfg(cfg)
+    local new_ref_quota = cfg.sched_ref_quota
+    local new_move_quota = cfg.sched_move_quota
+
+    if new_ref_quota then
+        if new_ref_quota > M.ref_quota then
+            M.cond:broadcast()
+        end
+        M.ref_quota = new_ref_quota
+    end
+    if new_move_quota then
+        if new_move_quota > M.move_quota then
+            M.cond:broadcast()
+        end
+        M.move_quota = new_move_quota
+    end
+end
+
+M.ref_start = sched_ref_start
+M.ref_end = sched_ref_end
+M.move_start = sched_move_start
+M.move_end = sched_move_end
+M.cfg = sched_cfg
+lregistry.storage_sched = M
+
+return M
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Tarantool-patches] [PATCH vshard 11/11] router: introduce map_callrw()
  2021-02-23  0:15 [Tarantool-patches] [PATCH vshard 00/11] VShard Map-Reduce, part 2: Ref, Sched, Map Vladislav Shpilevoy via Tarantool-patches
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 01/11] error: introduce vshard.error.timeout() Vladislav Shpilevoy via Tarantool-patches
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 10/11] sched: introduce vshard.storage.sched module Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-23  0:15 ` Vladislav Shpilevoy via Tarantool-patches
  2021-02-24 10:28   ` Oleg Babin via Tarantool-patches
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 02/11] storage: add helper for local functions invocation Vladislav Shpilevoy via Tarantool-patches
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 47+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-02-23  0:15 UTC (permalink / raw)
  To: tarantool-patches, olegrok, yaroslav.dynnikov

Closes #147

@TarantoolBot document
Title: vshard.router.map_callrw()

`vshard.router.map_callrw()` implements consistent map-reduce over
the entire cluster. Consistency means all the data was accessible,
and didn't move during map requests execution.

It is useful when need to access potentially all the data in the
cluster or simply huge number of buckets scattered over the
instances and whose individual `vshard.router.call()` would take
too long.

`Map_callrw()` takes name of the function to call on the storages,
arguments in the format of array, and not required options map.
The only supported option for now is timeout which is applied to
the entire call. Not to individual calls for each storage.
```
vshard.router.map_callrw(func_name, args[, {timeout = <seconds>}])
```

The chosen function is called on the master node of each
replicaset with the given arguments.

In case of success `vshard.router.map_callrw()` returns a map with
replicaset UUIDs as keys and results of the user's function as
values, like this:
```
{uuid1 = {res1}, uuid2 = {res2}, ...}
```
If the function returned `nil` or `box.NULL` from one of the
storages, it won't be present in the result map.

In case of fail it returns nil, error object, and optional
replicaset UUID where the error happened. UUID may not be returned
if the error wasn't about a concrete replicaset.

For instance, the method fails if not all buckets were found even
if all replicasets were scanned successfully.

Handling the result looks like this:
```Lua
res, err, uuid = vshard.router.map_callrw(...)
if not res then
    -- Error.
    -- 'err' - error object. 'uuid' - optional UUID of replicaset
    -- where the error happened.
    ...
else
    -- Success.
    for uuid, value in pairs(res) do
        ...
    end
end
```

Map-Reduce in vshard works in 3 stages: Ref, Map, Reduce. Ref is
an internal stage which is supposed to ensure data consistency
during user's function execution on all nodes.

Reduce is not performed by vshard. It is what user's code does
with results of `map_callrw()`.

Consistency, as it is defined for map-reduce, is not compatible
with rebalancing. Because any bucket move would make the sender
and receiver nodes 'inconsistent' - it is not possible to call a
function on them which could simply access all the data without
doing `vshard.storage.bucket_ref()`.

This makes Ref stage very intricate as it must work together with
rebalancer to ensure neither of them block each other.

For this storage has a scheduler specifically for bucket moves and
storage refs which shares storage time between them fairly.

Definition of fairness depends on how long and frequent the moves
and refs are. This can be configured using storage options
`sched_move_quota` and `sched_ref_quota`. See more details about
them in the corresponding doc section.

The scheduler configuration may affect map-reduce requests if they
are used a lot during rebalancing.

Keep in mind that it is not a good idea to use too big timeouts
for `map_callrw()`. Because the router will try to block the
bucket moves for the given timeout on all storages. And in case
something will go wrong, the block will remain for the entire
timeout. This means, in particular, having the timeout longer
than, say, minutes is a super bad way to go unless it is for
tests only.

Also it is important to remember that `map_callrw()` does not
work on replicas. It works only on masters. This makes it unusable
if at least one replicaset has its master node down.
---
 test/router/map-reduce.result   | 636 ++++++++++++++++++++++++++++++++
 test/router/map-reduce.test.lua | 258 +++++++++++++
 test/router/router.result       |   9 +-
 test/upgrade/upgrade.result     |   5 +-
 vshard/replicaset.lua           |  34 ++
 vshard/router/init.lua          | 180 +++++++++
 vshard/storage/init.lua         |  47 +++
 7 files changed, 1164 insertions(+), 5 deletions(-)
 create mode 100644 test/router/map-reduce.result
 create mode 100644 test/router/map-reduce.test.lua

diff --git a/test/router/map-reduce.result b/test/router/map-reduce.result
new file mode 100644
index 0000000..1e8995a
--- /dev/null
+++ b/test/router/map-reduce.result
@@ -0,0 +1,636 @@
+-- test-run result file version 2
+test_run = require('test_run').new()
+ | ---
+ | ...
+REPLICASET_1 = { 'storage_1_a', 'storage_1_b' }
+ | ---
+ | ...
+REPLICASET_2 = { 'storage_2_a', 'storage_2_b' }
+ | ---
+ | ...
+test_run:create_cluster(REPLICASET_1, 'router')
+ | ---
+ | ...
+test_run:create_cluster(REPLICASET_2, 'router')
+ | ---
+ | ...
+util = require('util')
+ | ---
+ | ...
+util.wait_master(test_run, REPLICASET_1, 'storage_1_a')
+ | ---
+ | ...
+util.wait_master(test_run, REPLICASET_2, 'storage_2_a')
+ | ---
+ | ...
+util.map_evals(test_run, {REPLICASET_1, REPLICASET_2}, 'bootstrap_storage()')
+ | ---
+ | ...
+util.push_rs_filters(test_run)
+ | ---
+ | ...
+_ = test_run:cmd("create server router_1 with script='router/router_1.lua'")
+ | ---
+ | ...
+_ = test_run:cmd("start server router_1")
+ | ---
+ | ...
+
+_ = test_run:switch("router_1")
+ | ---
+ | ...
+util = require('util')
+ | ---
+ | ...
+
+--
+-- gh-147: consistent map-reduce.
+--
+big_timeout = 1000000
+ | ---
+ | ...
+big_timeout_opts = {timeout = big_timeout}
+ | ---
+ | ...
+vshard.router.cfg(cfg)
+ | ---
+ | ...
+vshard.router.bootstrap(big_timeout_opts)
+ | ---
+ | - true
+ | ...
+-- Trivial basic sanity test. Multireturn is not supported, should be truncated.
+vshard.router.map_callrw('echo', {1, 2, 3}, big_timeout_opts)
+ | ---
+ | - <replicaset_2>:
+ |   - 1
+ |   <replicaset_1>:
+ |   - 1
+ | ...
+
+--
+-- Fail during connecting to storages. For the succeeded storages the router
+-- tries to send unref.
+--
+timeout = 0.001
+ | ---
+ | ...
+timeout_opts = {timeout = timeout}
+ | ---
+ | ...
+
+test_run:cmd('stop server storage_1_a')
+ | ---
+ | - true
+ | ...
+ok, err = vshard.router.map_callrw('echo', {1}, timeout_opts)
+ | ---
+ | ...
+assert(not ok and err.message)
+ | ---
+ | - Timeout exceeded
+ | ...
+-- Even if ref was sent successfully to storage_2_a, it was deleted before
+-- router returned an error.
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+lref = require('vshard.storage.ref')
+ | ---
+ | ...
+-- Wait because unref is sent asynchronously. Could arrive not immediately.
+test_run:wait_cond(function() return lref.count == 0 end)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('router_1')
+ | ---
+ | ...
+test_run:cmd('start server storage_1_a')
+ | ---
+ | - true
+ | ...
+-- Works again - router waited for connection being established.
+vshard.router.map_callrw('echo', {1}, big_timeout_opts)
+ | ---
+ | - <replicaset_2>:
+ |   - 1
+ |   <replicaset_1>:
+ |   - 1
+ | ...
+
+--
+-- Do all the same but with another storage being stopped. The same test is done
+-- again because can't tell at which of the tests to where the router will go
+-- first.
+--
+test_run:cmd('stop server storage_2_a')
+ | ---
+ | - true
+ | ...
+ok, err = vshard.router.map_callrw('echo', {1}, timeout_opts)
+ | ---
+ | ...
+assert(not ok and err.message)
+ | ---
+ | - Timeout exceeded
+ | ...
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+lref = require('vshard.storage.ref')
+ | ---
+ | ...
+test_run:wait_cond(function() return lref.count == 0 end)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('router_1')
+ | ---
+ | ...
+test_run:cmd('start server storage_2_a')
+ | ---
+ | - true
+ | ...
+vshard.router.map_callrw('echo', {1}, big_timeout_opts)
+ | ---
+ | - <replicaset_2>:
+ |   - 1
+ |   <replicaset_1>:
+ |   - 1
+ | ...
+
+--
+-- Fail at ref stage handling. Unrefs are sent to cancel those refs which
+-- succeeded. To simulate a ref fail make the router think there is a moving
+-- bucket.
+--
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+lsched = require('vshard.storage.sched')
+ | ---
+ | ...
+big_timeout = 1000000
+ | ---
+ | ...
+lsched.move_start(big_timeout)
+ | ---
+ | - 1000000
+ | ...
+
+_ = test_run:switch('router_1')
+ | ---
+ | ...
+ok, err = vshard.router.map_callrw('echo', {1}, timeout_opts)
+ | ---
+ | ...
+assert(not ok and err.message)
+ | ---
+ | - Timeout exceeded
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+lsched = require('vshard.storage.sched')
+ | ---
+ | ...
+lref = require('vshard.storage.ref')
+ | ---
+ | ...
+test_run:wait_cond(function() return lref.count == 0 end)
+ | ---
+ | - true
+ | ...
+
+--
+-- Do all the same with another storage being busy with a 'move'.
+--
+big_timeout = 1000000
+ | ---
+ | ...
+lsched.move_start(big_timeout)
+ | ---
+ | - 1000000
+ | ...
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+lref = require('vshard.storage.ref')
+ | ---
+ | ...
+lsched.move_end(1)
+ | ---
+ | ...
+assert(lref.count == 0)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('router_1')
+ | ---
+ | ...
+ok, err = vshard.router.map_callrw('echo', {1}, timeout_opts)
+ | ---
+ | ...
+assert(not ok and err.message)
+ | ---
+ | - Timeout exceeded
+ | ...
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+test_run:wait_cond(function() return lref.count == 0 end)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+lsched.move_end(1)
+ | ---
+ | ...
+assert(lref.count == 0)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('router_1')
+ | ---
+ | ...
+vshard.router.map_callrw('echo', {1}, big_timeout_opts)
+ | ---
+ | - <replicaset_2>:
+ |   - 1
+ |   <replicaset_1>:
+ |   - 1
+ | ...
+
+--
+-- Ref can fail earlier than by a timeout. Router still should broadcast unrefs
+-- correctly. To simulate ref fail add a duplicate manually.
+--
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+box.schema.user.grant('storage', 'super')
+ | ---
+ | ...
+router_sid = nil
+ | ---
+ | ...
+function save_router_sid()                                                      \
+    router_sid = box.session.id()                                               \
+end
+ | ---
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+box.schema.user.grant('storage', 'super')
+ | ---
+ | ...
+router_sid = nil
+ | ---
+ | ...
+function save_router_sid()                                                      \
+    router_sid = box.session.id()                                               \
+end
+ | ---
+ | ...
+
+_ = test_run:switch('router_1')
+ | ---
+ | ...
+vshard.router.map_callrw('save_router_sid', {}, big_timeout_opts)
+ | ---
+ | - []
+ | ...
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+lref.add(1, router_sid, big_timeout)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('router_1')
+ | ---
+ | ...
+vshard.router.internal.ref_id = 1
+ | ---
+ | ...
+ok, err = vshard.router.map_callrw('echo', {1}, big_timeout_opts)
+ | ---
+ | ...
+assert(not ok and err.message)
+ | ---
+ | - 'Can not add a storage ref: duplicate ref'
+ | ...
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+_ = lref.del(1, router_sid)
+ | ---
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+test_run:wait_cond(function() return lref.count == 0 end)
+ | ---
+ | - true
+ | ...
+lref.add(1, router_sid, big_timeout)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('router_1')
+ | ---
+ | ...
+vshard.router.internal.ref_id = 1
+ | ---
+ | ...
+ok, err = vshard.router.map_callrw('echo', {1}, big_timeout_opts)
+ | ---
+ | ...
+assert(not ok and err.message)
+ | ---
+ | - 'Can not add a storage ref: duplicate ref'
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+_ = lref.del(1, router_sid)
+ | ---
+ | ...
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+test_run:wait_cond(function() return lref.count == 0 end)
+ | ---
+ | - true
+ | ...
+
+--
+-- Fail if some buckets are not visible. Even if all the known replicasets were
+-- scanned. It means consistency violation.
+--
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+bucket_id = box.space._bucket.index.pk:min().id
+ | ---
+ | ...
+vshard.storage.bucket_force_drop(bucket_id)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('router_1')
+ | ---
+ | ...
+ok, err = vshard.router.map_callrw('echo', {1}, big_timeout_opts)
+ | ---
+ | ...
+assert(not ok and err.message)
+ | ---
+ | - 1 buckets are not discovered
+ | ...
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+test_run:wait_cond(function() return lref.count == 0 end)
+ | ---
+ | - true
+ | ...
+vshard.storage.bucket_force_create(bucket_id)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+test_run:wait_cond(function() return lref.count == 0 end)
+ | ---
+ | - true
+ | ...
+bucket_id = box.space._bucket.index.pk:min().id
+ | ---
+ | ...
+vshard.storage.bucket_force_drop(bucket_id)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('router_1')
+ | ---
+ | ...
+ok, err = vshard.router.map_callrw('echo', {1}, big_timeout_opts)
+ | ---
+ | ...
+assert(not ok and err.message)
+ | ---
+ | - 1 buckets are not discovered
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+test_run:wait_cond(function() return lref.count == 0 end)
+ | ---
+ | - true
+ | ...
+vshard.storage.bucket_force_create(bucket_id)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+test_run:wait_cond(function() return lref.count == 0 end)
+ | ---
+ | - true
+ | ...
+
+--
+-- Storage map unit tests.
+--
+
+-- Map fails not being able to use the ref.
+ok, err = vshard.storage._call('storage_map', 0, 'echo', {1})
+ | ---
+ | ...
+ok, err.message
+ | ---
+ | - null
+ | - 'Can not use a storage ref: no session'
+ | ...
+
+-- Map fails and clears the ref when the user function fails.
+vshard.storage._call('storage_ref', 0, big_timeout)
+ | ---
+ | - 1500
+ | ...
+assert(lref.count == 1)
+ | ---
+ | - true
+ | ...
+ok, err = vshard.storage._call('storage_map', 0, 'raise_client_error', {})
+ | ---
+ | ...
+assert(lref.count == 0)
+ | ---
+ | - true
+ | ...
+assert(not ok and err.message)
+ | ---
+ | - Unknown error
+ | ...
+
+-- Map fails gracefully when couldn't delete the ref.
+vshard.storage._call('storage_ref', 0, big_timeout)
+ | ---
+ | - 1500
+ | ...
+ok, err = vshard.storage._call('storage_map', 0, 'vshard.storage._call',        \
+                               {'storage_unref', 0})
+ | ---
+ | ...
+assert(lref.count == 0)
+ | ---
+ | - true
+ | ...
+assert(not ok and err.message)
+ | ---
+ | - 'Can not delete a storage ref: no ref'
+ | ...
+
+--
+-- Map fail is handled and the router tries to send unrefs.
+--
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+need_throw = true
+ | ---
+ | ...
+function map_throw()                                                            \
+    if need_throw then                                                          \
+        raise_client_error()                                                    \
+    end                                                                         \
+    return '+'                                                                  \
+end
+ | ---
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+need_throw = false
+ | ---
+ | ...
+function map_throw()                                                            \
+    if need_throw then                                                          \
+        raise_client_error()                                                    \
+    end                                                                         \
+    return '+'                                                                  \
+end
+ | ---
+ | ...
+
+_ = test_run:switch('router_1')
+ | ---
+ | ...
+ok, err = vshard.router.map_callrw('raise_client_error', {}, big_timeout_opts)
+ | ---
+ | ...
+ok, err.message
+ | ---
+ | - null
+ | - Unknown error
+ | ...
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+need_throw = false
+ | ---
+ | ...
+test_run:wait_cond(function() return lref.count == 0 end)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+need_throw = true
+ | ---
+ | ...
+test_run:wait_cond(function() return lref.count == 0 end)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('router_1')
+ | ---
+ | ...
+ok, err = vshard.router.map_callrw('raise_client_error', {}, big_timeout_opts)
+ | ---
+ | ...
+ok, err.message
+ | ---
+ | - null
+ | - Unknown error
+ | ...
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+test_run:wait_cond(function() return lref.count == 0 end)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+test_run:wait_cond(function() return lref.count == 0 end)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('default')
+ | ---
+ | ...
+_ = test_run:cmd("stop server router_1")
+ | ---
+ | ...
+_ = test_run:cmd("cleanup server router_1")
+ | ---
+ | ...
+test_run:drop_cluster(REPLICASET_1)
+ | ---
+ | ...
+test_run:drop_cluster(REPLICASET_2)
+ | ---
+ | ...
+_ = test_run:cmd('clear filter')
+ | ---
+ | ...
diff --git a/test/router/map-reduce.test.lua b/test/router/map-reduce.test.lua
new file mode 100644
index 0000000..3b63248
--- /dev/null
+++ b/test/router/map-reduce.test.lua
@@ -0,0 +1,258 @@
+test_run = require('test_run').new()
+REPLICASET_1 = { 'storage_1_a', 'storage_1_b' }
+REPLICASET_2 = { 'storage_2_a', 'storage_2_b' }
+test_run:create_cluster(REPLICASET_1, 'router')
+test_run:create_cluster(REPLICASET_2, 'router')
+util = require('util')
+util.wait_master(test_run, REPLICASET_1, 'storage_1_a')
+util.wait_master(test_run, REPLICASET_2, 'storage_2_a')
+util.map_evals(test_run, {REPLICASET_1, REPLICASET_2}, 'bootstrap_storage()')
+util.push_rs_filters(test_run)
+_ = test_run:cmd("create server router_1 with script='router/router_1.lua'")
+_ = test_run:cmd("start server router_1")
+
+_ = test_run:switch("router_1")
+util = require('util')
+
+--
+-- gh-147: consistent map-reduce.
+--
+big_timeout = 1000000
+big_timeout_opts = {timeout = big_timeout}
+vshard.router.cfg(cfg)
+vshard.router.bootstrap(big_timeout_opts)
+-- Trivial basic sanity test. Multireturn is not supported, should be truncated.
+vshard.router.map_callrw('echo', {1, 2, 3}, big_timeout_opts)
+
+--
+-- Fail during connecting to storages. For the succeeded storages the router
+-- tries to send unref.
+--
+timeout = 0.001
+timeout_opts = {timeout = timeout}
+
+test_run:cmd('stop server storage_1_a')
+ok, err = vshard.router.map_callrw('echo', {1}, timeout_opts)
+assert(not ok and err.message)
+-- Even if ref was sent successfully to storage_2_a, it was deleted before
+-- router returned an error.
+_ = test_run:switch('storage_2_a')
+lref = require('vshard.storage.ref')
+-- Wait because unref is sent asynchronously. Could arrive not immediately.
+test_run:wait_cond(function() return lref.count == 0 end)
+
+_ = test_run:switch('router_1')
+test_run:cmd('start server storage_1_a')
+-- Works again - router waited for connection being established.
+vshard.router.map_callrw('echo', {1}, big_timeout_opts)
+
+--
+-- Do all the same but with another storage being stopped. The same test is done
+-- again because can't tell at which of the tests to where the router will go
+-- first.
+--
+test_run:cmd('stop server storage_2_a')
+ok, err = vshard.router.map_callrw('echo', {1}, timeout_opts)
+assert(not ok and err.message)
+_ = test_run:switch('storage_1_a')
+lref = require('vshard.storage.ref')
+test_run:wait_cond(function() return lref.count == 0 end)
+
+_ = test_run:switch('router_1')
+test_run:cmd('start server storage_2_a')
+vshard.router.map_callrw('echo', {1}, big_timeout_opts)
+
+--
+-- Fail at ref stage handling. Unrefs are sent to cancel those refs which
+-- succeeded. To simulate a ref fail make the router think there is a moving
+-- bucket.
+--
+_ = test_run:switch('storage_1_a')
+lsched = require('vshard.storage.sched')
+big_timeout = 1000000
+lsched.move_start(big_timeout)
+
+_ = test_run:switch('router_1')
+ok, err = vshard.router.map_callrw('echo', {1}, timeout_opts)
+assert(not ok and err.message)
+
+_ = test_run:switch('storage_2_a')
+lsched = require('vshard.storage.sched')
+lref = require('vshard.storage.ref')
+test_run:wait_cond(function() return lref.count == 0 end)
+
+--
+-- Do all the same with another storage being busy with a 'move'.
+--
+big_timeout = 1000000
+lsched.move_start(big_timeout)
+
+_ = test_run:switch('storage_1_a')
+lref = require('vshard.storage.ref')
+lsched.move_end(1)
+assert(lref.count == 0)
+
+_ = test_run:switch('router_1')
+ok, err = vshard.router.map_callrw('echo', {1}, timeout_opts)
+assert(not ok and err.message)
+
+_ = test_run:switch('storage_1_a')
+test_run:wait_cond(function() return lref.count == 0 end)
+
+_ = test_run:switch('storage_2_a')
+lsched.move_end(1)
+assert(lref.count == 0)
+
+_ = test_run:switch('router_1')
+vshard.router.map_callrw('echo', {1}, big_timeout_opts)
+
+--
+-- Ref can fail earlier than by a timeout. Router still should broadcast unrefs
+-- correctly. To simulate ref fail add a duplicate manually.
+--
+_ = test_run:switch('storage_1_a')
+box.schema.user.grant('storage', 'super')
+router_sid = nil
+function save_router_sid()                                                      \
+    router_sid = box.session.id()                                               \
+end
+
+_ = test_run:switch('storage_2_a')
+box.schema.user.grant('storage', 'super')
+router_sid = nil
+function save_router_sid()                                                      \
+    router_sid = box.session.id()                                               \
+end
+
+_ = test_run:switch('router_1')
+vshard.router.map_callrw('save_router_sid', {}, big_timeout_opts)
+
+_ = test_run:switch('storage_1_a')
+lref.add(1, router_sid, big_timeout)
+
+_ = test_run:switch('router_1')
+vshard.router.internal.ref_id = 1
+ok, err = vshard.router.map_callrw('echo', {1}, big_timeout_opts)
+assert(not ok and err.message)
+
+_ = test_run:switch('storage_1_a')
+_ = lref.del(1, router_sid)
+
+_ = test_run:switch('storage_2_a')
+test_run:wait_cond(function() return lref.count == 0 end)
+lref.add(1, router_sid, big_timeout)
+
+_ = test_run:switch('router_1')
+vshard.router.internal.ref_id = 1
+ok, err = vshard.router.map_callrw('echo', {1}, big_timeout_opts)
+assert(not ok and err.message)
+
+_ = test_run:switch('storage_2_a')
+_ = lref.del(1, router_sid)
+
+_ = test_run:switch('storage_1_a')
+test_run:wait_cond(function() return lref.count == 0 end)
+
+--
+-- Fail if some buckets are not visible. Even if all the known replicasets were
+-- scanned. It means consistency violation.
+--
+_ = test_run:switch('storage_1_a')
+bucket_id = box.space._bucket.index.pk:min().id
+vshard.storage.bucket_force_drop(bucket_id)
+
+_ = test_run:switch('router_1')
+ok, err = vshard.router.map_callrw('echo', {1}, big_timeout_opts)
+assert(not ok and err.message)
+
+_ = test_run:switch('storage_1_a')
+test_run:wait_cond(function() return lref.count == 0 end)
+vshard.storage.bucket_force_create(bucket_id)
+
+_ = test_run:switch('storage_2_a')
+test_run:wait_cond(function() return lref.count == 0 end)
+bucket_id = box.space._bucket.index.pk:min().id
+vshard.storage.bucket_force_drop(bucket_id)
+
+_ = test_run:switch('router_1')
+ok, err = vshard.router.map_callrw('echo', {1}, big_timeout_opts)
+assert(not ok and err.message)
+
+_ = test_run:switch('storage_2_a')
+test_run:wait_cond(function() return lref.count == 0 end)
+vshard.storage.bucket_force_create(bucket_id)
+
+_ = test_run:switch('storage_1_a')
+test_run:wait_cond(function() return lref.count == 0 end)
+
+--
+-- Storage map unit tests.
+--
+
+-- Map fails not being able to use the ref.
+ok, err = vshard.storage._call('storage_map', 0, 'echo', {1})
+ok, err.message
+
+-- Map fails and clears the ref when the user function fails.
+vshard.storage._call('storage_ref', 0, big_timeout)
+assert(lref.count == 1)
+ok, err = vshard.storage._call('storage_map', 0, 'raise_client_error', {})
+assert(lref.count == 0)
+assert(not ok and err.message)
+
+-- Map fails gracefully when couldn't delete the ref.
+vshard.storage._call('storage_ref', 0, big_timeout)
+ok, err = vshard.storage._call('storage_map', 0, 'vshard.storage._call',        \
+                               {'storage_unref', 0})
+assert(lref.count == 0)
+assert(not ok and err.message)
+
+--
+-- Map fail is handled and the router tries to send unrefs.
+--
+_ = test_run:switch('storage_1_a')
+need_throw = true
+function map_throw()                                                            \
+    if need_throw then                                                          \
+        raise_client_error()                                                    \
+    end                                                                         \
+    return '+'                                                                  \
+end
+
+_ = test_run:switch('storage_2_a')
+need_throw = false
+function map_throw()                                                            \
+    if need_throw then                                                          \
+        raise_client_error()                                                    \
+    end                                                                         \
+    return '+'                                                                  \
+end
+
+_ = test_run:switch('router_1')
+ok, err = vshard.router.map_callrw('raise_client_error', {}, big_timeout_opts)
+ok, err.message
+
+_ = test_run:switch('storage_1_a')
+need_throw = false
+test_run:wait_cond(function() return lref.count == 0 end)
+
+_ = test_run:switch('storage_2_a')
+need_throw = true
+test_run:wait_cond(function() return lref.count == 0 end)
+
+_ = test_run:switch('router_1')
+ok, err = vshard.router.map_callrw('raise_client_error', {}, big_timeout_opts)
+ok, err.message
+
+_ = test_run:switch('storage_1_a')
+test_run:wait_cond(function() return lref.count == 0 end)
+
+_ = test_run:switch('storage_2_a')
+test_run:wait_cond(function() return lref.count == 0 end)
+
+_ = test_run:switch('default')
+_ = test_run:cmd("stop server router_1")
+_ = test_run:cmd("cleanup server router_1")
+test_run:drop_cluster(REPLICASET_1)
+test_run:drop_cluster(REPLICASET_2)
+_ = test_run:cmd('clear filter')
\ No newline at end of file
diff --git a/test/router/router.result b/test/router/router.result
index 3c1d073..f9ee37c 100644
--- a/test/router/router.result
+++ b/test/router/router.result
@@ -1163,14 +1163,15 @@ error_messages
 - - Use replicaset:callro(...) instead of replicaset.callro(...)
   - Use replicaset:connect_master(...) instead of replicaset.connect_master(...)
   - Use replicaset:callre(...) instead of replicaset.callre(...)
-  - Use replicaset:connect_replica(...) instead of replicaset.connect_replica(...)
   - Use replicaset:down_replica_priority(...) instead of replicaset.down_replica_priority(...)
-  - Use replicaset:callrw(...) instead of replicaset.callrw(...)
+  - Use replicaset:connect(...) instead of replicaset.connect(...)
+  - Use replicaset:wait_connected(...) instead of replicaset.wait_connected(...)
+  - Use replicaset:up_replica_priority(...) instead of replicaset.up_replica_priority(...)
   - Use replicaset:callbro(...) instead of replicaset.callbro(...)
   - Use replicaset:connect_all(...) instead of replicaset.connect_all(...)
+  - Use replicaset:connect_replica(...) instead of replicaset.connect_replica(...)
   - Use replicaset:call(...) instead of replicaset.call(...)
-  - Use replicaset:connect(...) instead of replicaset.connect(...)
-  - Use replicaset:up_replica_priority(...) instead of replicaset.up_replica_priority(...)
+  - Use replicaset:callrw(...) instead of replicaset.callrw(...)
   - Use replicaset:callbre(...) instead of replicaset.callbre(...)
 ...
 _, replica = next(replicaset.replicas)
diff --git a/test/upgrade/upgrade.result b/test/upgrade/upgrade.result
index c2d54a3..833da3f 100644
--- a/test/upgrade/upgrade.result
+++ b/test/upgrade/upgrade.result
@@ -162,9 +162,12 @@ vshard.storage._call ~= nil
 vshard.storage._call('test_api', 1, 2, 3)
  | ---
  | - bucket_recv: true
+ |   storage_ref: true
  |   rebalancer_apply_routes: true
- |   test_api: true
+ |   storage_map: true
  |   rebalancer_request_state: true
+ |   test_api: true
+ |   storage_unref: true
  | - 1
  | - 2
  | - 3
diff --git a/vshard/replicaset.lua b/vshard/replicaset.lua
index 7437e3b..56ea165 100644
--- a/vshard/replicaset.lua
+++ b/vshard/replicaset.lua
@@ -139,6 +139,39 @@ local function replicaset_connect_master(replicaset)
     return replicaset_connect_to_replica(replicaset, master)
 end
 
+--
+-- Wait until the master instance is connected. This is necessary at least for
+-- async requests because they fail immediately if the connection is not
+-- established.
+-- Returns the remaining timeout because is expected to be used to connect to
+-- many replicasets in a loop, where such return saves one clock get in the
+-- caller code and is just cleaner code.
+--
+local function replicaset_wait_connected(replicaset, timeout)
+    local deadline = fiber_clock() + timeout
+    local ok, res
+    while true do
+        local conn = replicaset_connect_master(replicaset)
+        if conn.state == 'active' then
+            return timeout
+        end
+        -- Netbox uses fiber_cond inside, which throws an irrelevant usage error
+        -- at negative timeout. Need to check the case manually.
+        if timeout < 0 then
+            return nil, lerror.timeout()
+        end
+        ok, res = pcall(conn.wait_connected, conn, timeout)
+        if not ok then
+            return nil, lerror.make(res)
+        end
+        if not res then
+            return nil, lerror.timeout()
+        end
+        timeout = deadline - fiber_clock()
+    end
+    assert(false)
+end
+
 --
 -- Create net.box connections to all replicas and master.
 --
@@ -483,6 +516,7 @@ local replicaset_mt = {
         connect_replica = replicaset_connect_to_replica;
         down_replica_priority = replicaset_down_replica_priority;
         up_replica_priority = replicaset_up_replica_priority;
+        wait_connected = replicaset_wait_connected,
         call = replicaset_master_call;
         callrw = replicaset_master_call;
         callro = replicaset_template_multicallro(false, false);
diff --git a/vshard/router/init.lua b/vshard/router/init.lua
index 97bcb0a..8abd77f 100644
--- a/vshard/router/init.lua
+++ b/vshard/router/init.lua
@@ -44,6 +44,11 @@ if not M then
         module_version = 0,
         -- Number of router which require collecting lua garbage.
         collect_lua_garbage_cnt = 0,
+
+        ----------------------- Map-Reduce -----------------------
+        -- Storage Ref ID. It must be unique for each ref request
+        -- and therefore is global and monotonically growing.
+        ref_id = 0,
     }
 end
 
@@ -674,6 +679,177 @@ local function router_call(router, bucket_id, opts, ...)
                             ...)
 end
 
+local router_map_callrw
+
+if util.version_is_at_least(1, 10, 0) then
+--
+-- Consistent Map-Reduce. The given function is called on all masters in the
+-- cluster with a guarantee that in case of success it was executed with all
+-- buckets being accessible for reads and writes.
+--
+-- Consistency in scope of map-reduce means all the data was accessible, and
+-- didn't move during map requests execution. To preserve the consistency there
+-- is a third stage - Ref. So the algorithm is actually Ref-Map-Reduce.
+--
+-- Refs are broadcast before Map stage to pin the buckets to their storages, and
+-- ensure they won't move until maps are done.
+--
+-- Map requests are broadcast in case all refs are done successfully. They
+-- execute the user function + delete the refs to enable rebalancing again.
+--
+-- On the storages there are additional means to ensure map-reduces don't block
+-- rebalancing forever and vice versa.
+--
+-- The function is not as slow as it may seem - it uses netbox's feature
+-- is_async to send refs and maps in parallel. So cost of the function is about
+-- 2 network exchanges to the most far storage in terms of time.
+--
+-- @param router Router instance to use.
+-- @param func Name of the function to call.
+-- @param args Function arguments passed in netbox style (as an array).
+-- @param opts Can only contain 'timeout' as a number of seconds. Note that the
+--     refs may end up being kept on the storages during this entire timeout if
+--     something goes wrong. For instance, network issues appear. This means
+--     better not use a value bigger than necessary. A stuck infinite ref can
+--     only be dropped by this router restart/reconnect or the storage restart.
+--
+-- @return In case of success - a map with replicaset UUID keys and values being
+--     what the function returned from the replicaset.
+--
+-- @return In case of an error - nil, error object, optional UUID of the
+--     replicaset where the error happened. UUID may be not present if it wasn't
+--     about concrete replicaset. For example, not all buckets were found even
+--     though all replicasets were scanned.
+--
+router_map_callrw = function(router, func, args, opts)
+    local replicasets = router.replicasets
+    local timeout = opts and opts.timeout or consts.CALL_TIMEOUT_MIN
+    local deadline = fiber_clock() + timeout
+    local err, err_uuid, res, ok, map
+    local futures = {}
+    local bucket_count = 0
+    local opts_async = {is_async = true}
+    local rs_count = 0
+    local rid = M.ref_id
+    M.ref_id = rid + 1
+    -- Nil checks are done explicitly here (== nil instead of 'not'), because
+    -- netbox requests return box.NULL instead of nils.
+
+    --
+    -- Ref stage: send.
+    --
+    for uuid, rs in pairs(replicasets) do
+        -- Netbox async requests work only with active connections. Need to wait
+        -- for the connection explicitly.
+        timeout, err = rs:wait_connected(timeout)
+        if timeout == nil then
+            err_uuid = uuid
+            goto fail
+        end
+        res, err = rs:callrw('vshard.storage._call',
+                              {'storage_ref', rid, timeout}, opts_async)
+        if res == nil then
+            err_uuid = uuid
+            goto fail
+        end
+        futures[uuid] = res
+        rs_count = rs_count + 1
+    end
+    map = table_new(0, rs_count)
+    --
+    -- Ref stage: collect.
+    --
+    for uuid, future in pairs(futures) do
+        res, err = future:wait_result(timeout)
+        -- Handle netbox error first.
+        if res == nil then
+            err_uuid = uuid
+            goto fail
+        end
+        -- Ref returns nil,err or bucket count.
+        res, err = unpack(res)
+        if res == nil then
+            err_uuid = uuid
+            goto fail
+        end
+        bucket_count = bucket_count + res
+        timeout = deadline - fiber_clock()
+    end
+    -- All refs are done but not all buckets are covered. This is odd and can
+    -- mean many things. The most possible ones: 1) outdated configuration on
+    -- the router and it does not see another replicaset with more buckets,
+    -- 2) some buckets are simply lost or duplicated - could happen as a bug, or
+    -- if the user does a maintenance of some kind by creating/deleting buckets.
+    -- In both cases can't guarantee all the data would be covered by Map calls.
+    if bucket_count ~= router.total_bucket_count then
+        err = lerror.vshard(lerror.code.UNKNOWN_BUCKETS,
+                            router.total_bucket_count - bucket_count)
+        goto fail
+    end
+    --
+    -- Map stage: send.
+    --
+    args = {'storage_map', rid, func, args}
+    for uuid, rs in pairs(replicasets) do
+        res, err = rs:callrw('vshard.storage._call', args, opts_async)
+        if res == nil then
+            err_uuid = uuid
+            goto fail
+        end
+        futures[uuid] = res
+    end
+    --
+    -- Ref stage: collect.
+    --
+    for uuid, f in pairs(futures) do
+        res, err = f:wait_result(timeout)
+        if res == nil then
+            err_uuid = uuid
+            goto fail
+        end
+        -- Map returns true,res or nil,err.
+        ok, res = unpack(res)
+        if ok == nil then
+            err = res
+            err_uuid = uuid
+            goto fail
+        end
+        if res ~= nil then
+            -- Store as a table so in future it could be extended for
+            -- multireturn.
+            map[uuid] = {res}
+        end
+        timeout = deadline - fiber_clock()
+    end
+    do return map end
+
+::fail::
+    for uuid, f in pairs(futures) do
+        f:discard()
+        -- Best effort to remove the created refs before exiting. Can help if
+        -- the timeout was big and the error happened early.
+        f = replicasets[uuid]:callrw('vshard.storage._call',
+                                     {'storage_unref', rid}, opts_async)
+        if f ~= nil then
+            -- Don't care waiting for a result - no time for this. But it won't
+            -- affect the request sending if the connection is still alive.
+            f:discard()
+        end
+    end
+    err = lerror.make(err)
+    return nil, err, err_uuid
+end
+
+-- Version >= 1.10.
+else
+-- Version < 1.10.
+
+router_map_callrw = function()
+    error('Supported for Tarantool >= 1.10')
+end
+
+end
+
 --
 -- Get replicaset object by bucket identifier.
 -- @param bucket_id Bucket identifier.
@@ -1268,6 +1444,7 @@ local router_mt = {
         callrw = router_callrw;
         callre = router_callre;
         callbre = router_callbre;
+        map_callrw = router_map_callrw,
         route = router_route;
         routeall = router_routeall;
         bucket_id = router_bucket_id,
@@ -1365,6 +1542,9 @@ end
 if not rawget(_G, MODULE_INTERNALS) then
     rawset(_G, MODULE_INTERNALS, M)
 else
+    if not M.ref_id then
+        M.ref_id = 0
+    end
     for _, router in pairs(M.routers) do
         router_cfg(router, router.current_cfg, true)
         setmetatable(router, router_mt)
diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
index 31f668f..0a14440 100644
--- a/vshard/storage/init.lua
+++ b/vshard/storage/init.lua
@@ -2415,6 +2415,50 @@ local function storage_call(bucket_id, mode, name, args)
     return ok, ret1, ret2, ret3
 end
 
+--
+-- Bind a new storage ref to the current box session. Is used as a part of
+-- Map-Reduce API.
+--
+local function storage_ref(rid, timeout)
+    local ok, err = lref.add(rid, box.session.id(), timeout)
+    if not ok then
+        return nil, err
+    end
+    return bucket_count()
+end
+
+--
+-- Drop a storage ref from the current box session. Is used as a part of
+-- Map-Reduce API.
+--
+local function storage_unref(rid)
+    return lref.del(rid, box.session.id())
+end
+
+--
+-- Execute a user's function under an infinite storage ref protecting from
+-- bucket moves. The ref should exist before, and is deleted after, regardless
+-- of the function result. Is used as a part of Map-Reduce API.
+--
+local function storage_map(rid, name, args)
+    local ok, err, res
+    local sid = box.session.id()
+    ok, err = lref.use(rid, sid)
+    if not ok then
+        return nil, err
+    end
+    ok, res = local_call(name, args)
+    if not ok then
+        lref.del(rid, sid)
+        return nil, lerror.make(res)
+    end
+    ok, err = lref.del(rid, sid)
+    if not ok then
+        return nil, err
+    end
+    return true, res
+end
+
 local service_call_api
 
 local function service_call_test_api(...)
@@ -2425,6 +2469,9 @@ service_call_api = setmetatable({
     bucket_recv = bucket_recv,
     rebalancer_apply_routes = rebalancer_apply_routes,
     rebalancer_request_state = rebalancer_request_state,
+    storage_ref = storage_ref,
+    storage_unref = storage_unref,
+    storage_map = storage_map,
     test_api = service_call_test_api,
 }, {__serialize = function(api)
     local res = {}
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Tarantool-patches] [PATCH vshard 02/11] storage: add helper for local functions invocation
  2021-02-23  0:15 [Tarantool-patches] [PATCH vshard 00/11] VShard Map-Reduce, part 2: Ref, Sched, Map Vladislav Shpilevoy via Tarantool-patches
                   ` (2 preceding siblings ...)
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 11/11] router: introduce map_callrw() Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-23  0:15 ` Vladislav Shpilevoy via Tarantool-patches
  2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 03/11] storage: cache bucket count Vladislav Shpilevoy via Tarantool-patches
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 47+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-02-23  0:15 UTC (permalink / raw)
  To: tarantool-patches, olegrok, yaroslav.dynnikov

Function local_call() works like netbox.self.call, but is
exception-safe, and uses cached values of 'netbox.self' and
'netbox.self.call'. This saves at least 3 indexing operations,
which are not free as it appeared.

The cached values are not used directly in storage_call(), because
local_call() also will be used from the future function
storage_map() - a part of map-reduce API.

Needed for #147
---
 vshard/storage/init.lua | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
index e0ce31d..a3d383d 100644
--- a/vshard/storage/init.lua
+++ b/vshard/storage/init.lua
@@ -6,6 +6,8 @@ local trigger = require('internal.trigger')
 local ffi = require('ffi')
 local yaml_encode = require('yaml').encode
 local fiber_clock = lfiber.clock
+local netbox_self = netbox.self
+local netbox_self_call = netbox_self.call
 
 local MODULE_INTERNALS = '__module_vshard_storage'
 -- Reload requirements, in case this module is reloaded manually.
@@ -171,6 +173,16 @@ else
     bucket_ref_new = ffi.typeof("struct bucket_ref")
 end
 
+--
+-- Invoke a function on this instance. Arguments are unpacked into the function
+-- as arguments.
+-- The function returns pcall() as is, because is used from places where
+-- exceptions are not allowed.
+--
+local function local_call(func_name, args)
+    return pcall(netbox_self_call, netbox_self, func_name, args)
+end
+
 --
 -- Trigger for on replace into _bucket to update its generation.
 --
@@ -2275,7 +2287,7 @@ local function storage_call(bucket_id, mode, name, args)
     if not ok then
         return ok, err
     end
-    ok, ret1, ret2, ret3 = pcall(netbox.self.call, netbox.self, name, args)
+    ok, ret1, ret2, ret3 = local_call(name, args)
     _, err = bucket_unref(bucket_id, mode)
     assert(not err)
     if not ok then
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Tarantool-patches] [PATCH vshard 03/11] storage: cache bucket count
  2021-02-23  0:15 [Tarantool-patches] [PATCH vshard 00/11] VShard Map-Reduce, part 2: Ref, Sched, Map Vladislav Shpilevoy via Tarantool-patches
                   ` (3 preceding siblings ...)
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 02/11] storage: add helper for local functions invocation Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-23  0:15 ` Vladislav Shpilevoy via Tarantool-patches
  2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 04/11] registry: module for circular deps resolution Vladislav Shpilevoy via Tarantool-patches
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 47+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-02-23  0:15 UTC (permalink / raw)
  To: tarantool-patches, olegrok, yaroslav.dynnikov

Bucket count calculation costs 1 FFI call in Lua, and makes a few
actions and virtual calls in C. So it is not free even for memtx
spaces.

But it changes extremely rare, which makes reasonable to cache the
value.

Bucket count is not used much now, but will be used a lot in the
future storage_ref() function, which is a part of map-reduce API.

The idea is that a router will need to reference all the storages
and ensure that all the buckets in the cluster are pinned to their
storages. To check this, storage_ref() will return number of
buckets successfully pinned on the storage.

The router will sum counts from all storage_ref() calls and ensure
it equals to total configured bucket count.

This means bucket count is needed for each storage_ref() call,
whose count per second can be thousands and more.

The patch makes count calculation cost as much as one Lua function
call and a Lua table index operation (almost always).

Needed for #147
---
 test/storage/storage.result   | 45 +++++++++++++++++++++++++++++++++++
 test/storage/storage.test.lua | 18 ++++++++++++++
 vshard/storage/init.lua       | 44 +++++++++++++++++++++++++++++-----
 3 files changed, 101 insertions(+), 6 deletions(-)

diff --git a/test/storage/storage.result b/test/storage/storage.result
index 0550ad1..edb45be 100644
--- a/test/storage/storage.result
+++ b/test/storage/storage.result
@@ -677,6 +677,51 @@ rs:callro('echo', {'some_data'})
 - null
 - null
 ...
+--
+-- Bucket count is calculated properly.
+--
+-- Cleanup after the previous tests.
+_ = test_run:switch('storage_1_a')
+---
+...
+buckets = vshard.storage.buckets_info()
+---
+...
+for bid, _ in pairs(buckets) do vshard.storage.bucket_force_drop(bid) end
+---
+...
+_ = test_run:switch('storage_2_a')
+---
+...
+buckets = vshard.storage.buckets_info()
+---
+...
+for bid, _ in pairs(buckets) do vshard.storage.bucket_force_drop(bid) end
+---
+...
+_ = test_run:switch('storage_1_a')
+---
+...
+assert(vshard.storage.buckets_count() == 0)
+---
+- true
+...
+vshard.storage.bucket_force_create(1, 5)
+---
+- true
+...
+assert(vshard.storage.buckets_count() == 5)
+---
+- true
+...
+vshard.storage.bucket_force_create(6, 5)
+---
+- true
+...
+assert(vshard.storage.buckets_count() == 10)
+---
+- true
+...
 _ = test_run:switch("default")
 ---
 ...
diff --git a/test/storage/storage.test.lua b/test/storage/storage.test.lua
index d8fbd94..db014ef 100644
--- a/test/storage/storage.test.lua
+++ b/test/storage/storage.test.lua
@@ -187,6 +187,24 @@ util.has_same_fields(old_internal, vshard.storage.internal)
 _, rs = next(vshard.storage.internal.replicasets)
 rs:callro('echo', {'some_data'})
 
+--
+-- Bucket count is calculated properly.
+--
+-- Cleanup after the previous tests.
+_ = test_run:switch('storage_1_a')
+buckets = vshard.storage.buckets_info()
+for bid, _ in pairs(buckets) do vshard.storage.bucket_force_drop(bid) end
+_ = test_run:switch('storage_2_a')
+buckets = vshard.storage.buckets_info()
+for bid, _ in pairs(buckets) do vshard.storage.bucket_force_drop(bid) end
+
+_ = test_run:switch('storage_1_a')
+assert(vshard.storage.buckets_count() == 0)
+vshard.storage.bucket_force_create(1, 5)
+assert(vshard.storage.buckets_count() == 5)
+vshard.storage.bucket_force_create(6, 5)
+assert(vshard.storage.buckets_count() == 10)
+
 _ = test_run:switch("default")
 test_run:drop_cluster(REPLICASET_2)
 test_run:drop_cluster(REPLICASET_1)
diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
index a3d383d..9b74bcb 100644
--- a/vshard/storage/init.lua
+++ b/vshard/storage/init.lua
@@ -110,6 +110,9 @@ if not M then
         -- replace the old function is to keep its reference.
         --
         bucket_on_replace = nil,
+        -- Fast alternative to box.space._bucket:count(). But may be nil. Reset
+        -- on each generation change.
+        bucket_count_cache = nil,
         -- Redirects for recently sent buckets. They are kept for a while to
         -- help routers to find a new location for sent and deleted buckets
         -- without whole cluster scan.
@@ -183,10 +186,44 @@ local function local_call(func_name, args)
     return pcall(netbox_self_call, netbox_self, func_name, args)
 end
 
+--
+-- Get number of buckets stored on this storage. Regardless of their state.
+--
+-- The idea is that all the code should use one function ref to get the bucket
+-- count. But inside the function never branches. Instead, it points at one of 2
+-- branch-less functions. Cached one simply returns a number which is supposed
+-- to be super fast. Non-cached remembers the count and changes the global
+-- function to the cached one. So on the next call it is cheap. No 'if's at all.
+--
+local bucket_count
+
+local function bucket_count_cache()
+    return M.bucket_count_cache
+end
+
+local function bucket_count_not_cache()
+    local count = box.space._bucket:count()
+    M.bucket_count_cache = count
+    bucket_count = bucket_count_cache
+    return count
+end
+
+bucket_count = bucket_count_not_cache
+
+--
+-- Can't expose bucket_count to the public API as is. Need this proxy-call.
+-- Because the original function changes at runtime.
+--
+local function bucket_count_public()
+    return bucket_count()
+end
+
 --
 -- Trigger for on replace into _bucket to update its generation.
 --
 local function bucket_generation_increment()
+    bucket_count = bucket_count_not_cache
+    M.bucket_count_cache = nil
     M.bucket_generation = M.bucket_generation + 1
     M.bucket_generation_cond:broadcast()
 end
@@ -2240,7 +2277,6 @@ local function rebalancer_request_state()
     if #status_index:select({consts.BUCKET.GARBAGE}, {limit = 1}) > 0 then
         return
     end
-    local bucket_count = _bucket:count()
     return {
         bucket_active_count = status_index:count({consts.BUCKET.ACTIVE}),
         bucket_pinned_count = status_index:count({consts.BUCKET.PINNED}),
@@ -2501,10 +2537,6 @@ end
 -- Monitoring
 --------------------------------------------------------------------------------
 
-local function storage_buckets_count()
-    return  box.space._bucket.index.pk:count()
-end
-
 local function storage_buckets_info(bucket_id)
     local ibuckets = setmetatable({}, { __serialize = 'mapping' })
 
@@ -2780,7 +2812,7 @@ return {
     cfg = function(cfg, uuid) return storage_cfg(cfg, uuid, false) end,
     info = storage_info,
     buckets_info = storage_buckets_info,
-    buckets_count = storage_buckets_count,
+    buckets_count = bucket_count_public,
     buckets_discovery = buckets_discovery,
     rebalancer_request_state = rebalancer_request_state,
     internal = M,
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Tarantool-patches] [PATCH vshard 04/11] registry: module for circular deps resolution
  2021-02-23  0:15 [Tarantool-patches] [PATCH vshard 00/11] VShard Map-Reduce, part 2: Ref, Sched, Map Vladislav Shpilevoy via Tarantool-patches
                   ` (4 preceding siblings ...)
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 03/11] storage: cache bucket count Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-23  0:15 ` Vladislav Shpilevoy via Tarantool-patches
  2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 05/11] util: introduce safe fiber_cond_wait() Vladislav Shpilevoy via Tarantool-patches
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 47+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-02-23  0:15 UTC (permalink / raw)
  To: tarantool-patches, olegrok, yaroslav.dynnikov

Registry is a way to resolve cyclic dependencies which normally
can exist between files of the same module/library.

It is a global table hidden in _G with a long unlikely anywhere
used name.

Files, which want to expose their API to the other files, which in
turn can't require the formers directly, should put their API to
the registry.

The files use the registry to get API of the other files. They
don't require() and use the latter directly.

At runtime, when all require() are done, the registry is full,
and all the files see API of each other.

Such circular dependency will exist between new files implementing
map-reduce engine as a set of relatively independent submodules of
the storage.

In particular there will be storage_ref and storage_sched. Both
require a few functions from the main storage file, and will use
API of each other.

Having the modules accessed via registry adds at lest +1 indexing
operation at runtime when need to get a function from there. But
sometimes it can be cached similar to how bucket count cache works
in the main storage file.

Main purpose is not to increase size of the main storage file
again. It wouldn't fix the circular deps anyway, and would make it
much harder to follow the code.

Part of #147
---
 vshard/CMakeLists.txt   |  3 +-
 vshard/registry.lua     | 67 +++++++++++++++++++++++++++++++++++++++++
 vshard/storage/init.lua |  5 ++-
 3 files changed, 73 insertions(+), 2 deletions(-)
 create mode 100644 vshard/registry.lua

diff --git a/vshard/CMakeLists.txt b/vshard/CMakeLists.txt
index 78a3f07..2a15df5 100644
--- a/vshard/CMakeLists.txt
+++ b/vshard/CMakeLists.txt
@@ -7,4 +7,5 @@ add_subdirectory(router)
 
 # Install module
 install(FILES cfg.lua error.lua consts.lua hash.lua init.lua replicaset.lua
-        util.lua lua_gc.lua rlist.lua heap.lua DESTINATION ${TARANTOOL_INSTALL_LUADIR}/vshard)
+        util.lua lua_gc.lua rlist.lua heap.lua registry.lua
+        DESTINATION ${TARANTOOL_INSTALL_LUADIR}/vshard)
diff --git a/vshard/registry.lua b/vshard/registry.lua
new file mode 100644
index 0000000..9583add
--- /dev/null
+++ b/vshard/registry.lua
@@ -0,0 +1,67 @@
+--
+-- Registry is a way to resolve cyclic dependencies which normally can exist
+-- between files of the same module/library.
+--
+-- Files, which want to expose their API to the other files, which in turn can't
+-- require the formers directly, should put their API to the registry.
+--
+-- The files should use the registry to get API of the other files. They don't
+-- require() and use the latter directly if there is a known loop dependency
+-- between them.
+--
+-- At runtime, when all require() are done, the registry is full, and all the
+-- files see API of each other.
+--
+-- Having the modules accessed via the registry adds at lest +1 indexing
+-- operation at runtime when need to get a function from there. But sometimes it
+-- can be cached to reduce the effect in perf-sensitive code. For example, like
+-- this:
+--
+--     local lreg = require('vshard.registry')
+--
+--     local storage_func
+--
+--     local function storage_func_no_cache(...)
+--         storage_func = lreg.storage.func
+--         return storage_func(...)
+--     end
+--
+--     storage_func = storage_func_no_cache
+--
+-- The code will always call storage_func(), but will load it from the registry
+-- only on first invocation.
+--
+-- However in case reload is important, it is not possible - the original
+-- function object in the registry may change. In such situation still makes
+-- sense to cache at least 'lreg.storage' to save 1 indexing operation.
+--
+--     local lreg = require('vshard.registry')
+--
+--     local lstorage
+--
+--     local function storage_func_cache(...)
+--         return lstorage.storage_func(...)
+--     end
+--
+--     local function storage_func_no_cache(...)
+--         lstorage = lref.storage
+--         storage_func = storage_func_cache
+--         return lstorage.storage_func(...)
+--     end
+--
+--     storage_func = storage_func_no_cache
+--
+-- A harder way would be to use the first approach + add triggers on reload of
+-- the cached module to update the cached function refs. If the code is
+-- extremely perf-critical (which should not be Lua then).
+--
+
+local MODULE_INTERNALS = '__module_vshard_registry'
+
+local M = rawget(_G, MODULE_INTERNALS)
+if not M then
+    M = {}
+    rawset(_G, MODULE_INTERNALS, M)
+end
+
+return M
diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
index 9b74bcb..b47665b 100644
--- a/vshard/storage/init.lua
+++ b/vshard/storage/init.lua
@@ -16,7 +16,7 @@ if rawget(_G, MODULE_INTERNALS) then
         'vshard.consts', 'vshard.error', 'vshard.cfg',
         'vshard.replicaset', 'vshard.util',
         'vshard.storage.reload_evolution',
-        'vshard.lua_gc', 'vshard.rlist'
+        'vshard.lua_gc', 'vshard.rlist', 'vshard.registry',
     }
     for _, module in pairs(vshard_modules) do
         package.loaded[module] = nil
@@ -29,6 +29,7 @@ local lcfg = require('vshard.cfg')
 local lreplicaset = require('vshard.replicaset')
 local util = require('vshard.util')
 local lua_gc = require('vshard.lua_gc')
+local lregistry = require('vshard.registry')
 local reload_evolution = require('vshard.storage.reload_evolution')
 local bucket_ref_new
 
@@ -2782,6 +2783,8 @@ M.schema_upgrade_handlers = schema_upgrade_handlers
 M.schema_version_make = schema_version_make
 M.schema_bootstrap = schema_init_0_1_15_0
 
+lregistry.storage = M
+
 return {
     sync = sync,
     bucket_force_create = bucket_force_create,
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Tarantool-patches] [PATCH vshard 05/11] util: introduce safe fiber_cond_wait()
  2021-02-23  0:15 [Tarantool-patches] [PATCH vshard 00/11] VShard Map-Reduce, part 2: Ref, Sched, Map Vladislav Shpilevoy via Tarantool-patches
                   ` (5 preceding siblings ...)
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 04/11] registry: module for circular deps resolution Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-23  0:15 ` Vladislav Shpilevoy via Tarantool-patches
  2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 06/11] util: introduce fiber_is_self_canceled() Vladislav Shpilevoy via Tarantool-patches
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 47+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-02-23  0:15 UTC (permalink / raw)
  To: tarantool-patches, olegrok, yaroslav.dynnikov

Original fiber_cond:wait() has a few issues:

- Raises exception when fiber is canceled, which makes it
  inapplicable in exception-intolerant code;

- Raises an ugly misleading usage exception when timeout is
  negative which easily can happen if the caller's code calls
  wait() multiple times retrying something and does not want to
  bother with doing a part of cond's job;

- When fails, pushes an error of type 'TimedOut' which is not the
  same as 'ClientError' with box.error.TIMEOUT code. The latter is
  used wider, at least in vshard.

The patch introduces util.fiber_cond_wait() function which fixes
the mentioned issues.

It is needed in the future map-reduce subsystem modules revolving
around waiting on various conditions.

Part of #147
---
 test/unit/util.result   | 82 +++++++++++++++++++++++++++++++++++++++++
 test/unit/util.test.lua | 31 ++++++++++++++++
 vshard/util.lua         | 35 ++++++++++++++++++
 3 files changed, 148 insertions(+)

diff --git a/test/unit/util.result b/test/unit/util.result
index 42a361a..679c087 100644
--- a/test/unit/util.result
+++ b/test/unit/util.result
@@ -184,3 +184,85 @@ t ~= res
 ---
 - true
 ...
+--
+-- Exception-safe cond wait.
+--
+cond_wait = util.fiber_cond_wait
+---
+...
+cond = fiber.cond()
+---
+...
+ok, err = cond_wait(cond, -1)
+---
+...
+assert(not ok and err.message)
+---
+- Timeout exceeded
+...
+-- Ensure it does not return 'false' like pcall(). It must conform to nil,err
+-- signature.
+assert(type(ok) == 'nil')
+---
+- true
+...
+ok, err = cond_wait(cond, 0)
+---
+...
+assert(not ok and err.message)
+---
+- Timeout exceeded
+...
+ok, err = cond_wait(cond, 0.000001)
+---
+...
+assert(not ok and err.message)
+---
+- Timeout exceeded
+...
+ok, err = nil
+---
+...
+_ = fiber.create(function() ok, err = cond_wait(cond, 1000000) end)
+---
+...
+fiber.yield()
+---
+...
+cond:signal()
+---
+...
+_ = test_run:wait_cond(function() return ok or err end)
+---
+...
+assert(ok and not err)
+---
+- true
+...
+ok, err = nil
+---
+...
+f = fiber.create(function() ok, err = cond_wait(cond, 1000000) end)
+---
+...
+fiber.yield()
+---
+...
+f:cancel()
+---
+...
+_ = test_run:wait_cond(function() return ok or err end)
+---
+...
+assert(not ok)
+---
+- true
+...
+err.message
+---
+- fiber is cancelled
+...
+assert(type(err) == 'table')
+---
+- true
+...
diff --git a/test/unit/util.test.lua b/test/unit/util.test.lua
index 9550a95..df3db6f 100644
--- a/test/unit/util.test.lua
+++ b/test/unit/util.test.lua
@@ -76,3 +76,34 @@ yield_count
 t
 res
 t ~= res
+
+--
+-- Exception-safe cond wait.
+--
+cond_wait = util.fiber_cond_wait
+cond = fiber.cond()
+ok, err = cond_wait(cond, -1)
+assert(not ok and err.message)
+-- Ensure it does not return 'false' like pcall(). It must conform to nil,err
+-- signature.
+assert(type(ok) == 'nil')
+ok, err = cond_wait(cond, 0)
+assert(not ok and err.message)
+ok, err = cond_wait(cond, 0.000001)
+assert(not ok and err.message)
+
+ok, err = nil
+_ = fiber.create(function() ok, err = cond_wait(cond, 1000000) end)
+fiber.yield()
+cond:signal()
+_ = test_run:wait_cond(function() return ok or err end)
+assert(ok and not err)
+
+ok, err = nil
+f = fiber.create(function() ok, err = cond_wait(cond, 1000000) end)
+fiber.yield()
+f:cancel()
+_ = test_run:wait_cond(function() return ok or err end)
+assert(not ok)
+err.message
+assert(type(err) == 'table')
diff --git a/vshard/util.lua b/vshard/util.lua
index 2362607..d78f3a5 100644
--- a/vshard/util.lua
+++ b/vshard/util.lua
@@ -1,6 +1,7 @@
 -- vshard.util
 local log = require('log')
 local fiber = require('fiber')
+local lerror = require('vshard.error')
 
 local MODULE_INTERNALS = '__module_vshard_util'
 local M = rawget(_G, MODULE_INTERNALS)
@@ -191,6 +192,39 @@ local function table_minus_yield(dst, src, interval)
     return dst
 end
 
+local function fiber_cond_wait_xc(cond, timeout)
+    -- Handle negative timeout specifically - otherwise wait() will throw an
+    -- ugly usage error.
+    -- Don't trust this check to the caller's code, because often it just calls
+    -- wait many times until it fails or the condition is met. Code looks much
+    -- cleaner when it does not need to check the timeout sign. On the other
+    -- hand, perf is not important here - anyway wait() yields which is slow on
+    -- its own, but also breaks JIT trace recording which makes pcall() in the
+    -- non-xc version of this function inconsiderable.
+    if timeout < 0 or not cond:wait(timeout) then
+        -- Don't use the original error if cond sets it. Because it sets
+        -- TimedOut error. It does not have a proper error code, and may not be
+        -- detected by router as a special timeout case if necessary. Or at
+        -- least would complicate the handling in future. Instead, try to use a
+        -- unified timeout error where possible.
+        error(lerror.timeout())
+    end
+    -- Still possible though that the fiber is canceled and cond:wait() throws.
+    -- This is why the _xc() version of this function throws even the timeout -
+    -- anyway pcall() is inevitable.
+end
+
+--
+-- Exception-safe cond wait with unified errors in vshard format.
+--
+local function fiber_cond_wait(cond, timeout)
+    local ok, err = pcall(fiber_cond_wait_xc, cond, timeout)
+    if ok then
+        return true
+    end
+    return nil, lerror.make(err)
+end
+
 return {
     tuple_extract_key = tuple_extract_key,
     reloadable_fiber_create = reloadable_fiber_create,
@@ -200,4 +234,5 @@ return {
     version_is_at_least = version_is_at_least,
     table_copy_yield = table_copy_yield,
     table_minus_yield = table_minus_yield,
+    fiber_cond_wait = fiber_cond_wait,
 }
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Tarantool-patches] [PATCH vshard 06/11] util: introduce fiber_is_self_canceled()
  2021-02-23  0:15 [Tarantool-patches] [PATCH vshard 00/11] VShard Map-Reduce, part 2: Ref, Sched, Map Vladislav Shpilevoy via Tarantool-patches
                   ` (6 preceding siblings ...)
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 05/11] util: introduce safe fiber_cond_wait() Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-23  0:15 ` Vladislav Shpilevoy via Tarantool-patches
  2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 07/11] storage: introduce bucket_generation_wait() Vladislav Shpilevoy via Tarantool-patches
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 47+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-02-23  0:15 UTC (permalink / raw)
  To: tarantool-patches, olegrok, yaroslav.dynnikov

Original fiber.testcancel() has an issue - it is not
exception-safe. This makes it unusable for code which wants to do
cleanup before cancellation.

The patch introduces util.fiber_is_self_canceled() which checks if
the current fiber is canceled but returns true/false instead of
throwing an error.

The patch is going to be used in the map-reduce patches where it
will be necessary to check if the fiber is canceled. And if it
is - perform cleanup and quit whatever the code was doing.

Part of #147
---
 test/unit/util.result   | 28 ++++++++++++++++++++++++++++
 test/unit/util.test.lua | 14 ++++++++++++++
 vshard/util.lua         |  8 ++++++++
 3 files changed, 50 insertions(+)

diff --git a/test/unit/util.result b/test/unit/util.result
index 679c087..c83e80c 100644
--- a/test/unit/util.result
+++ b/test/unit/util.result
@@ -266,3 +266,31 @@ assert(type(err) == 'table')
 ---
 - true
 ...
+--
+-- Exception-safe fiber cancel check.
+--
+self_is_canceled = util.fiber_is_self_canceled
+---
+...
+assert(not self_is_canceled())
+---
+- true
+...
+ok = nil
+---
+...
+_ = fiber.create(function()                                                     \
+    local f = fiber.self()                                                      \
+    pcall(f.cancel, f)                                                          \
+    ok = self_is_canceled()                                                     \
+end)
+---
+...
+test_run:wait_cond(function() return ok ~= nil end)
+---
+- true
+...
+assert(ok)
+---
+- true
+...
diff --git a/test/unit/util.test.lua b/test/unit/util.test.lua
index df3db6f..881feb4 100644
--- a/test/unit/util.test.lua
+++ b/test/unit/util.test.lua
@@ -107,3 +107,17 @@ _ = test_run:wait_cond(function() return ok or err end)
 assert(not ok)
 err.message
 assert(type(err) == 'table')
+
+--
+-- Exception-safe fiber cancel check.
+--
+self_is_canceled = util.fiber_is_self_canceled
+assert(not self_is_canceled())
+ok = nil
+_ = fiber.create(function()                                                     \
+    local f = fiber.self()                                                      \
+    pcall(f.cancel, f)                                                          \
+    ok = self_is_canceled()                                                     \
+end)
+test_run:wait_cond(function() return ok ~= nil end)
+assert(ok)
diff --git a/vshard/util.lua b/vshard/util.lua
index d78f3a5..30a1e6e 100644
--- a/vshard/util.lua
+++ b/vshard/util.lua
@@ -225,6 +225,13 @@ local function fiber_cond_wait(cond, timeout)
     return nil, lerror.make(err)
 end
 
+--
+-- Exception-safe way to check if the current fiber is canceled.
+--
+local function fiber_is_self_canceled()
+    return not pcall(fiber.testcancel)
+end
+
 return {
     tuple_extract_key = tuple_extract_key,
     reloadable_fiber_create = reloadable_fiber_create,
@@ -235,4 +242,5 @@ return {
     table_copy_yield = table_copy_yield,
     table_minus_yield = table_minus_yield,
     fiber_cond_wait = fiber_cond_wait,
+    fiber_is_self_canceled = fiber_is_self_canceled,
 }
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Tarantool-patches] [PATCH vshard 07/11] storage: introduce bucket_generation_wait()
  2021-02-23  0:15 [Tarantool-patches] [PATCH vshard 00/11] VShard Map-Reduce, part 2: Ref, Sched, Map Vladislav Shpilevoy via Tarantool-patches
                   ` (7 preceding siblings ...)
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 06/11] util: introduce fiber_is_self_canceled() Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-23  0:15 ` Vladislav Shpilevoy via Tarantool-patches
  2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 08/11] storage: introduce bucket_are_all_rw() Vladislav Shpilevoy via Tarantool-patches
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 47+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-02-23  0:15 UTC (permalink / raw)
  To: tarantool-patches, olegrok, yaroslav.dynnikov

In the future map-reduce code it will be needed to be able to wait
until all buckets on the storage enter writable state. If they are
not writable, the code should wait efficiently, without polling.

The patch adds a function bucket_generation_wait() which is
registered in registry.storage.

It helps to wait until state of any bucket is changed. The caller
code, if wants to wait for all buckets to enter writable state,
should wait on the generation and re-check the requested condition
until it matches or timeout happens.

Part of #147
---
 test/storage/storage.result   | 86 +++++++++++++++++++++++++++++++++++
 test/storage/storage.test.lua | 36 +++++++++++++++
 vshard/storage/init.lua       |  6 +++
 3 files changed, 128 insertions(+)

diff --git a/test/storage/storage.result b/test/storage/storage.result
index edb45be..4730e20 100644
--- a/test/storage/storage.result
+++ b/test/storage/storage.result
@@ -722,6 +722,92 @@ assert(vshard.storage.buckets_count() == 10)
 ---
 - true
 ...
+--
+-- Bucket_generation_wait() registry function.
+--
+lstorage = require('vshard.registry').storage
+---
+...
+ok, err = lstorage.bucket_generation_wait(-1)
+---
+...
+assert(not ok and err.message)
+---
+- Timeout exceeded
+...
+ok, err = lstorage.bucket_generation_wait(0)
+---
+...
+assert(not ok and err.message)
+---
+- Timeout exceeded
+...
+small_timeout = 0.000001
+---
+...
+ok, err = lstorage.bucket_generation_wait(small_timeout)
+---
+...
+assert(not ok and err.message)
+---
+- Timeout exceeded
+...
+ok, err = nil
+---
+...
+big_timeout = 1000000
+---
+...
+_ = fiber.create(function()                                                     \
+    ok, err = lstorage.bucket_generation_wait(big_timeout)                      \
+end)
+---
+...
+fiber.sleep(small_timeout)
+---
+...
+assert(not ok and not err)
+---
+- true
+...
+vshard.storage.bucket_force_drop(10)
+---
+- true
+...
+test_run:wait_cond(function() return ok or err end)
+---
+- true
+...
+assert(ok)
+---
+- true
+...
+-- Cancel should interrupt the waiting.
+ok, err = nil
+---
+...
+f = fiber.create(function()                                                     \
+    ok, err = lstorage.bucket_generation_wait(big_timeout)                      \
+end)
+---
+...
+fiber.sleep(small_timeout)
+---
+...
+assert(not ok and not err)
+---
+- true
+...
+f:cancel()
+---
+...
+_ = test_run:wait_cond(function() return ok or err end)
+---
+...
+assert(not ok and err.message)
+---
+- fiber is cancelled
+...
 _ = test_run:switch("default")
 ---
 ...
diff --git a/test/storage/storage.test.lua b/test/storage/storage.test.lua
index db014ef..86c5e33 100644
--- a/test/storage/storage.test.lua
+++ b/test/storage/storage.test.lua
@@ -205,6 +205,42 @@ assert(vshard.storage.buckets_count() == 5)
 vshard.storage.bucket_force_create(6, 5)
 assert(vshard.storage.buckets_count() == 10)
 
+--
+-- Bucket_generation_wait() registry function.
+--
+lstorage = require('vshard.registry').storage
+ok, err = lstorage.bucket_generation_wait(-1)
+assert(not ok and err.message)
+
+ok, err = lstorage.bucket_generation_wait(0)
+assert(not ok and err.message)
+
+small_timeout = 0.000001
+ok, err = lstorage.bucket_generation_wait(small_timeout)
+assert(not ok and err.message)
+
+ok, err = nil
+big_timeout = 1000000
+_ = fiber.create(function()                                                     \
+    ok, err = lstorage.bucket_generation_wait(big_timeout)                      \
+end)
+fiber.sleep(small_timeout)
+assert(not ok and not err)
+vshard.storage.bucket_force_drop(10)
+test_run:wait_cond(function() return ok or err end)
+assert(ok)
+
+-- Cancel should interrupt the waiting.
+ok, err = nil
+f = fiber.create(function()                                                     \
+    ok, err = lstorage.bucket_generation_wait(big_timeout)                      \
+end)
+fiber.sleep(small_timeout)
+assert(not ok and not err)
+f:cancel()
+_ = test_run:wait_cond(function() return ok or err end)
+assert(not ok and err.message)
+
 _ = test_run:switch("default")
 test_run:drop_cluster(REPLICASET_2)
 test_run:drop_cluster(REPLICASET_1)
diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
index b47665b..ffa48b6 100644
--- a/vshard/storage/init.lua
+++ b/vshard/storage/init.lua
@@ -31,6 +31,7 @@ local util = require('vshard.util')
 local lua_gc = require('vshard.lua_gc')
 local lregistry = require('vshard.registry')
 local reload_evolution = require('vshard.storage.reload_evolution')
+local fiber_cond_wait = util.fiber_cond_wait
 local bucket_ref_new
 
 local M = rawget(_G, MODULE_INTERNALS)
@@ -229,6 +230,10 @@ local function bucket_generation_increment()
     M.bucket_generation_cond:broadcast()
 end
 
+local function bucket_generation_wait(timeout)
+    return fiber_cond_wait(M.bucket_generation_cond, timeout)
+end
+
 --
 -- Check if this replicaset is locked. It means be invisible for
 -- the rebalancer.
@@ -2783,6 +2788,7 @@ M.schema_upgrade_handlers = schema_upgrade_handlers
 M.schema_version_make = schema_version_make
 M.schema_bootstrap = schema_init_0_1_15_0
 
+M.bucket_generation_wait = bucket_generation_wait
 lregistry.storage = M
 
 return {
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Tarantool-patches] [PATCH vshard 08/11] storage: introduce bucket_are_all_rw()
  2021-02-23  0:15 [Tarantool-patches] [PATCH vshard 00/11] VShard Map-Reduce, part 2: Ref, Sched, Map Vladislav Shpilevoy via Tarantool-patches
                   ` (8 preceding siblings ...)
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 07/11] storage: introduce bucket_generation_wait() Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-23  0:15 ` Vladislav Shpilevoy via Tarantool-patches
  2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 09/11] ref: introduce vshard.storage.ref module Vladislav Shpilevoy via Tarantool-patches
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 47+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-02-23  0:15 UTC (permalink / raw)
  To: tarantool-patches, olegrok, yaroslav.dynnikov

In the future map-reduce code it will be needed to be able to
check if all buckets on the storage are in writable state. If they
are - any request can do anything with all the data on the
storage.

Such 'all writable' state will be pinned by a new module
'storage_ref' so as map-reduce requests could execute without
being afraid of the rebalancer.

The patch adds a function bucket_are_all_rw() which is registered
in registry.storage.

The function is not trivial because tries to cache the returned
value. It makes a lot of sense, because the value changes super
rare and the calculation costs a lot (4 lookups in an index by a
string key via FFI + each lookup returns a tuple which is +1 Lua
GC object).

The function is going to be used almost on each map-reduce
request, so it must be fast.

Part of #147
---
 test/storage/storage.result   | 37 +++++++++++++++++++++++++++++++++++
 test/storage/storage.test.lua | 14 +++++++++++++
 vshard/storage/init.lua       | 37 +++++++++++++++++++++++++++++++++++
 3 files changed, 88 insertions(+)

diff --git a/test/storage/storage.result b/test/storage/storage.result
index 4730e20..2c9784a 100644
--- a/test/storage/storage.result
+++ b/test/storage/storage.result
@@ -808,6 +808,43 @@ assert(not ok and err.message)
 ---
 - fiber is cancelled
 ...
+--
+-- Bucket_are_all_rw() registry function.
+--
+assert(lstorage.bucket_are_all_rw())
+---
+- true
+...
+vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
+---
+...
+-- Let it stuck in the errinj.
+vshard.storage.recovery_wakeup()
+---
+...
+vshard.storage.bucket_force_create(10)
+---
+- true
+...
+box.space._bucket:update(10, {{'=', 2, vshard.consts.BUCKET.SENDING}})
+---
+- [10, 'sending']
+...
+assert(not lstorage.bucket_are_all_rw())
+---
+- true
+...
+box.space._bucket:update(10, {{'=', 2, vshard.consts.BUCKET.ACTIVE}})
+---
+- [10, 'active']
+...
+assert(lstorage.bucket_are_all_rw())
+---
+- true
+...
+vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
+---
+...
 _ = test_run:switch("default")
 ---
 ...
diff --git a/test/storage/storage.test.lua b/test/storage/storage.test.lua
index 86c5e33..33f0498 100644
--- a/test/storage/storage.test.lua
+++ b/test/storage/storage.test.lua
@@ -241,6 +241,20 @@ f:cancel()
 _ = test_run:wait_cond(function() return ok or err end)
 assert(not ok and err.message)
 
+--
+-- Bucket_are_all_rw() registry function.
+--
+assert(lstorage.bucket_are_all_rw())
+vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
+-- Let it stuck in the errinj.
+vshard.storage.recovery_wakeup()
+vshard.storage.bucket_force_create(10)
+box.space._bucket:update(10, {{'=', 2, vshard.consts.BUCKET.SENDING}})
+assert(not lstorage.bucket_are_all_rw())
+box.space._bucket:update(10, {{'=', 2, vshard.consts.BUCKET.ACTIVE}})
+assert(lstorage.bucket_are_all_rw())
+vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
+
 _ = test_run:switch("default")
 test_run:drop_cluster(REPLICASET_2)
 test_run:drop_cluster(REPLICASET_1)
diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
index ffa48b6..c3ed236 100644
--- a/vshard/storage/init.lua
+++ b/vshard/storage/init.lua
@@ -115,6 +115,10 @@ if not M then
         -- Fast alternative to box.space._bucket:count(). But may be nil. Reset
         -- on each generation change.
         bucket_count_cache = nil,
+        -- Fast alternative to checking multiple keys presence in
+        -- box.space._bucket status index. But may be nil. Reset on each
+        -- generation change.
+        bucket_are_all_rw_cache = nil,
         -- Redirects for recently sent buckets. They are kept for a while to
         -- help routers to find a new location for sent and deleted buckets
         -- without whole cluster scan.
@@ -220,12 +224,44 @@ local function bucket_count_public()
     return bucket_count()
 end
 
+--
+-- Check if all buckets on the storage are writable. The idea is the same as
+-- with bucket count - the value changes very rare, and is cached most of the
+-- time. Only that its non-cached calculation is more expensive than with count.
+--
+local bucket_are_all_rw
+
+local function bucket_are_all_rw_cache()
+    return M.bucket_are_all_rw_cache
+end
+
+local function bucket_are_all_rw_not_cache()
+    local status_index = box.space._bucket.index.status
+    local status = consts.BUCKET
+    local res = not status_index:min(status.SENDING) and
+       not status_index:min(status.SENT) and
+       not status_index:min(status.RECEIVING) and
+       not status_index:min(status.GARBAGE)
+
+    M.bucket_are_all_rw_cache = res
+    bucket_are_all_rw = bucket_are_all_rw_cache
+    return res
+end
+
+bucket_are_all_rw = bucket_are_all_rw_not_cache
+
+local function bucket_are_all_rw_public()
+    return bucket_are_all_rw()
+end
+
 --
 -- Trigger for on replace into _bucket to update its generation.
 --
 local function bucket_generation_increment()
     bucket_count = bucket_count_not_cache
+    bucket_are_all_rw = bucket_are_all_rw_not_cache
     M.bucket_count_cache = nil
+    M.bucket_are_all_rw_cache = nil
     M.bucket_generation = M.bucket_generation + 1
     M.bucket_generation_cond:broadcast()
 end
@@ -2788,6 +2824,7 @@ M.schema_upgrade_handlers = schema_upgrade_handlers
 M.schema_version_make = schema_version_make
 M.schema_bootstrap = schema_init_0_1_15_0
 
+M.bucket_are_all_rw = bucket_are_all_rw_public
 M.bucket_generation_wait = bucket_generation_wait
 lregistry.storage = M
 
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Tarantool-patches] [PATCH vshard 09/11] ref: introduce vshard.storage.ref module
  2021-02-23  0:15 [Tarantool-patches] [PATCH vshard 00/11] VShard Map-Reduce, part 2: Ref, Sched, Map Vladislav Shpilevoy via Tarantool-patches
                   ` (9 preceding siblings ...)
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 08/11] storage: introduce bucket_are_all_rw() Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-23  0:15 ` Vladislav Shpilevoy via Tarantool-patches
  2021-02-24 10:28   ` Oleg Babin via Tarantool-patches
                     ` (2 more replies)
  2021-03-12 23:13 ` [Tarantool-patches] [PATCH vshard 00/11] VShard Map-Reduce, part 2: Ref, Sched, Map Vladislav Shpilevoy via Tarantool-patches
  2021-03-28 18:17 ` Vladislav Shpilevoy via Tarantool-patches
  12 siblings, 3 replies; 47+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-02-23  0:15 UTC (permalink / raw)
  To: tarantool-patches, olegrok, yaroslav.dynnikov

'vshard.storage.ref' module helps to ensure that all buckets on
the storage stay writable while there is at least one ref on the
storage. Having storage referenced allows to execute any kinds of
requests on all the visible data in all spaces in locally stored
buckets.

This is useful when need to access tons of buckets at once,
especially when exact bucket IDs are not known.

Refs have deadlines. So as the storage wouldn't freeze not being
able to move buckets until restart in case a ref is not deleted
due to an error in user's code or disconnect.

The disconnects and restarts mean the refs can't be global.
Otherwise any kinds of global counters, uuids and so on, even
paired with any ids from a client could clash between clients on
their reconnects or storage restarts. Unless they establish a
TCP-like session, which would be too complicated.

Instead, the refs are spread over the existing box sessions. This
allows to bind refs of each client to its TCP connection and not
care about how to make them unique, how not to mess the refs on
restart, and how to drop the refs when a client disconnects.

Vshard.storage.ref does not depend on internals of the main file
(storage/init.lua), so it is implemented as a separate module to
keep it simple and isolated. It uses the storage via the registry
only to get a couple of functions from its API.

In addition, having it in a module simplifies the tests.

The API is not public so far, and is going to be privately used by
the future map-reduce API.

Part of #147
---
 test/reload_evolution/storage.result   |  66 ++++
 test/reload_evolution/storage.test.lua |  28 ++
 test/storage/ref.result                | 399 +++++++++++++++++++++++++
 test/storage/ref.test.lua              | 166 ++++++++++
 test/unit-tap/ref.test.lua             | 202 +++++++++++++
 vshard/consts.lua                      |   1 +
 vshard/error.lua                       |  19 ++
 vshard/storage/CMakeLists.txt          |   2 +-
 vshard/storage/init.lua                |   9 +
 vshard/storage/ref.lua                 | 371 +++++++++++++++++++++++
 10 files changed, 1262 insertions(+), 1 deletion(-)
 create mode 100644 test/storage/ref.result
 create mode 100644 test/storage/ref.test.lua
 create mode 100755 test/unit-tap/ref.test.lua
 create mode 100644 vshard/storage/ref.lua

diff --git a/test/reload_evolution/storage.result b/test/reload_evolution/storage.result
index 9d30a04..c4a0cdd 100644
--- a/test/reload_evolution/storage.result
+++ b/test/reload_evolution/storage.result
@@ -227,6 +227,72 @@ box.space._bucket.index.status:count({vshard.consts.BUCKET.ACTIVE})
 ---
 - 1500
 ...
+--
+-- Ensure storage refs are enabled and work from the scratch via reload.
+--
+lref = require('vshard.storage.ref')
+---
+...
+vshard.storage.rebalancer_disable()
+---
+...
+big_timeout = 1000000
+---
+...
+timeout = 0.01
+---
+...
+lref.add(0, 0, big_timeout)
+---
+- true
+...
+status_index = box.space._bucket.index.status
+---
+...
+bucket_id_to_move = status_index:min({vshard.consts.BUCKET.ACTIVE}).id
+---
+...
+ok, err = vshard.storage.bucket_send(bucket_id_to_move, util.replicasets[2],    \
+                                     {timeout = timeout})
+---
+...
+assert(not ok and err.message)
+---
+- Storage is referenced
+...
+lref.del(0, 0)
+---
+- true
+...
+vshard.storage.bucket_send(bucket_id_to_move, util.replicasets[2],              \
+                           {timeout = big_timeout})
+---
+- true
+...
+wait_bucket_is_collected(bucket_id_to_move)
+---
+...
+test_run:switch('storage_2_a')
+---
+- true
+...
+vshard.storage.rebalancer_disable()
+---
+...
+big_timeout = 1000000
+---
+...
+bucket_id_to_move = test_run:eval('storage_1_a', 'return bucket_id_to_move')[1]
+---
+...
+vshard.storage.bucket_send(bucket_id_to_move, util.replicasets[1],              \
+                           {timeout = big_timeout})
+---
+- true
+...
+wait_bucket_is_collected(bucket_id_to_move)
+---
+...
 test_run:switch('default')
 ---
 - true
diff --git a/test/reload_evolution/storage.test.lua b/test/reload_evolution/storage.test.lua
index 639553e..c351ada 100644
--- a/test/reload_evolution/storage.test.lua
+++ b/test/reload_evolution/storage.test.lua
@@ -83,6 +83,34 @@ box.space._bucket.index.status:count({vshard.consts.BUCKET.ACTIVE})
 test_run:switch('storage_1_a')
 box.space._bucket.index.status:count({vshard.consts.BUCKET.ACTIVE})
 
+--
+-- Ensure storage refs are enabled and work from the scratch via reload.
+--
+lref = require('vshard.storage.ref')
+vshard.storage.rebalancer_disable()
+
+big_timeout = 1000000
+timeout = 0.01
+lref.add(0, 0, big_timeout)
+status_index = box.space._bucket.index.status
+bucket_id_to_move = status_index:min({vshard.consts.BUCKET.ACTIVE}).id
+ok, err = vshard.storage.bucket_send(bucket_id_to_move, util.replicasets[2],    \
+                                     {timeout = timeout})
+assert(not ok and err.message)
+lref.del(0, 0)
+vshard.storage.bucket_send(bucket_id_to_move, util.replicasets[2],              \
+                           {timeout = big_timeout})
+wait_bucket_is_collected(bucket_id_to_move)
+
+test_run:switch('storage_2_a')
+vshard.storage.rebalancer_disable()
+
+big_timeout = 1000000
+bucket_id_to_move = test_run:eval('storage_1_a', 'return bucket_id_to_move')[1]
+vshard.storage.bucket_send(bucket_id_to_move, util.replicasets[1],              \
+                           {timeout = big_timeout})
+wait_bucket_is_collected(bucket_id_to_move)
+
 test_run:switch('default')
 test_run:drop_cluster(REPLICASET_2)
 test_run:drop_cluster(REPLICASET_1)
diff --git a/test/storage/ref.result b/test/storage/ref.result
new file mode 100644
index 0000000..d5f4166
--- /dev/null
+++ b/test/storage/ref.result
@@ -0,0 +1,399 @@
+-- test-run result file version 2
+test_run = require('test_run').new()
+ | ---
+ | ...
+netbox = require('net.box')
+ | ---
+ | ...
+REPLICASET_1 = { 'storage_1_a', 'storage_1_b' }
+ | ---
+ | ...
+REPLICASET_2 = { 'storage_2_a', 'storage_2_b' }
+ | ---
+ | ...
+
+test_run:create_cluster(REPLICASET_1, 'storage')
+ | ---
+ | ...
+test_run:create_cluster(REPLICASET_2, 'storage')
+ | ---
+ | ...
+util = require('util')
+ | ---
+ | ...
+util.wait_master(test_run, REPLICASET_1, 'storage_1_a')
+ | ---
+ | ...
+util.wait_master(test_run, REPLICASET_2, 'storage_2_a')
+ | ---
+ | ...
+util.map_evals(test_run, {REPLICASET_1, REPLICASET_2}, 'bootstrap_storage()')
+ | ---
+ | ...
+
+--
+-- gh-147: refs allow to pin all the buckets on the storage at once. Is invented
+-- for map-reduce functionality to pin all buckets on all storages in the
+-- cluster to execute consistent map-reduce calls on all cluster data.
+--
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+vshard.storage.rebalancer_disable()
+ | ---
+ | ...
+vshard.storage.bucket_force_create(1, 1500)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+vshard.storage.rebalancer_disable()
+ | ---
+ | ...
+vshard.storage.bucket_force_create(1501, 1500)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+lref = require('vshard.storage.ref')
+ | ---
+ | ...
+
+--
+-- Bucket moves are not allowed under a ref.
+--
+util = require('util')
+ | ---
+ | ...
+sid = 0
+ | ---
+ | ...
+rid = 0
+ | ---
+ | ...
+big_timeout = 1000000
+ | ---
+ | ...
+small_timeout = 0.001
+ | ---
+ | ...
+lref.add(rid, sid, big_timeout)
+ | ---
+ | - true
+ | ...
+-- Send fails.
+ok, err = vshard.storage.bucket_send(1, util.replicasets[2],                    \
+                                     {timeout = big_timeout})
+ | ---
+ | ...
+assert(not ok and err.message)
+ | ---
+ | - Storage is referenced
+ | ...
+lref.use(rid, sid)
+ | ---
+ | - true
+ | ...
+-- Still fails - use only makes ref undead until it is deleted explicitly.
+ok, err = vshard.storage.bucket_send(1, util.replicasets[2],                    \
+                                     {timeout = big_timeout})
+ | ---
+ | ...
+assert(not ok and err.message)
+ | ---
+ | - Storage is referenced
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+-- Receive (from another replicaset) also fails.
+big_timeout = 1000000
+ | ---
+ | ...
+ok, err = vshard.storage.bucket_send(1501, util.replicasets[1],                 \
+                                     {timeout = big_timeout})
+ | ---
+ | ...
+assert(not ok and err.message)
+ | ---
+ | - Storage is referenced
+ | ...
+
+--
+-- After unref all the bucket moves are allowed again.
+--
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+lref.del(rid, sid)
+ | ---
+ | - true
+ | ...
+
+vshard.storage.bucket_send(1, util.replicasets[2], {timeout = big_timeout})
+ | ---
+ | - true
+ | ...
+wait_bucket_is_collected(1)
+ | ---
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+vshard.storage.bucket_send(1, util.replicasets[1], {timeout = big_timeout})
+ | ---
+ | - true
+ | ...
+wait_bucket_is_collected(1)
+ | ---
+ | ...
+
+--
+-- While bucket move is in progress, ref won't work.
+--
+vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = true
+ | ---
+ | ...
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+fiber = require('fiber')
+ | ---
+ | ...
+_ = fiber.create(vshard.storage.bucket_send, 1, util.replicasets[2],            \
+                 {timeout = big_timeout})
+ | ---
+ | ...
+ok, err = lref.add(rid, sid, small_timeout)
+ | ---
+ | ...
+assert(not ok and err.message)
+ | ---
+ | - Timeout exceeded
+ | ...
+-- Ref will wait if timeout is big enough.
+ok, err = nil
+ | ---
+ | ...
+_ = fiber.create(function()                                                     \
+    ok, err = lref.add(rid, sid, big_timeout)                                   \
+end)
+ | ---
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = false
+ | ---
+ | ...
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+wait_bucket_is_collected(1)
+ | ---
+ | ...
+test_run:wait_cond(function() return ok or err end)
+ | ---
+ | - true
+ | ...
+lref.use(rid, sid)
+ | ---
+ | - true
+ | ...
+lref.del(rid, sid)
+ | ---
+ | - true
+ | ...
+assert(ok and not err)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+vshard.storage.bucket_send(1, util.replicasets[1], {timeout = big_timeout})
+ | ---
+ | - true
+ | ...
+wait_bucket_is_collected(1)
+ | ---
+ | ...
+
+--
+-- Refs are bound to sessions.
+--
+box.schema.user.grant('storage', 'super')
+ | ---
+ | ...
+lref = require('vshard.storage.ref')
+ | ---
+ | ...
+small_timeout = 0.001
+ | ---
+ | ...
+function make_ref(rid, timeout)                                                 \
+    return lref.add(rid, box.session.id(), timeout)                             \
+end
+ | ---
+ | ...
+function use_ref(rid)                                                           \
+    return lref.use(rid, box.session.id())                                      \
+end
+ | ---
+ | ...
+function del_ref(rid)                                                           \
+    return lref.del(rid, box.session.id())                                      \
+end
+ | ---
+ | ...
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+netbox = require('net.box')
+ | ---
+ | ...
+remote_uri = test_run:eval('storage_2_a', 'return box.cfg.listen')[1]
+ | ---
+ | ...
+c = netbox.connect(remote_uri)
+ | ---
+ | ...
+
+-- Ref is added and does not disappear anywhere on its own.
+c:call('make_ref', {1, small_timeout})
+ | ---
+ | - true
+ | ...
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+assert(lref.count == 1)
+ | ---
+ | - true
+ | ...
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+
+-- Use works.
+c:call('use_ref', {1})
+ | ---
+ | - true
+ | ...
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+assert(lref.count == 1)
+ | ---
+ | - true
+ | ...
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+
+-- Del works.
+c:call('del_ref', {1})
+ | ---
+ | - true
+ | ...
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+assert(lref.count == 0)
+ | ---
+ | - true
+ | ...
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+
+-- Expiration works. Try to add a second ref when the first one is expired - the
+-- first is collected and a subsequent use and del won't work.
+c:call('make_ref', {1, small_timeout})
+ | ---
+ | - true
+ | ...
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+assert(lref.count == 1)
+ | ---
+ | - true
+ | ...
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+
+fiber.sleep(small_timeout)
+ | ---
+ | ...
+c:call('make_ref', {2, small_timeout})
+ | ---
+ | - true
+ | ...
+ok, err = c:call('use_ref', {1})
+ | ---
+ | ...
+assert(ok == nil and err.message)
+ | ---
+ | - 'Can not use a storage ref: no ref'
+ | ...
+ok, err = c:call('del_ref', {1})
+ | ---
+ | ...
+assert(ok == nil and err.message)
+ | ---
+ | - 'Can not delete a storage ref: no ref'
+ | ...
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+assert(lref.count == 1)
+ | ---
+ | - true
+ | ...
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+
+--
+-- Session disconnect removes its refs.
+--
+c:call('make_ref', {3, big_timeout})
+ | ---
+ | - true
+ | ...
+c:close()
+ | ---
+ | ...
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+test_run:wait_cond(function() return lref.count == 0 end)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch("default")
+ | ---
+ | ...
+test_run:drop_cluster(REPLICASET_2)
+ | ---
+ | ...
+test_run:drop_cluster(REPLICASET_1)
+ | ---
+ | ...
diff --git a/test/storage/ref.test.lua b/test/storage/ref.test.lua
new file mode 100644
index 0000000..b34a294
--- /dev/null
+++ b/test/storage/ref.test.lua
@@ -0,0 +1,166 @@
+test_run = require('test_run').new()
+netbox = require('net.box')
+REPLICASET_1 = { 'storage_1_a', 'storage_1_b' }
+REPLICASET_2 = { 'storage_2_a', 'storage_2_b' }
+
+test_run:create_cluster(REPLICASET_1, 'storage')
+test_run:create_cluster(REPLICASET_2, 'storage')
+util = require('util')
+util.wait_master(test_run, REPLICASET_1, 'storage_1_a')
+util.wait_master(test_run, REPLICASET_2, 'storage_2_a')
+util.map_evals(test_run, {REPLICASET_1, REPLICASET_2}, 'bootstrap_storage()')
+
+--
+-- gh-147: refs allow to pin all the buckets on the storage at once. Is invented
+-- for map-reduce functionality to pin all buckets on all storages in the
+-- cluster to execute consistent map-reduce calls on all cluster data.
+--
+
+_ = test_run:switch('storage_1_a')
+vshard.storage.rebalancer_disable()
+vshard.storage.bucket_force_create(1, 1500)
+
+_ = test_run:switch('storage_2_a')
+vshard.storage.rebalancer_disable()
+vshard.storage.bucket_force_create(1501, 1500)
+
+_ = test_run:switch('storage_1_a')
+lref = require('vshard.storage.ref')
+
+--
+-- Bucket moves are not allowed under a ref.
+--
+util = require('util')
+sid = 0
+rid = 0
+big_timeout = 1000000
+small_timeout = 0.001
+lref.add(rid, sid, big_timeout)
+-- Send fails.
+ok, err = vshard.storage.bucket_send(1, util.replicasets[2],                    \
+                                     {timeout = big_timeout})
+assert(not ok and err.message)
+lref.use(rid, sid)
+-- Still fails - use only makes ref undead until it is deleted explicitly.
+ok, err = vshard.storage.bucket_send(1, util.replicasets[2],                    \
+                                     {timeout = big_timeout})
+assert(not ok and err.message)
+
+_ = test_run:switch('storage_2_a')
+-- Receive (from another replicaset) also fails.
+big_timeout = 1000000
+ok, err = vshard.storage.bucket_send(1501, util.replicasets[1],                 \
+                                     {timeout = big_timeout})
+assert(not ok and err.message)
+
+--
+-- After unref all the bucket moves are allowed again.
+--
+_ = test_run:switch('storage_1_a')
+lref.del(rid, sid)
+
+vshard.storage.bucket_send(1, util.replicasets[2], {timeout = big_timeout})
+wait_bucket_is_collected(1)
+
+_ = test_run:switch('storage_2_a')
+vshard.storage.bucket_send(1, util.replicasets[1], {timeout = big_timeout})
+wait_bucket_is_collected(1)
+
+--
+-- While bucket move is in progress, ref won't work.
+--
+vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = true
+
+_ = test_run:switch('storage_1_a')
+fiber = require('fiber')
+_ = fiber.create(vshard.storage.bucket_send, 1, util.replicasets[2],            \
+                 {timeout = big_timeout})
+ok, err = lref.add(rid, sid, small_timeout)
+assert(not ok and err.message)
+-- Ref will wait if timeout is big enough.
+ok, err = nil
+_ = fiber.create(function()                                                     \
+    ok, err = lref.add(rid, sid, big_timeout)                                   \
+end)
+
+_ = test_run:switch('storage_2_a')
+vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = false
+
+_ = test_run:switch('storage_1_a')
+wait_bucket_is_collected(1)
+test_run:wait_cond(function() return ok or err end)
+lref.use(rid, sid)
+lref.del(rid, sid)
+assert(ok and not err)
+
+_ = test_run:switch('storage_2_a')
+vshard.storage.bucket_send(1, util.replicasets[1], {timeout = big_timeout})
+wait_bucket_is_collected(1)
+
+--
+-- Refs are bound to sessions.
+--
+box.schema.user.grant('storage', 'super')
+lref = require('vshard.storage.ref')
+small_timeout = 0.001
+function make_ref(rid, timeout)                                                 \
+    return lref.add(rid, box.session.id(), timeout)                             \
+end
+function use_ref(rid)                                                           \
+    return lref.use(rid, box.session.id())                                      \
+end
+function del_ref(rid)                                                           \
+    return lref.del(rid, box.session.id())                                      \
+end
+
+_ = test_run:switch('storage_1_a')
+netbox = require('net.box')
+remote_uri = test_run:eval('storage_2_a', 'return box.cfg.listen')[1]
+c = netbox.connect(remote_uri)
+
+-- Ref is added and does not disappear anywhere on its own.
+c:call('make_ref', {1, small_timeout})
+_ = test_run:switch('storage_2_a')
+assert(lref.count == 1)
+_ = test_run:switch('storage_1_a')
+
+-- Use works.
+c:call('use_ref', {1})
+_ = test_run:switch('storage_2_a')
+assert(lref.count == 1)
+_ = test_run:switch('storage_1_a')
+
+-- Del works.
+c:call('del_ref', {1})
+_ = test_run:switch('storage_2_a')
+assert(lref.count == 0)
+_ = test_run:switch('storage_1_a')
+
+-- Expiration works. Try to add a second ref when the first one is expired - the
+-- first is collected and a subsequent use and del won't work.
+c:call('make_ref', {1, small_timeout})
+_ = test_run:switch('storage_2_a')
+assert(lref.count == 1)
+_ = test_run:switch('storage_1_a')
+
+fiber.sleep(small_timeout)
+c:call('make_ref', {2, small_timeout})
+ok, err = c:call('use_ref', {1})
+assert(ok == nil and err.message)
+ok, err = c:call('del_ref', {1})
+assert(ok == nil and err.message)
+_ = test_run:switch('storage_2_a')
+assert(lref.count == 1)
+_ = test_run:switch('storage_1_a')
+
+--
+-- Session disconnect removes its refs.
+--
+c:call('make_ref', {3, big_timeout})
+c:close()
+_ = test_run:switch('storage_2_a')
+test_run:wait_cond(function() return lref.count == 0 end)
+
+_ = test_run:switch("default")
+test_run:drop_cluster(REPLICASET_2)
+test_run:drop_cluster(REPLICASET_1)
diff --git a/test/unit-tap/ref.test.lua b/test/unit-tap/ref.test.lua
new file mode 100755
index 0000000..d987a63
--- /dev/null
+++ b/test/unit-tap/ref.test.lua
@@ -0,0 +1,202 @@
+#!/usr/bin/env tarantool
+
+local tap = require('tap')
+local test = tap.test('cfg')
+local fiber = require('fiber')
+local lregistry = require('vshard.registry')
+local lref = require('vshard.storage.ref')
+
+local big_timeout = 1000000
+local small_timeout = 0.000001
+local sid = 0
+local sid2 = 1
+local sid3 = 2
+
+--
+-- gh-147: refs allow to pin all the buckets on the storage at once. Is invented
+-- for map-reduce functionality to pin all buckets on all storages in the
+-- cluster to execute consistent map-reduce calls on all cluster data.
+--
+
+--
+-- Refs used storage API to get bucket space state and wait on its changes. But
+-- not important for these unit tests.
+--
+local function bucket_are_all_rw()
+    return true
+end
+
+lregistry.storage = {
+    bucket_are_all_rw = bucket_are_all_rw,
+}
+
+--
+-- Min heap fill and empty.
+--
+local function test_ref_basic(test)
+    test:plan(15)
+
+    local rid = 0
+    local ok, err
+    --
+    -- Basic ref/unref.
+    --
+    ok, err = lref.add(rid, sid, big_timeout)
+    test:ok(ok and not err, '+1 ref')
+    test:is(lref.count, 1, 'accounted')
+    ok, err = lref.use(rid, sid)
+    test:ok(ok and not err, 'use the ref')
+    test:is(lref.count, 1, 'but still accounted')
+    ok, err = lref.del(rid, sid)
+    test:ok(ok and not err, '-1 ref')
+    test:is(lref.count, 0, 'accounted')
+
+    --
+    -- Bad ref ID.
+    --
+    rid = 1
+    ok, err = lref.use(rid, sid)
+    test:ok(not ok and err, 'invalid RID at use')
+    ok, err = lref.del(rid, sid)
+    test:ok(not ok and err, 'invalid RID at del')
+
+    --
+    -- Bad session ID.
+    --
+    lref.kill(sid)
+    rid = 0
+    ok, err = lref.use(rid, sid)
+    test:ok(not ok and err, 'invalid SID at use')
+    ok, err = lref.del(rid, sid)
+    test:ok(not ok and err, 'invalid SID at del')
+
+    --
+    -- Duplicate ID.
+    --
+    ok, err = lref.add(rid, sid, big_timeout)
+    test:ok(ok and not err, 'add ref')
+    ok, err = lref.add(rid, sid, big_timeout)
+    test:ok(not ok and err, 'duplicate ref')
+    test:is(lref.count, 1, 'did not affect count')
+    test:ok(lref.use(rid, sid) and lref.del(rid, sid), 'del old ref')
+    test:is(lref.count, 0, 'accounted')
+end
+
+local function test_ref_incremental_gc(test)
+    test:plan(20)
+
+    --
+    -- Ref addition expires 2 old refs.
+    --
+    local ok, err
+    for i = 0, 2 do
+        assert(lref.add(i, sid, small_timeout))
+    end
+    fiber.sleep(small_timeout)
+    test:is(lref.count, 3, 'expired refs are still here')
+    test:ok(lref.add(3, sid, 0), 'add new ref')
+    -- 3 + 1 new - 2 old = 2.
+    test:is(lref.count, 2, 'it collected 2 old refs')
+    test:ok(lref.add(4, sid, 0), 'add new ref')
+    -- 2 + 1 new - 2 old = 1.
+    test:is(lref.count, 2, 'it collected 2 old refs')
+    test:ok(lref.del(4, sid), 'del the latest manually')
+
+    --
+    -- Incremental GC works fine if only one ref was GCed.
+    --
+    test:ok(lref.add(0, sid, small_timeout), 'add ref with small timeout')
+    test:ok(lref.add(1, sid, big_timeout), 'add ref with big timeout')
+    fiber.sleep(small_timeout)
+    test:ok(lref.add(2, sid, 0), 'add ref with 0 timeout')
+    test:is(lref.count, 2, 'collected 1 old ref, 1 is kept')
+    test:ok(lref.del(2, sid), 'del newest ref, it was not collected')
+    test:ok(lref.del(1, sid), 'del ref with big timeout')
+    test:ok(lref.count, 0, 'all is deleted')
+
+    --
+    -- GC works fine when only one ref was left and it was expired.
+    --
+    test:ok(lref.add(0, sid, small_timeout), 'add ref with small timeout')
+    test:is(lref.count, 1, '1 ref total')
+    fiber.sleep(small_timeout)
+    test:ok(lref.add(1, sid, big_timeout), 'add ref with big timeout')
+    test:is(lref.count, 1, 'collected the old one')
+    lref.gc()
+    test:is(lref.count, 1, 'still 1 - timeout was big')
+    test:ok(lref.del(1, sid), 'delete it')
+    test:is(lref.count, 0, 'no refs')
+end
+
+local function test_ref_gc(test)
+    test:plan(7)
+
+    --
+    -- Generic GC works fine with multiple sessions.
+    --
+    assert(lref.add(0, sid, big_timeout))
+    assert(lref.add(1, sid, small_timeout))
+    assert(lref.add(0, sid3, small_timeout))
+    assert(lref.add(0, sid2, small_timeout))
+    assert(lref.add(1, sid2, big_timeout))
+    assert(lref.add(1, sid3, big_timeout))
+    test:is(lref.count, 6, 'add 6 refs total')
+    fiber.sleep(small_timeout)
+    lref.gc()
+    test:is(lref.count, 3, '3 collected')
+    test:ok(lref.del(0, sid), 'del first')
+    test:ok(lref.del(1, sid2), 'del second')
+    test:ok(lref.del(1, sid3), 'del third')
+    test:is(lref.count, 0, '3 deleted')
+    lref.gc()
+    test:is(lref.count, 0, 'gc on empty refs did not break anything')
+end
+
+local function test_ref_use(test)
+    test:plan(7)
+
+    --
+    -- Ref use updates the session heap.
+    --
+    assert(lref.add(0, sid, small_timeout))
+    assert(lref.add(0, sid2, big_timeout))
+    test:ok(lref.count, 2, 'add 2 refs')
+    test:ok(lref.use(0, sid), 'use one with small timeout')
+    lref.gc()
+    test:is(lref.count, 2, 'still 2 refs')
+    fiber.sleep(small_timeout)
+    test:is(lref.count, 2, 'still 2 refs after sleep')
+    test:ok(lref.del(0, sid, 'del first'))
+    test:ok(lref.del(0, sid2, 'del second'))
+    test:is(lref.count, 0, 'now all is deleted')
+end
+
+local function test_ref_del(test)
+    test:plan(7)
+
+    --
+    -- Ref del updates the session heap.
+    --
+    assert(lref.add(0, sid, small_timeout))
+    assert(lref.add(0, sid2, big_timeout))
+    test:is(lref.count, 2, 'add 2 refs')
+    test:ok(lref.del(0, sid), 'del with small timeout')
+    lref.gc()
+    test:is(lref.count, 1, '1 ref remains')
+    fiber.sleep(small_timeout)
+    test:is(lref.count, 1, '1 ref remains after sleep')
+    lref.gc()
+    test:is(lref.count, 1, '1 ref remains after sleep and gc')
+    test:ok(lref.del(0, sid2), 'del with big timeout')
+    test:is(lref.count, 0, 'now all is deleted')
+end
+
+test:plan(5)
+
+test:test('basic', test_ref_basic)
+test:test('incremental gc', test_ref_incremental_gc)
+test:test('gc', test_ref_gc)
+test:test('use', test_ref_use)
+test:test('del', test_ref_del)
+
+os.exit(test:check() and 0 or 1)
diff --git a/vshard/consts.lua b/vshard/consts.lua
index cf3f422..0ffe0e2 100644
--- a/vshard/consts.lua
+++ b/vshard/consts.lua
@@ -48,4 +48,5 @@ return {
     DISCOVERY_TIMEOUT = 10,
 
     TIMEOUT_INFINITY = 500 * 365 * 86400,
+    DEADLINE_INFINITY = math.huge,
 }
diff --git a/vshard/error.lua b/vshard/error.lua
index a6f46a9..b02bfe9 100644
--- a/vshard/error.lua
+++ b/vshard/error.lua
@@ -130,6 +130,25 @@ local error_message_template = {
         name = 'TOO_MANY_RECEIVING',
         msg = 'Too many receiving buckets at once, please, throttle'
     },
+    [26] = {
+        name = 'STORAGE_IS_REFERENCED',
+        msg = 'Storage is referenced'
+    },
+    [27] = {
+        name = 'STORAGE_REF_ADD',
+        msg = 'Can not add a storage ref: %s',
+        args = {'reason'},
+    },
+    [28] = {
+        name = 'STORAGE_REF_USE',
+        msg = 'Can not use a storage ref: %s',
+        args = {'reason'},
+    },
+    [29] = {
+        name = 'STORAGE_REF_DEL',
+        msg = 'Can not delete a storage ref: %s',
+        args = {'reason'},
+    },
 }
 
 --
diff --git a/vshard/storage/CMakeLists.txt b/vshard/storage/CMakeLists.txt
index 3f4ed43..7c1e97d 100644
--- a/vshard/storage/CMakeLists.txt
+++ b/vshard/storage/CMakeLists.txt
@@ -1,2 +1,2 @@
-install(FILES init.lua reload_evolution.lua
+install(FILES init.lua reload_evolution.lua ref.lua
         DESTINATION ${TARANTOOL_INSTALL_LUADIR}/vshard/storage)
diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
index c3ed236..2957f48 100644
--- a/vshard/storage/init.lua
+++ b/vshard/storage/init.lua
@@ -17,6 +17,7 @@ if rawget(_G, MODULE_INTERNALS) then
         'vshard.replicaset', 'vshard.util',
         'vshard.storage.reload_evolution',
         'vshard.lua_gc', 'vshard.rlist', 'vshard.registry',
+        'vshard.heap', 'vshard.storage.ref',
     }
     for _, module in pairs(vshard_modules) do
         package.loaded[module] = nil
@@ -30,6 +31,7 @@ local lreplicaset = require('vshard.replicaset')
 local util = require('vshard.util')
 local lua_gc = require('vshard.lua_gc')
 local lregistry = require('vshard.registry')
+local lref = require('vshard.storage.ref')
 local reload_evolution = require('vshard.storage.reload_evolution')
 local fiber_cond_wait = util.fiber_cond_wait
 local bucket_ref_new
@@ -1140,6 +1142,9 @@ local function bucket_recv_xc(bucket_id, from, data, opts)
             return nil, lerror.vshard(lerror.code.WRONG_BUCKET, bucket_id, msg,
                                       from)
         end
+        if lref.count > 0 then
+            return nil, lerror.vshard(lerror.code.STORAGE_IS_REFERENCED)
+        end
         if is_this_replicaset_locked() then
             return nil, lerror.vshard(lerror.code.REPLICASET_IS_LOCKED)
         end
@@ -1441,6 +1446,9 @@ local function bucket_send_xc(bucket_id, destination, opts, exception_guard)
 
     local _bucket = box.space._bucket
     local bucket = _bucket:get({bucket_id})
+    if lref.count > 0 then
+        return nil, lerror.vshard(lerror.code.STORAGE_IS_REFERENCED)
+    end
     if is_this_replicaset_locked() then
         return nil, lerror.vshard(lerror.code.REPLICASET_IS_LOCKED)
     end
@@ -2528,6 +2536,7 @@ local function storage_cfg(cfg, this_replica_uuid, is_reload)
         box.space._bucket:on_replace(nil, M.bucket_on_replace)
         M.bucket_on_replace = nil
     end
+    lref.cfg()
     if is_master then
         box.space._bucket:on_replace(bucket_generation_increment)
         M.bucket_on_replace = bucket_generation_increment
diff --git a/vshard/storage/ref.lua b/vshard/storage/ref.lua
new file mode 100644
index 0000000..7589cb9
--- /dev/null
+++ b/vshard/storage/ref.lua
@@ -0,0 +1,371 @@
+--
+-- 'Ref' module helps to ensure that all buckets on the storage stay writable
+-- while there is at least one ref on the storage.
+-- Having storage referenced allows to execute any kinds of requests on all the
+-- visible data in all spaces in locally stored buckets. This is useful when
+-- need to access tons of buckets at once, especially when exact bucket IDs are
+-- not known.
+--
+-- Refs have deadlines. So as the storage wouldn't freeze not being able to move
+-- buckets until restart in case a ref is not deleted due to an error in user's
+-- code or disconnect.
+--
+-- The disconnects and restarts mean the refs can't be global. Otherwise any
+-- kinds of global counters, uuids and so on, even paired with any ids from a
+-- client could clash between clients on their reconnects or storage restarts.
+-- Unless they establish a TCP-like session, which would be too complicated.
+--
+-- Instead, the refs are spread over the existing box sessions. This allows to
+-- bind refs of each client to its TCP connection and not care about how to make
+-- them unique across all sessions, how not to mess the refs on restart, and how
+-- to drop the refs when a client disconnects.
+--
+
+local MODULE_INTERNALS = '__module_vshard_storage_ref'
+-- Update when change behaviour of anything in the file, to be able to reload.
+local MODULE_VERSION = 1
+
+local lfiber = require('fiber')
+local lheap = require('vshard.heap')
+local lerror = require('vshard.error')
+local lconsts = require('vshard.consts')
+local lregistry = require('vshard.registry')
+local fiber_clock = lfiber.clock
+local fiber_yield = lfiber.yield
+local DEADLINE_INFINITY = lconsts.DEADLINE_INFINITY
+local LUA_CHUNK_SIZE = lconsts.LUA_CHUNK_SIZE
+
+--
+-- Binary heap sort. Object with the closest deadline should be on top.
+--
+local function heap_min_deadline_cmp(ref1, ref2)
+    return ref1.deadline < ref2.deadline
+end
+
+local M = rawget(_G, MODULE_INTERNALS)
+if not M then
+    M = {
+        module_version = MODULE_VERSION,
+        -- Total number of references in all sessions.
+        count = 0,
+        -- Heap of session objects. Each session has refs sorted by their
+        -- deadline. The sessions themselves are also sorted by deadlines.
+        -- Session deadline is defined as the closest deadline of all its refs.
+        -- Or infinity in case there are no refs in it.
+        session_heap = lheap.new(heap_min_deadline_cmp),
+        -- Map of session objects. This is used to get session object by its ID.
+        session_map = {},
+        -- On session disconnect trigger to kill the dead sessions. It is saved
+        -- here for the sake of future reload to be able to delete the old
+        -- on disconnect function before setting a new one.
+        on_disconnect = nil,
+    }
+else
+    -- No reload so far. This is a first version. Return as is.
+    return M
+end
+
+local function ref_session_new(sid)
+    -- Session object does store its internal hot attributes in a table. Because
+    -- it would mean access to any session attribute would cost at least one
+    -- table indexing operation. Instead, all internal fields are stored as
+    -- upvalues referenced by the methods defined as closures.
+    --
+    -- This means session creation may not very suitable for jitting, but it is
+    -- very rare and attempts to optimize the most common case.
+    --
+    -- Still the public functions take 'self' object to make it look normally.
+    -- They even use it a bit.
+
+    -- Ref map to get ref object by its ID.
+    local ref_map = {}
+    -- Ref heap sorted by their deadlines.
+    local ref_heap = lheap.new(heap_min_deadline_cmp)
+    -- Total number of refs of the session. Is used to drop the session without
+    -- fullscan of the ref map. Heap size can't be used because not all refs are
+    -- stored here. See more on that below.
+    local count = 0
+    -- Cache global session storages as upvalues to save on M indexing.
+    local global_heap = M.session_heap
+    local global_map = M.session_map
+
+    local function ref_session_discount(self, del_count)
+        local new_count = M.count - del_count
+        assert(new_count >= 0)
+        M.count = new_count
+
+        new_count = count - del_count
+        assert(new_count >= 0)
+        count = new_count
+    end
+
+    local function ref_session_update_deadline(self)
+        local ref = ref_heap:top()
+        if not ref then
+            self.deadline = DEADLINE_INFINITY
+            global_heap:update(self)
+        else
+            local deadline = ref.deadline
+            if deadline ~= self.deadline then
+                self.deadline = deadline
+                global_heap:update(self)
+            end
+        end
+    end
+
+    --
+    -- Garbage collect at most 2 expired refs. The idea is that there is no a
+    -- dedicated fiber for expired refs collection. It would be too expensive to
+    -- wakeup a fiber on each added or removed or updated ref.
+    --
+    -- Instead, ref GC is mostly incremental and works by the principle "remove
+    -- more than add". On each new ref added, two old refs try to expire. This
+    -- way refs don't stack infinitely, and the expired refs are eventually
+    -- removed. Because removal is faster than addition: -2 for each +1.
+    --
+    local function ref_session_gc_step(self, now)
+        -- This is inlined 2 iterations of the more general GC procedure. The
+        -- latter is not called in order to save on not having a loop,
+        -- additional branches and variables.
+        if self.deadline > now then
+            return
+        end
+        local top = ref_heap:top()
+        ref_heap:remove_top()
+        ref_map[top.id] = nil
+        top = ref_heap:top()
+        if not top then
+            self.deadline = DEADLINE_INFINITY
+            global_heap:update(self)
+            ref_session_discount(self, 1)
+            return
+        end
+        local deadline = top.deadline
+        if deadline >= now then
+            self.deadline = deadline
+            global_heap:update(self)
+            ref_session_discount(self, 1)
+            return
+        end
+        ref_heap:remove_top()
+        ref_map[top.id] = nil
+        top = ref_heap:top()
+        if not top then
+            self.deadline = DEADLINE_INFINITY
+        else
+            self.deadline = top.deadline
+        end
+        global_heap:update(self)
+        ref_session_discount(self, 2)
+    end
+
+    --
+    -- GC expired refs until they end or the limit on the number of iterations
+    -- is exhausted. The limit is supposed to prevent too long GC which would
+    -- occupy TX thread unfairly.
+    --
+    -- Returns false if nothing to GC, or number of iterations left from the
+    -- limit. The caller is supposed to yield when 0 is returned, and retry GC
+    -- until it returns false.
+    -- The function itself does not yield, because it is used from a more
+    -- generic function GCing all sessions. It would not ever yield if all
+    -- sessions would have less than limit refs, even if total ref count would
+    -- be much bigger.
+    --
+    -- Besides, the session might be killed during general GC. There must not be
+    -- any yields in session methods so as not to introduce a support of dead
+    -- sessions.
+    --
+    local function ref_session_gc(self, limit, now)
+        if self.deadline >= now then
+            return false
+        end
+        local top = ref_heap:top()
+        local del = 1
+        local rest = 0
+        local deadline
+        repeat
+            ref_heap:remove_top()
+            ref_map[top.id] = nil
+            top = ref_heap:top()
+            if not top then
+                self.deadline = DEADLINE_INFINITY
+                rest = limit - del
+                break
+            end
+            deadline = top.deadline
+            if deadline >= now then
+                self.deadline = deadline
+                rest = limit - del
+                break
+            end
+            del = del + 1
+        until del >= limit
+        ref_session_discount(self, del)
+        global_heap:update(self)
+        return rest
+    end
+
+    local function ref_session_add(self, rid, deadline, now)
+        if ref_map[rid] then
+            return nil, lerror.vshard(lerror.code.STORAGE_REF_ADD,
+                                      'duplicate ref')
+        end
+        local ref = {
+            deadline = deadline,
+            id = rid,
+            -- Used by the heap.
+            index = -1,
+        }
+        ref_session_gc_step(self, now)
+        ref_map[rid] = ref
+        ref_heap:push(ref)
+        if deadline < self.deadline then
+            self.deadline = deadline
+            global_heap:update(self)
+        end
+        count = count + 1
+        M.count = M.count + 1
+        return true
+    end
+
+    --
+    -- Ref use means it can't be expired until deleted explicitly. Should be
+    -- done when the request affecting the whole storage starts. After use it is
+    -- important to call del afterwards - GC won't delete it automatically now.
+    -- Unless the entire session is killed.
+    --
+    local function ref_session_use(self, rid)
+        local ref = ref_map[rid]
+        if not ref then
+            return nil, lerror.vshard(lerror.code.STORAGE_REF_USE, 'no ref')
+        end
+        ref_heap:remove(ref)
+        ref_session_update_deadline(self)
+        return true
+    end
+
+    local function ref_session_del(self, rid)
+        local ref = ref_map[rid]
+        if not ref then
+            return nil, lerror.vshard(lerror.code.STORAGE_REF_DEL, 'no ref')
+        end
+        ref_heap:remove_try(ref)
+        ref_map[rid] = nil
+        ref_session_update_deadline(self)
+        ref_session_discount(self, 1)
+        return true
+    end
+
+    local function ref_session_kill(self)
+        global_map[sid] = nil
+        global_heap:remove(self)
+        ref_session_discount(self, count)
+    end
+
+    -- Don't use __index. It is useless since all sessions use closures as
+    -- methods. Also it is probably slower because on each method call would
+    -- need to get the metatable, get __index, find the method here. While now
+    -- it is only an index operation on the session object.
+    local session = {
+        deadline = DEADLINE_INFINITY,
+        -- Used by the heap.
+        index = -1,
+        -- Methods.
+        del = ref_session_del,
+        gc = ref_session_gc,
+        add = ref_session_add,
+        use = ref_session_use,
+        kill = ref_session_kill,
+    }
+    global_map[sid] = session
+    global_heap:push(session)
+    return session
+end
+
+local function ref_gc()
+    local session_heap = M.session_heap
+    local session = session_heap:top()
+    if not session then
+        return
+    end
+    local limit = LUA_CHUNK_SIZE
+    local now = fiber_clock()
+    repeat
+        limit = session:gc(limit, now)
+        if not limit then
+            return
+        end
+        if limit == 0 then
+            fiber_yield()
+            limit = LUA_CHUNK_SIZE
+            now = fiber_clock()
+        end
+        session = session_heap:top()
+    until not session
+end
+
+local function ref_add(rid, sid, timeout)
+    local now = fiber_clock()
+    local deadline = now + timeout
+    local ok, err, session
+    local storage = lregistry.storage
+    while not storage.bucket_are_all_rw() do
+        ok, err = storage.bucket_generation_wait(timeout)
+        if not ok then
+            return nil, err
+        end
+        now = fiber_clock()
+        timeout = deadline - now
+    end
+    session = M.session_map[sid]
+    if not session then
+        session = ref_session_new(sid)
+    end
+    return session:add(rid, deadline, now)
+end
+
+local function ref_use(rid, sid)
+    local session = M.session_map[sid]
+    if not session then
+        return nil, lerror.vshard(lerror.code.STORAGE_REF_USE, 'no session')
+    end
+    return session:use(rid)
+end
+
+local function ref_del(rid, sid)
+    local session = M.session_map[sid]
+    if not session then
+        return nil, lerror.vshard(lerror.code.STORAGE_REF_DEL, 'no session')
+    end
+    return session:del(rid)
+end
+
+local function ref_kill_session(sid)
+    local session = M.session_map[sid]
+    if session then
+        session:kill()
+    end
+end
+
+local function ref_on_session_disconnect()
+    ref_kill_session(box.session.id())
+end
+
+local function ref_cfg()
+    if M.on_disconnect then
+        pcall(box.session.on_disconnect, nil, M.on_disconnect)
+    end
+    box.session.on_disconnect(ref_on_session_disconnect)
+    M.on_disconnect = ref_on_session_disconnect
+end
+
+M.del = ref_del
+M.gc = ref_gc
+M.add = ref_add
+M.use = ref_use
+M.cfg = ref_cfg
+M.kill = ref_kill_session
+lregistry.storage_ref = M
+
+return M
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 01/11] error: introduce vshard.error.timeout()
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 01/11] error: introduce vshard.error.timeout() Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
  2021-02-24 21:46     ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 1 reply; 47+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-02-24 10:27 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, yaroslav.dynnikov

Hi! Thanks for your patch.

Personally, I vote for dropping 1.9 support (it's already broken - #256).

But if you want to eliminate "long and ugly" ways you could do something 
like:


```

local make_timeout
if box.error.new ~= nil then
     make_timeout = function() return box.error.new(box.error.TIMEOUT) end
else
     make_timeout = function() return select(2, pcall(...)) end
end

```


On 23.02.2021 03:15, Vladislav Shpilevoy wrote:
> The function returns a box.error.TIMEOUT error converted to the
> format used by vshard.
>
> Probably it wouldn't be needed if only Tarantool >= 1.10 was
> supported - then error.make(box.error.new(box.error.TIMEOUT))
> wouldn't be so bad. But 1.9 is supposed to work as well, and to
> create a timeout error on <= 1.9 it is necessary to make a pcall()
> which is long and ugly.
>
> vshard.error.timeout() provides a version-agnostic way of
> returning timeout errors.
>
> The patch is motivated by timeout error being actively used in the
> future patches about map-reduce.
>
> Needed for #147
> ---
>   test/router/sync.result   | 10 +++++++---
>   test/router/sync.test.lua |  3 ++-
>   test/unit/error.result    | 22 ++++++++++++++++++++++
>   test/unit/error.test.lua  |  9 +++++++++
>   vshard/error.lua          | 10 ++++++++++
>   vshard/replicaset.lua     |  3 +--
>   vshard/router/init.lua    |  6 ++----
>   vshard/storage/init.lua   |  6 ++----
>   8 files changed, 55 insertions(+), 14 deletions(-)
>
> diff --git a/test/router/sync.result b/test/router/sync.result
> index 6f0821d..040d611 100644
> --- a/test/router/sync.result
> +++ b/test/router/sync.result
> @@ -45,10 +45,14 @@ vshard.router.bootstrap()
>   ---
>   - true
>   ...
> -vshard.router.sync(-1)
> +res, err = vshard.router.sync(-1)
>   ---
> -- null
> -- Timeout exceeded
> +...
> +util.portable_error(err)
> +---
> +- type: ClientError
> +  code: 78
> +  message: Timeout exceeded
>   ...
>   res, err = vshard.router.sync(0)
>   ---
> diff --git a/test/router/sync.test.lua b/test/router/sync.test.lua
> index 3150343..cb36b0e 100644
> --- a/test/router/sync.test.lua
> +++ b/test/router/sync.test.lua
> @@ -15,7 +15,8 @@ util = require('util')
>   
>   vshard.router.bootstrap()
>   
> -vshard.router.sync(-1)
> +res, err = vshard.router.sync(-1)
> +util.portable_error(err)
>   res, err = vshard.router.sync(0)
>   util.portable_error(err)
>   
> diff --git a/test/unit/error.result b/test/unit/error.result
> index 8552d91..738cfeb 100644
> --- a/test/unit/error.result
> +++ b/test/unit/error.result
> @@ -97,3 +97,25 @@ util.portable_error(err)
>     code: 32
>     message: '[string "function raise_lua_err() assert(false) end "]:1: assertion failed!'
>   ...
> +--
> +-- lerror.timeout() - portable alternative to box.error.new(box.error.TIMEOUT).
> +--
> +err = lerror.timeout()
> +---
> +...
> +type(err)
> +---
> +- table
> +...
> +assert(err.code == box.error.TIMEOUT)
> +---
> +- true
> +...
> +err.type
> +---
> +- ClientError
> +...
> +err.message
> +---
> +- Timeout exceeded
> +...
> diff --git a/test/unit/error.test.lua b/test/unit/error.test.lua
> index 859414e..0a51d33 100644
> --- a/test/unit/error.test.lua
> +++ b/test/unit/error.test.lua
> @@ -36,3 +36,12 @@ function raise_lua_err() assert(false) end
>   ok, err = pcall(raise_lua_err)
>   err = lerror.make(err)
>   util.portable_error(err)
> +
> +--
> +-- lerror.timeout() - portable alternative to box.error.new(box.error.TIMEOUT).
> +--
> +err = lerror.timeout()
> +type(err)
> +assert(err.code == box.error.TIMEOUT)
> +err.type
> +err.message
> diff --git a/vshard/error.lua b/vshard/error.lua
> index 65da763..a6f46a9 100644
> --- a/vshard/error.lua
> +++ b/vshard/error.lua
> @@ -212,10 +212,20 @@ local function make_alert(code, ...)
>       return setmetatable(r, { __serialize = 'seq' })
>   end
>   
> +--
> +-- Create a timeout error object. Box.error.new() can't be used because is
> +-- present only since 1.10.
> +--
> +local function make_timeout()
> +    local _, err = pcall(box.error, box.error.TIMEOUT)
> +    return make_error(err)
> +end
> +
>   return {
>       code = error_code,
>       box = box_error,
>       vshard = vshard_error,
>       make = make_error,
>       alert = make_alert,
> +    timeout = make_timeout,
>   }
> diff --git a/vshard/replicaset.lua b/vshard/replicaset.lua
> index 9c792b3..7437e3b 100644
> --- a/vshard/replicaset.lua
> +++ b/vshard/replicaset.lua
> @@ -401,8 +401,7 @@ local function replicaset_template_multicallro(prefer_replica, balance)
>           local timeout = opts.timeout or consts.CALL_TIMEOUT_MAX
>           local net_status, storage_status, retval, err, replica
>           if timeout <= 0 then
> -            net_status, err = pcall(box.error, box.error.TIMEOUT)
> -            return nil, lerror.make(err)
> +            return nil, lerror.timeout()
>           end
>           local end_time = fiber_clock() + timeout
>           while not net_status and timeout > 0 do
> diff --git a/vshard/router/init.lua b/vshard/router/init.lua
> index eeb7515..97bcb0a 100644
> --- a/vshard/router/init.lua
> +++ b/vshard/router/init.lua
> @@ -628,8 +628,7 @@ local function router_call_impl(router, bucket_id, mode, prefer_replica,
>       if err then
>           return nil, err
>       else
> -        local _, boxerror = pcall(box.error, box.error.TIMEOUT)
> -        return nil, lerror.box(boxerror)
> +        return nil, lerror.timeout()
>       end
>   end
>   
> @@ -1235,8 +1234,7 @@ local function router_sync(router, timeout)
>       local opts = {timeout = timeout}
>       for rs_uuid, replicaset in pairs(router.replicasets) do
>           if timeout < 0 then
> -            local ok, err = pcall(box.error, box.error.TIMEOUT)
> -            return nil, err
> +            return nil, lerror.timeout()
>           end
>           local status, err = replicaset:callrw('vshard.storage.sync', arg, opts)
>           if not status then
> diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
> index a3e7008..e0ce31d 100644
> --- a/vshard/storage/init.lua
> +++ b/vshard/storage/init.lua
> @@ -756,8 +756,7 @@ local function sync(timeout)
>           lfiber.sleep(0.001)
>       until fiber_clock() > tstart + timeout
>       log.warn("Timed out during synchronizing replicaset")
> -    local ok, err = pcall(box.error, box.error.TIMEOUT)
> -    return nil, lerror.make(err)
> +    return nil, lerror.timeout()
>   end
>   
>   --------------------------------------------------------------------------------
> @@ -1344,8 +1343,7 @@ local function bucket_send_xc(bucket_id, destination, opts, exception_guard)
>       while ref.rw ~= 0 do
>           timeout = deadline - fiber_clock()
>           if not M.bucket_rw_lock_is_ready_cond:wait(timeout) then
> -            status, err = pcall(box.error, box.error.TIMEOUT)
> -            return nil, lerror.make(err)
> +            return nil, lerror.timeout()
>           end
>           lfiber.testcancel()
>       end

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 02/11] storage: add helper for local functions invocation
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 02/11] storage: add helper for local functions invocation Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
  0 siblings, 0 replies; 47+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-02-24 10:27 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, yaroslav.dynnikov

Hi! Thanks for your patch. LGTM.

On 23.02.2021 03:15, Vladislav Shpilevoy wrote:
> Function local_call() works like netbox.self.call, but is
> exception-safe, and uses cached values of 'netbox.self' and
> 'netbox.self.call'. This saves at least 3 indexing operations,
> which are not free as it appeared.
>
> The cached values are not used directly in storage_call(), because
> local_call() also will be used from the future function
> storage_map() - a part of map-reduce API.
>
> Needed for #147
> ---
>   vshard/storage/init.lua | 14 +++++++++++++-
>   1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
> index e0ce31d..a3d383d 100644
> --- a/vshard/storage/init.lua
> +++ b/vshard/storage/init.lua
> @@ -6,6 +6,8 @@ local trigger = require('internal.trigger')
>   local ffi = require('ffi')
>   local yaml_encode = require('yaml').encode
>   local fiber_clock = lfiber.clock
> +local netbox_self = netbox.self
> +local netbox_self_call = netbox_self.call
>   
>   local MODULE_INTERNALS = '__module_vshard_storage'
>   -- Reload requirements, in case this module is reloaded manually.
> @@ -171,6 +173,16 @@ else
>       bucket_ref_new = ffi.typeof("struct bucket_ref")
>   end
>   
> +--
> +-- Invoke a function on this instance. Arguments are unpacked into the function
> +-- as arguments.
> +-- The function returns pcall() as is, because is used from places where
> +-- exceptions are not allowed.
> +--
> +local function local_call(func_name, args)
> +    return pcall(netbox_self_call, netbox_self, func_name, args)
> +end
> +
>   --
>   -- Trigger for on replace into _bucket to update its generation.
>   --
> @@ -2275,7 +2287,7 @@ local function storage_call(bucket_id, mode, name, args)
>       if not ok then
>           return ok, err
>       end
> -    ok, ret1, ret2, ret3 = pcall(netbox.self.call, netbox.self, name, args)
> +    ok, ret1, ret2, ret3 = local_call(name, args)
>       _, err = bucket_unref(bucket_id, mode)
>       assert(not err)
>       if not ok then

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 03/11] storage: cache bucket count
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 03/11] storage: cache bucket count Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
  2021-02-24 21:47     ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 1 reply; 47+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-02-24 10:27 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, yaroslav.dynnikov

Thanks for your patch! LGTM.

I see calls like "status_index:count({consts.BUCKET.ACTIVE})". Maybe it 
worth

to cache whole buckets stats as well?

On 23.02.2021 03:15, Vladislav Shpilevoy wrote:
> Bucket count calculation costs 1 FFI call in Lua, and makes a few
> actions and virtual calls in C. So it is not free even for memtx
> spaces.
>
> But it changes extremely rare, which makes reasonable to cache the
> value.
>
> Bucket count is not used much now, but will be used a lot in the
> future storage_ref() function, which is a part of map-reduce API.
>
> The idea is that a router will need to reference all the storages
> and ensure that all the buckets in the cluster are pinned to their
> storages. To check this, storage_ref() will return number of
> buckets successfully pinned on the storage.
>
> The router will sum counts from all storage_ref() calls and ensure
> it equals to total configured bucket count.
>
> This means bucket count is needed for each storage_ref() call,
> whose count per second can be thousands and more.
>
> The patch makes count calculation cost as much as one Lua function
> call and a Lua table index operation (almost always).
>
> Needed for #147
> ---
>   test/storage/storage.result   | 45 +++++++++++++++++++++++++++++++++++
>   test/storage/storage.test.lua | 18 ++++++++++++++
>   vshard/storage/init.lua       | 44 +++++++++++++++++++++++++++++-----
>   3 files changed, 101 insertions(+), 6 deletions(-)
>
> diff --git a/test/storage/storage.result b/test/storage/storage.result
> index 0550ad1..edb45be 100644
> --- a/test/storage/storage.result
> +++ b/test/storage/storage.result
> @@ -677,6 +677,51 @@ rs:callro('echo', {'some_data'})
>   - null
>   - null
>   ...
> +--
> +-- Bucket count is calculated properly.
> +--
> +-- Cleanup after the previous tests.
> +_ = test_run:switch('storage_1_a')
> +---
> +...
> +buckets = vshard.storage.buckets_info()
> +---
> +...
> +for bid, _ in pairs(buckets) do vshard.storage.bucket_force_drop(bid) end
> +---
> +...
> +_ = test_run:switch('storage_2_a')
> +---
> +...
> +buckets = vshard.storage.buckets_info()
> +---
> +...
> +for bid, _ in pairs(buckets) do vshard.storage.bucket_force_drop(bid) end
> +---
> +...
> +_ = test_run:switch('storage_1_a')
> +---
> +...
> +assert(vshard.storage.buckets_count() == 0)
> +---
> +- true
> +...
> +vshard.storage.bucket_force_create(1, 5)
> +---
> +- true
> +...
> +assert(vshard.storage.buckets_count() == 5)
> +---
> +- true
> +...
> +vshard.storage.bucket_force_create(6, 5)
> +---
> +- true
> +...
> +assert(vshard.storage.buckets_count() == 10)
> +---
> +- true
> +...
>   _ = test_run:switch("default") --- ... diff --git a/test/storage/storage.test.lua 
> b/test/storage/storage.test.lua index d8fbd94..db014ef 100644 --- 
> a/test/storage/storage.test.lua +++ b/test/storage/storage.test.lua @@ 
> -187,6 +187,24 @@ util.has_same_fields(old_internal, 
> vshard.storage.internal) _, rs = 
> next(vshard.storage.internal.replicasets) rs:callro('echo', 
> {'some_data'}) +-- +-- Bucket count is calculated properly. +-- +-- 
> Cleanup after the previous tests. +_ = test_run:switch('storage_1_a') 
> +buckets = vshard.storage.buckets_info() +for bid, _ in pairs(buckets) 
> do vshard.storage.bucket_force_drop(bid) end +_ = 
> test_run:switch('storage_2_a') +buckets = 
> vshard.storage.buckets_info() +for bid, _ in pairs(buckets) do 
> vshard.storage.bucket_force_drop(bid) end + +_ = 
> test_run:switch('storage_1_a') +assert(vshard.storage.buckets_count() 
> == 0) +vshard.storage.bucket_force_create(1, 5) 
> +assert(vshard.storage.buckets_count() == 5) 
> +vshard.storage.bucket_force_create(6, 5) 
> +assert(vshard.storage.buckets_count() == 10) + _ = test_run:switch("default")
>   test_run:drop_cluster(REPLICASET_2)
>   test_run:drop_cluster(REPLICASET_1)
> diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
> index a3d383d..9b74bcb 100644
> --- a/vshard/storage/init.lua
> +++ b/vshard/storage/init.lua
> @@ -110,6 +110,9 @@ if not M then
>           -- replace the old function is to keep its reference.
>           --
>           bucket_on_replace = nil,
> +        -- Fast alternative to box.space._bucket:count(). But may be nil. Reset
> +        -- on each generation change.
> +        bucket_count_cache = nil,
>           -- Redirects for recently sent buckets. They are kept for a while to
>           -- help routers to find a new location for sent and deleted buckets
>           -- without whole cluster scan.
> @@ -183,10 +186,44 @@ local function local_call(func_name, args)
>       return pcall(netbox_self_call, netbox_self, func_name, args)
>   end
>   
> +--
> +-- Get number of buckets stored on this storage. Regardless of their state.
> +--
> +-- The idea is that all the code should use one function ref to get the bucket
> +-- count. But inside the function never branches. Instead, it points at one of 2
> +-- branch-less functions. Cached one simply returns a number which is supposed
> +-- to be super fast. Non-cached remembers the count and changes the global
> +-- function to the cached one. So on the next call it is cheap. No 'if's at all.
> +--
> +local bucket_count
> +
> +local function bucket_count_cache()
> +    return M.bucket_count_cache
> +end
> +
> +local function bucket_count_not_cache()
> +    local count = box.space._bucket:count()
> +    M.bucket_count_cache = count
> +    bucket_count = bucket_count_cache
> +    return count
> +end
> +
> +bucket_count = bucket_count_not_cache
> +
> +--
> +-- Can't expose bucket_count to the public API as is. Need this proxy-call.
> +-- Because the original function changes at runtime.
> +--
> +local function bucket_count_public()
> +    return bucket_count()
> +end
> +
>   --
>   -- Trigger for on replace into _bucket to update its generation.
>   --
>   local function bucket_generation_increment()
> +    bucket_count = bucket_count_not_cache
> +    M.bucket_count_cache = nil
>       M.bucket_generation = M.bucket_generation + 1
>       M.bucket_generation_cond:broadcast()
>   end
> @@ -2240,7 +2277,6 @@ local function rebalancer_request_state()
>       if #status_index:select({consts.BUCKET.GARBAGE}, {limit = 1}) > 0 then
>           return
>       end
> -    local bucket_count = _bucket:count()
>       return {
>           bucket_active_count = status_index:count({consts.BUCKET.ACTIVE}),
>           bucket_pinned_count = status_index:count({consts.BUCKET.PINNED}),
> @@ -2501,10 +2537,6 @@ end
>   -- Monitoring
>   --------------------------------------------------------------------------------
>   
> -local function storage_buckets_count()
> -    return  box.space._bucket.index.pk:count()
> -end
> -
>   local function storage_buckets_info(bucket_id)
>       local ibuckets = setmetatable({}, { __serialize = 'mapping' })
>   
> @@ -2780,7 +2812,7 @@ return {
>       cfg = function(cfg, uuid) return storage_cfg(cfg, uuid, false) end,
>       info = storage_info,
>       buckets_info = storage_buckets_info,
> -    buckets_count = storage_buckets_count,
> +    buckets_count = bucket_count_public,
>       buckets_discovery = buckets_discovery,
>       rebalancer_request_state = rebalancer_request_state,
>       internal = M,

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 04/11] registry: module for circular deps resolution
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 04/11] registry: module for circular deps resolution Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
  0 siblings, 0 replies; 47+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-02-24 10:27 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, yaroslav.dynnikov

Thanks for your patch. LGTM.

On 23.02.2021 03:15, Vladislav Shpilevoy wrote:
> Registry is a way to resolve cyclic dependencies which normally
> can exist between files of the same module/library.
>
> It is a global table hidden in _G with a long unlikely anywhere
> used name.
>
> Files, which want to expose their API to the other files, which in
> turn can't require the formers directly, should put their API to
> the registry.
>
> The files use the registry to get API of the other files. They
> don't require() and use the latter directly.
>
> At runtime, when all require() are done, the registry is full,
> and all the files see API of each other.
>
> Such circular dependency will exist between new files implementing
> map-reduce engine as a set of relatively independent submodules of
> the storage.
>
> In particular there will be storage_ref and storage_sched. Both
> require a few functions from the main storage file, and will use
> API of each other.
>
> Having the modules accessed via registry adds at lest +1 indexing
> operation at runtime when need to get a function from there. But
> sometimes it can be cached similar to how bucket count cache works
> in the main storage file.
>
> Main purpose is not to increase size of the main storage file
> again. It wouldn't fix the circular deps anyway, and would make it
> much harder to follow the code.
>
> Part of #147
> ---
>   vshard/CMakeLists.txt   |  3 +-
>   vshard/registry.lua     | 67 +++++++++++++++++++++++++++++++++++++++++
>   vshard/storage/init.lua |  5 ++-
>   3 files changed, 73 insertions(+), 2 deletions(-)
>   create mode 100644 vshard/registry.lua
>
> diff --git a/vshard/CMakeLists.txt b/vshard/CMakeLists.txt
> index 78a3f07..2a15df5 100644
> --- a/vshard/CMakeLists.txt
> +++ b/vshard/CMakeLists.txt
> @@ -7,4 +7,5 @@ add_subdirectory(router)
>   
>   # Install module
>   install(FILES cfg.lua error.lua consts.lua hash.lua init.lua replicaset.lua
> -        util.lua lua_gc.lua rlist.lua heap.lua DESTINATION ${TARANTOOL_INSTALL_LUADIR}/vshard)
> +        util.lua lua_gc.lua rlist.lua heap.lua registry.lua
> +        DESTINATION ${TARANTOOL_INSTALL_LUADIR}/vshard)
> diff --git a/vshard/registry.lua b/vshard/registry.lua
> new file mode 100644
> index 0000000..9583add
> --- /dev/null
> +++ b/vshard/registry.lua
> @@ -0,0 +1,67 @@
> +--
> +-- Registry is a way to resolve cyclic dependencies which normally can exist
> +-- between files of the same module/library.
> +--
> +-- Files, which want to expose their API to the other files, which in turn can't
> +-- require the formers directly, should put their API to the registry.
> +--
> +-- The files should use the registry to get API of the other files. They don't
> +-- require() and use the latter directly if there is a known loop dependency
> +-- between them.
> +--
> +-- At runtime, when all require() are done, the registry is full, and all the
> +-- files see API of each other.
> +--
> +-- Having the modules accessed via the registry adds at lest +1 indexing
> +-- operation at runtime when need to get a function from there. But sometimes it
> +-- can be cached to reduce the effect in perf-sensitive code. For example, like
> +-- this:
> +--
> +--     local lreg = require('vshard.registry')
> +--
> +--     local storage_func
> +--
> +--     local function storage_func_no_cache(...)
> +--         storage_func = lreg.storage.func
> +--         return storage_func(...)
> +--     end
> +--
> +--     storage_func = storage_func_no_cache
> +--
> +-- The code will always call storage_func(), but will load it from the registry
> +-- only on first invocation.
> +--
> +-- However in case reload is important, it is not possible - the original
> +-- function object in the registry may change. In such situation still makes
> +-- sense to cache at least 'lreg.storage' to save 1 indexing operation.
> +--
> +--     local lreg = require('vshard.registry')
> +--
> +--     local lstorage
> +--
> +--     local function storage_func_cache(...)
> +--         return lstorage.storage_func(...)
> +--     end
> +--
> +--     local function storage_func_no_cache(...)
> +--         lstorage = lref.storage
> +--         storage_func = storage_func_cache
> +--         return lstorage.storage_func(...)
> +--     end
> +--
> +--     storage_func = storage_func_no_cache
> +--
> +-- A harder way would be to use the first approach + add triggers on reload of
> +-- the cached module to update the cached function refs. If the code is
> +-- extremely perf-critical (which should not be Lua then).
> +--
> +
> +local MODULE_INTERNALS = '__module_vshard_registry'
> +
> +local M = rawget(_G, MODULE_INTERNALS)
> +if not M then
> +    M = {}
> +    rawset(_G, MODULE_INTERNALS, M)
> +end
> +
> +return M
> diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
> index 9b74bcb..b47665b 100644
> --- a/vshard/storage/init.lua
> +++ b/vshard/storage/init.lua
> @@ -16,7 +16,7 @@ if rawget(_G, MODULE_INTERNALS) then
>           'vshard.consts', 'vshard.error', 'vshard.cfg',
>           'vshard.replicaset', 'vshard.util',
>           'vshard.storage.reload_evolution',
> -        'vshard.lua_gc', 'vshard.rlist'
> +        'vshard.lua_gc', 'vshard.rlist', 'vshard.registry',
>       }
>       for _, module in pairs(vshard_modules) do
>           package.loaded[module] = nil
> @@ -29,6 +29,7 @@ local lcfg = require('vshard.cfg')
>   local lreplicaset = require('vshard.replicaset')
>   local util = require('vshard.util')
>   local lua_gc = require('vshard.lua_gc')
> +local lregistry = require('vshard.registry')
>   local reload_evolution = require('vshard.storage.reload_evolution')
>   local bucket_ref_new
>   
> @@ -2782,6 +2783,8 @@ M.schema_upgrade_handlers = schema_upgrade_handlers
>   M.schema_version_make = schema_version_make
>   M.schema_bootstrap = schema_init_0_1_15_0
>   
> +lregistry.storage = M
> +
>   return {
>       sync = sync,
>       bucket_force_create = bucket_force_create,

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 05/11] util: introduce safe fiber_cond_wait()
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 05/11] util: introduce safe fiber_cond_wait() Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
  2021-02-24 21:48     ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 1 reply; 47+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-02-24 10:27 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, yaroslav.dynnikov

Hi! Thanks for your patch. LGTM.

I see several usages of cond:wait() in code. Maybe after introducing 
this helper you could use it.

E.g. in "bucket_send_xc" function.

On 23.02.2021 03:15, Vladislav Shpilevoy wrote:
> Original fiber_cond:wait() has a few issues:
>
> - Raises exception when fiber is canceled, which makes it
>    inapplicable in exception-intolerant code;
>
> - Raises an ugly misleading usage exception when timeout is
>    negative which easily can happen if the caller's code calls
>    wait() multiple times retrying something and does not want to
>    bother with doing a part of cond's job;
>
> - When fails, pushes an error of type 'TimedOut' which is not the
>    same as 'ClientError' with box.error.TIMEOUT code. The latter is
>    used wider, at least in vshard.
>
> The patch introduces util.fiber_cond_wait() function which fixes
> the mentioned issues.
>
> It is needed in the future map-reduce subsystem modules revolving
> around waiting on various conditions.
>
> Part of #147
> ---
>   test/unit/util.result   | 82 +++++++++++++++++++++++++++++++++++++++++
>   test/unit/util.test.lua | 31 ++++++++++++++++
>   vshard/util.lua         | 35 ++++++++++++++++++
>   3 files changed, 148 insertions(+)
>
> diff --git a/test/unit/util.result b/test/unit/util.result
> index 42a361a..679c087 100644
> --- a/test/unit/util.result
> +++ b/test/unit/util.result
> @@ -184,3 +184,85 @@ t ~= res
>   ---
>   - true
>   ...
> +--
> +-- Exception-safe cond wait.
> +--
> +cond_wait = util.fiber_cond_wait
> +---
> +...
> +cond = fiber.cond()
> +---
> +...
> +ok, err = cond_wait(cond, -1)
> +---
> +...
> +assert(not ok and err.message)
> +---
> +- Timeout exceeded
> +...
> +-- Ensure it does not return 'false' like pcall(). It must conform to nil,err
> +-- signature.
> +assert(type(ok) == 'nil')
> +---
> +- true
> +...
> +ok, err = cond_wait(cond, 0)
> +---
> +...
> +assert(not ok and err.message)
> +---
> +- Timeout exceeded
> +...
> +ok, err = cond_wait(cond, 0.000001)
> +---
> +...
> +assert(not ok and err.message)
> +---
> +- Timeout exceeded
> +...
> +ok, err = nil
> +---
> +...
> +_ = fiber.create(function() ok, err = cond_wait(cond, 1000000) end)
> +---
> +...
> +fiber.yield()
> +---
> +...
> +cond:signal()
> +---
> +...
> +_ = test_run:wait_cond(function() return ok or err end)
> +---
> +...
> +assert(ok and not err)
> +---
> +- true
> +...
> +ok, err = nil
> +---
> +...
> +f = fiber.create(function() ok, err = cond_wait(cond, 1000000) end)
> +---
> +...
> +fiber.yield()
> +---
> +...
> +f:cancel()
> +---
> +...
> +_ = test_run:wait_cond(function() return ok or err end)
> +---
> +...
> +assert(not ok)
> +---
> +- true
> +...
> +err.message
> +---
> +- fiber is cancelled
> +...
> +assert(type(err) == 'table')
> +---
> +- true
> +...
> diff --git a/test/unit/util.test.lua b/test/unit/util.test.lua
> index 9550a95..df3db6f 100644
> --- a/test/unit/util.test.lua
> +++ b/test/unit/util.test.lua
> @@ -76,3 +76,34 @@ yield_count
>   t
>   res
>   t ~= res
> +
> +--
> +-- Exception-safe cond wait.
> +--
> +cond_wait = util.fiber_cond_wait
> +cond = fiber.cond()
> +ok, err = cond_wait(cond, -1)
> +assert(not ok and err.message)
> +-- Ensure it does not return 'false' like pcall(). It must conform to nil,err
> +-- signature.
> +assert(type(ok) == 'nil')
> +ok, err = cond_wait(cond, 0)
> +assert(not ok and err.message)
> +ok, err = cond_wait(cond, 0.000001)
> +assert(not ok and err.message)
> +
> +ok, err = nil
> +_ = fiber.create(function() ok, err = cond_wait(cond, 1000000) end)
> +fiber.yield()
> +cond:signal()
> +_ = test_run:wait_cond(function() return ok or err end)
> +assert(ok and not err)
> +
> +ok, err = nil
> +f = fiber.create(function() ok, err = cond_wait(cond, 1000000) end)
> +fiber.yield()
> +f:cancel()
> +_ = test_run:wait_cond(function() return ok or err end)
> +assert(not ok)
> +err.message
> +assert(type(err) == 'table')
> diff --git a/vshard/util.lua b/vshard/util.lua
> index 2362607..d78f3a5 100644
> --- a/vshard/util.lua
> +++ b/vshard/util.lua
> @@ -1,6 +1,7 @@
>   -- vshard.util
>   local log = require('log')
>   local fiber = require('fiber')
> +local lerror = require('vshard.error')
>   
>   local MODULE_INTERNALS = '__module_vshard_util'
>   local M = rawget(_G, MODULE_INTERNALS)
> @@ -191,6 +192,39 @@ local function table_minus_yield(dst, src, interval)
>       return dst
>   end
>   
> +local function fiber_cond_wait_xc(cond, timeout)
> +    -- Handle negative timeout specifically - otherwise wait() will throw an
> +    -- ugly usage error.
> +    -- Don't trust this check to the caller's code, because often it just calls
> +    -- wait many times until it fails or the condition is met. Code looks much
> +    -- cleaner when it does not need to check the timeout sign. On the other
> +    -- hand, perf is not important here - anyway wait() yields which is slow on
> +    -- its own, but also breaks JIT trace recording which makes pcall() in the
> +    -- non-xc version of this function inconsiderable.
> +    if timeout < 0 or not cond:wait(timeout) then
> +        -- Don't use the original error if cond sets it. Because it sets
> +        -- TimedOut error. It does not have a proper error code, and may not be
> +        -- detected by router as a special timeout case if necessary. Or at
> +        -- least would complicate the handling in future. Instead, try to use a
> +        -- unified timeout error where possible.
> +        error(lerror.timeout())
> +    end
> +    -- Still possible though that the fiber is canceled and cond:wait() throws.
> +    -- This is why the _xc() version of this function throws even the timeout -
> +    -- anyway pcall() is inevitable.
> +end
> +
> +--
> +-- Exception-safe cond wait with unified errors in vshard format.
> +--
> +local function fiber_cond_wait(cond, timeout)
> +    local ok, err = pcall(fiber_cond_wait_xc, cond, timeout)
> +    if ok then
> +        return true
> +    end
> +    return nil, lerror.make(err)
> +end
> +
>   return {
>       tuple_extract_key = tuple_extract_key,
>       reloadable_fiber_create = reloadable_fiber_create,
> @@ -200,4 +234,5 @@ return {
>       version_is_at_least = version_is_at_least,
>       table_copy_yield = table_copy_yield,
>       table_minus_yield = table_minus_yield,
> +    fiber_cond_wait = fiber_cond_wait,
>   }

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 06/11] util: introduce fiber_is_self_canceled()
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 06/11] util: introduce fiber_is_self_canceled() Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
  0 siblings, 0 replies; 47+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-02-24 10:27 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, yaroslav.dynnikov

Thanks for your patch. LGTM.


On 23.02.2021 03:15, Vladislav Shpilevoy wrote:
> Original fiber.testcancel() has an issue - it is not
> exception-safe. This makes it unusable for code which wants to do
> cleanup before cancellation.
>
> The patch introduces util.fiber_is_self_canceled() which checks if
> the current fiber is canceled but returns true/false instead of
> throwing an error.
>
> The patch is going to be used in the map-reduce patches where it
> will be necessary to check if the fiber is canceled. And if it
> is - perform cleanup and quit whatever the code was doing.
>
> Part of #147
> ---
>   test/unit/util.result   | 28 ++++++++++++++++++++++++++++
>   test/unit/util.test.lua | 14 ++++++++++++++
>   vshard/util.lua         |  8 ++++++++
>   3 files changed, 50 insertions(+)
>
> diff --git a/test/unit/util.result b/test/unit/util.result
> index 679c087..c83e80c 100644
> --- a/test/unit/util.result
> +++ b/test/unit/util.result
> @@ -266,3 +266,31 @@ assert(type(err) == 'table')
>   ---
>   - true
>   ...
> +--
> +-- Exception-safe fiber cancel check.
> +--
> +self_is_canceled = util.fiber_is_self_canceled
> +---
> +...
> +assert(not self_is_canceled())
> +---
> +- true
> +...
> +ok = nil
> +---
> +...
> +_ = fiber.create(function()                                                     \
> +    local f = fiber.self()                                                      \
> +    pcall(f.cancel, f)                                                          \
> +    ok = self_is_canceled()                                                     \
> +end)
> +---
> +...
> +test_run:wait_cond(function() return ok ~= nil end)
> +---
> +- true
> +...
> +assert(ok)
> +---
> +- true
> +...
> diff --git a/test/unit/util.test.lua b/test/unit/util.test.lua
> index df3db6f..881feb4 100644
> --- a/test/unit/util.test.lua
> +++ b/test/unit/util.test.lua
> @@ -107,3 +107,17 @@ _ = test_run:wait_cond(function() return ok or err end)
>   assert(not ok)
>   err.message
>   assert(type(err) == 'table')
> +
> +--
> +-- Exception-safe fiber cancel check.
> +--
> +self_is_canceled = util.fiber_is_self_canceled
> +assert(not self_is_canceled())
> +ok = nil
> +_ = fiber.create(function()                                                     \
> +    local f = fiber.self()                                                      \
> +    pcall(f.cancel, f)                                                          \
> +    ok = self_is_canceled()                                                     \
> +end)
> +test_run:wait_cond(function() return ok ~= nil end)
> +assert(ok)
> diff --git a/vshard/util.lua b/vshard/util.lua
> index d78f3a5..30a1e6e 100644
> --- a/vshard/util.lua
> +++ b/vshard/util.lua
> @@ -225,6 +225,13 @@ local function fiber_cond_wait(cond, timeout)
>       return nil, lerror.make(err)
>   end
>   
> +--
> +-- Exception-safe way to check if the current fiber is canceled.
> +--
> +local function fiber_is_self_canceled()
> +    return not pcall(fiber.testcancel)
> +end
> +
>   return {
>       tuple_extract_key = tuple_extract_key,
>       reloadable_fiber_create = reloadable_fiber_create,
> @@ -235,4 +242,5 @@ return {
>       table_copy_yield = table_copy_yield,
>       table_minus_yield = table_minus_yield,
>       fiber_cond_wait = fiber_cond_wait,
> +    fiber_is_self_canceled = fiber_is_self_canceled,
>   }

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 07/11] storage: introduce bucket_generation_wait()
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 07/11] storage: introduce bucket_generation_wait() Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
  0 siblings, 0 replies; 47+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-02-24 10:27 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, yaroslav.dynnikov

Thanks for your patch. LGTM.

On 23.02.2021 03:15, Vladislav Shpilevoy wrote:
> In the future map-reduce code it will be needed to be able to wait
> until all buckets on the storage enter writable state. If they are
> not writable, the code should wait efficiently, without polling.
>
> The patch adds a function bucket_generation_wait() which is
> registered in registry.storage.
>
> It helps to wait until state of any bucket is changed. The caller
> code, if wants to wait for all buckets to enter writable state,
> should wait on the generation and re-check the requested condition
> until it matches or timeout happens.
>
> Part of #147
> ---
>   test/storage/storage.result   | 86 +++++++++++++++++++++++++++++++++++
>   test/storage/storage.test.lua | 36 +++++++++++++++
>   vshard/storage/init.lua       |  6 +++
>   3 files changed, 128 insertions(+)
>
> diff --git a/test/storage/storage.result b/test/storage/storage.result
> index edb45be..4730e20 100644
> --- a/test/storage/storage.result
> +++ b/test/storage/storage.result
> @@ -722,6 +722,92 @@ assert(vshard.storage.buckets_count() == 10)
>   ---
>   - true
>   ...
> +--
> +-- Bucket_generation_wait() registry function.
> +--
> +lstorage = require('vshard.registry').storage
> +---
> +...
> +ok, err = lstorage.bucket_generation_wait(-1)
> +---
> +...
> +assert(not ok and err.message)
> +---
> +- Timeout exceeded
> +...
> +ok, err = lstorage.bucket_generation_wait(0)
> +---
> +...
> +assert(not ok and err.message)
> +---
> +- Timeout exceeded
> +...
> +small_timeout = 0.000001
> +---
> +...
> +ok, err = lstorage.bucket_generation_wait(small_timeout)
> +---
> +...
> +assert(not ok and err.message)
> +---
> +- Timeout exceeded
> +...
> +ok, err = nil
> +---
> +...
> +big_timeout = 1000000
> +---
> +...
> +_ = fiber.create(function()                                                     \
> +    ok, err = lstorage.bucket_generation_wait(big_timeout)                      \
> +end)
> +---
> +...
> +fiber.sleep(small_timeout)
> +---
> +...
> +assert(not ok and not err)
> +---
> +- true
> +...
> +vshard.storage.bucket_force_drop(10)
> +---
> +- true
> +...
> +test_run:wait_cond(function() return ok or err end)
> +---
> +- true
> +...
> +assert(ok)
> +---
> +- true
> +...
> +-- Cancel should interrupt the waiting.
> +ok, err = nil
> +---
> +...
> +f = fiber.create(function()                                                     \
> +    ok, err = lstorage.bucket_generation_wait(big_timeout)                      \
> +end)
> +---
> +...
> +fiber.sleep(small_timeout)
> +---
> +...
> +assert(not ok and not err)
> +---
> +- true
> +...
> +f:cancel()
> +---
> +...
> +_ = test_run:wait_cond(function() return ok or err end)
> +---
> +...
> +assert(not ok and err.message)
> +---
> +- fiber is cancelled
> +...
>   _ = test_run:switch("default") --- ... diff --git a/test/storage/storage.test.lua 
> b/test/storage/storage.test.lua index db014ef..86c5e33 100644 --- 
> a/test/storage/storage.test.lua +++ b/test/storage/storage.test.lua @@ 
> -205,6 +205,42 @@ assert(vshard.storage.buckets_count() == 5) 
> vshard.storage.bucket_force_create(6, 5) 
> assert(vshard.storage.buckets_count() == 10) +-- +-- 
> Bucket_generation_wait() registry function. +-- +lstorage = 
> require('vshard.registry').storage +ok, err = 
> lstorage.bucket_generation_wait(-1) +assert(not ok and err.message) + 
> +ok, err = lstorage.bucket_generation_wait(0) +assert(not ok and 
> err.message) + +small_timeout = 0.000001 +ok, err = 
> lstorage.bucket_generation_wait(small_timeout) +assert(not ok and 
> err.message) + +ok, err = nil +big_timeout = 1000000 +_ = 
> fiber.create(function() \ + ok, err = 
> lstorage.bucket_generation_wait(big_timeout) \ +end) 
> +fiber.sleep(small_timeout) +assert(not ok and not err) 
> +vshard.storage.bucket_force_drop(10) +test_run:wait_cond(function() 
> return ok or err end) +assert(ok) + +-- Cancel should interrupt the 
> waiting. +ok, err = nil +f = fiber.create(function() \ + ok, err = 
> lstorage.bucket_generation_wait(big_timeout) \ +end) 
> +fiber.sleep(small_timeout) +assert(not ok and not err) +f:cancel() +_ 
> = test_run:wait_cond(function() return ok or err end) +assert(not ok 
> and err.message) + _ = test_run:switch("default")
>   test_run:drop_cluster(REPLICASET_2)
>   test_run:drop_cluster(REPLICASET_1)
> diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
> index b47665b..ffa48b6 100644
> --- a/vshard/storage/init.lua
> +++ b/vshard/storage/init.lua
> @@ -31,6 +31,7 @@ local util = require('vshard.util')
>   local lua_gc = require('vshard.lua_gc')
>   local lregistry = require('vshard.registry')
>   local reload_evolution = require('vshard.storage.reload_evolution')
> +local fiber_cond_wait = util.fiber_cond_wait
>   local bucket_ref_new
>   
>   local M = rawget(_G, MODULE_INTERNALS)
> @@ -229,6 +230,10 @@ local function bucket_generation_increment()
>       M.bucket_generation_cond:broadcast()
>   end
>   
> +local function bucket_generation_wait(timeout)
> +    return fiber_cond_wait(M.bucket_generation_cond, timeout)
> +end
> +
>   --
>   -- Check if this replicaset is locked. It means be invisible for
>   -- the rebalancer.
> @@ -2783,6 +2788,7 @@ M.schema_upgrade_handlers = schema_upgrade_handlers
>   M.schema_version_make = schema_version_make
>   M.schema_bootstrap = schema_init_0_1_15_0
>   
> +M.bucket_generation_wait = bucket_generation_wait
>   lregistry.storage = M
>   
>   return {

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 08/11] storage: introduce bucket_are_all_rw()
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 08/11] storage: introduce bucket_are_all_rw() Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
  2021-02-24 21:48     ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 1 reply; 47+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-02-24 10:27 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, yaroslav.dynnikov

Thanks for your patch.

Seems here I should return to one of my previous e-mail.

Maybe it's reasonable to cache all bucket stats?


On 23.02.2021 03:15, Vladislav Shpilevoy wrote:
> In the future map-reduce code it will be needed to be able to
> check if all buckets on the storage are in writable state. If they
> are - any request can do anything with all the data on the
> storage.
>
> Such 'all writable' state will be pinned by a new module
> 'storage_ref' so as map-reduce requests could execute without
> being afraid of the rebalancer.
>
> The patch adds a function bucket_are_all_rw() which is registered
> in registry.storage.
>
> The function is not trivial because tries to cache the returned
> value. It makes a lot of sense, because the value changes super
> rare and the calculation costs a lot (4 lookups in an index by a
> string key via FFI + each lookup returns a tuple which is +1 Lua
> GC object).
>
> The function is going to be used almost on each map-reduce
> request, so it must be fast.
>
> Part of #147
> ---
>   test/storage/storage.result   | 37 +++++++++++++++++++++++++++++++++++
>   test/storage/storage.test.lua | 14 +++++++++++++
>   vshard/storage/init.lua       | 37 +++++++++++++++++++++++++++++++++++
>   3 files changed, 88 insertions(+)
>
> diff --git a/test/storage/storage.result b/test/storage/storage.result
> index 4730e20..2c9784a 100644
> --- a/test/storage/storage.result
> +++ b/test/storage/storage.result
> @@ -808,6 +808,43 @@ assert(not ok and err.message)
>   ---
>   - fiber is cancelled
>   ...
> +--
> +-- Bucket_are_all_rw() registry function.
> +--
> +assert(lstorage.bucket_are_all_rw())
> +---
> +- true
> +...
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
> +---
> +...
> +-- Let it stuck in the errinj.
> +vshard.storage.recovery_wakeup()
> +---
> +...
> +vshard.storage.bucket_force_create(10)
> +---
> +- true
> +...
> +box.space._bucket:update(10, {{'=', 2, vshard.consts.BUCKET.SENDING}})
> +---
> +- [10, 'sending']
> +...
> +assert(not lstorage.bucket_are_all_rw())
> +---
> +- true
> +...
> +box.space._bucket:update(10, {{'=', 2, vshard.consts.BUCKET.ACTIVE}})
> +---
> +- [10, 'active']
> +...
> +assert(lstorage.bucket_are_all_rw())
> +---
> +- true
> +...
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
> +---
> +...
>   _ = test_run:switch("default") --- ... diff --git a/test/storage/storage.test.lua 
> b/test/storage/storage.test.lua index 86c5e33..33f0498 100644 --- 
> a/test/storage/storage.test.lua +++ b/test/storage/storage.test.lua @@ 
> -241,6 +241,20 @@ f:cancel() _ = test_run:wait_cond(function() return 
> ok or err end) assert(not ok and err.message) +-- +-- 
> Bucket_are_all_rw() registry function. +-- 
> +assert(lstorage.bucket_are_all_rw()) 
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true +-- Let it 
> stuck in the errinj. +vshard.storage.recovery_wakeup() 
> +vshard.storage.bucket_force_create(10) +box.space._bucket:update(10, 
> {{'=', 2, vshard.consts.BUCKET.SENDING}}) +assert(not 
> lstorage.bucket_are_all_rw()) +box.space._bucket:update(10, {{'=', 2, 
> vshard.consts.BUCKET.ACTIVE}}) +assert(lstorage.bucket_are_all_rw()) 
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false + _ = 
> test_run:switch("default")
>   test_run:drop_cluster(REPLICASET_2)
>   test_run:drop_cluster(REPLICASET_1)
> diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
> index ffa48b6..c3ed236 100644
> --- a/vshard/storage/init.lua
> +++ b/vshard/storage/init.lua
> @@ -115,6 +115,10 @@ if not M then
>           -- Fast alternative to box.space._bucket:count(). But may be nil. Reset
>           -- on each generation change.
>           bucket_count_cache = nil,
> +        -- Fast alternative to checking multiple keys presence in
> +        -- box.space._bucket status index. But may be nil. Reset on each
> +        -- generation change.
> +        bucket_are_all_rw_cache = nil,
>           -- Redirects for recently sent buckets. They are kept for a while to
>           -- help routers to find a new location for sent and deleted buckets
>           -- without whole cluster scan.
> @@ -220,12 +224,44 @@ local function bucket_count_public()
>       return bucket_count()
>   end
>   
> +--
> +-- Check if all buckets on the storage are writable. The idea is the same as
> +-- with bucket count - the value changes very rare, and is cached most of the
> +-- time. Only that its non-cached calculation is more expensive than with count.
> +--
> +local bucket_are_all_rw
> +
> +local function bucket_are_all_rw_cache()
> +    return M.bucket_are_all_rw_cache
> +end
> +
> +local function bucket_are_all_rw_not_cache()
> +    local status_index = box.space._bucket.index.status
> +    local status = consts.BUCKET
> +    local res = not status_index:min(status.SENDING) and
> +       not status_index:min(status.SENT) and
> +       not status_index:min(status.RECEIVING) and
> +       not status_index:min(status.GARBAGE)
> +
> +    M.bucket_are_all_rw_cache = res
> +    bucket_are_all_rw = bucket_are_all_rw_cache
> +    return res
> +end
> +
> +bucket_are_all_rw = bucket_are_all_rw_not_cache
> +
> +local function bucket_are_all_rw_public()
> +    return bucket_are_all_rw()
> +end
> +
>   --
>   -- Trigger for on replace into _bucket to update its generation.
>   --
>   local function bucket_generation_increment()
>       bucket_count = bucket_count_not_cache
> +    bucket_are_all_rw = bucket_are_all_rw_not_cache
>       M.bucket_count_cache = nil
> +    M.bucket_are_all_rw_cache = nil
>       M.bucket_generation = M.bucket_generation + 1
>       M.bucket_generation_cond:broadcast()
>   end
> @@ -2788,6 +2824,7 @@ M.schema_upgrade_handlers = schema_upgrade_handlers
>   M.schema_version_make = schema_version_make
>   M.schema_bootstrap = schema_init_0_1_15_0
>   
> +M.bucket_are_all_rw = bucket_are_all_rw_public
>   M.bucket_generation_wait = bucket_generation_wait
>   lregistry.storage = M
>   

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 09/11] ref: introduce vshard.storage.ref module
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 09/11] ref: introduce vshard.storage.ref module Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-24 10:28   ` Oleg Babin via Tarantool-patches
  2021-02-24 21:49     ` Vladislav Shpilevoy via Tarantool-patches
  2021-03-04 21:22   ` Oleg Babin via Tarantool-patches
  2021-03-21 18:49   ` Vladislav Shpilevoy via Tarantool-patches
  2 siblings, 1 reply; 47+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-02-24 10:28 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, yaroslav.dynnikov

Thanks for you patch. It's a brief review - I hope I'll look once again 
on this patch.

Consider a question below.


On 23.02.2021 03:15, Vladislav Shpilevoy wrote:
> 'vshard.storage.ref' module helps to ensure that all buckets on
> the storage stay writable while there is at least one ref on the
> storage. Having storage referenced allows to execute any kinds of
> requests on all the visible data in all spaces in locally stored
> buckets.
>
> This is useful when need to access tons of buckets at once,
> especially when exact bucket IDs are not known.
>
> Refs have deadlines. So as the storage wouldn't freeze not being
> able to move buckets until restart in case a ref is not deleted
> due to an error in user's code or disconnect.
>
> The disconnects and restarts mean the refs can't be global.
> Otherwise any kinds of global counters, uuids and so on, even
> paired with any ids from a client could clash between clients on
> their reconnects or storage restarts. Unless they establish a
> TCP-like session, which would be too complicated.
>
> Instead, the refs are spread over the existing box sessions. This
> allows to bind refs of each client to its TCP connection and not
> care about how to make them unique, how not to mess the refs on
> restart, and how to drop the refs when a client disconnects.
>
> Vshard.storage.ref does not depend on internals of the main file
> (storage/init.lua), so it is implemented as a separate module to
> keep it simple and isolated. It uses the storage via the registry
> only to get a couple of functions from its API.
>
> In addition, having it in a module simplifies the tests.
>
> The API is not public so far, and is going to be privately used by
> the future map-reduce API.
>
> Part of #147
> ---
>   test/reload_evolution/storage.result   |  66 ++++
>   test/reload_evolution/storage.test.lua |  28 ++
>   test/storage/ref.result                | 399 +++++++++++++++++++++++++
>   test/storage/ref.test.lua              | 166 ++++++++++
>   test/unit-tap/ref.test.lua             | 202 +++++++++++++
>   vshard/consts.lua                      |   1 +
>   vshard/error.lua                       |  19 ++
>   vshard/storage/CMakeLists.txt          |   2 +-
>   vshard/storage/init.lua                |   9 +
>   vshard/storage/ref.lua                 | 371 +++++++++++++++++++++++
>   10 files changed, 1262 insertions(+), 1 deletion(-)
>   create mode 100644 test/storage/ref.result
>   create mode 100644 test/storage/ref.test.lua
>   create mode 100755 test/unit-tap/ref.test.lua
>   create mode 100644 vshard/storage/ref.lua
>
> diff --git a/test/reload_evolution/storage.result b/test/reload_evolution/storage.result
> index 9d30a04..c4a0cdd 100644
> --- a/test/reload_evolution/storage.result
> +++ b/test/reload_evolution/storage.result
> @@ -227,6 +227,72 @@ box.space._bucket.index.status:count({vshard.consts.BUCKET.ACTIVE})
>   ---
>   - 1500
>   ...
> +--
> +-- Ensure storage refs are enabled and work from the scratch via reload.
> +--
> +lref = require('vshard.storage.ref')
> +---
> +...
> +vshard.storage.rebalancer_disable()
> +---
> +...
> +big_timeout = 1000000
> +---
> +...
> +timeout = 0.01
> +---
> +...
> +lref.add(0, 0, big_timeout)
> +---
> +- true
> +...
> +status_index = box.space._bucket.index.status
> +---
> +...
> +bucket_id_to_move = status_index:min({vshard.consts.BUCKET.ACTIVE}).id
> +---
> +...
> +ok, err = vshard.storage.bucket_send(bucket_id_to_move, util.replicasets[2],    \
> +                                     {timeout = timeout})
> +---
> +...
> +assert(not ok and err.message)
> +---
> +- Storage is referenced
> +...
> +lref.del(0, 0)
> +---
> +- true
> +...
> +vshard.storage.bucket_send(bucket_id_to_move, util.replicasets[2],              \
> +                           {timeout = big_timeout})
> +---
> +- true
> +...
> +wait_bucket_is_collected(bucket_id_to_move)
> +---
> +...
> +test_run:switch('storage_2_a')
> +---
> +- true
> +...
> +vshard.storage.rebalancer_disable()
> +---
> +...
> +big_timeout = 1000000
> +---
> +...
> +bucket_id_to_move = test_run:eval('storage_1_a', 'return bucket_id_to_move')[1]
> +---
> +...
> +vshard.storage.bucket_send(bucket_id_to_move, util.replicasets[1],              \
> +                           {timeout = big_timeout})
> +---
> +- true
> +...
> +wait_bucket_is_collected(bucket_id_to_move)
> +---
> +...
>   test_run:switch('default')
>   ---
>   - true
> diff --git a/test/reload_evolution/storage.test.lua b/test/reload_evolution/storage.test.lua
> index 639553e..c351ada 100644
> --- a/test/reload_evolution/storage.test.lua
> +++ b/test/reload_evolution/storage.test.lua
> @@ -83,6 +83,34 @@ box.space._bucket.index.status:count({vshard.consts.BUCKET.ACTIVE})
>   test_run:switch('storage_1_a')
>   box.space._bucket.index.status:count({vshard.consts.BUCKET.ACTIVE})
>   
> +--
> +-- Ensure storage refs are enabled and work from the scratch via reload.
> +--
> +lref = require('vshard.storage.ref')
> +vshard.storage.rebalancer_disable()
> +
> +big_timeout = 1000000
> +timeout = 0.01
> +lref.add(0, 0, big_timeout)
> +status_index = box.space._bucket.index.status
> +bucket_id_to_move = status_index:min({vshard.consts.BUCKET.ACTIVE}).id
> +ok, err = vshard.storage.bucket_send(bucket_id_to_move, util.replicasets[2],    \
> +                                     {timeout = timeout})
> +assert(not ok and err.message)
> +lref.del(0, 0)
> +vshard.storage.bucket_send(bucket_id_to_move, util.replicasets[2],              \
> +                           {timeout = big_timeout})
> +wait_bucket_is_collected(bucket_id_to_move)
> +
> +test_run:switch('storage_2_a')
> +vshard.storage.rebalancer_disable()
> +
> +big_timeout = 1000000
> +bucket_id_to_move = test_run:eval('storage_1_a', 'return bucket_id_to_move')[1]
> +vshard.storage.bucket_send(bucket_id_to_move, util.replicasets[1],              \
> +                           {timeout = big_timeout})
> +wait_bucket_is_collected(bucket_id_to_move)
> +
>   test_run:switch('default')
>   test_run:drop_cluster(REPLICASET_2)
>   test_run:drop_cluster(REPLICASET_1)
> diff --git a/test/storage/ref.result b/test/storage/ref.result
> new file mode 100644
> index 0000000..d5f4166
> --- /dev/null
> +++ b/test/storage/ref.result
> @@ -0,0 +1,399 @@
> +-- test-run result file version 2
> +test_run = require('test_run').new()
> + | ---
> + | ...
> +netbox = require('net.box')
> + | ---
> + | ...
> +REPLICASET_1 = { 'storage_1_a', 'storage_1_b' }
> + | ---
> + | ...
> +REPLICASET_2 = { 'storage_2_a', 'storage_2_b' }
> + | ---
> + | ...
> +
> +test_run:create_cluster(REPLICASET_1, 'storage')
> + | ---
> + | ...
> +test_run:create_cluster(REPLICASET_2, 'storage')
> + | ---
> + | ...
> +util = require('util')
> + | ---
> + | ...
> +util.wait_master(test_run, REPLICASET_1, 'storage_1_a')
> + | ---
> + | ...
> +util.wait_master(test_run, REPLICASET_2, 'storage_2_a')
> + | ---
> + | ...
> +util.map_evals(test_run, {REPLICASET_1, REPLICASET_2}, 'bootstrap_storage()')
> + | ---
> + | ...
> +
> +--
> +-- gh-147: refs allow to pin all the buckets on the storage at once. Is invented
> +-- for map-reduce functionality to pin all buckets on all storages in the
> +-- cluster to execute consistent map-reduce calls on all cluster data.
> +--
> +
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +vshard.storage.rebalancer_disable()
> + | ---
> + | ...
> +vshard.storage.bucket_force_create(1, 1500)
> + | ---
> + | - true
> + | ...
> +
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +vshard.storage.rebalancer_disable()
> + | ---
> + | ...
> +vshard.storage.bucket_force_create(1501, 1500)
> + | ---
> + | - true
> + | ...
> +
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +lref = require('vshard.storage.ref')
> + | ---
> + | ...
> +
> +--
> +-- Bucket moves are not allowed under a ref.
> +--
> +util = require('util')
> + | ---
> + | ...
> +sid = 0
> + | ---
> + | ...
> +rid = 0
> + | ---
> + | ...
> +big_timeout = 1000000
> + | ---
> + | ...
> +small_timeout = 0.001
> + | ---
> + | ...
> +lref.add(rid, sid, big_timeout)
> + | ---
> + | - true
> + | ...
> +-- Send fails.
> +ok, err = vshard.storage.bucket_send(1, util.replicasets[2],                    \
> +                                     {timeout = big_timeout})
> + | ---
> + | ...
> +assert(not ok and err.message)
> + | ---
> + | - Storage is referenced
> + | ...
> +lref.use(rid, sid)
> + | ---
> + | - true
> + | ...
> +-- Still fails - use only makes ref undead until it is deleted explicitly.
> +ok, err = vshard.storage.bucket_send(1, util.replicasets[2],                    \
> +                                     {timeout = big_timeout})
> + | ---
> + | ...
> +assert(not ok and err.message)
> + | ---
> + | - Storage is referenced
> + | ...
> +
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +-- Receive (from another replicaset) also fails.
> +big_timeout = 1000000
> + | ---
> + | ...
> +ok, err = vshard.storage.bucket_send(1501, util.replicasets[1],                 \
> +                                     {timeout = big_timeout})
> + | ---
> + | ...
> +assert(not ok and err.message)
> + | ---
> + | - Storage is referenced
> + | ...
> +
> +--
> +-- After unref all the bucket moves are allowed again.
> +--
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +lref.del(rid, sid)
> + | ---
> + | - true
> + | ...
> +
> +vshard.storage.bucket_send(1, util.replicasets[2], {timeout = big_timeout})
> + | ---
> + | - true
> + | ...
> +wait_bucket_is_collected(1)
> + | ---
> + | ...
> +
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +vshard.storage.bucket_send(1, util.replicasets[1], {timeout = big_timeout})
> + | ---
> + | - true
> + | ...
> +wait_bucket_is_collected(1)
> + | ---
> + | ...
> +
> +--
> +-- While bucket move is in progress, ref won't work.
> +--
> +vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = true
> + | ---
> + | ...
> +
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +fiber = require('fiber')
> + | ---
> + | ...
> +_ = fiber.create(vshard.storage.bucket_send, 1, util.replicasets[2],            \
> +                 {timeout = big_timeout})
> + | ---
> + | ...
> +ok, err = lref.add(rid, sid, small_timeout)
> + | ---
> + | ...
> +assert(not ok and err.message)
> + | ---
> + | - Timeout exceeded
> + | ...
> +-- Ref will wait if timeout is big enough.
> +ok, err = nil
> + | ---
> + | ...
> +_ = fiber.create(function()                                                     \
> +    ok, err = lref.add(rid, sid, big_timeout)                                   \
> +end)
> + | ---
> + | ...
> +
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = false
> + | ---
> + | ...
> +
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +wait_bucket_is_collected(1)
> + | ---
> + | ...
> +test_run:wait_cond(function() return ok or err end)
> + | ---
> + | - true
> + | ...
> +lref.use(rid, sid)
> + | ---
> + | - true
> + | ...
> +lref.del(rid, sid)
> + | ---
> + | - true
> + | ...
> +assert(ok and not err)
> + | ---
> + | - true
> + | ...
> +
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +vshard.storage.bucket_send(1, util.replicasets[1], {timeout = big_timeout})
> + | ---
> + | - true
> + | ...
> +wait_bucket_is_collected(1)
> + | ---
> + | ...
> +
> +--
> +-- Refs are bound to sessions.
> +--
> +box.schema.user.grant('storage', 'super')
> + | ---
> + | ...
> +lref = require('vshard.storage.ref')
> + | ---
> + | ...
> +small_timeout = 0.001
> + | ---
> + | ...
> +function make_ref(rid, timeout)                                                 \
> +    return lref.add(rid, box.session.id(), timeout)                             \
> +end
> + | ---
> + | ...
> +function use_ref(rid)                                                           \
> +    return lref.use(rid, box.session.id())                                      \
> +end
> + | ---
> + | ...
> +function del_ref(rid)                                                           \
> +    return lref.del(rid, box.session.id())                                      \
> +end
> + | ---
> + | ...
> +
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +netbox = require('net.box')
> + | ---
> + | ...
> +remote_uri = test_run:eval('storage_2_a', 'return box.cfg.listen')[1]
> + | ---
> + | ...
> +c = netbox.connect(remote_uri)
> + | ---
> + | ...
> +
> +-- Ref is added and does not disappear anywhere on its own.
> +c:call('make_ref', {1, small_timeout})
> + | ---
> + | - true
> + | ...
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +assert(lref.count == 1)
> + | ---
> + | - true
> + | ...
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +
> +-- Use works.
> +c:call('use_ref', {1})
> + | ---
> + | - true
> + | ...
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +assert(lref.count == 1)
> + | ---
> + | - true
> + | ...
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +
> +-- Del works.
> +c:call('del_ref', {1})
> + | ---
> + | - true
> + | ...
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +assert(lref.count == 0)
> + | ---
> + | - true
> + | ...
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +
> +-- Expiration works. Try to add a second ref when the first one is expired - the
> +-- first is collected and a subsequent use and del won't work.
> +c:call('make_ref', {1, small_timeout})
> + | ---
> + | - true
> + | ...
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +assert(lref.count == 1)
> + | ---
> + | - true
> + | ...
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +
> +fiber.sleep(small_timeout)
> + | ---
> + | ...
> +c:call('make_ref', {2, small_timeout})
> + | ---
> + | - true
> + | ...
> +ok, err = c:call('use_ref', {1})
> + | ---
> + | ...
> +assert(ok == nil and err.message)
> + | ---
> + | - 'Can not use a storage ref: no ref'
> + | ...
> +ok, err = c:call('del_ref', {1})
> + | ---
> + | ...
> +assert(ok == nil and err.message)
> + | ---
> + | - 'Can not delete a storage ref: no ref'
> + | ...
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +assert(lref.count == 1)
> + | ---
> + | - true
> + | ...
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +
> +--
> +-- Session disconnect removes its refs.
> +--
> +c:call('make_ref', {3, big_timeout})
> + | ---
> + | - true
> + | ...
> +c:close()
> + | ---
> + | ...
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +test_run:wait_cond(function() return lref.count == 0 end)
> + | ---
> + | - true
> + | ...
> +
> +_ = test_run:switch("default")
> + | ---
> + | ...
> +test_run:drop_cluster(REPLICASET_2)
> + | ---
> + | ...
> +test_run:drop_cluster(REPLICASET_1)
> + | ---
> + | ...
> diff --git a/test/storage/ref.test.lua b/test/storage/ref.test.lua
> new file mode 100644
> index 0000000..b34a294
> --- /dev/null
> +++ b/test/storage/ref.test.lua
> @@ -0,0 +1,166 @@
> +test_run = require('test_run').new()
> +netbox = require('net.box')
> +REPLICASET_1 = { 'storage_1_a', 'storage_1_b' }
> +REPLICASET_2 = { 'storage_2_a', 'storage_2_b' }
> +
> +test_run:create_cluster(REPLICASET_1, 'storage')
> +test_run:create_cluster(REPLICASET_2, 'storage')
> +util = require('util')
> +util.wait_master(test_run, REPLICASET_1, 'storage_1_a')
> +util.wait_master(test_run, REPLICASET_2, 'storage_2_a')
> +util.map_evals(test_run, {REPLICASET_1, REPLICASET_2}, 'bootstrap_storage()')
> +
> +--
> +-- gh-147: refs allow to pin all the buckets on the storage at once. Is invented
> +-- for map-reduce functionality to pin all buckets on all storages in the
> +-- cluster to execute consistent map-reduce calls on all cluster data.
> +--
> +
> +_ = test_run:switch('storage_1_a')
> +vshard.storage.rebalancer_disable()
> +vshard.storage.bucket_force_create(1, 1500)
> +
> +_ = test_run:switch('storage_2_a')
> +vshard.storage.rebalancer_disable()
> +vshard.storage.bucket_force_create(1501, 1500)
> +
> +_ = test_run:switch('storage_1_a')
> +lref = require('vshard.storage.ref')
> +
> +--
> +-- Bucket moves are not allowed under a ref.
> +--
> +util = require('util')
> +sid = 0
> +rid = 0
> +big_timeout = 1000000
> +small_timeout = 0.001
> +lref.add(rid, sid, big_timeout)
> +-- Send fails.
> +ok, err = vshard.storage.bucket_send(1, util.replicasets[2],                    \
> +                                     {timeout = big_timeout})
> +assert(not ok and err.message)
> +lref.use(rid, sid)
> +-- Still fails - use only makes ref undead until it is deleted explicitly.
> +ok, err = vshard.storage.bucket_send(1, util.replicasets[2],                    \
> +                                     {timeout = big_timeout})
> +assert(not ok and err.message)
> +
> +_ = test_run:switch('storage_2_a')
> +-- Receive (from another replicaset) also fails.
> +big_timeout = 1000000
> +ok, err = vshard.storage.bucket_send(1501, util.replicasets[1],                 \
> +                                     {timeout = big_timeout})
> +assert(not ok and err.message)
> +
> +--
> +-- After unref all the bucket moves are allowed again.
> +--
> +_ = test_run:switch('storage_1_a')
> +lref.del(rid, sid)
> +
> +vshard.storage.bucket_send(1, util.replicasets[2], {timeout = big_timeout})
> +wait_bucket_is_collected(1)
> +
> +_ = test_run:switch('storage_2_a')
> +vshard.storage.bucket_send(1, util.replicasets[1], {timeout = big_timeout})
> +wait_bucket_is_collected(1)
> +
> +--
> +-- While bucket move is in progress, ref won't work.
> +--
> +vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = true
> +
> +_ = test_run:switch('storage_1_a')
> +fiber = require('fiber')
> +_ = fiber.create(vshard.storage.bucket_send, 1, util.replicasets[2],            \
> +                 {timeout = big_timeout})
> +ok, err = lref.add(rid, sid, small_timeout)
> +assert(not ok and err.message)
> +-- Ref will wait if timeout is big enough.
> +ok, err = nil
> +_ = fiber.create(function()                                                     \
> +    ok, err = lref.add(rid, sid, big_timeout)                                   \
> +end)
> +
> +_ = test_run:switch('storage_2_a')
> +vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = false
> +
> +_ = test_run:switch('storage_1_a')
> +wait_bucket_is_collected(1)
> +test_run:wait_cond(function() return ok or err end)
> +lref.use(rid, sid)
> +lref.del(rid, sid)
> +assert(ok and not err)
> +
> +_ = test_run:switch('storage_2_a')
> +vshard.storage.bucket_send(1, util.replicasets[1], {timeout = big_timeout})
> +wait_bucket_is_collected(1)
> +
> +--
> +-- Refs are bound to sessions.
> +--
> +box.schema.user.grant('storage', 'super')
> +lref = require('vshard.storage.ref')
> +small_timeout = 0.001
> +function make_ref(rid, timeout)                                                 \
> +    return lref.add(rid, box.session.id(), timeout)                             \
> +end
> +function use_ref(rid)                                                           \
> +    return lref.use(rid, box.session.id())                                      \
> +end
> +function del_ref(rid)                                                           \
> +    return lref.del(rid, box.session.id())                                      \
> +end
> +
> +_ = test_run:switch('storage_1_a')
> +netbox = require('net.box')
> +remote_uri = test_run:eval('storage_2_a', 'return box.cfg.listen')[1]
> +c = netbox.connect(remote_uri)
> +
> +-- Ref is added and does not disappear anywhere on its own.
> +c:call('make_ref', {1, small_timeout})
> +_ = test_run:switch('storage_2_a')
> +assert(lref.count == 1)
> +_ = test_run:switch('storage_1_a')
> +
> +-- Use works.
> +c:call('use_ref', {1})
> +_ = test_run:switch('storage_2_a')
> +assert(lref.count == 1)
> +_ = test_run:switch('storage_1_a')
> +
> +-- Del works.
> +c:call('del_ref', {1})
> +_ = test_run:switch('storage_2_a')
> +assert(lref.count == 0)
> +_ = test_run:switch('storage_1_a')
> +
> +-- Expiration works. Try to add a second ref when the first one is expired - the
> +-- first is collected and a subsequent use and del won't work.
> +c:call('make_ref', {1, small_timeout})
> +_ = test_run:switch('storage_2_a')
> +assert(lref.count == 1)
> +_ = test_run:switch('storage_1_a')
> +
> +fiber.sleep(small_timeout)
> +c:call('make_ref', {2, small_timeout})
> +ok, err = c:call('use_ref', {1})
> +assert(ok == nil and err.message)
> +ok, err = c:call('del_ref', {1})
> +assert(ok == nil and err.message)
> +_ = test_run:switch('storage_2_a')
> +assert(lref.count == 1)
> +_ = test_run:switch('storage_1_a')
> +
> +--
> +-- Session disconnect removes its refs.
> +--
> +c:call('make_ref', {3, big_timeout})
> +c:close()
> +_ = test_run:switch('storage_2_a')
> +test_run:wait_cond(function() return lref.count == 0 end)
> +
> +_ = test_run:switch("default")
> +test_run:drop_cluster(REPLICASET_2)
> +test_run:drop_cluster(REPLICASET_1)
> diff --git a/test/unit-tap/ref.test.lua b/test/unit-tap/ref.test.lua
> new file mode 100755
> index 0000000..d987a63
> --- /dev/null
> +++ b/test/unit-tap/ref.test.lua
> @@ -0,0 +1,202 @@
> +#!/usr/bin/env tarantool
> +
> +local tap = require('tap')
> +local test = tap.test('cfg')
> +local fiber = require('fiber')
> +local lregistry = require('vshard.registry')
> +local lref = require('vshard.storage.ref')
> +
> +local big_timeout = 1000000
> +local small_timeout = 0.000001
> +local sid = 0
> +local sid2 = 1
> +local sid3 = 2
> +
> +--
> +-- gh-147: refs allow to pin all the buckets on the storage at once. Is invented
> +-- for map-reduce functionality to pin all buckets on all storages in the
> +-- cluster to execute consistent map-reduce calls on all cluster data.
> +--
> +
> +--
> +-- Refs used storage API to get bucket space state and wait on its changes. But
> +-- not important for these unit tests.
> +--
> +local function bucket_are_all_rw()
> +    return true
> +end
> +
> +lregistry.storage = {
> +    bucket_are_all_rw = bucket_are_all_rw,
> +}
> +
> +--
> +-- Min heap fill and empty.
> +--
> +local function test_ref_basic(test)
> +    test:plan(15)
> +
> +    local rid = 0
> +    local ok, err
> +    --
> +    -- Basic ref/unref.
> +    --
> +    ok, err = lref.add(rid, sid, big_timeout)
> +    test:ok(ok and not err, '+1 ref')
> +    test:is(lref.count, 1, 'accounted')
> +    ok, err = lref.use(rid, sid)
> +    test:ok(ok and not err, 'use the ref')
> +    test:is(lref.count, 1, 'but still accounted')
> +    ok, err = lref.del(rid, sid)
> +    test:ok(ok and not err, '-1 ref')
> +    test:is(lref.count, 0, 'accounted')
> +
> +    --
> +    -- Bad ref ID.
> +    --
> +    rid = 1
> +    ok, err = lref.use(rid, sid)
> +    test:ok(not ok and err, 'invalid RID at use')
> +    ok, err = lref.del(rid, sid)
> +    test:ok(not ok and err, 'invalid RID at del')
> +
> +    --
> +    -- Bad session ID.
> +    --
> +    lref.kill(sid)
> +    rid = 0
> +    ok, err = lref.use(rid, sid)
> +    test:ok(not ok and err, 'invalid SID at use')
> +    ok, err = lref.del(rid, sid)
> +    test:ok(not ok and err, 'invalid SID at del')
> +
> +    --
> +    -- Duplicate ID.
> +    --
> +    ok, err = lref.add(rid, sid, big_timeout)
> +    test:ok(ok and not err, 'add ref')
> +    ok, err = lref.add(rid, sid, big_timeout)
> +    test:ok(not ok and err, 'duplicate ref')
> +    test:is(lref.count, 1, 'did not affect count')
> +    test:ok(lref.use(rid, sid) and lref.del(rid, sid), 'del old ref')
> +    test:is(lref.count, 0, 'accounted')
> +end
> +
> +local function test_ref_incremental_gc(test)
> +    test:plan(20)
> +
> +    --
> +    -- Ref addition expires 2 old refs.
> +    --
> +    local ok, err
> +    for i = 0, 2 do
> +        assert(lref.add(i, sid, small_timeout))
> +    end
> +    fiber.sleep(small_timeout)
> +    test:is(lref.count, 3, 'expired refs are still here')
> +    test:ok(lref.add(3, sid, 0), 'add new ref')
> +    -- 3 + 1 new - 2 old = 2.
> +    test:is(lref.count, 2, 'it collected 2 old refs')
> +    test:ok(lref.add(4, sid, 0), 'add new ref')
> +    -- 2 + 1 new - 2 old = 1.
> +    test:is(lref.count, 2, 'it collected 2 old refs')
> +    test:ok(lref.del(4, sid), 'del the latest manually')
> +
> +    --
> +    -- Incremental GC works fine if only one ref was GCed.
> +    --
> +    test:ok(lref.add(0, sid, small_timeout), 'add ref with small timeout')
> +    test:ok(lref.add(1, sid, big_timeout), 'add ref with big timeout')
> +    fiber.sleep(small_timeout)
> +    test:ok(lref.add(2, sid, 0), 'add ref with 0 timeout')
> +    test:is(lref.count, 2, 'collected 1 old ref, 1 is kept')
> +    test:ok(lref.del(2, sid), 'del newest ref, it was not collected')
> +    test:ok(lref.del(1, sid), 'del ref with big timeout')
> +    test:ok(lref.count, 0, 'all is deleted')
> +
> +    --
> +    -- GC works fine when only one ref was left and it was expired.
> +    --
> +    test:ok(lref.add(0, sid, small_timeout), 'add ref with small timeout')
> +    test:is(lref.count, 1, '1 ref total')
> +    fiber.sleep(small_timeout)
> +    test:ok(lref.add(1, sid, big_timeout), 'add ref with big timeout')
> +    test:is(lref.count, 1, 'collected the old one')
> +    lref.gc()
> +    test:is(lref.count, 1, 'still 1 - timeout was big')
> +    test:ok(lref.del(1, sid), 'delete it')
> +    test:is(lref.count, 0, 'no refs')
> +end
> +
> +local function test_ref_gc(test)
> +    test:plan(7)
> +
> +    --
> +    -- Generic GC works fine with multiple sessions.
> +    --
> +    assert(lref.add(0, sid, big_timeout))
> +    assert(lref.add(1, sid, small_timeout))
> +    assert(lref.add(0, sid3, small_timeout))
> +    assert(lref.add(0, sid2, small_timeout))
> +    assert(lref.add(1, sid2, big_timeout))
> +    assert(lref.add(1, sid3, big_timeout))
> +    test:is(lref.count, 6, 'add 6 refs total')
> +    fiber.sleep(small_timeout)
> +    lref.gc()
> +    test:is(lref.count, 3, '3 collected')
> +    test:ok(lref.del(0, sid), 'del first')
> +    test:ok(lref.del(1, sid2), 'del second')
> +    test:ok(lref.del(1, sid3), 'del third')
> +    test:is(lref.count, 0, '3 deleted')
> +    lref.gc()
> +    test:is(lref.count, 0, 'gc on empty refs did not break anything')
> +end
> +
> +local function test_ref_use(test)
> +    test:plan(7)
> +
> +    --
> +    -- Ref use updates the session heap.
> +    --
> +    assert(lref.add(0, sid, small_timeout))
> +    assert(lref.add(0, sid2, big_timeout))
> +    test:ok(lref.count, 2, 'add 2 refs')
> +    test:ok(lref.use(0, sid), 'use one with small timeout')
> +    lref.gc()
> +    test:is(lref.count, 2, 'still 2 refs')
> +    fiber.sleep(small_timeout)
> +    test:is(lref.count, 2, 'still 2 refs after sleep')
> +    test:ok(lref.del(0, sid, 'del first'))
> +    test:ok(lref.del(0, sid2, 'del second'))
> +    test:is(lref.count, 0, 'now all is deleted')
> +end
> +
> +local function test_ref_del(test)
> +    test:plan(7)
> +
> +    --
> +    -- Ref del updates the session heap.
> +    --
> +    assert(lref.add(0, sid, small_timeout))
> +    assert(lref.add(0, sid2, big_timeout))
> +    test:is(lref.count, 2, 'add 2 refs')
> +    test:ok(lref.del(0, sid), 'del with small timeout')
> +    lref.gc()
> +    test:is(lref.count, 1, '1 ref remains')
> +    fiber.sleep(small_timeout)
> +    test:is(lref.count, 1, '1 ref remains after sleep')
> +    lref.gc()
> +    test:is(lref.count, 1, '1 ref remains after sleep and gc')
> +    test:ok(lref.del(0, sid2), 'del with big timeout')
> +    test:is(lref.count, 0, 'now all is deleted')
> +end
> +
> +test:plan(5)
> +
> +test:test('basic', test_ref_basic)
> +test:test('incremental gc', test_ref_incremental_gc)
> +test:test('gc', test_ref_gc)
> +test:test('use', test_ref_use)
> +test:test('del', test_ref_del)
> +
> +os.exit(test:check() and 0 or 1)
> diff --git a/vshard/consts.lua b/vshard/consts.lua
> index cf3f422..0ffe0e2 100644
> --- a/vshard/consts.lua
> +++ b/vshard/consts.lua
> @@ -48,4 +48,5 @@ return {
>       DISCOVERY_TIMEOUT = 10,
>   
>       TIMEOUT_INFINITY = 500 * 365 * 86400,
> +    DEADLINE_INFINITY = math.huge,
>   }
> diff --git a/vshard/error.lua b/vshard/error.lua
> index a6f46a9..b02bfe9 100644
> --- a/vshard/error.lua
> +++ b/vshard/error.lua
> @@ -130,6 +130,25 @@ local error_message_template = {
>           name = 'TOO_MANY_RECEIVING',
>           msg = 'Too many receiving buckets at once, please, throttle'
>       },
> +    [26] = {
> +        name = 'STORAGE_IS_REFERENCED',
> +        msg = 'Storage is referenced'
> +    },
> +    [27] = {
> +        name = 'STORAGE_REF_ADD',
> +        msg = 'Can not add a storage ref: %s',
> +        args = {'reason'},
> +    },
> +    [28] = {
> +        name = 'STORAGE_REF_USE',
> +        msg = 'Can not use a storage ref: %s',
> +        args = {'reason'},
> +    },
> +    [29] = {
> +        name = 'STORAGE_REF_DEL',
> +        msg = 'Can not delete a storage ref: %s',
> +        args = {'reason'},
> +    },
>   }
>   
>   --
> diff --git a/vshard/storage/CMakeLists.txt b/vshard/storage/CMakeLists.txt
> index 3f4ed43..7c1e97d 100644
> --- a/vshard/storage/CMakeLists.txt
> +++ b/vshard/storage/CMakeLists.txt
> @@ -1,2 +1,2 @@
> -install(FILES init.lua reload_evolution.lua
> +install(FILES init.lua reload_evolution.lua ref.lua
>           DESTINATION ${TARANTOOL_INSTALL_LUADIR}/vshard/storage)
> diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
> index c3ed236..2957f48 100644
> --- a/vshard/storage/init.lua
> +++ b/vshard/storage/init.lua
> @@ -17,6 +17,7 @@ if rawget(_G, MODULE_INTERNALS) then
>           'vshard.replicaset', 'vshard.util',
>           'vshard.storage.reload_evolution',
>           'vshard.lua_gc', 'vshard.rlist', 'vshard.registry',
> +        'vshard.heap', 'vshard.storage.ref',
>       }
>       for _, module in pairs(vshard_modules) do
>           package.loaded[module] = nil
> @@ -30,6 +31,7 @@ local lreplicaset = require('vshard.replicaset')
>   local util = require('vshard.util')
>   local lua_gc = require('vshard.lua_gc')
>   local lregistry = require('vshard.registry')
> +local lref = require('vshard.storage.ref')
>   local reload_evolution = require('vshard.storage.reload_evolution')
>   local fiber_cond_wait = util.fiber_cond_wait
>   local bucket_ref_new
> @@ -1140,6 +1142,9 @@ local function bucket_recv_xc(bucket_id, from, data, opts)
>               return nil, lerror.vshard(lerror.code.WRONG_BUCKET, bucket_id, msg,
>                                         from)
>           end
> +        if lref.count > 0 then
> +            return nil, lerror.vshard(lerror.code.STORAGE_IS_REFERENCED)
> +        end


You will remove this part in the next patch. Do you really need it? Or 
you add it just for tests?


>           if is_this_replicaset_locked() then
>               return nil, lerror.vshard(lerror.code.REPLICASET_IS_LOCKED)
>           end
> @@ -1441,6 +1446,9 @@ local function bucket_send_xc(bucket_id, destination, opts, exception_guard)
>   
>       local _bucket = box.space._bucket
>       local bucket = _bucket:get({bucket_id})
> +    if lref.count > 0 then
> +        return nil, lerror.vshard(lerror.code.STORAGE_IS_REFERENCED)
> +    end


Ditto.

>       if is_this_replicaset_locked() then
>           return nil, lerror.vshard(lerror.code.REPLICASET_IS_LOCKED)
>       end
> @@ -2528,6 +2536,7 @@ local function storage_cfg(cfg, this_replica_uuid, is_reload)
>           box.space._bucket:on_replace(nil, M.bucket_on_replace)
>           M.bucket_on_replace = nil
>       end
> +    lref.cfg()
>       if is_master then
>           box.space._bucket:on_replace(bucket_generation_increment)
>           M.bucket_on_replace = bucket_generation_increment
> diff --git a/vshard/storage/ref.lua b/vshard/storage/ref.lua
> new file mode 100644
> index 0000000..7589cb9
> --- /dev/null
> +++ b/vshard/storage/ref.lua
> @@ -0,0 +1,371 @@
> +--
> +-- 'Ref' module helps to ensure that all buckets on the storage stay writable
> +-- while there is at least one ref on the storage.
> +-- Having storage referenced allows to execute any kinds of requests on all the
> +-- visible data in all spaces in locally stored buckets. This is useful when
> +-- need to access tons of buckets at once, especially when exact bucket IDs are
> +-- not known.
> +--
> +-- Refs have deadlines. So as the storage wouldn't freeze not being able to move
> +-- buckets until restart in case a ref is not deleted due to an error in user's
> +-- code or disconnect.
> +--
> +-- The disconnects and restarts mean the refs can't be global. Otherwise any
> +-- kinds of global counters, uuids and so on, even paired with any ids from a
> +-- client could clash between clients on their reconnects or storage restarts.
> +-- Unless they establish a TCP-like session, which would be too complicated.
> +--
> +-- Instead, the refs are spread over the existing box sessions. This allows to
> +-- bind refs of each client to its TCP connection and not care about how to make
> +-- them unique across all sessions, how not to mess the refs on restart, and how
> +-- to drop the refs when a client disconnects.
> +--
> +
> +local MODULE_INTERNALS = '__module_vshard_storage_ref'
> +-- Update when change behaviour of anything in the file, to be able to reload.
> +local MODULE_VERSION = 1
> +
> +local lfiber = require('fiber')
> +local lheap = require('vshard.heap')
> +local lerror = require('vshard.error')
> +local lconsts = require('vshard.consts')
> +local lregistry = require('vshard.registry')
> +local fiber_clock = lfiber.clock
> +local fiber_yield = lfiber.yield
> +local DEADLINE_INFINITY = lconsts.DEADLINE_INFINITY
> +local LUA_CHUNK_SIZE = lconsts.LUA_CHUNK_SIZE
> +
> +--
> +-- Binary heap sort. Object with the closest deadline should be on top.
> +--
> +local function heap_min_deadline_cmp(ref1, ref2)
> +    return ref1.deadline < ref2.deadline
> +end
> +
> +local M = rawget(_G, MODULE_INTERNALS)
> +if not M then
> +    M = {
> +        module_version = MODULE_VERSION,
> +        -- Total number of references in all sessions.
> +        count = 0,
> +        -- Heap of session objects. Each session has refs sorted by their
> +        -- deadline. The sessions themselves are also sorted by deadlines.
> +        -- Session deadline is defined as the closest deadline of all its refs.
> +        -- Or infinity in case there are no refs in it.
> +        session_heap = lheap.new(heap_min_deadline_cmp),
> +        -- Map of session objects. This is used to get session object by its ID.
> +        session_map = {},
> +        -- On session disconnect trigger to kill the dead sessions. It is saved
> +        -- here for the sake of future reload to be able to delete the old
> +        -- on disconnect function before setting a new one.
> +        on_disconnect = nil,
> +    }
> +else
> +    -- No reload so far. This is a first version. Return as is.
> +    return M
> +end
> +
> +local function ref_session_new(sid)
> +    -- Session object does store its internal hot attributes in a table. Because
> +    -- it would mean access to any session attribute would cost at least one
> +    -- table indexing operation. Instead, all internal fields are stored as
> +    -- upvalues referenced by the methods defined as closures.
> +    --
> +    -- This means session creation may not very suitable for jitting, but it is
> +    -- very rare and attempts to optimize the most common case.
> +    --
> +    -- Still the public functions take 'self' object to make it look normally.
> +    -- They even use it a bit.
> +
> +    -- Ref map to get ref object by its ID.
> +    local ref_map = {}
> +    -- Ref heap sorted by their deadlines.
> +    local ref_heap = lheap.new(heap_min_deadline_cmp)
> +    -- Total number of refs of the session. Is used to drop the session without
> +    -- fullscan of the ref map. Heap size can't be used because not all refs are
> +    -- stored here. See more on that below.
> +    local count = 0
> +    -- Cache global session storages as upvalues to save on M indexing.
> +    local global_heap = M.session_heap
> +    local global_map = M.session_map
> +
> +    local function ref_session_discount(self, del_count)
> +        local new_count = M.count - del_count
> +        assert(new_count >= 0)
> +        M.count = new_count
> +
> +        new_count = count - del_count
> +        assert(new_count >= 0)
> +        count = new_count
> +    end
> +
> +    local function ref_session_update_deadline(self)
> +        local ref = ref_heap:top()
> +        if not ref then
> +            self.deadline = DEADLINE_INFINITY
> +            global_heap:update(self)
> +        else
> +            local deadline = ref.deadline
> +            if deadline ~= self.deadline then
> +                self.deadline = deadline
> +                global_heap:update(self)
> +            end
> +        end
> +    end
> +
> +    --
> +    -- Garbage collect at most 2 expired refs. The idea is that there is no a
> +    -- dedicated fiber for expired refs collection. It would be too expensive to
> +    -- wakeup a fiber on each added or removed or updated ref.
> +    --
> +    -- Instead, ref GC is mostly incremental and works by the principle "remove
> +    -- more than add". On each new ref added, two old refs try to expire. This
> +    -- way refs don't stack infinitely, and the expired refs are eventually
> +    -- removed. Because removal is faster than addition: -2 for each +1.
> +    --
> +    local function ref_session_gc_step(self, now)
> +        -- This is inlined 2 iterations of the more general GC procedure. The
> +        -- latter is not called in order to save on not having a loop,
> +        -- additional branches and variables.
> +        if self.deadline > now then
> +            return
> +        end
> +        local top = ref_heap:top()
> +        ref_heap:remove_top()
> +        ref_map[top.id] = nil
> +        top = ref_heap:top()
> +        if not top then
> +            self.deadline = DEADLINE_INFINITY
> +            global_heap:update(self)
> +            ref_session_discount(self, 1)
> +            return
> +        end
> +        local deadline = top.deadline
> +        if deadline >= now then
> +            self.deadline = deadline
> +            global_heap:update(self)
> +            ref_session_discount(self, 1)
> +            return
> +        end
> +        ref_heap:remove_top()
> +        ref_map[top.id] = nil
> +        top = ref_heap:top()
> +        if not top then
> +            self.deadline = DEADLINE_INFINITY
> +        else
> +            self.deadline = top.deadline
> +        end
> +        global_heap:update(self)
> +        ref_session_discount(self, 2)
> +    end
> +
> +    --
> +    -- GC expired refs until they end or the limit on the number of iterations
> +    -- is exhausted. The limit is supposed to prevent too long GC which would
> +    -- occupy TX thread unfairly.
> +    --
> +    -- Returns false if nothing to GC, or number of iterations left from the
> +    -- limit. The caller is supposed to yield when 0 is returned, and retry GC
> +    -- until it returns false.
> +    -- The function itself does not yield, because it is used from a more
> +    -- generic function GCing all sessions. It would not ever yield if all
> +    -- sessions would have less than limit refs, even if total ref count would
> +    -- be much bigger.
> +    --
> +    -- Besides, the session might be killed during general GC. There must not be
> +    -- any yields in session methods so as not to introduce a support of dead
> +    -- sessions.
> +    --
> +    local function ref_session_gc(self, limit, now)
> +        if self.deadline >= now then
> +            return false
> +        end
> +        local top = ref_heap:top()
> +        local del = 1
> +        local rest = 0
> +        local deadline
> +        repeat
> +            ref_heap:remove_top()
> +            ref_map[top.id] = nil
> +            top = ref_heap:top()
> +            if not top then
> +                self.deadline = DEADLINE_INFINITY
> +                rest = limit - del
> +                break
> +            end
> +            deadline = top.deadline
> +            if deadline >= now then
> +                self.deadline = deadline
> +                rest = limit - del
> +                break
> +            end
> +            del = del + 1
> +        until del >= limit
> +        ref_session_discount(self, del)
> +        global_heap:update(self)
> +        return rest
> +    end
> +
> +    local function ref_session_add(self, rid, deadline, now)
> +        if ref_map[rid] then
> +            return nil, lerror.vshard(lerror.code.STORAGE_REF_ADD,
> +                                      'duplicate ref')
> +        end
> +        local ref = {
> +            deadline = deadline,
> +            id = rid,
> +            -- Used by the heap.
> +            index = -1,
> +        }
> +        ref_session_gc_step(self, now)
> +        ref_map[rid] = ref
> +        ref_heap:push(ref)
> +        if deadline < self.deadline then
> +            self.deadline = deadline
> +            global_heap:update(self)
> +        end
> +        count = count + 1
> +        M.count = M.count + 1
> +        return true
> +    end
> +
> +    --
> +    -- Ref use means it can't be expired until deleted explicitly. Should be
> +    -- done when the request affecting the whole storage starts. After use it is
> +    -- important to call del afterwards - GC won't delete it automatically now.
> +    -- Unless the entire session is killed.
> +    --
> +    local function ref_session_use(self, rid)
> +        local ref = ref_map[rid]
> +        if not ref then
> +            return nil, lerror.vshard(lerror.code.STORAGE_REF_USE, 'no ref')
> +        end
> +        ref_heap:remove(ref)
> +        ref_session_update_deadline(self)
> +        return true
> +    end
> +
> +    local function ref_session_del(self, rid)
> +        local ref = ref_map[rid]
> +        if not ref then
> +            return nil, lerror.vshard(lerror.code.STORAGE_REF_DEL, 'no ref')
> +        end
> +        ref_heap:remove_try(ref)
> +        ref_map[rid] = nil
> +        ref_session_update_deadline(self)
> +        ref_session_discount(self, 1)
> +        return true
> +    end
> +
> +    local function ref_session_kill(self)
> +        global_map[sid] = nil
> +        global_heap:remove(self)
> +        ref_session_discount(self, count)
> +    end
> +
> +    -- Don't use __index. It is useless since all sessions use closures as
> +    -- methods. Also it is probably slower because on each method call would
> +    -- need to get the metatable, get __index, find the method here. While now
> +    -- it is only an index operation on the session object.
> +    local session = {
> +        deadline = DEADLINE_INFINITY,
> +        -- Used by the heap.
> +        index = -1,
> +        -- Methods.
> +        del = ref_session_del,
> +        gc = ref_session_gc,
> +        add = ref_session_add,
> +        use = ref_session_use,
> +        kill = ref_session_kill,
> +    }
> +    global_map[sid] = session
> +    global_heap:push(session)
> +    return session
> +end
> +
> +local function ref_gc()
> +    local session_heap = M.session_heap
> +    local session = session_heap:top()
> +    if not session then
> +        return
> +    end
> +    local limit = LUA_CHUNK_SIZE
> +    local now = fiber_clock()
> +    repeat
> +        limit = session:gc(limit, now)
> +        if not limit then
> +            return
> +        end
> +        if limit == 0 then
> +            fiber_yield()
> +            limit = LUA_CHUNK_SIZE
> +            now = fiber_clock()
> +        end
> +        session = session_heap:top()
> +    until not session
> +end
> +
> +local function ref_add(rid, sid, timeout)
> +    local now = fiber_clock()
> +    local deadline = now + timeout
> +    local ok, err, session
> +    local storage = lregistry.storage
> +    while not storage.bucket_are_all_rw() do
> +        ok, err = storage.bucket_generation_wait(timeout)
> +        if not ok then
> +            return nil, err
> +        end
> +        now = fiber_clock()
> +        timeout = deadline - now
> +    end
> +    session = M.session_map[sid]
> +    if not session then
> +        session = ref_session_new(sid)
> +    end
> +    return session:add(rid, deadline, now)
> +end
> +
> +local function ref_use(rid, sid)
> +    local session = M.session_map[sid]
> +    if not session then
> +        return nil, lerror.vshard(lerror.code.STORAGE_REF_USE, 'no session')
> +    end
> +    return session:use(rid)
> +end
> +
> +local function ref_del(rid, sid)
> +    local session = M.session_map[sid]
> +    if not session then
> +        return nil, lerror.vshard(lerror.code.STORAGE_REF_DEL, 'no session')
> +    end
> +    return session:del(rid)
> +end
> +
> +local function ref_kill_session(sid)
> +    local session = M.session_map[sid]
> +    if session then
> +        session:kill()
> +    end
> +end
> +
> +local function ref_on_session_disconnect()
> +    ref_kill_session(box.session.id())
> +end
> +
> +local function ref_cfg()
> +    if M.on_disconnect then
> +        pcall(box.session.on_disconnect, nil, M.on_disconnect)
> +    end
> +    box.session.on_disconnect(ref_on_session_disconnect)
> +    M.on_disconnect = ref_on_session_disconnect
> +end
> +
> +M.del = ref_del
> +M.gc = ref_gc
> +M.add = ref_add
> +M.use = ref_use
> +M.cfg = ref_cfg
> +M.kill = ref_kill_session
> +lregistry.storage_ref = M
> +
> +return M

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 10/11] sched: introduce vshard.storage.sched module
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 10/11] sched: introduce vshard.storage.sched module Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-24 10:28   ` Oleg Babin via Tarantool-patches
  2021-02-24 21:50     ` Vladislav Shpilevoy via Tarantool-patches
  2021-03-04 21:02   ` Oleg Babin via Tarantool-patches
  1 sibling, 1 reply; 47+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-02-24 10:28 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, yaroslav.dynnikov

Thanks for you patch. It's a brief review - I hope I'll look once again 
on this patch.

Consider 2 comments below.


On 23.02.2021 03:15, Vladislav Shpilevoy wrote:
> 'vshard.storage.sched' module ensures that two incompatible
> operations share storage time fairly - storage refs and bucket
> moves.
>
> Storage refs are going to be used by map-reduce API to preserve
> data consistency while map requests are in progress on all
> storages.
>
> It means storage refs will be used as commonly as bucket refs,
> and should not block the rebalancer. However it is hard not to
> block the rebalancer forever if there are always refs on the
> storage.
>
> With bucket refs it was easy - one bucket temporary block is not a
> big deal. So rebalancer always has higher prio than bucket refs,
> and it still does not block requests for the other buckets +
> read-only requests on the subject bucket.
>
> With storage refs having rebalancer with a higher prio would make
> map-reduce requests die in the entire cluster for the whole time
> of rebalancing, which can be as long as hours or even days. It
> wouldn't be acceptable.
>
> The new module vshard.storage.sched shares time between moves and
> storeage refs fairly. They both get time to execute with
> proportions configures by user. The proportions depend on how
> big is a bucket, how long the map-reduce requests are expected to
> be. The longer is a request, the less quota it should be given,
> typically.
>
> The patch introduces new storage options to configure the
> scheduling.
>
> Part of #147
>
> @TarantoolBot document
> Title: vshard.storage.cfg new options - sched_ref_quota and sched_move_quota
>
> There are new options for `vshard.storage.cfg`: `sched_ref_quota`
> and `sched_move_quota`. The options control how much time should
> be given to storage refs and bucket moves - two incompatible but
> important operations.
>
> Storage refs are used by router's map-reduce API. Each map-reduce
> call creates storage refs on all storages to prevent data
> migration on them for the map execution.
>
> Bucket moves are used by the rebalancer. Obviously, they are
> incompatible with the storage refs.
>
> If vshard would prefer one operation to another always, it would
> lead to starvation of one of them. For example, if storage refs
> would be prefered, rebalancing could just never work if there are
> always refs under constant map-reduce load. If bucket moves would
> be prefered, storage refs (and therefore map-reduce) would stop
> for the entire rebalancing time which can be quite long (hours,
> days).
>
> To control how much time to give to which operation the new
> options serve.
>
> `sched_ref_quota` tells how many storage refs (therefore
> map-reduce requests) can be executed on the storage in a row if
> there are pending bucket moves, before they are blocked to let the
> moves work. Default value is 300.
>
> `sched_move_quota` controls the same, but vice-versa: how many
> bucket moves can be done in a row if there are pending refs.
> Default value is 1.
>
> Map-reduce requests are expected to be much shorter than bucket
> moves, so storage refs by default have a higher quota.
>
> This is how it works on an example. Assume map-reduces start.
> They execute one after another, 150 requests in a row. Now the
> rebalancer wakes up and wants to move some buckets. He stands into
> a queue and waits for the storage refs to be gone.
>
> But the ref quota is not reached yet, so the storage still can
> execute +150 map-reduces even with the queued bucket moves until
> new refs are blocked, and the moves start.
> ---
>   test/reload_evolution/storage.result |   2 +-
>   test/storage/ref.result              |  19 +-
>   test/storage/ref.test.lua            |   9 +-
>   test/storage/scheduler.result        | 410 ++++++++++++++++++++
>   test/storage/scheduler.test.lua      | 178 +++++++++
>   test/unit-tap/ref.test.lua           |   7 +-
>   test/unit-tap/scheduler.test.lua     | 555 +++++++++++++++++++++++++++
>   test/unit/config.result              |  59 +++
>   test/unit/config.test.lua            |  23 ++
>   vshard/cfg.lua                       |   8 +
>   vshard/consts.lua                    |   5 +
>   vshard/storage/CMakeLists.txt        |   2 +-
>   vshard/storage/init.lua              |  54 ++-
>   vshard/storage/ref.lua               |  30 +-
>   vshard/storage/sched.lua             | 231 +++++++++++
>   15 files changed, 1567 insertions(+), 25 deletions(-)
>   create mode 100644 test/storage/scheduler.result
>   create mode 100644 test/storage/scheduler.test.lua
>   create mode 100755 test/unit-tap/scheduler.test.lua
>   create mode 100644 vshard/storage/sched.lua
>
> diff --git a/test/reload_evolution/storage.result b/test/reload_evolution/storage.result
> index c4a0cdd..77010a2 100644
> --- a/test/reload_evolution/storage.result
> +++ b/test/reload_evolution/storage.result
> @@ -258,7 +258,7 @@ ok, err = vshard.storage.bucket_send(bucket_id_to_move, util.replicasets[2],
>   ...
>   assert(not ok and err.message)
>   ---
> -- Storage is referenced
> +- Timeout exceeded
>   ...
>   lref.del(0, 0)
>   ---
> diff --git a/test/storage/ref.result b/test/storage/ref.result
> index d5f4166..59f07f4 100644
> --- a/test/storage/ref.result
> +++ b/test/storage/ref.result
> @@ -84,18 +84,22 @@ big_timeout = 1000000
>   small_timeout = 0.001
>    | ---
>    | ...
> +
> +timeout = 0.01
> + | ---
> + | ...
>   lref.add(rid, sid, big_timeout)
>    | ---
>    | - true
>    | ...
>   -- Send fails.
>   ok, err = vshard.storage.bucket_send(1, util.replicasets[2],                    \
> -                                     {timeout = big_timeout})
> +                                     {timeout = timeout})
>    | ---
>    | ...
>   assert(not ok and err.message)
>    | ---
> - | - Storage is referenced
> + | - Timeout exceeded
>    | ...
>   lref.use(rid, sid)
>    | ---
> @@ -103,12 +107,12 @@ lref.use(rid, sid)
>    | ...
>   -- Still fails - use only makes ref undead until it is deleted explicitly.
>   ok, err = vshard.storage.bucket_send(1, util.replicasets[2],                    \
> -                                     {timeout = big_timeout})
> +                                     {timeout = timeout})
>    | ---
>    | ...
>   assert(not ok and err.message)
>    | ---
> - | - Storage is referenced
> + | - Timeout exceeded
>    | ...
>   
>   _ = test_run:switch('storage_2_a')
> @@ -118,13 +122,16 @@ _ = test_run:switch('storage_2_a')
>   big_timeout = 1000000
>    | ---
>    | ...
> +timeout = 0.01
> + | ---
> + | ...
>   ok, err = vshard.storage.bucket_send(1501, util.replicasets[1],                 \
> -                                     {timeout = big_timeout})
> +                                     {timeout = timeout})
>    | ---
>    | ...
>   assert(not ok and err.message)
>    | ---
> - | - Storage is referenced
> + | - Timeout exceeded
>    | ...
>   
>   --
> diff --git a/test/storage/ref.test.lua b/test/storage/ref.test.lua
> index b34a294..24303e2 100644
> --- a/test/storage/ref.test.lua
> +++ b/test/storage/ref.test.lua
> @@ -35,22 +35,25 @@ sid = 0
>   rid = 0
>   big_timeout = 1000000
>   small_timeout = 0.001
> +
> +timeout = 0.01
>   lref.add(rid, sid, big_timeout)
>   -- Send fails.
>   ok, err = vshard.storage.bucket_send(1, util.replicasets[2],                    \
> -                                     {timeout = big_timeout})
> +                                     {timeout = timeout})
>   assert(not ok and err.message)
>   lref.use(rid, sid)
>   -- Still fails - use only makes ref undead until it is deleted explicitly.
>   ok, err = vshard.storage.bucket_send(1, util.replicasets[2],                    \
> -                                     {timeout = big_timeout})
> +                                     {timeout = timeout})
>   assert(not ok and err.message)
>   
>   _ = test_run:switch('storage_2_a')
>   -- Receive (from another replicaset) also fails.
>   big_timeout = 1000000
> +timeout = 0.01
>   ok, err = vshard.storage.bucket_send(1501, util.replicasets[1],                 \
> -                                     {timeout = big_timeout})
> +                                     {timeout = timeout})
>   assert(not ok and err.message)
>   
>   --
> diff --git a/test/storage/scheduler.result b/test/storage/scheduler.result
> new file mode 100644
> index 0000000..0f53e42
> --- /dev/null
> +++ b/test/storage/scheduler.result
> @@ -0,0 +1,410 @@
> +-- test-run result file version 2
> +test_run = require('test_run').new()
> + | ---
> + | ...
> +REPLICASET_1 = { 'storage_1_a', 'storage_1_b' }
> + | ---
> + | ...
> +REPLICASET_2 = { 'storage_2_a', 'storage_2_b' }
> + | ---
> + | ...
> +
> +test_run:create_cluster(REPLICASET_1, 'storage')
> + | ---
> + | ...
> +test_run:create_cluster(REPLICASET_2, 'storage')
> + | ---
> + | ...
> +util = require('util')
> + | ---
> + | ...
> +util.wait_master(test_run, REPLICASET_1, 'storage_1_a')
> + | ---
> + | ...
> +util.wait_master(test_run, REPLICASET_2, 'storage_2_a')
> + | ---
> + | ...
> +util.map_evals(test_run, {REPLICASET_1, REPLICASET_2}, 'bootstrap_storage()')
> + | ---
> + | ...
> +util.push_rs_filters(test_run)
> + | ---
> + | ...
> +
> +--
> +-- gh-147: scheduler helps to share time fairly between incompatible but
> +-- necessary operations - storage refs and bucket moves. Refs are used for the
> +-- consistent map-reduce feature when the whole cluster can be scanned without
> +-- being afraid that some data may slip through requests on behalf of the
> +-- rebalancer.
> +--
> +
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +
> +vshard.storage.rebalancer_disable()
> + | ---
> + | ...
> +vshard.storage.bucket_force_create(1, 1500)
> + | ---
> + | - true
> + | ...
> +
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +vshard.storage.rebalancer_disable()
> + | ---
> + | ...
> +vshard.storage.bucket_force_create(1501, 1500)
> + | ---
> + | - true
> + | ...
> +
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +--
> +-- Bucket_send() uses the scheduler.
> +--
> +lsched = require('vshard.storage.sched')
> + | ---
> + | ...
> +assert(lsched.move_strike == 0)
> + | ---
> + | - true
> + | ...
> +assert(lsched.move_count == 0)
> + | ---
> + | - true
> + | ...
> +big_timeout = 1000000
> + | ---
> + | ...
> +big_timeout_opts = {timeout = big_timeout}
> + | ---
> + | ...
> +vshard.storage.bucket_send(1, util.replicasets[2], big_timeout_opts)
> + | ---
> + | - true
> + | ...
> +assert(lsched.move_strike == 1)
> + | ---
> + | - true
> + | ...
> +assert(lsched.move_count == 0)
> + | ---
> + | - true
> + | ...
> +wait_bucket_is_collected(1)
> + | ---
> + | ...
> +
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +lsched = require('vshard.storage.sched')
> + | ---
> + | ...
> +--
> +-- Bucket_recv() uses the scheduler.
> +--
> +assert(lsched.move_strike == 1)
> + | ---
> + | - true
> + | ...
> +assert(lsched.move_count == 0)
> + | ---
> + | - true
> + | ...
> +
> +--
> +-- When move is in progress, it is properly accounted.
> +--
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = true
> + | ---
> + | ...
> +
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +big_timeout = 1000000
> + | ---
> + | ...
> +big_timeout_opts = {timeout = big_timeout}
> + | ---
> + | ...
> +ok, err = nil
> + | ---
> + | ...
> +assert(lsched.move_strike == 1)
> + | ---
> + | - true
> + | ...
> +_ = fiber.create(function()                                                     \
> +    ok, err = vshard.storage.bucket_send(1, util.replicasets[1],                \
> +                                         big_timeout_opts)                      \
> +end)
> + | ---
> + | ...
> +-- Strike increase does not mean the move finished. It means it was successfully
> +-- scheduled.
> +assert(lsched.move_strike == 2)
> + | ---
> + | - true
> + | ...
> +
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +test_run:wait_cond(function() return lsched.move_strike == 2 end)
> + | ---
> + | - true
> + | ...
> +
> +--
> +-- Ref is not allowed during move.
> +--
> +small_timeout = 0.000001
> + | ---
> + | ...
> +lref = require('vshard.storage.ref')
> + | ---
> + | ...
> +ok, err = lref.add(0, 0, small_timeout)
> + | ---
> + | ...
> +assert(not ok)
> + | ---
> + | - true
> + | ...
> +err.message
> + | ---
> + | - Timeout exceeded
> + | ...
> +-- Put it to wait until move is done.
> +ok, err = nil
> + | ---
> + | ...
> +_ = fiber.create(function() ok, err = lref.add(0, 0, big_timeout) end)
> + | ---
> + | ...
> +vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = false
> + | ---
> + | ...
> +
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +test_run:wait_cond(function() return ok or err end)
> + | ---
> + | - true
> + | ...
> +ok, err
> + | ---
> + | - true
> + | - null
> + | ...
> +assert(lsched.move_count == 0)
> + | ---
> + | - true
> + | ...
> +wait_bucket_is_collected(1)
> + | ---
> + | ...
> +
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +test_run:wait_cond(function() return ok or err end)
> + | ---
> + | - true
> + | ...
> +ok, err
> + | ---
> + | - true
> + | - null
> + | ...
> +assert(lsched.move_count == 0)
> + | ---
> + | - true
> + | ...
> +assert(lsched.ref_count == 1)
> + | ---
> + | - true
> + | ...
> +lref.del(0, 0)
> + | ---
> + | - true
> + | ...
> +assert(lsched.ref_count == 0)
> + | ---
> + | - true
> + | ...
> +
> +--
> +-- Refs can't block sends infinitely. The scheduler must be fair and share time
> +-- between ref/move.
> +--
> +do_refs = true
> + | ---
> + | ...
> +ref_worker_count = 10
> + | ---
> + | ...
> +function ref_worker()                                                           \
> +    while do_refs do                                                            \
> +        lref.add(0, 0, big_timeout)                                             \
> +        fiber.sleep(small_timeout)                                              \
> +        lref.del(0, 0)                                                          \
> +    end                                                                         \
> +    ref_worker_count = ref_worker_count - 1                                     \
> +end
> + | ---
> + | ...
> +-- Simulate many fibers doing something with a ref being kept.
> +for i = 1, ref_worker_count do fiber.create(ref_worker) end
> + | ---
> + | ...
> +assert(lref.count > 0)
> + | ---
> + | - true
> + | ...
> +assert(lsched.ref_count > 0)
> + | ---
> + | - true
> + | ...
> +-- Ensure it passes with default opts (when move is in great unfairness). It is
> +-- important. Because moves are expected to be much longer than refs, and must
> +-- not happen too often with ref load in progress. But still should eventually
> +-- be processed.
> +bucket_count = 100
> + | ---
> + | ...
> +bucket_id = 1
> + | ---
> + | ...
> +bucket_worker_count = 5
> + | ---
> + | ...
> +function bucket_worker()                                                        \
> +    while bucket_id <= bucket_count do                                          \
> +        local id = bucket_id                                                    \
> +        bucket_id = bucket_id + 1                                               \
> +        assert(vshard.storage.bucket_send(id, util.replicasets[2]))             \
> +    end                                                                         \
> +    bucket_worker_count = bucket_worker_count - 1                               \
> +end
> + | ---
> + | ...
> +-- Simulate many rebalancer fibers like when max_sending is increased.
> +for i = 1, bucket_worker_count do fiber.create(bucket_worker) end
> + | ---
> + | ...
> +test_run:wait_cond(function() return bucket_worker_count == 0 end)
> + | ---
> + | - true
> + | ...
> +
> +do_refs = false
> + | ---
> + | ...
> +test_run:wait_cond(function() return ref_worker_count == 0 end)
> + | ---
> + | - true
> + | ...
> +assert(lref.count == 0)
> + | ---
> + | - true
> + | ...
> +assert(lsched.ref_count == 0)
> + | ---
> + | - true
> + | ...
> +
> +for i = 1, bucket_count do wait_bucket_is_collected(i) end
> + | ---
> + | ...
> +
> +--
> +-- Refs can't block recvs infinitely.
> +--
> +do_refs = true
> + | ---
> + | ...
> +for i = 1, ref_worker_count do fiber.create(ref_worker) end
> + | ---
> + | ...
> +
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +bucket_count = 100
> + | ---
> + | ...
> +bucket_id = 1
> + | ---
> + | ...
> +bucket_worker_count = 5
> + | ---
> + | ...
> +function bucket_worker()                                                        \
> +    while bucket_id <= bucket_count do                                          \
> +        local id = bucket_id                                                    \
> +        bucket_id = bucket_id + 1                                               \
> +        assert(vshard.storage.bucket_send(id, util.replicasets[1]))             \
> +    end                                                                         \
> +    bucket_worker_count = bucket_worker_count - 1                               \
> +end
> + | ---
> + | ...
> +for i = 1, bucket_worker_count do fiber.create(bucket_worker) end
> + | ---
> + | ...
> +test_run:wait_cond(function() return bucket_worker_count == 0 end)
> + | ---
> + | - true
> + | ...
> +
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +do_refs = false
> + | ---
> + | ...
> +test_run:wait_cond(function() return ref_worker_count == 0 end)
> + | ---
> + | - true
> + | ...
> +assert(lref.count == 0)
> + | ---
> + | - true
> + | ...
> +assert(lsched.ref_count == 0)
> + | ---
> + | - true
> + | ...
> +
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +for i = 1, bucket_count do wait_bucket_is_collected(i) end
> + | ---
> + | ...
> +
> +_ = test_run:switch("default")
> + | ---
> + | ...
> +test_run:drop_cluster(REPLICASET_2)
> + | ---
> + | ...
> +test_run:drop_cluster(REPLICASET_1)
> + | ---
> + | ...
> +_ = test_run:cmd('clear filter')
> + | ---
> + | ...
> diff --git a/test/storage/scheduler.test.lua b/test/storage/scheduler.test.lua
> new file mode 100644
> index 0000000..8628f0e
> --- /dev/null
> +++ b/test/storage/scheduler.test.lua
> @@ -0,0 +1,178 @@
> +test_run = require('test_run').new()
> +REPLICASET_1 = { 'storage_1_a', 'storage_1_b' }
> +REPLICASET_2 = { 'storage_2_a', 'storage_2_b' }
> +
> +test_run:create_cluster(REPLICASET_1, 'storage')
> +test_run:create_cluster(REPLICASET_2, 'storage')
> +util = require('util')
> +util.wait_master(test_run, REPLICASET_1, 'storage_1_a')
> +util.wait_master(test_run, REPLICASET_2, 'storage_2_a')
> +util.map_evals(test_run, {REPLICASET_1, REPLICASET_2}, 'bootstrap_storage()')
> +util.push_rs_filters(test_run)
> +
> +--
> +-- gh-147: scheduler helps to share time fairly between incompatible but
> +-- necessary operations - storage refs and bucket moves. Refs are used for the
> +-- consistent map-reduce feature when the whole cluster can be scanned without
> +-- being afraid that some data may slip through requests on behalf of the
> +-- rebalancer.
> +--
> +
> +_ = test_run:switch('storage_1_a')
> +
> +vshard.storage.rebalancer_disable()
> +vshard.storage.bucket_force_create(1, 1500)
> +
> +_ = test_run:switch('storage_2_a')
> +vshard.storage.rebalancer_disable()
> +vshard.storage.bucket_force_create(1501, 1500)
> +
> +_ = test_run:switch('storage_1_a')
> +--
> +-- Bucket_send() uses the scheduler.
> +--
> +lsched = require('vshard.storage.sched')
> +assert(lsched.move_strike == 0)
> +assert(lsched.move_count == 0)
> +big_timeout = 1000000
> +big_timeout_opts = {timeout = big_timeout}
> +vshard.storage.bucket_send(1, util.replicasets[2], big_timeout_opts)
> +assert(lsched.move_strike == 1)
> +assert(lsched.move_count == 0)
> +wait_bucket_is_collected(1)
> +
> +_ = test_run:switch('storage_2_a')
> +lsched = require('vshard.storage.sched')
> +--
> +-- Bucket_recv() uses the scheduler.
> +--
> +assert(lsched.move_strike == 1)
> +assert(lsched.move_count == 0)
> +
> +--
> +-- When move is in progress, it is properly accounted.
> +--
> +_ = test_run:switch('storage_1_a')
> +vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = true
> +
> +_ = test_run:switch('storage_2_a')
> +big_timeout = 1000000
> +big_timeout_opts = {timeout = big_timeout}
> +ok, err = nil
> +assert(lsched.move_strike == 1)
> +_ = fiber.create(function()                                                     \
> +    ok, err = vshard.storage.bucket_send(1, util.replicasets[1],                \
> +                                         big_timeout_opts)                      \
> +end)
> +-- Strike increase does not mean the move finished. It means it was successfully
> +-- scheduled.
> +assert(lsched.move_strike == 2)
> +
> +_ = test_run:switch('storage_1_a')
> +test_run:wait_cond(function() return lsched.move_strike == 2 end)
> +
> +--
> +-- Ref is not allowed during move.
> +--
> +small_timeout = 0.000001
> +lref = require('vshard.storage.ref')
> +ok, err = lref.add(0, 0, small_timeout)
> +assert(not ok)
> +err.message
> +-- Put it to wait until move is done.
> +ok, err = nil
> +_ = fiber.create(function() ok, err = lref.add(0, 0, big_timeout) end)
> +vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = false
> +
> +_ = test_run:switch('storage_2_a')
> +test_run:wait_cond(function() return ok or err end)
> +ok, err
> +assert(lsched.move_count == 0)
> +wait_bucket_is_collected(1)
> +
> +_ = test_run:switch('storage_1_a')
> +test_run:wait_cond(function() return ok or err end)
> +ok, err
> +assert(lsched.move_count == 0)
> +assert(lsched.ref_count == 1)
> +lref.del(0, 0)
> +assert(lsched.ref_count == 0)
> +
> +--
> +-- Refs can't block sends infinitely. The scheduler must be fair and share time
> +-- between ref/move.
> +--
> +do_refs = true
> +ref_worker_count = 10
> +function ref_worker()                                                           \
> +    while do_refs do                                                            \
> +        lref.add(0, 0, big_timeout)                                             \
> +        fiber.sleep(small_timeout)                                              \
> +        lref.del(0, 0)                                                          \
> +    end                                                                         \
> +    ref_worker_count = ref_worker_count - 1                                     \
> +end
> +-- Simulate many fibers doing something with a ref being kept.
> +for i = 1, ref_worker_count do fiber.create(ref_worker) end
> +assert(lref.count > 0)
> +assert(lsched.ref_count > 0)
> +-- Ensure it passes with default opts (when move is in great unfairness). It is
> +-- important. Because moves are expected to be much longer than refs, and must
> +-- not happen too often with ref load in progress. But still should eventually
> +-- be processed.
> +bucket_count = 100
> +bucket_id = 1
> +bucket_worker_count = 5
> +function bucket_worker()                                                        \
> +    while bucket_id <= bucket_count do                                          \
> +        local id = bucket_id                                                    \
> +        bucket_id = bucket_id + 1                                               \
> +        assert(vshard.storage.bucket_send(id, util.replicasets[2]))             \
> +    end                                                                         \
> +    bucket_worker_count = bucket_worker_count - 1                               \
> +end
> +-- Simulate many rebalancer fibers like when max_sending is increased.
> +for i = 1, bucket_worker_count do fiber.create(bucket_worker) end
> +test_run:wait_cond(function() return bucket_worker_count == 0 end)
> +
> +do_refs = false
> +test_run:wait_cond(function() return ref_worker_count == 0 end)
> +assert(lref.count == 0)
> +assert(lsched.ref_count == 0)
> +
> +for i = 1, bucket_count do wait_bucket_is_collected(i) end
> +
> +--
> +-- Refs can't block recvs infinitely.
> +--
> +do_refs = true
> +for i = 1, ref_worker_count do fiber.create(ref_worker) end
> +
> +_ = test_run:switch('storage_2_a')
> +bucket_count = 100
> +bucket_id = 1
> +bucket_worker_count = 5
> +function bucket_worker()                                                        \
> +    while bucket_id <= bucket_count do                                          \
> +        local id = bucket_id                                                    \
> +        bucket_id = bucket_id + 1                                               \
> +        assert(vshard.storage.bucket_send(id, util.replicasets[1]))             \
> +    end                                                                         \
> +    bucket_worker_count = bucket_worker_count - 1                               \
> +end
> +for i = 1, bucket_worker_count do fiber.create(bucket_worker) end
> +test_run:wait_cond(function() return bucket_worker_count == 0 end)
> +
> +_ = test_run:switch('storage_1_a')
> +do_refs = false
> +test_run:wait_cond(function() return ref_worker_count == 0 end)
> +assert(lref.count == 0)
> +assert(lsched.ref_count == 0)
> +
> +_ = test_run:switch('storage_2_a')
> +for i = 1, bucket_count do wait_bucket_is_collected(i) end
> +
> +_ = test_run:switch("default")
> +test_run:drop_cluster(REPLICASET_2)
> +test_run:drop_cluster(REPLICASET_1)
> +_ = test_run:cmd('clear filter')
> diff --git a/test/unit-tap/ref.test.lua b/test/unit-tap/ref.test.lua
> index d987a63..ba95eee 100755
> --- a/test/unit-tap/ref.test.lua
> +++ b/test/unit-tap/ref.test.lua
> @@ -5,6 +5,7 @@ local test = tap.test('cfg')
>   local fiber = require('fiber')
>   local lregistry = require('vshard.registry')
>   local lref = require('vshard.storage.ref')
> +require('vshard.storage.sched')
>   
>   local big_timeout = 1000000
>   local small_timeout = 0.000001
> @@ -19,9 +20,11 @@ local sid3 = 2
>   --
>   
>   --
> --- Refs used storage API to get bucket space state and wait on its changes. But
> --- not important for these unit tests.
> +-- Refs use storage API to get bucket space state and wait on its changes. And
> +-- scheduler API to sync with bucket moves. But not important for these unit
> +-- tests.
>   --
> +
>   local function bucket_are_all_rw()
>       return true
>   end
> diff --git a/test/unit-tap/scheduler.test.lua b/test/unit-tap/scheduler.test.lua
> new file mode 100755
> index 0000000..0af4f5e
> --- /dev/null
> +++ b/test/unit-tap/scheduler.test.lua
> @@ -0,0 +1,555 @@
> +#!/usr/bin/env tarantool
> +
> +local fiber = require('fiber')
> +local tap = require('tap')
> +local test = tap.test('cfg')
> +local lregistry = require('vshard.registry')
> +local lref = require('vshard.storage.ref')
> +local lsched = require('vshard.storage.sched')
> +
> +local big_timeout = 1000000
> +local small_timeout = 0.000001
> +
> +--
> +-- gh-147: scheduler helps to share time fairly between incompatible but
> +-- necessary operations - storage refs and bucket moves. Refs are used for the
> +-- consistent map-reduce feature when the whole cluster can be scanned without
> +-- being afraid that some data may slip through requests on behalf of the
> +-- rebalancer.
> +--
> +
> +box.cfg{
> +    log = 'log.txt'
> +}
> +-- io.write = function(...) require('log').info(...) end
> +
> +--
> +-- Storage registry is used by the ref module. The ref module is used in the
> +-- tests in order to ensure the scheduler performs ref garbage collection.
> +--
> +local function bucket_are_all_rw()
> +    return true
> +end
> +
> +lregistry.storage = {
> +    bucket_are_all_rw = bucket_are_all_rw,
> +}
> +
> +local function fiber_csw()
> +    return fiber.info()[fiber.self():id()].csw
> +end
> +
> +local function fiber_set_joinable()
> +    fiber.self():set_joinable(true)
> +end
> +
> +local function test_basic(test)
> +    test:plan(32)
> +
> +    local ref_strike = lsched.ref_strike
> +    --
> +    -- Simplest possible test - start and end a ref.
> +    --
> +    test:is(lsched.ref_start(big_timeout), big_timeout, 'start ref')
> +    test:is(lsched.ref_count, 1, '1 ref')
> +    test:is(lsched.ref_strike, ref_strike + 1, '+1 ref in a row')
> +    lsched.ref_end(1)
> +    test:is(lsched.ref_count, 0, '0 refs after end')
> +    test:is(lsched.ref_strike, ref_strike + 1, 'strike is kept')
> +
> +    lsched.ref_start(big_timeout)
> +    lsched.ref_end(1)
> +    test:is(lsched.ref_strike, ref_strike + 2, 'strike grows')
> +    test:is(lsched.ref_count, 0, 'count does not')
> +
> +    --
> +    -- Move ends ref strike.
> +    --
> +    test:is(lsched.move_start(big_timeout), big_timeout, 'start move')
> +    test:is(lsched.move_count, 1, '1 move')
> +    test:is(lsched.move_strike, 1, '+1 move strike')
> +    test:is(lsched.ref_strike, 0, 'ref strike is interrupted')
> +
> +    --
> +    -- Ref times out if there is a move in progress.
> +    --
> +    local ok, err = lsched.ref_start(small_timeout)
> +    test:ok(not ok and err, 'ref fails')
> +    test:is(lsched.move_count, 1, 'still 1 move')
> +    test:is(lsched.move_strike, 1, 'still 1 move strike')
> +    test:is(lsched.ref_count, 0, 'could not add ref')
> +    test:is(lsched.ref_queue, 0, 'empty ref queue')
> +
> +    --
> +    -- Ref succeeds when move ends.
> +    --
> +    local f = fiber.create(function()
> +        fiber_set_joinable()
> +        return lsched.ref_start(big_timeout)
> +    end)
> +    fiber.sleep(small_timeout)
> +    lsched.move_end(1)
> +    local new_timeout
> +    ok, new_timeout = f:join()
> +    test:ok(ok and new_timeout < big_timeout, 'correct timeout')
> +    test:is(lsched.move_count, 0, 'no moves')
> +    test:is(lsched.move_strike, 0, 'move strike ends')
> +    test:is(lsched.ref_count, 1, '+1 ref')
> +    test:is(lsched.ref_strike, 1, '+1 ref strike')
> +
> +    --
> +    -- Move succeeds when ref ends.
> +    --
> +    f = fiber.create(function()
> +        fiber_set_joinable()
> +        return lsched.move_start(big_timeout)
> +    end)
> +    fiber.sleep(small_timeout)
> +    lsched.ref_end(1)
> +    ok, new_timeout = f:join()
> +    test:ok(ok and new_timeout < big_timeout, 'correct timeout')
> +    test:is(lsched.ref_count, 0, 'no refs')
> +    test:is(lsched.ref_strike, 0, 'ref strike ends')
> +    test:is(lsched.move_count, 1, '+1 move')
> +    test:is(lsched.move_strike, 1, '+1 move strike')
> +    lsched.move_end(1)
> +
> +    --
> +    -- Move times out when there is a ref.
> +    --
> +    test:is(lsched.ref_start(big_timeout), big_timeout, '+ ref')
> +    ok, err = lsched.move_start(small_timeout)
> +    test:ok(not ok and err, 'move fails')
> +    test:is(lsched.ref_count, 1, 'still 1 ref')
> +    test:is(lsched.ref_strike, 1, 'still 1 ref strike')
> +    test:is(lsched.move_count, 0, 'could not add move')
> +    test:is(lsched.move_queue, 0, 'empty move queue')
> +    lsched.ref_end(1)
> +end
> +
> +local function test_negative_timeout(test)
> +    test:plan(12)
> +
> +    --
> +    -- Move works even with negative timeout if no refs.
> +    --
> +    test:is(lsched.move_start(-1), -1, 'timeout does not matter if no refs')
> +    test:is(lsched.move_count, 1, '+1 move')
> +
> +    --
> +    -- Ref fails immediately if timeout negative and has moves.
> +    --
> +    local csw = fiber_csw()
> +    local ok, err = lsched.ref_start(-1)
> +    test:ok(not ok and err, 'ref fails')
> +    test:is(csw, fiber_csw(), 'no yields')
> +    test:is(lsched.ref_count, 0, 'no refs')
> +    test:is(lsched.ref_queue, 0, 'no ref queue')
> +
> +    --
> +    -- Ref works even with negative timeout if no moves.
> +    --
> +    lsched.move_end(1)
> +    test:is(lsched.ref_start(-1), -1, 'timeout does not matter if no moves')
> +    test:is(lsched.ref_count, 1, '+1 ref')
> +
> +    --
> +    -- Move fails immediately if timeout is negative and has refs.
> +    --
> +    csw = fiber_csw()
> +    ok, err = lsched.move_start(-1)
> +    test:ok(not ok and err, 'move fails')
> +    test:is(csw, fiber_csw(), 'no yields')
> +    test:is(lsched.move_count, 0, 'no moves')
> +    test:is(lsched.move_queue, 0, 'no move queue')
> +    lsched.ref_end(1)
> +end
> +
> +local function test_move_gc_ref(test)
> +    test:plan(10)
> +
> +    --
> +    -- Move deletes expired refs if it may help to start the move.
> +    --
> +    for sid = 1, 10 do
> +        for rid = 1, 5 do
> +            lref.add(rid, sid, small_timeout)
> +        end
> +    end
> +    test:is(lsched.ref_count, 50, 'refs are in progress')
> +    local ok, err = lsched.move_start(-1)
> +    test:ok(not ok and err, 'move without timeout failed')
> +
> +    fiber.sleep(small_timeout)
> +    test:is(lsched.move_start(-1), -1, 'succeeds even with negative timeout')
> +    test:is(lsched.ref_count, 0, 'all refs are expired and deleted')
> +    test:is(lref.count, 0, 'ref module knows about it')
> +    test:is(lsched.move_count, 1, 'move is started')
> +    lsched.move_end(1)
> +
> +    --
> +    -- May need more than 1 GC step.
> +    --
> +    for rid = 1, 5 do
> +        lref.add(0, rid, small_timeout)
> +    end
> +    for rid = 1, 5 do
> +        lref.add(1, rid, small_timeout * 100)
> +    end
> +    local new_timeout = lsched.move_start(big_timeout)
> +    test:ok(new_timeout < big_timeout, 'succeeds by doing 2 gc steps')
> +    test:is(lsched.ref_count, 0, 'all refs are expired and deleted')
> +    test:is(lref.count, 0, 'ref module knows about it')
> +    test:is(lsched.move_count, 1, 'move is started')
> +    lsched.move_end(1)
> +end
> +
> +local function test_ref_strike(test)
> +    test:plan(10)
> +
> +    local quota = lsched.ref_quota
> +    --
> +    -- Strike should stop new refs if they exceed the quota and there is a
> +    -- pending move.
> +    --
> +    -- End ref strike if there was one.
> +    lsched.move_start(small_timeout)
> +    lsched.move_end(1)
> +    -- Ref strike starts.
> +    assert(lsched.ref_start(small_timeout))
> +
> +    local f = fiber.create(function()
> +        fiber_set_joinable()
> +        return lsched.move_start(big_timeout)
> +    end)
> +    test:is(lsched.move_queue, 1, 'move is queued')
> +    --
> +    -- New refs should work only until quota is reached, because there is a
> +    -- pending move.
> +    --
> +    for i = 1, quota - 1 do
> +        assert(lsched.ref_start(small_timeout))
> +    end
> +    local ok, err = lsched.ref_start(small_timeout)
> +    test:ok(not ok and err, 'too long strike with move queue not empty')
> +    test:is(lsched.ref_strike, quota, 'max strike is reached')
> +    -- Even if number of current refs decreases, new still are not accepted.
> +    -- Because there was too many in a row while a new move was waiting.
> +    lsched.ref_end(1)
> +    ok, err = lsched.ref_start(small_timeout)
> +    test:ok(not ok and err, 'still too long strike after one unref')
> +    test:is(lsched.ref_strike, quota, 'strike is unchanged')
> +
> +    lsched.ref_end(quota - 1)
> +    local new_timeout
> +    ok, new_timeout = f:join()
> +    test:ok(ok and new_timeout < big_timeout, 'move succeeded')
> +    test:is(lsched.move_count, 1, '+1 move')
> +    test:is(lsched.move_strike, 1, '+1 move strike')
> +    test:is(lsched.ref_count, 0, 'no refs')
> +    test:is(lsched.ref_strike, 0, 'no ref strike')
> +    lsched.move_end(1)
> +end
> +
> +local function test_move_strike(test)
> +    test:plan(10)
> +
> +    local quota = lsched.move_quota
> +    --
> +    -- Strike should stop new moves if they exceed the quota and there is a
> +    -- pending ref.
> +    --
> +    -- End move strike if there was one.
> +    lsched.ref_start(small_timeout)
> +    lsched.ref_end(1)
> +    -- Move strike starts.
> +    assert(lsched.move_start(small_timeout))
> +
> +    local f = fiber.create(function()
> +        fiber_set_joinable()
> +        return lsched.ref_start(big_timeout)
> +    end)
> +    test:is(lsched.ref_queue, 1, 'ref is queued')
> +    --
> +    -- New moves should work only until quota is reached, because there is a
> +    -- pending ref.
> +    --
> +    for i = 1, quota - 1 do
> +        assert(lsched.move_start(small_timeout))
> +    end
> +    local ok, err = lsched.move_start(small_timeout)
> +    test:ok(not ok and err, 'too long strike with ref queue not empty')
> +    test:is(lsched.move_strike, quota, 'max strike is reached')
> +    -- Even if number of current moves decreases, new still are not accepted.
> +    -- Because there was too many in a row while a new ref was waiting.
> +    lsched.move_end(1)
> +    ok, err = lsched.move_start(small_timeout)
> +    test:ok(not ok and err, 'still too long strike after one move end')
> +    test:is(lsched.move_strike, quota, 'strike is unchanged')
> +
> +    lsched.move_end(quota - 1)
> +    local new_timeout
> +    ok, new_timeout = f:join()
> +    test:ok(ok and new_timeout < big_timeout, 'ref succeeded')
> +    test:is(lsched.ref_count, 1, '+1 ref')
> +    test:is(lsched.ref_strike, 1, '+1 ref strike')
> +    test:is(lsched.move_count, 0, 'no moves')
> +    test:is(lsched.move_strike, 0, 'no move strike')
> +    lsched.ref_end(1)
> +end
> +
> +local function test_ref_increase_quota(test)
> +    test:plan(4)
> +
> +    local quota = lsched.ref_quota
> +    --
> +    -- Ref quota increase allows to do more refs even if there are pending
> +    -- moves.
> +    --
> +    -- End ref strike if there was one.
> +    lsched.move_start(big_timeout)
> +    lsched.move_end(1)
> +    -- Fill the quota.
> +    for _ = 1, quota do
> +        assert(lsched.ref_start(big_timeout))
> +    end
> +    -- Start move to block new refs by quota.
> +    local f = fiber.create(function()
> +        fiber_set_joinable()
> +        return lsched.move_start(big_timeout)
> +    end)
> +    test:ok(not lsched.ref_start(small_timeout), 'can not add ref - full quota')
> +
> +    lsched.cfg({sched_ref_quota = quota + 1})
> +    test:ok(lsched.ref_start(small_timeout), 'now can add - quota is extended')
> +
> +    -- Decrease quota - should not accept new refs again.
> +    lsched.cfg{sched_ref_quota = quota}
> +    test:ok(not lsched.ref_start(small_timeout), 'full quota again')
> +
> +    lsched.ref_end(quota + 1)
> +    local ok, new_timeout = f:join()
> +    test:ok(ok and new_timeout < big_timeout, 'move started')
> +    lsched.move_end(1)
> +end
> +
> +local function test_move_increase_quota(test)
> +    test:plan(4)
> +
> +    local quota = lsched.move_quota
> +    --
> +    -- Move quota increase allows to do more moves even if there are pending
> +    -- refs.
> +    --
> +    -- End move strike if there was one.
> +    lsched.ref_start(big_timeout)
> +    lsched.ref_end(1)
> +    -- Fill the quota.
> +    for _ = 1, quota do
> +        assert(lsched.move_start(big_timeout))
> +    end
> +    -- Start ref to block new moves by quota.
> +    local f = fiber.create(function()
> +        fiber_set_joinable()
> +        return lsched.ref_start(big_timeout)
> +    end)
> +    test:ok(not lsched.move_start(small_timeout), 'can not add move - full quota')
> +
> +    lsched.cfg({sched_move_quota = quota + 1})
> +    test:ok(lsched.move_start(small_timeout), 'now can add - quota is extended')
> +
> +    -- Decrease quota - should not accept new moves again.
> +    lsched.cfg{sched_move_quota = quota}
> +    test:ok(not lsched.move_start(small_timeout), 'full quota again')
> +
> +    lsched.move_end(quota + 1)
> +    local ok, new_timeout = f:join()
> +    test:ok(ok and new_timeout < big_timeout, 'ref started')
> +    lsched.ref_end(1)
> +end
> +
> +local function test_ref_decrease_quota(test)
> +    test:plan(4)
> +
> +    local old_quota = lsched.ref_quota
> +    --
> +    -- Quota decrease should not affect any existing operations or break
> +    -- anything.
> +    --
> +    lsched.cfg({sched_ref_quota = 10})
> +    for _ = 1, 5 do
> +        assert(lsched.ref_start(big_timeout))
> +    end
> +    test:is(lsched.ref_count, 5, 'started refs below quota')
> +
> +    local f = fiber.create(function()
> +        fiber_set_joinable()
> +        return lsched.move_start(big_timeout)
> +    end)
> +    test:ok(lsched.ref_start(big_timeout), 'another ref after move queued')
> +
> +    lsched.cfg({sched_ref_quota = 2})
> +    test:ok(not lsched.ref_start(small_timeout), 'quota decreased - can not '..
> +            'start ref')
> +
> +    lsched.ref_end(6)
> +    local ok, new_timeout = f:join()
> +    test:ok(ok and new_timeout, 'move is started')
> +    lsched.move_end(1)
> +
> +    lsched.cfg({sched_ref_quota = old_quota})
> +end
> +
> +local function test_move_decrease_quota(test)
> +    test:plan(4)
> +
> +    local old_quota = lsched.move_quota
> +    --
> +    -- Quota decrease should not affect any existing operations or break
> +    -- anything.
> +    --
> +    lsched.cfg({sched_move_quota = 10})
> +    for _ = 1, 5 do
> +        assert(lsched.move_start(big_timeout))
> +    end
> +    test:is(lsched.move_count, 5, 'started moves below quota')
> +
> +    local f = fiber.create(function()
> +        fiber_set_joinable()
> +        return lsched.ref_start(big_timeout)
> +    end)
> +    test:ok(lsched.move_start(big_timeout), 'another move after ref queued')
> +
> +    lsched.cfg({sched_move_quota = 2})
> +    test:ok(not lsched.move_start(small_timeout), 'quota decreased - can not '..
> +            'start move')
> +
> +    lsched.move_end(6)
> +    local ok, new_timeout = f:join()
> +    test:ok(ok and new_timeout, 'ref is started')
> +    lsched.ref_end(1)
> +
> +    lsched.cfg({sched_move_quota = old_quota})
> +end
> +
> +local function test_ref_zero_quota(test)
> +    test:plan(6)
> +
> +    local old_quota = lsched.ref_quota
> +    --
> +    -- Zero quota is a valid value. Moreover, it is special. It means the
> +    -- 0-quoted operation should always be paused in favor of the other
> +    -- operation.
> +    --
> +    lsched.cfg({sched_ref_quota = 0})
> +    test:ok(lsched.ref_start(big_timeout), 'started ref with 0 quota')
> +
> +    local f = fiber.create(function()
> +        fiber_set_joinable()
> +        return lsched.move_start(big_timeout)
> +    end)
> +    test:ok(not lsched.ref_start(small_timeout), 'can not add more refs if '..
> +            'move is queued - quota 0')
> +
> +    lsched.ref_end(1)
> +    local ok, new_timeout = f:join()
> +    test:ok(ok and new_timeout, 'move is started')
> +
> +    -- Ensure ref never starts if there are always moves, when quota is 0.
> +    f = fiber.create(function()
> +        fiber_set_joinable()
> +        return lsched.ref_start(big_timeout)
> +    end)
> +    local move_count = lsched.move_quota + 3
> +    -- Start from 2 to account the already existing move.
> +    for _ = 2, move_count do
> +        -- Start one new move.
> +        assert(lsched.move_start(big_timeout))
> +        -- Start second new move.
> +        assert(lsched.move_start(big_timeout))
> +        -- End first move.
> +        lsched.move_end(1)
> +        -- In result the moves are always interleaving - no time for refs at
> +        -- all.
> +    end
> +    test:is(lsched.move_count, move_count, 'moves exceed quota')
> +    test:ok(lsched.move_strike > move_count, 'strike is not interrupted')
> +
> +    lsched.move_end(move_count)
> +    ok, new_timeout = f:join()
> +    test:ok(ok and new_timeout, 'ref finally started')
> +    lsched.ref_end(1)
> +
> +    lsched.cfg({sched_ref_quota = old_quota})
> +end
> +
> +local function test_move_zero_quota(test)
> +    test:plan(6)
> +
> +    local old_quota = lsched.move_quota
> +    --
> +    -- Zero quota is a valid value. Moreover, it is special. It means the
> +    -- 0-quoted operation should always be paused in favor of the other
> +    -- operation.
> +    --
> +    lsched.cfg({sched_move_quota = 0})
> +    test:ok(lsched.move_start(big_timeout), 'started move with 0 quota')
> +
> +    local f = fiber.create(function()
> +        fiber_set_joinable()
> +        return lsched.ref_start(big_timeout)
> +    end)
> +    test:ok(not lsched.move_start(small_timeout), 'can not add more moves if '..
> +            'ref is queued - quota 0')
> +
> +    lsched.move_end(1)
> +    local ok, new_timeout = f:join()
> +    test:ok(ok and new_timeout, 'ref is started')
> +
> +    -- Ensure move never starts if there are always refs, when quota is 0.
> +    f = fiber.create(function()
> +        fiber_set_joinable()
> +        return lsched.move_start(big_timeout)
> +    end)
> +    local ref_count = lsched.ref_quota + 3
> +    -- Start from 2 to account the already existing ref.
> +    for _ = 2, ref_count do
> +        -- Start one new ref.
> +        assert(lsched.ref_start(big_timeout))
> +        -- Start second new ref.
> +        assert(lsched.ref_start(big_timeout))
> +        -- End first ref.
> +        lsched.ref_end(1)
> +        -- In result the refs are always interleaving - no time for moves at
> +        -- all.
> +    end
> +    test:is(lsched.ref_count, ref_count, 'refs exceed quota')
> +    test:ok(lsched.ref_strike > ref_count, 'strike is not interrupted')
> +
> +    lsched.ref_end(ref_count)
> +    ok, new_timeout = f:join()
> +    test:ok(ok and new_timeout, 'move finally started')
> +    lsched.move_end(1)
> +
> +    lsched.cfg({sched_move_quota = old_quota})
> +end
> +
> +test:plan(11)
> +
> +-- Change default values. Move is 1 by default, which would reduce the number of
> +-- possible tests. Ref is decreased to speed the tests up.
> +lsched.cfg({sched_ref_quota = 10, sched_move_quota = 5})
> +
> +test:test('basic', test_basic)
> +test:test('negative timeout', test_negative_timeout)
> +test:test('ref gc', test_move_gc_ref)
> +test:test('ref strike', test_ref_strike)
> +test:test('move strike', test_move_strike)
> +test:test('ref add quota', test_ref_increase_quota)
> +test:test('move add quota', test_move_increase_quota)
> +test:test('ref decrease quota', test_ref_decrease_quota)
> +test:test('move decrease quota', test_move_decrease_quota)
> +test:test('ref zero quota', test_ref_zero_quota)
> +test:test('move zero quota', test_move_zero_quota)
> +
> +os.exit(test:check() and 0 or 1)
> diff --git a/test/unit/config.result b/test/unit/config.result
> index e0b2482..9df3bf1 100644
> --- a/test/unit/config.result
> +++ b/test/unit/config.result
> @@ -597,3 +597,62 @@ cfg.collect_bucket_garbage_interval = 100
>   _ = lcfg.check(cfg)
>   ---
>   ...
> +--
> +-- gh-147: router map-reduce. It adds scheduler options on the storage.
> +--
> +cfg.sched_ref_quota = 100
> +---
> +...
> +_ = lcfg.check(cfg)
> +---
> +...
> +cfg.sched_ref_quota = 1
> +---
> +...
> +_ = lcfg.check(cfg)
> +---
> +...
> +cfg.sched_ref_quota = 0
> +---
> +...
> +_ = lcfg.check(cfg)
> +---
> +...
> +cfg.sched_ref_quota = -1
> +---
> +...
> +util.check_error(lcfg.check, cfg)
> +---
> +- Scheduler storage ref quota must be non-negative number
> +...
> +cfg.sched_ref_quota = nil
> +---
> +...
> +cfg.sched_move_quota = 100
> +---
> +...
> +_ = lcfg.check(cfg)
> +---
> +...
> +cfg.sched_move_quota = 1
> +---
> +...
> +_ = lcfg.check(cfg)
> +---
> +...
> +cfg.sched_move_quota = 0
> +---
> +...
> +_ = lcfg.check(cfg)
> +---
> +...
> +cfg.sched_move_quota = -1
> +---
> +...
> +util.check_error(lcfg.check, cfg)
> +---
> +- Scheduler bucket move quota must be non-negative number
> +...
> +cfg.sched_move_quota = nil
> +---
> +...
> diff --git a/test/unit/config.test.lua b/test/unit/config.test.lua
> index a1c9f07..473e460 100644
> --- a/test/unit/config.test.lua
> +++ b/test/unit/config.test.lua
> @@ -241,3 +241,26 @@ cfg.rebalancer_max_sending = nil
>   --
>   cfg.collect_bucket_garbage_interval = 100
>   _ = lcfg.check(cfg)
> +
> +--
> +-- gh-147: router map-reduce. It adds scheduler options on the storage.
> +--
> +cfg.sched_ref_quota = 100
> +_ = lcfg.check(cfg)
> +cfg.sched_ref_quota = 1
> +_ = lcfg.check(cfg)
> +cfg.sched_ref_quota = 0
> +_ = lcfg.check(cfg)
> +cfg.sched_ref_quota = -1
> +util.check_error(lcfg.check, cfg)
> +cfg.sched_ref_quota = nil
> +
> +cfg.sched_move_quota = 100
> +_ = lcfg.check(cfg)
> +cfg.sched_move_quota = 1
> +_ = lcfg.check(cfg)
> +cfg.sched_move_quota = 0
> +_ = lcfg.check(cfg)
> +cfg.sched_move_quota = -1
> +util.check_error(lcfg.check, cfg)
> +cfg.sched_move_quota = nil
> diff --git a/vshard/cfg.lua b/vshard/cfg.lua
> index 63d5414..30f8794 100644
> --- a/vshard/cfg.lua
> +++ b/vshard/cfg.lua
> @@ -274,6 +274,14 @@ local cfg_template = {
>           type = 'string', name = 'Discovery mode: on, off, once',
>           is_optional = true, default = 'on', check = check_discovery_mode
>       },
> +    sched_ref_quota = {
> +        name = 'Scheduler storage ref quota', type = 'non-negative number',
> +        is_optional = true, default = consts.DEFAULT_SCHED_REF_QUOTA
> +    },
> +    sched_move_quota = {
> +        name = 'Scheduler bucket move quota', type = 'non-negative number',
> +        is_optional = true, default = consts.DEFAULT_SCHED_MOVE_QUOTA
> +    },
>   }
>   
>   --
> diff --git a/vshard/consts.lua b/vshard/consts.lua
> index 0ffe0e2..47a893b 100644
> --- a/vshard/consts.lua
> +++ b/vshard/consts.lua
> @@ -41,6 +41,11 @@ return {
>       GC_BACKOFF_INTERVAL = 5,
>       RECOVERY_BACKOFF_INTERVAL = 5,
>       COLLECT_LUA_GARBAGE_INTERVAL = 100;
> +    DEFAULT_BUCKET_SEND_TIMEOUT = 10,
> +    DEFAULT_BUCKET_RECV_TIMEOUT = 10,
> +
> +    DEFAULT_SCHED_REF_QUOTA = 300,
> +    DEFAULT_SCHED_MOVE_QUOTA = 1,
>   
>       DISCOVERY_IDLE_INTERVAL = 10,
>       DISCOVERY_WORK_INTERVAL = 1,
> diff --git a/vshard/storage/CMakeLists.txt b/vshard/storage/CMakeLists.txt
> index 7c1e97d..396664a 100644
> --- a/vshard/storage/CMakeLists.txt
> +++ b/vshard/storage/CMakeLists.txt
> @@ -1,2 +1,2 @@
> -install(FILES init.lua reload_evolution.lua ref.lua
> +install(FILES init.lua reload_evolution.lua ref.lua sched.lua
>           DESTINATION ${TARANTOOL_INSTALL_LUADIR}/vshard/storage)
> diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
> index 2957f48..31f668f 100644
> --- a/vshard/storage/init.lua
> +++ b/vshard/storage/init.lua
> @@ -17,7 +17,7 @@ if rawget(_G, MODULE_INTERNALS) then
>           'vshard.replicaset', 'vshard.util',
>           'vshard.storage.reload_evolution',
>           'vshard.lua_gc', 'vshard.rlist', 'vshard.registry',
> -        'vshard.heap', 'vshard.storage.ref',
> +        'vshard.heap', 'vshard.storage.ref', 'vshard.storage.sched',
>       }
>       for _, module in pairs(vshard_modules) do
>           package.loaded[module] = nil
> @@ -32,6 +32,7 @@ local util = require('vshard.util')
>   local lua_gc = require('vshard.lua_gc')
>   local lregistry = require('vshard.registry')
>   local lref = require('vshard.storage.ref')
> +local lsched = require('vshard.storage.sched')
>   local reload_evolution = require('vshard.storage.reload_evolution')
>   local fiber_cond_wait = util.fiber_cond_wait
>   local bucket_ref_new
> @@ -1142,16 +1143,33 @@ local function bucket_recv_xc(bucket_id, from, data, opts)
>               return nil, lerror.vshard(lerror.code.WRONG_BUCKET, bucket_id, msg,
>                                         from)
>           end
> -        if lref.count > 0 then
> -            return nil, lerror.vshard(lerror.code.STORAGE_IS_REFERENCED)
> -        end
>           if is_this_replicaset_locked() then
>               return nil, lerror.vshard(lerror.code.REPLICASET_IS_LOCKED)
>           end
>           if not bucket_receiving_quota_add(-1) then
>               return nil, lerror.vshard(lerror.code.TOO_MANY_RECEIVING)
>           end
> -        _bucket:insert({bucket_id, recvg, from})
> +        local timeout = opts and opts.timeout or
> +                        consts.DEFAULT_BUCKET_SEND_TIMEOUT
> +        local ok, err = lsched.move_start(timeout)
> +        if not ok then
> +            return nil, err
> +        end
> +        assert(lref.count == 0)
> +        -- Move schedule is done only for the time of _bucket update.
> +        -- The reason is that one bucket_send() calls bucket_recv() on the
> +        -- remote storage multiple times. If the latter would schedule new moves
> +        -- on each call, it could happen that the scheduler would block it in
> +        -- favor of refs right in the middle of bucket_send().
> +        -- It would lead to a deadlock, because refs won't be able to start -
> +        -- the bucket won't be writable.
> +        -- This way still provides fair scheduling, but does not have the
> +        -- described issue.
> +        ok, err = pcall(_bucket.insert, _bucket, {bucket_id, recvg, from})
> +        lsched.move_end(1)
> +        if not ok then
> +            return nil, lerror.make(err)
> +        end
>       elseif b.status ~= recvg then
>           local msg = string.format("bucket state is changed: was receiving, "..
>                                     "became %s", b.status)
> @@ -1434,7 +1452,7 @@ local function bucket_send_xc(bucket_id, destination, opts, exception_guard)
>       ref.rw_lock = true
>       exception_guard.ref = ref
>       exception_guard.drop_rw_lock = true
> -    local timeout = opts and opts.timeout or 10
> +    local timeout = opts and opts.timeout or consts.DEFAULT_BUCKET_SEND_TIMEOUT
>       local deadline = fiber_clock() + timeout
>       while ref.rw ~= 0 do
>           timeout = deadline - fiber_clock()
> @@ -1446,9 +1464,6 @@ local function bucket_send_xc(bucket_id, destination, opts, exception_guard)
>   
>       local _bucket = box.space._bucket
>       local bucket = _bucket:get({bucket_id})
> -    if lref.count > 0 then
> -        return nil, lerror.vshard(lerror.code.STORAGE_IS_REFERENCED)
> -    end
>       if is_this_replicaset_locked() then
>           return nil, lerror.vshard(lerror.code.REPLICASET_IS_LOCKED)
>       end
> @@ -1468,7 +1483,25 @@ local function bucket_send_xc(bucket_id, destination, opts, exception_guard)
>       local idx = M.shard_index
>       local bucket_generation = M.bucket_generation
>       local sendg = consts.BUCKET.SENDING
> -    _bucket:replace({bucket_id, sendg, destination})
> +
> +    local ok, err = lsched.move_start(timeout)
> +    if not ok then
> +        return nil, err
> +    end
> +    assert(lref.count == 0)
> +    -- Move is scheduled only for the time of _bucket update because:
> +    --
> +    -- * it is consistent with bucket_recv() (see its comments);
> +    --
> +    -- * gives the same effect as if move was in the scheduler for the whole
> +    --   bucket_send() time, because refs won't be able to start anyway - the
> +    --   bucket is not writable.
> +    ok, err = pcall(_bucket.replace, _bucket, {bucket_id, sendg, destination})
> +    lsched.move_end(1)
> +    if not ok then
> +        return nil, lerror.make(err)
> +    end
> +
>       -- From this moment the bucket is SENDING. Such a status is
>       -- even stronger than the lock.
>       ref.rw_lock = false
> @@ -2542,6 +2575,7 @@ local function storage_cfg(cfg, this_replica_uuid, is_reload)
>           M.bucket_on_replace = bucket_generation_increment
>       end
>   
> +    lsched.cfg(vshard_cfg)
>       lreplicaset.rebind_replicasets(new_replicasets, M.replicasets)
>       lreplicaset.outdate_replicasets(M.replicasets)
>       M.replicasets = new_replicasets
> diff --git a/vshard/storage/ref.lua b/vshard/storage/ref.lua
> index 7589cb9..2daad6b 100644
> --- a/vshard/storage/ref.lua
> +++ b/vshard/storage/ref.lua
> @@ -33,6 +33,7 @@ local lregistry = require('vshard.registry')
>   local fiber_clock = lfiber.clock
>   local fiber_yield = lfiber.yield
>   local DEADLINE_INFINITY = lconsts.DEADLINE_INFINITY
> +local TIMEOUT_INFINITY = lconsts.TIMEOUT_INFINITY
>   local LUA_CHUNK_SIZE = lconsts.LUA_CHUNK_SIZE
>   
>   --
> @@ -88,6 +89,7 @@ local function ref_session_new(sid)
>       -- Cache global session storages as upvalues to save on M indexing.
>       local global_heap = M.session_heap
>       local global_map = M.session_map
> +    local sched = lregistry.storage_sched
>   
>       local function ref_session_discount(self, del_count)
>           local new_count = M.count - del_count
> @@ -97,6 +99,8 @@ local function ref_session_new(sid)
>           new_count = count - del_count
>           assert(new_count >= 0)
>           count = new_count
> +
> +        sched.ref_end(del_count)
>       end
>   
>       local function ref_session_update_deadline(self)
> @@ -310,10 +314,17 @@ local function ref_add(rid, sid, timeout)
>       local deadline = now + timeout
>       local ok, err, session
>       local storage = lregistry.storage
> +    local sched = lregistry.storage_sched
> +
> +    timeout, err = sched.ref_start(timeout)
> +    if not timeout then
> +        return nil, err
> +    end
> +
>       while not storage.bucket_are_all_rw() do
>           ok, err = storage.bucket_generation_wait(timeout)
>           if not ok then
> -            return nil, err
> +            goto fail_sched
>           end
>           now = fiber_clock()
>           timeout = deadline - now
> @@ -322,7 +333,13 @@ local function ref_add(rid, sid, timeout)
>       if not session then
>           session = ref_session_new(sid)
>       end
> -    return session:add(rid, deadline, now)
> +    ok, err = session:add(rid, deadline, now)
> +    if ok then
> +        return true
> +    end
> +::fail_sched::
> +    sched.ref_end(1)
> +    return nil, err
>   end
>   
>   local function ref_use(rid, sid)
> @@ -341,6 +358,14 @@ local function ref_del(rid, sid)
>       return session:del(rid)
>   end
>   
> +local function ref_next_deadline()
> +    local session = M.session_heap:top()
> +    if not session then
> +        return fiber_clock() + TIMEOUT_INFINITY
> +    end

Does it make sence? inf + fiber_clock() = inf


> +    return session.deadline
> +end
> +
>   local function ref_kill_session(sid)
>       local session = M.session_map[sid]
>       if session then
> @@ -366,6 +391,7 @@ M.add = ref_add
>   M.use = ref_use
>   M.cfg = ref_cfg
>   M.kill = ref_kill_session
> +M.next_deadline = ref_next_deadline
>   lregistry.storage_ref = M
>   
>   return M
> diff --git a/vshard/storage/sched.lua b/vshard/storage/sched.lua
> new file mode 100644
> index 0000000..0ac71f4
> --- /dev/null
> +++ b/vshard/storage/sched.lua
> @@ -0,0 +1,231 @@
> +--
> +-- Scheduler module ensures fair time sharing between incompatible operations:
> +-- storage refs and bucket moves.
> +-- Storage ref is supposed to prevent all bucket moves and provide safe
> +-- environment for all kinds of possible requests on entire dataset of all
> +-- spaces stored on the instance.
> +-- Bucket move, on the contrary, wants to make a part of the dataset not usable
> +-- temporary.
> +-- Without a scheduler it would be possible to always keep at least one ref on
> +-- the storage and block bucket moves forever. Or vice versa - during
> +-- rebalancing block all incoming refs for the entire time of data migration,
> +-- essentially making map-reduce not usable since it heavily depends on refs.
> +--
> +-- The schedule divides storage time between refs and moves so both of them can
> +-- execute without blocking each other. Division proportions depend on the
> +-- configuration settings.
> +--
> +-- Idea of non-blockage is based on quotas and strikes. Move and ref both have
> +-- quotas. When one op executes more than quota requests in a row (makes a
> +-- strike) while the other op has queued requests, the first op stops accepting
> +-- new requests until the other op executes.
> +--
> +
> +local MODULE_INTERNALS = '__module_vshard_storage_sched'
> +-- Update when change behaviour of anything in the file, to be able to reload.
> +local MODULE_VERSION = 1
> +
> +local lfiber = require('fiber')
> +local lerror = require('vshard.error')
> +local lconsts = require('vshard.consts')
> +local lregistry = require('vshard.registry')
> +local lutil = require('vshard.util')
> +local fiber_clock = lfiber.clock
> +local fiber_cond_wait = lutil.fiber_cond_wait
> +local fiber_is_self_canceled = lutil.fiber_is_self_canceled
> +
> +local M = rawget(_G, MODULE_INTERNALS)
> +if not M then
> +    M = {
> +        ---------------- Common module attributes ----------------
> +        module_version = MODULE_VERSION,
> +        -- Scheduler condition is signaled every time anything significant
> +        -- happens - count of an operation type drops to 0, or quota increased,
> +        -- etc.
> +        cond = lfiber.cond(),
> +
> +        -------------------------- Refs --------------------------
> +        -- Number of ref requests waiting for start.
> +        ref_queue = 0,
> +        -- Number of ref requests being executed. It is the same as ref's module
> +        -- counter, but is duplicated here for the sake of isolation and
> +        -- symmetry with moves.
> +        ref_count = 0,
> +        -- Number of ref requests executed in a row. When becomes bigger than
> +        -- quota, any next queued move blocks new refs.
> +        ref_strike = 0,
> +        ref_quota = lconsts.DEFAULT_SCHED_REF_QUOTA,
> +
> +        ------------------------- Moves --------------------------
> +        -- Number of move requests waiting for start.
> +        move_queue = 0,
> +        -- Number of move requests being executed.
> +        move_count = 0,
> +        -- Number of move requests executed in a row. When becomes bigger than
> +        -- quota, any next queued ref blocks new moves.
> +        move_strike = 0,
> +        move_quota = lconsts.DEFAULT_SCHED_MOVE_QUOTA,
> +    }
> +else
> +    return M
> +end
> +
> +local function sched_wait_anything(timeout)
> +    return fiber_cond_wait(M.cond, timeout)
> +end
> +
> +--
> +-- Return the remaining timeout in case there was a yield. This helps to save
> +-- current clock get in the caller code if there were no yields.
> +--
> +local function sched_ref_start(timeout)
> +    local deadline = fiber_clock() + timeout

Let's do it after fast check to eliminate excess fiber_clock call.

Also there are several similar places below. Please fix them as well.

> +    local ok, err
> +    -- Fast-path. Moves are extremely rare. No need to inc-dec the ref queue
> +    -- then nor try to start some loops.
> +    if M.move_count == 0 and M.move_queue == 0 then
> +        goto success
> +    end
> +
> +    M.ref_queue = M.ref_queue + 1
> +
> +::retry::
> +    if M.move_count > 0 then
> +        goto wait_and_retry
> +    end
> +    -- Even if move count is zero, must ensure the time usage is fair. Does not
> +    -- matter in case the moves have no quota at all. That allows to ignore them
> +    -- infinitely until all refs end voluntarily.
> +    if M.move_queue > 0 and M.ref_strike >= M.ref_quota and
> +       M.move_quota > 0 then
> +        goto wait_and_retry
> +    end
> +
> +    M.ref_queue = M.ref_queue - 1
> +
> +::success::
> +    M.ref_count = M.ref_count + 1
> +    M.ref_strike = M.ref_strike + 1
> +    M.move_strike = 0
> +    do return timeout end
> +
> +::wait_and_retry::
> +    ok, err = sched_wait_anything(timeout)
> +    if not ok then
> +        M.ref_queue = M.ref_queue - 1
> +        return nil, err
> +    end
> +    timeout = deadline - fiber_clock()
> +    goto retry
> +end
> +
> +local function sched_ref_end(count)
> +    count = M.ref_count - count
> +    M.ref_count = count
> +    if count == 0 and M.move_queue > 0 then
> +        M.cond:broadcast()
> +    end
> +end
> +
> +--
> +-- Return the remaining timeout in case there was a yield. This helps to save
> +-- current clock get in the caller code if there were no yields.
> +--
> +local function sched_move_start(timeout)
> +    local deadline = fiber_clock() + timeout
> +    local ok, err, ref_deadline
> +    local lref = lregistry.storage_ref
> +    -- Fast-path. Refs are not extremely rare *when used*. But they are not
> +    -- expected to be used in a lot of installations. So most of the times the
> +    -- moves should work right away.
> +    if M.ref_count == 0 and M.ref_queue == 0 then
> +        goto success
> +    end
> +
> +    M.move_queue = M.move_queue + 1
> +
> +::retry::
> +    if M.ref_count > 0 then
> +        ref_deadline = lref.next_deadline()
> +        if ref_deadline < deadline then
> +            timeout = ref_deadline - fiber_clock()
> +        end
> +        ok, err = sched_wait_anything(timeout)
> +        timeout = deadline - fiber_clock()
> +        if ok then
> +            goto retry
> +        end
> +        if fiber_is_self_canceled() then
> +            goto fail
> +        end
> +        -- Even if the timeout has expired already (or was 0 from the
> +        -- beginning), it is still possible the move can be started if all the
> +        -- present refs are expired too and can be collected.
> +        lref.gc()
> +        -- GC could yield - need to refetch the clock again.
> +        timeout = deadline - fiber_clock()
> +        if M.ref_count > 0 then
> +            if timeout < 0 then
> +                goto fail
> +            end
> +            goto retry
> +        end
> +    end
> +
> +    if M.ref_queue > 0 and M.move_strike >= M.move_quota and
> +       M.ref_quota > 0 then
> +        ok, err = sched_wait_anything(timeout)
> +        if not ok then
> +            goto fail
> +        end
> +        timeout = deadline - fiber_clock()
> +        goto retry
> +    end
> +
> +    M.move_queue = M.move_queue - 1
> +
> +::success::
> +    M.move_count = M.move_count + 1
> +    M.move_strike = M.move_strike + 1
> +    M.ref_strike = 0
> +    do return timeout end
> +
> +::fail::
> +    M.move_queue = M.move_queue - 1
> +    return nil, err
> +end
> +
> +local function sched_move_end(count)
> +    count = M.move_count - count
> +    M.move_count = count
> +    if count == 0 and M.ref_queue > 0 then
> +        M.cond:broadcast()
> +    end
> +end
> +
> +local function sched_cfg(cfg)
> +    local new_ref_quota = cfg.sched_ref_quota
> +    local new_move_quota = cfg.sched_move_quota
> +
> +    if new_ref_quota then
> +        if new_ref_quota > M.ref_quota then
> +            M.cond:broadcast()
> +        end
> +        M.ref_quota = new_ref_quota
> +    end
> +    if new_move_quota then
> +        if new_move_quota > M.move_quota then
> +            M.cond:broadcast()
> +        end
> +        M.move_quota = new_move_quota
> +    end
> +end
> +
> +M.ref_start = sched_ref_start
> +M.ref_end = sched_ref_end
> +M.move_start = sched_move_start
> +M.move_end = sched_move_end
> +M.cfg = sched_cfg
> +lregistry.storage_sched = M
> +
> +return M

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 11/11] router: introduce map_callrw()
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 11/11] router: introduce map_callrw() Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-24 10:28   ` Oleg Babin via Tarantool-patches
  2021-02-24 22:04     ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 1 reply; 47+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-02-24 10:28 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, yaroslav.dynnikov

Thanks a lot for your patch! See 5 comments below.


On 23.02.2021 03:15, Vladislav Shpilevoy wrote:
> Closes #147

Will read-only map-reduce functions be done in the scope of separate 
issue/patch?

I know about #173 but seems we need to keep information about map_callro 
function.

> @TarantoolBot document
> Title: vshard.router.map_callrw()
>
> `vshard.router.map_callrw()` implements consistent map-reduce over
> the entire cluster. Consistency means all the data was accessible,
> and didn't move during map requests execution.
>
> It is useful when need to access potentially all the data in the
> cluster or simply huge number of buckets scattered over the
> instances and whose individual `vshard.router.call()` would take
> too long.
>
> `Map_callrw()` takes name of the function to call on the storages,
> arguments in the format of array, and not required options map.
> The only supported option for now is timeout which is applied to
> the entire call. Not to individual calls for each storage.
> ```
> vshard.router.map_callrw(func_name, args[, {timeout = <seconds>}])
> ```
>
> The chosen function is called on the master node of each
> replicaset with the given arguments.
>
> In case of success `vshard.router.map_callrw()` returns a map with
> replicaset UUIDs as keys and results of the user's function as
> values, like this:
> ```
> {uuid1 = {res1}, uuid2 = {res2}, ...}
> ```
> If the function returned `nil` or `box.NULL` from one of the
> storages, it won't be present in the result map.
>
> In case of fail it returns nil, error object, and optional
> replicaset UUID where the error happened. UUID may not be returned
> if the error wasn't about a concrete replicaset.
>
> For instance, the method fails if not all buckets were found even
> if all replicasets were scanned successfully.
>
> Handling the result looks like this:
> ```Lua
> res, err, uuid = vshard.router.map_callrw(...)
> if not res then
>      -- Error.
>      -- 'err' - error object. 'uuid' - optional UUID of replicaset
>      -- where the error happened.
>      ...
> else
>      -- Success.
>      for uuid, value in pairs(res) do
>          ...
>      end
> end
> ```
>
> Map-Reduce in vshard works in 3 stages: Ref, Map, Reduce. Ref is
> an internal stage which is supposed to ensure data consistency
> during user's function execution on all nodes.
>
> Reduce is not performed by vshard. It is what user's code does
> with results of `map_callrw()`.
>
> Consistency, as it is defined for map-reduce, is not compatible
> with rebalancing. Because any bucket move would make the sender
> and receiver nodes 'inconsistent' - it is not possible to call a
> function on them which could simply access all the data without
> doing `vshard.storage.bucket_ref()`.
>
> This makes Ref stage very intricate as it must work together with
> rebalancer to ensure neither of them block each other.
>
> For this storage has a scheduler specifically for bucket moves and
> storage refs which shares storage time between them fairly.
>
> Definition of fairness depends on how long and frequent the moves
> and refs are. This can be configured using storage options
> `sched_move_quota` and `sched_ref_quota`. See more details about
> them in the corresponding doc section.
>
> The scheduler configuration may affect map-reduce requests if they
> are used a lot during rebalancing.
>
> Keep in mind that it is not a good idea to use too big timeouts
> for `map_callrw()`. Because the router will try to block the
> bucket moves for the given timeout on all storages. And in case
> something will go wrong, the block will remain for the entire
> timeout. This means, in particular, having the timeout longer
> than, say, minutes is a super bad way to go unless it is for
> tests only.
>
> Also it is important to remember that `map_callrw()` does not
> work on replicas. It works only on masters. This makes it unusable
> if at least one replicaset has its master node down.
> ---
>   test/router/map-reduce.result   | 636 ++++++++++++++++++++++++++++++++
>   test/router/map-reduce.test.lua | 258 +++++++++++++
>   test/router/router.result       |   9 +-
>   test/upgrade/upgrade.result     |   5 +-
>   vshard/replicaset.lua           |  34 ++
>   vshard/router/init.lua          | 180 +++++++++
>   vshard/storage/init.lua         |  47 +++
>   7 files changed, 1164 insertions(+), 5 deletions(-)
>   create mode 100644 test/router/map-reduce.result
>   create mode 100644 test/router/map-reduce.test.lua
>
> diff --git a/test/router/map-reduce.result b/test/router/map-reduce.result
> new file mode 100644
> index 0000000..1e8995a
> --- /dev/null
> +++ b/test/router/map-reduce.result
> @@ -0,0 +1,636 @@
> +-- test-run result file version 2
> +test_run = require('test_run').new()
> + | ---
> + | ...
> +REPLICASET_1 = { 'storage_1_a', 'storage_1_b' }
> + | ---
> + | ...
> +REPLICASET_2 = { 'storage_2_a', 'storage_2_b' }
> + | ---
> + | ...
> +test_run:create_cluster(REPLICASET_1, 'router')
> + | ---
> + | ...
> +test_run:create_cluster(REPLICASET_2, 'router')
> + | ---
> + | ...
> +util = require('util')
> + | ---
> + | ...
> +util.wait_master(test_run, REPLICASET_1, 'storage_1_a')
> + | ---
> + | ...
> +util.wait_master(test_run, REPLICASET_2, 'storage_2_a')
> + | ---
> + | ...
> +util.map_evals(test_run, {REPLICASET_1, REPLICASET_2}, 'bootstrap_storage()')
> + | ---
> + | ...
> +util.push_rs_filters(test_run)
> + | ---
> + | ...
> +_ = test_run:cmd("create server router_1 with script='router/router_1.lua'")
> + | ---
> + | ...
> +_ = test_run:cmd("start server router_1")
> + | ---
> + | ...
> +
> +_ = test_run:switch("router_1")
> + | ---
> + | ...
> +util = require('util')
> + | ---
> + | ...
> +
> +--
> +-- gh-147: consistent map-reduce.
> +--
> +big_timeout = 1000000
> + | ---
> + | ...
> +big_timeout_opts = {timeout = big_timeout}
> + | ---
> + | ...
> +vshard.router.cfg(cfg)
> + | ---
> + | ...
> +vshard.router.bootstrap(big_timeout_opts)
> + | ---
> + | - true
> + | ...
> +-- Trivial basic sanity test. Multireturn is not supported, should be truncated.
> +vshard.router.map_callrw('echo', {1, 2, 3}, big_timeout_opts)
> + | ---
> + | - <replicaset_2>:
> + |   - 1
> + |   <replicaset_1>:
> + |   - 1
> + | ...
> +
> +--
> +-- Fail during connecting to storages. For the succeeded storages the router
> +-- tries to send unref.
> +--
> +timeout = 0.001
> + | ---
> + | ...
> +timeout_opts = {timeout = timeout}
> + | ---
> + | ...
> +
> +test_run:cmd('stop server storage_1_a')
> + | ---
> + | - true
> + | ...
> +ok, err = vshard.router.map_callrw('echo', {1}, timeout_opts)
> + | ---
> + | ...
> +assert(not ok and err.message)
> + | ---
> + | - Timeout exceeded
> + | ...
> +-- Even if ref was sent successfully to storage_2_a, it was deleted before
> +-- router returned an error.
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +lref = require('vshard.storage.ref')
> + | ---
> + | ...
> +-- Wait because unref is sent asynchronously. Could arrive not immediately.
> +test_run:wait_cond(function() return lref.count == 0 end)
> + | ---
> + | - true
> + | ...
> +
> +_ = test_run:switch('router_1')
> + | ---
> + | ...
> +test_run:cmd('start server storage_1_a')
> + | ---
> + | - true
> + | ...
> +-- Works again - router waited for connection being established.
> +vshard.router.map_callrw('echo', {1}, big_timeout_opts)
> + | ---
> + | - <replicaset_2>:
> + |   - 1
> + |   <replicaset_1>:
> + |   - 1
> + | ...
> +
> +--
> +-- Do all the same but with another storage being stopped. The same test is done
> +-- again because can't tell at which of the tests to where the router will go
> +-- first.
> +--
> +test_run:cmd('stop server storage_2_a')
> + | ---
> + | - true
> + | ...
> +ok, err = vshard.router.map_callrw('echo', {1}, timeout_opts)
> + | ---
> + | ...
> +assert(not ok and err.message)
> + | ---
> + | - Timeout exceeded
> + | ...
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +lref = require('vshard.storage.ref')
> + | ---
> + | ...
> +test_run:wait_cond(function() return lref.count == 0 end)
> + | ---
> + | - true
> + | ...
> +
> +_ = test_run:switch('router_1')
> + | ---
> + | ...
> +test_run:cmd('start server storage_2_a')
> + | ---
> + | - true
> + | ...
> +vshard.router.map_callrw('echo', {1}, big_timeout_opts)
> + | ---
> + | - <replicaset_2>:
> + |   - 1
> + |   <replicaset_1>:
> + |   - 1
> + | ...
> +
> +--
> +-- Fail at ref stage handling. Unrefs are sent to cancel those refs which
> +-- succeeded. To simulate a ref fail make the router think there is a moving
> +-- bucket.
> +--
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +lsched = require('vshard.storage.sched')
> + | ---
> + | ...
> +big_timeout = 1000000
> + | ---
> + | ...
> +lsched.move_start(big_timeout)
> + | ---
> + | - 1000000
> + | ...
> +
> +_ = test_run:switch('router_1')
> + | ---
> + | ...
> +ok, err = vshard.router.map_callrw('echo', {1}, timeout_opts)
> + | ---
> + | ...
> +assert(not ok and err.message)
> + | ---
> + | - Timeout exceeded
> + | ...
> +
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +lsched = require('vshard.storage.sched')
> + | ---
> + | ...
> +lref = require('vshard.storage.ref')
> + | ---
> + | ...
> +test_run:wait_cond(function() return lref.count == 0 end)
> + | ---
> + | - true
> + | ...
> +
> +--
> +-- Do all the same with another storage being busy with a 'move'.
> +--
> +big_timeout = 1000000
> + | ---
> + | ...
> +lsched.move_start(big_timeout)
> + | ---
> + | - 1000000
> + | ...
> +
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +lref = require('vshard.storage.ref')
> + | ---
> + | ...
> +lsched.move_end(1)
> + | ---
> + | ...
> +assert(lref.count == 0)
> + | ---
> + | - true
> + | ...
> +
> +_ = test_run:switch('router_1')
> + | ---
> + | ...
> +ok, err = vshard.router.map_callrw('echo', {1}, timeout_opts)
> + | ---
> + | ...
> +assert(not ok and err.message)
> + | ---
> + | - Timeout exceeded
> + | ...
> +
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +test_run:wait_cond(function() return lref.count == 0 end)
> + | ---
> + | - true
> + | ...
> +
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +lsched.move_end(1)
> + | ---
> + | ...
> +assert(lref.count == 0)
> + | ---
> + | - true
> + | ...
> +
> +_ = test_run:switch('router_1')
> + | ---
> + | ...
> +vshard.router.map_callrw('echo', {1}, big_timeout_opts)
> + | ---
> + | - <replicaset_2>:
> + |   - 1
> + |   <replicaset_1>:
> + |   - 1
> + | ...
> +
> +--
> +-- Ref can fail earlier than by a timeout. Router still should broadcast unrefs
> +-- correctly. To simulate ref fail add a duplicate manually.
> +--
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +box.schema.user.grant('storage', 'super')
> + | ---
> + | ...
> +router_sid = nil
> + | ---
> + | ...
> +function save_router_sid()                                                      \
> +    router_sid = box.session.id()                                               \
> +end
> + | ---
> + | ...
> +
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +box.schema.user.grant('storage', 'super')
> + | ---
> + | ...
> +router_sid = nil
> + | ---
> + | ...
> +function save_router_sid()                                                      \
> +    router_sid = box.session.id()                                               \
> +end
> + | ---
> + | ...
> +
> +_ = test_run:switch('router_1')
> + | ---
> + | ...
> +vshard.router.map_callrw('save_router_sid', {}, big_timeout_opts)
> + | ---
> + | - []
> + | ...
> +
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +lref.add(1, router_sid, big_timeout)
> + | ---
> + | - true
> + | ...
> +
> +_ = test_run:switch('router_1')
> + | ---
> + | ...
> +vshard.router.internal.ref_id = 1
> + | ---
> + | ...
> +ok, err = vshard.router.map_callrw('echo', {1}, big_timeout_opts)
> + | ---
> + | ...
> +assert(not ok and err.message)
> + | ---
> + | - 'Can not add a storage ref: duplicate ref'
> + | ...
> +
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +_ = lref.del(1, router_sid)
> + | ---
> + | ...
> +
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +test_run:wait_cond(function() return lref.count == 0 end)
> + | ---
> + | - true
> + | ...
> +lref.add(1, router_sid, big_timeout)
> + | ---
> + | - true
> + | ...
> +
> +_ = test_run:switch('router_1')
> + | ---
> + | ...
> +vshard.router.internal.ref_id = 1
> + | ---
> + | ...
> +ok, err = vshard.router.map_callrw('echo', {1}, big_timeout_opts)
> + | ---
> + | ...
> +assert(not ok and err.message)
> + | ---
> + | - 'Can not add a storage ref: duplicate ref'
> + | ...
> +
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +_ = lref.del(1, router_sid)
> + | ---
> + | ...
> +
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +test_run:wait_cond(function() return lref.count == 0 end)
> + | ---
> + | - true
> + | ...
> +
> +--
> +-- Fail if some buckets are not visible. Even if all the known replicasets were
> +-- scanned. It means consistency violation.
> +--
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +bucket_id = box.space._bucket.index.pk:min().id
> + | ---
> + | ...
> +vshard.storage.bucket_force_drop(bucket_id)
> + | ---
> + | - true
> + | ...
> +
> +_ = test_run:switch('router_1')
> + | ---
> + | ...
> +ok, err = vshard.router.map_callrw('echo', {1}, big_timeout_opts)
> + | ---
> + | ...
> +assert(not ok and err.message)
> + | ---
> + | - 1 buckets are not discovered
> + | ...
> +
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +test_run:wait_cond(function() return lref.count == 0 end)
> + | ---
> + | - true
> + | ...
> +vshard.storage.bucket_force_create(bucket_id)
> + | ---
> + | - true
> + | ...
> +
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +test_run:wait_cond(function() return lref.count == 0 end)
> + | ---
> + | - true
> + | ...
> +bucket_id = box.space._bucket.index.pk:min().id
> + | ---
> + | ...
> +vshard.storage.bucket_force_drop(bucket_id)
> + | ---
> + | - true
> + | ...
> +
> +_ = test_run:switch('router_1')
> + | ---
> + | ...
> +ok, err = vshard.router.map_callrw('echo', {1}, big_timeout_opts)
> + | ---
> + | ...
> +assert(not ok and err.message)
> + | ---
> + | - 1 buckets are not discovered
> + | ...
> +
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +test_run:wait_cond(function() return lref.count == 0 end)
> + | ---
> + | - true
> + | ...
> +vshard.storage.bucket_force_create(bucket_id)
> + | ---
> + | - true
> + | ...
> +
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +test_run:wait_cond(function() return lref.count == 0 end)
> + | ---
> + | - true
> + | ...
> +
> +--
> +-- Storage map unit tests.
> +--
> +
> +-- Map fails not being able to use the ref.
> +ok, err = vshard.storage._call('storage_map', 0, 'echo', {1})
> + | ---
> + | ...
> +ok, err.message
> + | ---
> + | - null
> + | - 'Can not use a storage ref: no session'
> + | ...
> +
> +-- Map fails and clears the ref when the user function fails.
> +vshard.storage._call('storage_ref', 0, big_timeout)
> + | ---
> + | - 1500
> + | ...
> +assert(lref.count == 1)
> + | ---
> + | - true
> + | ...
> +ok, err = vshard.storage._call('storage_map', 0, 'raise_client_error', {})
> + | ---
> + | ...
> +assert(lref.count == 0)
> + | ---
> + | - true
> + | ...
> +assert(not ok and err.message)
> + | ---
> + | - Unknown error
> + | ...
> +
> +-- Map fails gracefully when couldn't delete the ref.
> +vshard.storage._call('storage_ref', 0, big_timeout)
> + | ---
> + | - 1500
> + | ...
> +ok, err = vshard.storage._call('storage_map', 0, 'vshard.storage._call',        \
> +                               {'storage_unref', 0})
> + | ---
> + | ...
> +assert(lref.count == 0)
> + | ---
> + | - true
> + | ...
> +assert(not ok and err.message)
> + | ---
> + | - 'Can not delete a storage ref: no ref'
> + | ...
> +
> +--
> +-- Map fail is handled and the router tries to send unrefs.
> +--
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +need_throw = true
> + | ---
> + | ...
> +function map_throw()                                                            \
> +    if need_throw then                                                          \
> +        raise_client_error()                                                    \
> +    end                                                                         \
> +    return '+'                                                                  \
> +end
> + | ---
> + | ...
> +
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +need_throw = false
> + | ---
> + | ...
> +function map_throw()                                                            \
> +    if need_throw then                                                          \
> +        raise_client_error()                                                    \
> +    end                                                                         \
> +    return '+'                                                                  \
> +end
> + | ---
> + | ...
> +
> +_ = test_run:switch('router_1')
> + | ---
> + | ...
> +ok, err = vshard.router.map_callrw('raise_client_error', {}, big_timeout_opts)
> + | ---
> + | ...
> +ok, err.message
> + | ---
> + | - null
> + | - Unknown error
> + | ...
> +
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +need_throw = false
> + | ---
> + | ...
> +test_run:wait_cond(function() return lref.count == 0 end)
> + | ---
> + | - true
> + | ...
> +
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +need_throw = true
> + | ---
> + | ...
> +test_run:wait_cond(function() return lref.count == 0 end)
> + | ---
> + | - true
> + | ...
> +
> +_ = test_run:switch('router_1')
> + | ---
> + | ...
> +ok, err = vshard.router.map_callrw('raise_client_error', {}, big_timeout_opts)
> + | ---
> + | ...
> +ok, err.message
> + | ---
> + | - null
> + | - Unknown error
> + | ...
> +
> +_ = test_run:switch('storage_1_a')
> + | ---
> + | ...
> +test_run:wait_cond(function() return lref.count == 0 end)
> + | ---
> + | - true
> + | ...
> +
> +_ = test_run:switch('storage_2_a')
> + | ---
> + | ...
> +test_run:wait_cond(function() return lref.count == 0 end)
> + | ---
> + | - true
> + | ...
> +
> +_ = test_run:switch('default')
> + | ---
> + | ...
> +_ = test_run:cmd("stop server router_1")
> + | ---
> + | ...
> +_ = test_run:cmd("cleanup server router_1")
> + | ---
> + | ...
> +test_run:drop_cluster(REPLICASET_1)
> + | ---
> + | ...
> +test_run:drop_cluster(REPLICASET_2)
> + | ---
> + | ...
> +_ = test_run:cmd('clear filter')
> + | ---
> + | ...
> diff --git a/test/router/map-reduce.test.lua b/test/router/map-reduce.test.lua
> new file mode 100644
> index 0000000..3b63248
> --- /dev/null
> +++ b/test/router/map-reduce.test.lua
> @@ -0,0 +1,258 @@
> +test_run = require('test_run').new()
> +REPLICASET_1 = { 'storage_1_a', 'storage_1_b' }
> +REPLICASET_2 = { 'storage_2_a', 'storage_2_b' }
> +test_run:create_cluster(REPLICASET_1, 'router')
> +test_run:create_cluster(REPLICASET_2, 'router')
> +util = require('util')
> +util.wait_master(test_run, REPLICASET_1, 'storage_1_a')
> +util.wait_master(test_run, REPLICASET_2, 'storage_2_a')
> +util.map_evals(test_run, {REPLICASET_1, REPLICASET_2}, 'bootstrap_storage()')
> +util.push_rs_filters(test_run)
> +_ = test_run:cmd("create server router_1 with script='router/router_1.lua'")
> +_ = test_run:cmd("start server router_1")
> +
> +_ = test_run:switch("router_1")
> +util = require('util')
> +
> +--
> +-- gh-147: consistent map-reduce.
> +--
> +big_timeout = 1000000
> +big_timeout_opts = {timeout = big_timeout}
> +vshard.router.cfg(cfg)
> +vshard.router.bootstrap(big_timeout_opts)
> +-- Trivial basic sanity test. Multireturn is not supported, should be truncated.
> +vshard.router.map_callrw('echo', {1, 2, 3}, big_timeout_opts)
> +
> +--
> +-- Fail during connecting to storages. For the succeeded storages the router
> +-- tries to send unref.
> +--
> +timeout = 0.001
> +timeout_opts = {timeout = timeout}
> +
> +test_run:cmd('stop server storage_1_a')
> +ok, err = vshard.router.map_callrw('echo', {1}, timeout_opts)
> +assert(not ok and err.message)
> +-- Even if ref was sent successfully to storage_2_a, it was deleted before
> +-- router returned an error.
> +_ = test_run:switch('storage_2_a')
> +lref = require('vshard.storage.ref')
> +-- Wait because unref is sent asynchronously. Could arrive not immediately.
> +test_run:wait_cond(function() return lref.count == 0 end)
> +
> +_ = test_run:switch('router_1')
> +test_run:cmd('start server storage_1_a')
> +-- Works again - router waited for connection being established.
> +vshard.router.map_callrw('echo', {1}, big_timeout_opts)
> +
> +--
> +-- Do all the same but with another storage being stopped. The same test is done
> +-- again because can't tell at which of the tests to where the router will go
> +-- first.
> +--
> +test_run:cmd('stop server storage_2_a')
> +ok, err = vshard.router.map_callrw('echo', {1}, timeout_opts)
> +assert(not ok and err.message)
> +_ = test_run:switch('storage_1_a')
> +lref = require('vshard.storage.ref')
> +test_run:wait_cond(function() return lref.count == 0 end)
> +
> +_ = test_run:switch('router_1')
> +test_run:cmd('start server storage_2_a')
> +vshard.router.map_callrw('echo', {1}, big_timeout_opts)
> +
> +--
> +-- Fail at ref stage handling. Unrefs are sent to cancel those refs which
> +-- succeeded. To simulate a ref fail make the router think there is a moving
> +-- bucket.
> +--
> +_ = test_run:switch('storage_1_a')
> +lsched = require('vshard.storage.sched')
> +big_timeout = 1000000
> +lsched.move_start(big_timeout)
> +
> +_ = test_run:switch('router_1')
> +ok, err = vshard.router.map_callrw('echo', {1}, timeout_opts)
> +assert(not ok and err.message)
> +
> +_ = test_run:switch('storage_2_a')
> +lsched = require('vshard.storage.sched')
> +lref = require('vshard.storage.ref')
> +test_run:wait_cond(function() return lref.count == 0 end)
> +
> +--
> +-- Do all the same with another storage being busy with a 'move'.
> +--
> +big_timeout = 1000000
> +lsched.move_start(big_timeout)
> +
> +_ = test_run:switch('storage_1_a')
> +lref = require('vshard.storage.ref')
> +lsched.move_end(1)
> +assert(lref.count == 0)
> +
> +_ = test_run:switch('router_1')
> +ok, err = vshard.router.map_callrw('echo', {1}, timeout_opts)
> +assert(not ok and err.message)
> +
> +_ = test_run:switch('storage_1_a')
> +test_run:wait_cond(function() return lref.count == 0 end)
> +
> +_ = test_run:switch('storage_2_a')
> +lsched.move_end(1)
> +assert(lref.count == 0)
> +
> +_ = test_run:switch('router_1')
> +vshard.router.map_callrw('echo', {1}, big_timeout_opts)
> +
> +--
> +-- Ref can fail earlier than by a timeout. Router still should broadcast unrefs
> +-- correctly. To simulate ref fail add a duplicate manually.
> +--
> +_ = test_run:switch('storage_1_a')
> +box.schema.user.grant('storage', 'super')
> +router_sid = nil
> +function save_router_sid()                                                      \
> +    router_sid = box.session.id()                                               \
> +end
> +
> +_ = test_run:switch('storage_2_a')
> +box.schema.user.grant('storage', 'super')
> +router_sid = nil
> +function save_router_sid()                                                      \
> +    router_sid = box.session.id()                                               \
> +end
> +
> +_ = test_run:switch('router_1')
> +vshard.router.map_callrw('save_router_sid', {}, big_timeout_opts)
> +
> +_ = test_run:switch('storage_1_a')
> +lref.add(1, router_sid, big_timeout)
> +
> +_ = test_run:switch('router_1')
> +vshard.router.internal.ref_id = 1
> +ok, err = vshard.router.map_callrw('echo', {1}, big_timeout_opts)
> +assert(not ok and err.message)
> +
> +_ = test_run:switch('storage_1_a')
> +_ = lref.del(1, router_sid)
> +
> +_ = test_run:switch('storage_2_a')
> +test_run:wait_cond(function() return lref.count == 0 end)
> +lref.add(1, router_sid, big_timeout)
> +
> +_ = test_run:switch('router_1')
> +vshard.router.internal.ref_id = 1
> +ok, err = vshard.router.map_callrw('echo', {1}, big_timeout_opts)
> +assert(not ok and err.message)
> +
> +_ = test_run:switch('storage_2_a')
> +_ = lref.del(1, router_sid)
> +
> +_ = test_run:switch('storage_1_a')
> +test_run:wait_cond(function() return lref.count == 0 end)
> +
> +--
> +-- Fail if some buckets are not visible. Even if all the known replicasets were
> +-- scanned. It means consistency violation.
> +--
> +_ = test_run:switch('storage_1_a')
> +bucket_id = box.space._bucket.index.pk:min().id
> +vshard.storage.bucket_force_drop(bucket_id)
> +
> +_ = test_run:switch('router_1')
> +ok, err = vshard.router.map_callrw('echo', {1}, big_timeout_opts)
> +assert(not ok and err.message)
> +
> +_ = test_run:switch('storage_1_a')
> +test_run:wait_cond(function() return lref.count == 0 end)
> +vshard.storage.bucket_force_create(bucket_id)
> +
> +_ = test_run:switch('storage_2_a')
> +test_run:wait_cond(function() return lref.count == 0 end)
> +bucket_id = box.space._bucket.index.pk:min().id
> +vshard.storage.bucket_force_drop(bucket_id)
> +
> +_ = test_run:switch('router_1')
> +ok, err = vshard.router.map_callrw('echo', {1}, big_timeout_opts)
> +assert(not ok and err.message)
> +
> +_ = test_run:switch('storage_2_a')
> +test_run:wait_cond(function() return lref.count == 0 end)
> +vshard.storage.bucket_force_create(bucket_id)
> +
> +_ = test_run:switch('storage_1_a')
> +test_run:wait_cond(function() return lref.count == 0 end)
> +
> +--
> +-- Storage map unit tests.
> +--
> +
> +-- Map fails not being able to use the ref.
> +ok, err = vshard.storage._call('storage_map', 0, 'echo', {1})
> +ok, err.message
> +
> +-- Map fails and clears the ref when the user function fails.
> +vshard.storage._call('storage_ref', 0, big_timeout)
> +assert(lref.count == 1)
> +ok, err = vshard.storage._call('storage_map', 0, 'raise_client_error', {})
> +assert(lref.count == 0)
> +assert(not ok and err.message)
> +
> +-- Map fails gracefully when couldn't delete the ref.
> +vshard.storage._call('storage_ref', 0, big_timeout)
> +ok, err = vshard.storage._call('storage_map', 0, 'vshard.storage._call',        \
> +                               {'storage_unref', 0})
> +assert(lref.count == 0)
> +assert(not ok and err.message)
> +
> +--
> +-- Map fail is handled and the router tries to send unrefs.
> +--
> +_ = test_run:switch('storage_1_a')
> +need_throw = true
> +function map_throw()                                                            \
> +    if need_throw then                                                          \
> +        raise_client_error()                                                    \
> +    end                                                                         \
> +    return '+'                                                                  \
> +end
> +
> +_ = test_run:switch('storage_2_a')
> +need_throw = false
> +function map_throw()                                                            \
> +    if need_throw then                                                          \
> +        raise_client_error()                                                    \
> +    end                                                                         \
> +    return '+'                                                                  \
> +end
> +
> +_ = test_run:switch('router_1')
> +ok, err = vshard.router.map_callrw('raise_client_error', {}, big_timeout_opts)
> +ok, err.message
> +
> +_ = test_run:switch('storage_1_a')
> +need_throw = false
> +test_run:wait_cond(function() return lref.count == 0 end)
> +
> +_ = test_run:switch('storage_2_a')
> +need_throw = true
> +test_run:wait_cond(function() return lref.count == 0 end)
> +
> +_ = test_run:switch('router_1')
> +ok, err = vshard.router.map_callrw('raise_client_error', {}, big_timeout_opts)
> +ok, err.message
> +
> +_ = test_run:switch('storage_1_a')
> +test_run:wait_cond(function() return lref.count == 0 end)
> +
> +_ = test_run:switch('storage_2_a')
> +test_run:wait_cond(function() return lref.count == 0 end)
> +
> +_ = test_run:switch('default')
> +_ = test_run:cmd("stop server router_1")
> +_ = test_run:cmd("cleanup server router_1")
> +test_run:drop_cluster(REPLICASET_1)
> +test_run:drop_cluster(REPLICASET_2)
> +_ = test_run:cmd('clear filter')
> \ No newline at end of file
> diff --git a/test/router/router.result b/test/router/router.result
> index 3c1d073..f9ee37c 100644
> --- a/test/router/router.result
> +++ b/test/router/router.result
> @@ -1163,14 +1163,15 @@ error_messages
>   - - Use replicaset:callro(...) instead of replicaset.callro(...)
>     - Use replicaset:connect_master(...) instead of replicaset.connect_master(...)
>     - Use replicaset:callre(...) instead of replicaset.callre(...)
> -  - Use replicaset:connect_replica(...) instead of replicaset.connect_replica(...)
>     - Use replicaset:down_replica_priority(...) instead of replicaset.down_replica_priority(...)
> -  - Use replicaset:callrw(...) instead of replicaset.callrw(...)
> +  - Use replicaset:connect(...) instead of replicaset.connect(...)
> +  - Use replicaset:wait_connected(...) instead of replicaset.wait_connected(...)
> +  - Use replicaset:up_replica_priority(...) instead of replicaset.up_replica_priority(...)
>     - Use replicaset:callbro(...) instead of replicaset.callbro(...)
>     - Use replicaset:connect_all(...) instead of replicaset.connect_all(...)
> +  - Use replicaset:connect_replica(...) instead of replicaset.connect_replica(...)
>     - Use replicaset:call(...) instead of replicaset.call(...)
> -  - Use replicaset:connect(...) instead of replicaset.connect(...)
> -  - Use replicaset:up_replica_priority(...) instead of replicaset.up_replica_priority(...)
> +  - Use replicaset:callrw(...) instead of replicaset.callrw(...)
>     - Use replicaset:callbre(...) instead of replicaset.callbre(...)
>   ...
>   _, replica = next(replicaset.replicas)
> diff --git a/test/upgrade/upgrade.result b/test/upgrade/upgrade.result
> index c2d54a3..833da3f 100644
> --- a/test/upgrade/upgrade.result
> +++ b/test/upgrade/upgrade.result
> @@ -162,9 +162,12 @@ vshard.storage._call ~= nil
>   vshard.storage._call('test_api', 1, 2, 3)
>    | ---
>    | - bucket_recv: true
> + |   storage_ref: true
>    |   rebalancer_apply_routes: true
> - |   test_api: true
> + |   storage_map: true
>    |   rebalancer_request_state: true
> + |   test_api: true
> + |   storage_unref: true
>    | - 1
>    | - 2
>    | - 3
> diff --git a/vshard/replicaset.lua b/vshard/replicaset.lua
> index 7437e3b..56ea165 100644
> --- a/vshard/replicaset.lua
> +++ b/vshard/replicaset.lua
> @@ -139,6 +139,39 @@ local function replicaset_connect_master(replicaset)
>       return replicaset_connect_to_replica(replicaset, master)
>   end
>   
> +--
> +-- Wait until the master instance is connected. This is necessary at least for
> +-- async requests because they fail immediately if the connection is not
> +-- established.
> +-- Returns the remaining timeout because is expected to be used to connect to
> +-- many replicasets in a loop, where such return saves one clock get in the
> +-- caller code and is just cleaner code.
> +--
> +local function replicaset_wait_connected(replicaset, timeout)
> +    local deadline = fiber_clock() + timeout
> +    local ok, res
> +    while true do
> +        local conn = replicaset_connect_master(replicaset)
> +        if conn.state == 'active' then
> +            return timeout
> +        end

Why don't you use conn:is_connected(). It considers "fetch_schema" as 
appropriate state.

> +        -- Netbox uses fiber_cond inside, which throws an irrelevant usage error
> +        -- at negative timeout. Need to check the case manually.
> +        if timeout < 0 then
> +            return nil, lerror.timeout()
> +        end
> +        ok, res = pcall(conn.wait_connected, conn, timeout)
> +        if not ok then
> +            return nil, lerror.make(res)
> +        end
> +        if not res then
> +            return nil, lerror.timeout()
> +        end
> +        timeout = deadline - fiber_clock()
> +    end
> +    assert(false)
> +end
> +
>   --
>   -- Create net.box connections to all replicas and master.
>   --
> @@ -483,6 +516,7 @@ local replicaset_mt = {
>           connect_replica = replicaset_connect_to_replica;
>           down_replica_priority = replicaset_down_replica_priority;
>           up_replica_priority = replicaset_up_replica_priority;
> +        wait_connected = replicaset_wait_connected,
>           call = replicaset_master_call;
>           callrw = replicaset_master_call;
>           callro = replicaset_template_multicallro(false, false);
> diff --git a/vshard/router/init.lua b/vshard/router/init.lua
> index 97bcb0a..8abd77f 100644
> --- a/vshard/router/init.lua
> +++ b/vshard/router/init.lua
> @@ -44,6 +44,11 @@ if not M then
>           module_version = 0,
>           -- Number of router which require collecting lua garbage.
>           collect_lua_garbage_cnt = 0,
> +
> +        ----------------------- Map-Reduce -----------------------
> +        -- Storage Ref ID. It must be unique for each ref request
> +        -- and therefore is global and monotonically growing.
> +        ref_id = 0,

Maybe 0ULL?

>       }
>   end
>   
> @@ -674,6 +679,177 @@ local function router_call(router, bucket_id, opts, ...)
>                               ...)
>   end
>   
> +local router_map_callrw
> +
> +if util.version_is_at_least(1, 10, 0) then
> +--
> +-- Consistent Map-Reduce. The given function is called on all masters in the
> +-- cluster with a guarantee that in case of success it was executed with all
> +-- buckets being accessible for reads and writes.
> +--
> +-- Consistency in scope of map-reduce means all the data was accessible, and
> +-- didn't move during map requests execution. To preserve the consistency there
> +-- is a third stage - Ref. So the algorithm is actually Ref-Map-Reduce.
> +--
> +-- Refs are broadcast before Map stage to pin the buckets to their storages, and
> +-- ensure they won't move until maps are done.
> +--
> +-- Map requests are broadcast in case all refs are done successfully. They
> +-- execute the user function + delete the refs to enable rebalancing again.
> +--
> +-- On the storages there are additional means to ensure map-reduces don't block
> +-- rebalancing forever and vice versa.
> +--
> +-- The function is not as slow as it may seem - it uses netbox's feature
> +-- is_async to send refs and maps in parallel. So cost of the function is about
> +-- 2 network exchanges to the most far storage in terms of time.
> +--
> +-- @param router Router instance to use.
> +-- @param func Name of the function to call.
> +-- @param args Function arguments passed in netbox style (as an array).
> +-- @param opts Can only contain 'timeout' as a number of seconds. Note that the
> +--     refs may end up being kept on the storages during this entire timeout if
> +--     something goes wrong. For instance, network issues appear. This means
> +--     better not use a value bigger than necessary. A stuck infinite ref can
> +--     only be dropped by this router restart/reconnect or the storage restart.
> +--
> +-- @return In case of success - a map with replicaset UUID keys and values being
> +--     what the function returned from the replicaset.
> +--
> +-- @return In case of an error - nil, error object, optional UUID of the
> +--     replicaset where the error happened. UUID may be not present if it wasn't
> +--     about concrete replicaset. For example, not all buckets were found even
> +--     though all replicasets were scanned.
> +--
> +router_map_callrw = function(router, func, args, opts)
> +    local replicasets = router.replicasets

It would be great to filter here replicasets with bucket_count = 0 and 
weight = 0.

In case if such "dummy" replicasets are disabled we get an error 
"connection refused".

> +    local timeout = opts and opts.timeout or consts.CALL_TIMEOUT_MIN
> +    local deadline = fiber_clock() + timeout
> +    local err, err_uuid, res, ok, map
> +    local futures = {}
> +    local bucket_count = 0
> +    local opts_async = {is_async = true}
> +    local rs_count = 0
> +    local rid = M.ref_id
> +    M.ref_id = rid + 1
> +    -- Nil checks are done explicitly here (== nil instead of 'not'), because
> +    -- netbox requests return box.NULL instead of nils.
> +
> +    --
> +    -- Ref stage: send.
> +    --
> +    for uuid, rs in pairs(replicasets) do
> +        -- Netbox async requests work only with active connections. Need to wait
> +        -- for the connection explicitly.
> +        timeout, err = rs:wait_connected(timeout)
> +        if timeout == nil then
> +            err_uuid = uuid
> +            goto fail
> +        end
> +        res, err = rs:callrw('vshard.storage._call',
> +                              {'storage_ref', rid, timeout}, opts_async)
> +        if res == nil then
> +            err_uuid = uuid
> +            goto fail
> +        end
> +        futures[uuid] = res
> +        rs_count = rs_count + 1
> +    end
> +    map = table_new(0, rs_count)
> +    --
> +    -- Ref stage: collect.
> +    --
> +    for uuid, future in pairs(futures) do
> +        res, err = future:wait_result(timeout)
> +        -- Handle netbox error first.
> +        if res == nil then
> +            err_uuid = uuid
> +            goto fail
> +        end
> +        -- Ref returns nil,err or bucket count.
> +        res, err = unpack(res)

Seems `res, err = res[1], res[2]` could be a bit faster.

> +        if res == nil then
> +            err_uuid = uuid
> +            goto fail
> +        end
> +        bucket_count = bucket_count + res
> +        timeout = deadline - fiber_clock()
> +    end
> +    -- All refs are done but not all buckets are covered. This is odd and can
> +    -- mean many things. The most possible ones: 1) outdated configuration on
> +    -- the router and it does not see another replicaset with more buckets,
> +    -- 2) some buckets are simply lost or duplicated - could happen as a bug, or
> +    -- if the user does a maintenance of some kind by creating/deleting buckets.
> +    -- In both cases can't guarantee all the data would be covered by Map calls.
> +    if bucket_count ~= router.total_bucket_count then
> +        err = lerror.vshard(lerror.code.UNKNOWN_BUCKETS,
> +                            router.total_bucket_count - bucket_count)
> +        goto fail
> +    end
> +    --
> +    -- Map stage: send.
> +    --
> +    args = {'storage_map', rid, func, args}
> +    for uuid, rs in pairs(replicasets) do
> +        res, err = rs:callrw('vshard.storage._call', args, opts_async)
> +        if res == nil then
> +            err_uuid = uuid
> +            goto fail
> +        end
> +        futures[uuid] = res
> +    end
> +    --
> +    -- Ref stage: collect.
> +    --
> +    for uuid, f in pairs(futures) do
> +        res, err = f:wait_result(timeout)
> +        if res == nil then
> +            err_uuid = uuid
> +            goto fail
> +        end
> +        -- Map returns true,res or nil,err.
> +        ok, res = unpack(res)
> +        if ok == nil then
> +            err = res
> +            err_uuid = uuid
> +            goto fail
> +        end
> +        if res ~= nil then
> +            -- Store as a table so in future it could be extended for
> +            -- multireturn.
> +            map[uuid] = {res}
> +        end
> +        timeout = deadline - fiber_clock()
> +    end
> +    do return map end
> +
> +::fail::
> +    for uuid, f in pairs(futures) do
> +        f:discard()
> +        -- Best effort to remove the created refs before exiting. Can help if
> +        -- the timeout was big and the error happened early.
> +        f = replicasets[uuid]:callrw('vshard.storage._call',
> +                                     {'storage_unref', rid}, opts_async)
> +        if f ~= nil then
> +            -- Don't care waiting for a result - no time for this. But it won't
> +            -- affect the request sending if the connection is still alive.
> +            f:discard()
> +        end
> +    end
> +    err = lerror.make(err)
> +    return nil, err, err_uuid
> +end
> +
> +-- Version >= 1.10.
> +else
> +-- Version < 1.10.
> +
> +router_map_callrw = function()
> +    error('Supported for Tarantool >= 1.10')
> +end
> +
> +end
> +
>   --
>   -- Get replicaset object by bucket identifier.
>   -- @param bucket_id Bucket identifier.
> @@ -1268,6 +1444,7 @@ local router_mt = {
>           callrw = router_callrw;
>           callre = router_callre;
>           callbre = router_callbre;
> +        map_callrw = router_map_callrw,
>           route = router_route;
>           routeall = router_routeall;
>           bucket_id = router_bucket_id,
> @@ -1365,6 +1542,9 @@ end
>   if not rawget(_G, MODULE_INTERNALS) then
>       rawset(_G, MODULE_INTERNALS, M)
>   else
> +    if not M.ref_id then
> +        M.ref_id = 0
> +    end
>       for _, router in pairs(M.routers) do
>           router_cfg(router, router.current_cfg, true)
>           setmetatable(router, router_mt)
> diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
> index 31f668f..0a14440 100644
> --- a/vshard/storage/init.lua
> +++ b/vshard/storage/init.lua
> @@ -2415,6 +2415,50 @@ local function storage_call(bucket_id, mode, name, args)
>       return ok, ret1, ret2, ret3
>   end
>   
> +--
> +-- Bind a new storage ref to the current box session. Is used as a part of
> +-- Map-Reduce API.
> +--
> +local function storage_ref(rid, timeout)
> +    local ok, err = lref.add(rid, box.session.id(), timeout)
> +    if not ok then
> +        return nil, err
> +    end
> +    return bucket_count()
> +end
> +
> +--
> +-- Drop a storage ref from the current box session. Is used as a part of
> +-- Map-Reduce API.
> +--
> +local function storage_unref(rid)
> +    return lref.del(rid, box.session.id())
> +end
> +
> +--
> +-- Execute a user's function under an infinite storage ref protecting from
> +-- bucket moves. The ref should exist before, and is deleted after, regardless
> +-- of the function result. Is used as a part of Map-Reduce API.
> +--
> +local function storage_map(rid, name, args)
> +    local ok, err, res
> +    local sid = box.session.id()
> +    ok, err = lref.use(rid, sid)
> +    if not ok then
> +        return nil, err
> +    end
> +    ok, res = local_call(name, args)
> +    if not ok then
> +        lref.del(rid, sid)
> +        return nil, lerror.make(res)
> +    end
> +    ok, err = lref.del(rid, sid)
> +    if not ok then
> +        return nil, err
> +    end
> +    return true, res
> +end
> +
>   local service_call_api
>   
>   local function service_call_test_api(...)
> @@ -2425,6 +2469,9 @@ service_call_api = setmetatable({
>       bucket_recv = bucket_recv,
>       rebalancer_apply_routes = rebalancer_apply_routes,
>       rebalancer_request_state = rebalancer_request_state,
> +    storage_ref = storage_ref,
> +    storage_unref = storage_unref,
> +    storage_map = storage_map,
>       test_api = service_call_test_api,
>   }, {__serialize = function(api)
>       local res = {}

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 01/11] error: introduce vshard.error.timeout()
  2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
@ 2021-02-24 21:46     ` Vladislav Shpilevoy via Tarantool-patches
  2021-02-25 12:42       ` Oleg Babin via Tarantool-patches
  0 siblings, 1 reply; 47+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-02-24 21:46 UTC (permalink / raw)
  To: Oleg Babin, tarantool-patches, yaroslav.dynnikov

Hi! Thanks for the review!

On 24.02.2021 11:27, Oleg Babin wrote:
> Hi! Thanks for your patch.
> 
> Personally, I vote for dropping 1.9 support (it's already broken - #256).

It kind of works. Just with some spam in the logs. But yeah, I would like
to finally drop it. I talked to Mons and he approved this:
https://github.com/tarantool/vshard/issues/267

> But if you want to eliminate "long and ugly" ways you could do something like:
> 
> 
> ```
> 
> local make_timeout
> if box.error.new ~= nil then
>     make_timeout = function() return box.error.new(box.error.TIMEOUT) end
> else
>     make_timeout = function() return select(2, pcall(...)) end
> end
> 
> ```

Which is longer and uglier, right? Where is the win?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 03/11] storage: cache bucket count
  2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
@ 2021-02-24 21:47     ` Vladislav Shpilevoy via Tarantool-patches
  2021-02-25 12:42       ` Oleg Babin via Tarantool-patches
  0 siblings, 1 reply; 47+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-02-24 21:47 UTC (permalink / raw)
  To: Oleg Babin, tarantool-patches, yaroslav.dynnikov

Thanks for the review!

On 24.02.2021 11:27, Oleg Babin wrote:
> Thanks for your patch! LGTM.
> 
> I see calls like "status_index:count({consts.BUCKET.ACTIVE})". Maybe it worth
> 
> to cache whole buckets stats as well?

I thought about it a lot. But realized that I need only a few cached
metrics used for most of the requests. Count of active buckets is not
one of them, but would waste time on invaliding the cache on each
generation update.

Talking specifically, count({consts.BUCKET.ACTIVE}) is used by
rebalancer only which happens extremely rare. So there is no win in
optimizing it for normal cluster operation.

Even now I worry about doing too much in the generation increment
trigger. To calculate and keep the stat up to date I would need to
make it more universal. So for example store number of buckets of
each type. Then I face the issues:

- In on_replace trigger I need to extract bucket status from the old
  and new tuple, update the relevant counters. I mostly worry about
  extracting the statuses (too long).

- I need to handle the rollback to somehow revert the counters back.

I could do something similar to the cache in this patch (I simply
calculate the counts on demand and invalidate them all on each generation
update), but it does not fix the real issue with the counts - they
can be long if bucket count is millions, and the cache will be invalidated
a lot during rebalancing. Exactly when a cache could help most.

In the end I decided not to bother with this now in scope of map-reduce.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 05/11] util: introduce safe fiber_cond_wait()
  2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
@ 2021-02-24 21:48     ` Vladislav Shpilevoy via Tarantool-patches
  2021-02-25 12:42       ` Oleg Babin via Tarantool-patches
  0 siblings, 1 reply; 47+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-02-24 21:48 UTC (permalink / raw)
  To: Oleg Babin, tarantool-patches, yaroslav.dynnikov

Thanks for the review!

On 24.02.2021 11:27, Oleg Babin wrote:
> Hi! Thanks for your patch. LGTM.
> 
> I see several usages of cond:wait() in code. Maybe after introducing this helper you could use it.
> 
> E.g. in "bucket_send_xc" function.

Yeah, I checked the existing usages, but it is fine as is
there.

In bucket_send_xc() it is ok to throw. This is why it is 'xc' -
'exception'. It is ok, because there are other ops which can
throw, and eventually I decided not to wrap them all into pcalls.

In global fibers I kept the normal waits, because its throw here
is fine. They don't have any finalization work to do after the waits.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 08/11] storage: introduce bucket_are_all_rw()
  2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
@ 2021-02-24 21:48     ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 0 replies; 47+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-02-24 21:48 UTC (permalink / raw)
  To: Oleg Babin, tarantool-patches, yaroslav.dynnikov

Thanks for the review!

On 24.02.2021 11:27, Oleg Babin via Tarantool-patches wrote:
> Thanks for your patch.
> 
> Seems here I should return to one of my previous e-mail.
> 
> Maybe it's reasonable to cache all bucket stats?

Responded in the other email. Short answer: the other stats are
needed rare. Mostly during rebalancing, when the caches will be
invalidated constantly anyway.

The ones I optimized in this patchset are going to be needed
often. Super often if map-reduce is actively used.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 09/11] ref: introduce vshard.storage.ref module
  2021-02-24 10:28   ` Oleg Babin via Tarantool-patches
@ 2021-02-24 21:49     ` Vladislav Shpilevoy via Tarantool-patches
  2021-02-25 12:42       ` Oleg Babin via Tarantool-patches
  0 siblings, 1 reply; 47+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-02-24 21:49 UTC (permalink / raw)
  To: Oleg Babin, tarantool-patches, yaroslav.dynnikov

Thanks for the review!

On 24.02.2021 11:28, Oleg Babin wrote:
> Thanks for you patch. It's a brief review - I hope I'll look once again on this patch.
> 
> Consider a question below.

A huge request - could you please remove the irrelevant parts of
the original emails from your responses? I need to scroll tons of
text to find your commets, and can accidentally miss some.

Or at least provide some markers I could grep and jump to quickly.

>> diff --git a/vshard/storage/CMakeLists.txt b/vshard/storage/CMakeLists.txt
>> index 3f4ed43..7c1e97d 100644
>> --- a/vshard/storage/CMakeLists.txt
>> +++ b/vshard/storage/CMakeLists.txt
>> @@ -1,2 +1,2 @@
>> -install(FILES init.lua reload_evolution.lua
>> +install(FILES init.lua reload_evolution.lua ref.lua
>>           DESTINATION ${TARANTOOL_INSTALL_LUADIR}/vshard/storage)
>> diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
>> index c3ed236..2957f48 100644
>> --- a/vshard/storage/init.lua
>> +++ b/vshard/storage/init.lua
>> @@ -1140,6 +1142,9 @@ local function bucket_recv_xc(bucket_id, from, data, opts)
>>               return nil, lerror.vshard(lerror.code.WRONG_BUCKET, bucket_id, msg,
>>                                         from)
>>           end
>> +        if lref.count > 0 then
>> +            return nil, lerror.vshard(lerror.code.STORAGE_IS_REFERENCED)
>> +        end
> 
> 
> You will remove this part in the next patch. Do you really need it? Or you add it just for tests?

For the tests and for making the patch atomic. So as it wouldn't depend on the next
patch.

>>           if is_this_replicaset_locked() then
>>               return nil, lerror.vshard(lerror.code.REPLICASET_IS_LOCKED)
>>           end
>> @@ -1441,6 +1446,9 @@ local function bucket_send_xc(bucket_id, destination, opts, exception_guard)
>>         local _bucket = box.space._bucket
>>       local bucket = _bucket:get({bucket_id})
>> +    if lref.count > 0 then
>> +        return nil, lerror.vshard(lerror.code.STORAGE_IS_REFERENCED)
>> +    end
> 
> 
> Ditto.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 10/11] sched: introduce vshard.storage.sched module
  2021-02-24 10:28   ` Oleg Babin via Tarantool-patches
@ 2021-02-24 21:50     ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 0 replies; 47+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-02-24 21:50 UTC (permalink / raw)
  To: Oleg Babin, tarantool-patches, yaroslav.dynnikov

Thanks for the review!

On 24.02.2021 11:28, Oleg Babin wrote:
> Thanks for you patch. It's a brief review - I hope I'll look once again on this patch.
> 
> Consider 2 comments below.
> 
> 
>> diff --git a/vshard/storage/ref.lua b/vshard/storage/ref.lua
>> index 7589cb9..2daad6b 100644
>> --- a/vshard/storage/ref.lua
>> +++ b/vshard/storage/ref.lua
>> @@ -341,6 +358,14 @@ local function ref_del(rid, sid)
>>       return session:del(rid)
>>   end
>>   +local function ref_next_deadline()
>> +    local session = M.session_heap:top()
>> +    if not session then
>> +        return fiber_clock() + TIMEOUT_INFINITY
>> +    end
> 
> Does it make sence? inf + fiber_clock() = inf

Indeed. I could simply return the infinite deadline.

====================
 local fiber_clock = lfiber.clock
 local fiber_yield = lfiber.yield
 local DEADLINE_INFINITY = lconsts.DEADLINE_INFINITY
-local TIMEOUT_INFINITY = lconsts.TIMEOUT_INFINITY
 local LUA_CHUNK_SIZE = lconsts.LUA_CHUNK_SIZE
====================
 local function ref_next_deadline()
     local session = M.session_heap:top()
-    if not session then
-        return fiber_clock() + TIMEOUT_INFINITY
-    end
-    return session.deadline
+    return session and session.deadline or DEADLINE_INFINITY
 end
====================

>> diff --git a/vshard/storage/sched.lua b/vshard/storage/sched.lua
>> new file mode 100644
>> index 0000000..0ac71f4
>> --- /dev/null
>> +++ b/vshard/storage/sched.lua
>> @@ -0,0 +1,231 @@
>> +local function sched_wait_anything(timeout)
>> +    return fiber_cond_wait(M.cond, timeout)
>> +end
>> +
>> +--
>> +-- Return the remaining timeout in case there was a yield. This helps to save
>> +-- current clock get in the caller code if there were no yields.
>> +--
>> +local function sched_ref_start(timeout)
>> +    local deadline = fiber_clock() + timeout
> 
> Let's do it after fast check to eliminate excess fiber_clock call.
> 
> Also there are several similar places below. Please fix them as well.

Good idea, fixed. However there are just 2 such places.

====================
@@ -79,13 +79,13 @@ end
 -- current clock get in the caller code if there were no yields.
 --
 local function sched_ref_start(timeout)
-    local deadline = fiber_clock() + timeout
-    local ok, err
+    local deadline, ok, err
     -- Fast-path. Moves are extremely rare. No need to inc-dec the ref queue
     -- then nor try to start some loops.
     if M.move_count == 0 and M.move_queue == 0 then
         goto success
     end
+    deadline = fiber_clock() + timeout
 
     M.ref_queue = M.ref_queue + 1
 
@@ -132,8 +132,7 @@ end
 -- current clock get in the caller code if there were no yields.
 --
 local function sched_move_start(timeout)
-    local deadline = fiber_clock() + timeout
-    local ok, err, ref_deadline
+    local ok, err, deadline, ref_deadline
     local lref = lregistry.storage_ref
     -- Fast-path. Refs are not extremely rare *when used*. But they are not
     -- expected to be used in a lot of installations. So most of the times the
@@ -141,6 +140,7 @@ local function sched_move_start(timeout)
     if M.ref_count == 0 and M.ref_queue == 0 then
         goto success
     end
+    deadline = fiber_clock() + timeout
 
     M.move_queue = M.move_queue + 1
====================

I also removed some debug code which I forgot first time:

====================
@@ -18,11 +18,6 @@ local small_timeout = 0.000001
 -- rebalancer.
 --
 
-box.cfg{
-    log = 'log.txt'
-}
--- io.write = function(...) require('log').info(...) end
-
 --
====================

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 11/11] router: introduce map_callrw()
  2021-02-24 10:28   ` Oleg Babin via Tarantool-patches
@ 2021-02-24 22:04     ` Vladislav Shpilevoy via Tarantool-patches
  2021-02-25 12:43       ` Oleg Babin via Tarantool-patches
  0 siblings, 1 reply; 47+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-02-24 22:04 UTC (permalink / raw)
  To: Oleg Babin, tarantool-patches, yaroslav.dynnikov

Thanks for the review!

On 24.02.2021 11:28, Oleg Babin via Tarantool-patches wrote:
> Thanks a lot for your patch! See 5 comments below.
> 
> 
> On 23.02.2021 03:15, Vladislav Shpilevoy wrote:
>> Closes #147
> 
> Will read-only map-reduce functions be done in the scope of separate issue/patch?
> 
> I know about #173 but seems we need to keep information about map_callro function.

Perhaps there will be a new ticket, yes. Until 173 is fixed, ro refs seem
pointless. The implementation depends on how exactly to fix 173.

At this moment I didn't even design it yet. I am thinking if I need to
account ro storage refs separated from rw refs in the scheduler or not.
Implementation heavily depends on this.

If I go for considering ro refs separated, it adds third group of operations
to the scheduler complicating it significantly, and a second type of refs to
the refs module obviously.

It would allow to send buckets and mark them as GARBAGE/SENT while keep ro
refs. But would complicate the garbage collector a bit as it would need to
consider storage ro refs.

On the other hand, it would heavily complicate the code in the scheduler.
Like really heavy, I suppose, but I can be wrong.

Also the win will be zeroed if we ever implement rebalancing of writable
buckets. Besides, the main reason I am more into unified type of refs is
that they are used solely for map-reduce. Which means they are taken on all
storages. This means if you have SENDING readable bucket on one storage,
you have RECEIVING non-readable bucket on another storage. Which makes
such ro refs pointless for full cluster scan. Simply won't be able to take
ro ref on the RECEIVING storage.

If we go for unified refs, it would allow to keep the scheduler relatively
simple, but requires urgent fix of 173. Otherwise you can't be sure your
data is consistent. Even now you can't be sure with normal requests, which
terrifies me and raises a huge question - why the fuck nobody cares? I
think I already had a couple of nightmares about 173.

Another issue - ro refs won't stop rebalancer from working and don't
even participate in the scheduler, if we fix 173 in the simplest way -
just invalidate all refs on the replica if a bucket starts moving. If
the rebalancer works actively, nearly all your ro map-reduces will return
errors during data migration because buckets will move constantly. No
throttling via the scheduler.

One way to go - switch all replica weights to defaults for the time of
data migration manually. So map_callro() will go to master nodes and
will participate in the scheduling.

Another way to go - don't care and fix 173 in the simplest way. Weights
anyway are used super rare AFAIK.

Third way to go - implement some smarter fix of 173. I have many ideas here.
But neither of them are quick.

>> diff --git a/vshard/router/init.lua b/vshard/router/init.lua
>> index 97bcb0a..8abd77f 100644
>> --- a/vshard/router/init.lua
>> +++ b/vshard/router/init.lua
>> @@ -44,6 +44,11 @@ if not M then
>>           module_version = 0,
>>           -- Number of router which require collecting lua garbage.
>>           collect_lua_garbage_cnt = 0,
>> +
>> +        ----------------------- Map-Reduce -----------------------
>> +        -- Storage Ref ID. It must be unique for each ref request
>> +        -- and therefore is global and monotonically growing.
>> +        ref_id = 0,
> 
> Maybe 0ULL?

Wouldn't break anyway - doubles are precise until 2^53. But an integer
should be faster I hope.

Changed to 0ULL.

But still not sure. I asked Igor about this. If ULL/LL are more performant,
I will also use them in storage.ref and storage.sched where possible. They
have many counters.

>>   @@ -674,6 +679,177 @@ local function router_call(router, bucket_id, opts, ...)
>>                               ...)
>>   end
>>   +local router_map_callrw
>> +
>> +if util.version_is_at_least(1, 10, 0) then
>> +--
>> +-- Consistent Map-Reduce. The given function is called on all masters in the
>> +-- cluster with a guarantee that in case of success it was executed with all
>> +-- buckets being accessible for reads and writes.
>> +--
>> +-- Consistency in scope of map-reduce means all the data was accessible, and
>> +-- didn't move during map requests execution. To preserve the consistency there
>> +-- is a third stage - Ref. So the algorithm is actually Ref-Map-Reduce.
>> +--
>> +-- Refs are broadcast before Map stage to pin the buckets to their storages, and
>> +-- ensure they won't move until maps are done.
>> +--
>> +-- Map requests are broadcast in case all refs are done successfully. They
>> +-- execute the user function + delete the refs to enable rebalancing again.
>> +--
>> +-- On the storages there are additional means to ensure map-reduces don't block
>> +-- rebalancing forever and vice versa.
>> +--
>> +-- The function is not as slow as it may seem - it uses netbox's feature
>> +-- is_async to send refs and maps in parallel. So cost of the function is about
>> +-- 2 network exchanges to the most far storage in terms of time.
>> +--
>> +-- @param router Router instance to use.
>> +-- @param func Name of the function to call.
>> +-- @param args Function arguments passed in netbox style (as an array).
>> +-- @param opts Can only contain 'timeout' as a number of seconds. Note that the
>> +--     refs may end up being kept on the storages during this entire timeout if
>> +--     something goes wrong. For instance, network issues appear. This means
>> +--     better not use a value bigger than necessary. A stuck infinite ref can
>> +--     only be dropped by this router restart/reconnect or the storage restart.
>> +--
>> +-- @return In case of success - a map with replicaset UUID keys and values being
>> +--     what the function returned from the replicaset.
>> +--
>> +-- @return In case of an error - nil, error object, optional UUID of the
>> +--     replicaset where the error happened. UUID may be not present if it wasn't
>> +--     about concrete replicaset. For example, not all buckets were found even
>> +--     though all replicasets were scanned.
>> +--
>> +router_map_callrw = function(router, func, args, opts)
>> +    local replicasets = router.replicasets
> 
> It would be great to filter here replicasets with bucket_count = 0 and weight = 0.

Information on the router may be outdated. Even if bucket_count is 0,
it may still mean the discovery didn't get to there yet. Or the
discovery is simply disabled or dead due to a bug.

Weight 0 also does not say anything certain about buckets on the
storage. It can be that config on the router is outdated, or it is
outdated on the storage. Or someone simply does not set the weights
for router configs because they are not used here. Or the weights are
really 0 and set everywhere, but rebalancing is in progress - buckets
move from the storage slowly, and the scheduler allows to squeeze some
map-reduces in the meantime.

> In case if such "dummy" replicasets are disabled we get an error "connection refused".

I need more info here. What is 'disabled' replicaset? And why would
0 discovered buckets or 0 weight lead to the refused connection?

In case this is some cartridge specific shit, this is bad. As you
can see above, I can't ignore such replicasets. I need to send requests
to them anyway.

There is a workaround though - even if an error has occurred, continue
execution if even without the failed storages I still cover all the buckets.
Then having 'disabled' replicasets in the config would result in some
unnecessary faulty requests for each map call, but they would work.

Although I don't know how to 'return' such errors. I don't want to log them
on each request, and don't have a concept of a 'warning' object or something
like this. Weird option - in case of this uncertain success return the
result, error, uuid. So the function could return

- nil, err[, uuid] - fail;
- res              - success;
- res, err[, uuid] - success but with a suspicious issue;

>> +    local timeout = opts and opts.timeout or consts.CALL_TIMEOUT_MIN
>> +    local deadline = fiber_clock() + timeout
>> +    local err, err_uuid, res, ok, map
>> +    local futures = {}
>> +    local bucket_count = 0
>> +    local opts_async = {is_async = true}
>> +    local rs_count = 0
>> +    local rid = M.ref_id
>> +    M.ref_id = rid + 1
>> +    -- Nil checks are done explicitly here (== nil instead of 'not'), because
>> +    -- netbox requests return box.NULL instead of nils.
>> +
>> +    --
>> +    -- Ref stage: send.
>> +    --
>> +    for uuid, rs in pairs(replicasets) do
>> +        -- Netbox async requests work only with active connections. Need to wait
>> +        -- for the connection explicitly.
>> +        timeout, err = rs:wait_connected(timeout)
>> +        if timeout == nil then
>> +            err_uuid = uuid
>> +            goto fail
>> +        end
>> +        res, err = rs:callrw('vshard.storage._call',
>> +                              {'storage_ref', rid, timeout}, opts_async)
>> +        if res == nil then
>> +            err_uuid = uuid
>> +            goto fail
>> +        end
>> +        futures[uuid] = res
>> +        rs_count = rs_count + 1
>> +    end
>> +    map = table_new(0, rs_count)
>> +    --
>> +    -- Ref stage: collect.
>> +    --
>> +    for uuid, future in pairs(futures) do
>> +        res, err = future:wait_result(timeout)
>> +        -- Handle netbox error first.
>> +        if res == nil then
>> +            err_uuid = uuid
>> +            goto fail
>> +        end
>> +        -- Ref returns nil,err or bucket count.
>> +        res, err = unpack(res)
> 
> Seems `res, err = res[1], res[2]` could be a bit faster.

Indeed. Applied:

====================
@@ -767,7 +767,7 @@ router_map_callrw = function(router, func, args, opts)
             goto fail
         end
         -- Ref returns nil,err or bucket count.
-        res, err = unpack(res)
+        res, err = res[1], res[2]
         if res == nil then
             err_uuid = uuid
====================

On a not related note, I gave more thought to your idea with
doing a 'map-reduce' but not on the whole cluster. And for this we
could introduce 'sca_callrw/sca_callro'. Which mean 'scatter'.

They could send your requests to the given replicasets (or all by
default) and return whatever is back. Just a wrapper on top of a
couple of loops with is_async netbox calls. Not sure if need to
take storage refs for the calls. Their usage might be not related
to buckets really. Just to access the storages and their own local
data, not sharded data.

In case you want scattering for accessing a set of buckets, I
could add `bat_callrw/bat_callro`. Which means 'batch'. They would
take a set of bucket ids, match them to replicasets, go to each
replicaset just one time for all its buckets, take storage ref
there, and execute your function. Or ref all the buckets individually
but I don't know what to do if their number is big. Would increase
request size, and referencing overhead too much.

To attach some context to the buckets bat_call could accept pairs
{bucket_id = {func, args}}. Or accept one function name and for
each bucket have {bucket_id = {args}}. And bat_call will call your
function with the given args on the storage. Maybe with such approach
individual refs make more sense. Don't know.

If you like these ideas, you can file tickets for them so as they
wouldn't be lost and maybe eventually would be implemented.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 01/11] error: introduce vshard.error.timeout()
  2021-02-24 21:46     ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-25 12:42       ` Oleg Babin via Tarantool-patches
  0 siblings, 0 replies; 47+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-02-25 12:42 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, yaroslav.dynnikov

Thanks for your answers, consider my comments below.


On 25.02.2021 00:46, Vladislav Shpilevoy wrote:
> Hi! Thanks for the review!
>
> On 24.02.2021 11:27, Oleg Babin wrote:
>> Hi! Thanks for your patch.
>>
>> Personally, I vote for dropping 1.9 support (it's already broken - #256).
> It kind of works. Just with some spam in the logs. But yeah, I would like
> to finally drop it. I talked to Mons and he approved this:
> https://github.com/tarantool/vshard/issues/267

Good news!

>> But if you want to eliminate "long and ugly" ways you could do something like:
>>
>>
>> ```
>>
>> local make_timeout
>> if box.error.new ~= nil then
>>      make_timeout = function() return box.error.new(box.error.TIMEOUT) end
>> else
>>      make_timeout = function() return select(2, pcall(...)) end
>> end
>>
>> ```
> Which is longer and uglier, right? Where is the win?

Not sure it's important for errors but pcall is a bit slower.


```

local clock = require('clock')

local start = clock.time()
for _ = 1,1e5 do
     pcall(box.error, box.error.TIMEOUT)
end
print(clock.time() - start)

local start = clock.time()
for _ = 1,1e5 do
     box.error.new(box.error.TIMEOUT)
end
print(clock.time() - start)

```

0.22471904754639
0.12087297439575

(on my Mac)


Feel free to ignore. LGTM.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 03/11] storage: cache bucket count
  2021-02-24 21:47     ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-25 12:42       ` Oleg Babin via Tarantool-patches
  0 siblings, 0 replies; 47+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-02-25 12:42 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, yaroslav.dynnikov

Thanks for your answer. You are right, let's won't overcomplicate this task.

On 25.02.2021 00:47, Vladislav Shpilevoy wrote:
> Thanks for the review!
>
> On 24.02.2021 11:27, Oleg Babin wrote:
>> Thanks for your patch! LGTM.
>>
>> I see calls like "status_index:count({consts.BUCKET.ACTIVE})". Maybe it worth
>>
>> to cache whole buckets stats as well?
> I thought about it a lot. But realized that I need only a few cached
> metrics used for most of the requests. Count of active buckets is not
> one of them, but would waste time on invaliding the cache on each
> generation update.
>
> Talking specifically, count({consts.BUCKET.ACTIVE}) is used by
> rebalancer only which happens extremely rare. So there is no win in
> optimizing it for normal cluster operation.
>
> Even now I worry about doing too much in the generation increment
> trigger. To calculate and keep the stat up to date I would need to
> make it more universal. So for example store number of buckets of
> each type. Then I face the issues:
>
> - In on_replace trigger I need to extract bucket status from the old
>    and new tuple, update the relevant counters. I mostly worry about
>    extracting the statuses (too long).
>
> - I need to handle the rollback to somehow revert the counters back.
>
> I could do something similar to the cache in this patch (I simply
> calculate the counts on demand and invalidate them all on each generation
> update), but it does not fix the real issue with the counts - they
> can be long if bucket count is millions, and the cache will be invalidated
> a lot during rebalancing. Exactly when a cache could help most.
>
> In the end I decided not to bother with this now in scope of map-reduce.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 05/11] util: introduce safe fiber_cond_wait()
  2021-02-24 21:48     ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-25 12:42       ` Oleg Babin via Tarantool-patches
  0 siblings, 0 replies; 47+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-02-25 12:42 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, yaroslav.dynnikov

Thanks for your answer!

On 25.02.2021 00:48, Vladislav Shpilevoy wrote:
> Thanks for the review!
>
> On 24.02.2021 11:27, Oleg Babin wrote:
>> Hi! Thanks for your patch. LGTM.
>>
>> I see several usages of cond:wait() in code. Maybe after introducing this helper you could use it.
>>
>> E.g. in "bucket_send_xc" function.
> Yeah, I checked the existing usages, but it is fine as is
> there.
>
> In bucket_send_xc() it is ok to throw. This is why it is 'xc' -
> 'exception'. It is ok, because there are other ops which can
> throw, and eventually I decided not to wrap them all into pcalls.
>
> In global fibers I kept the normal waits, because its throw here
> is fine. They don't have any finalization work to do after the waits.

Ok.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 09/11] ref: introduce vshard.storage.ref module
  2021-02-24 21:49     ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-25 12:42       ` Oleg Babin via Tarantool-patches
  0 siblings, 0 replies; 47+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-02-25 12:42 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, yaroslav.dynnikov

Thanks for your answers.

On 25.02.2021 00:49, Vladislav Shpilevoy wrote:
> Thanks for the review!
>
> On 24.02.2021 11:28, Oleg Babin wrote:
>> Thanks for you patch. It's a brief review - I hope I'll look once again on this patch.
>>
>> Consider a question below.
> A huge request - could you please remove the irrelevant parts of
> the original emails from your responses? I need to scroll tons of
> text to find your commets, and can accidentally miss some.
>
> Or at least provide some markers I could grep and jump to quickly.

No problem. I'll remove irrelevant parts.


>>> diff --git a/vshard/storage/CMakeLists.txt b/vshard/storage/CMakeLists.txt
>>> index 3f4ed43..7c1e97d 100644
>>> --- a/vshard/storage/CMakeLists.txt
>>> +++ b/vshard/storage/CMakeLists.txt
>>> @@ -1,2 +1,2 @@
>>> -install(FILES init.lua reload_evolution.lua
>>> +install(FILES init.lua reload_evolution.lua ref.lua
>>>            DESTINATION ${TARANTOOL_INSTALL_LUADIR}/vshard/storage)
>>> diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
>>> index c3ed236..2957f48 100644
>>> --- a/vshard/storage/init.lua
>>> +++ b/vshard/storage/init.lua
>>> @@ -1140,6 +1142,9 @@ local function bucket_recv_xc(bucket_id, from, data, opts)
>>>                return nil, lerror.vshard(lerror.code.WRONG_BUCKET, bucket_id, msg,
>>>                                          from)
>>>            end
>>> +        if lref.count > 0 then
>>> +            return nil, lerror.vshard(lerror.code.STORAGE_IS_REFERENCED)
>>> +        end
>> You will remove this part in the next patch. Do you really need it? Or you add it just for tests?
> For the tests and for making the patch atomic. So as it wouldn't depend on the next
> patch.
>
Ok. Thanks for your explanation.


>>>            if is_this_replicaset_locked() then
>>>                return nil, lerror.vshard(lerror.code.REPLICASET_IS_LOCKED)
>>>            end
>>> @@ -1441,6 +1446,9 @@ local function bucket_send_xc(bucket_id, destination, opts, exception_guard)
>>>          local _bucket = box.space._bucket
>>>        local bucket = _bucket:get({bucket_id})
>>> +    if lref.count > 0 then
>>> +        return nil, lerror.vshard(lerror.code.STORAGE_IS_REFERENCED)
>>> +    end
>> Ditto.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 11/11] router: introduce map_callrw()
  2021-02-24 22:04     ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-02-25 12:43       ` Oleg Babin via Tarantool-patches
  2021-02-26 23:58         ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 1 reply; 47+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-02-25 12:43 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, yaroslav.dynnikov

Thanks for your detailed explanation. See my comments/answers below.

On 25.02.2021 01:04, Vladislav Shpilevoy wrote:
> Thanks for the review!
>
> On 24.02.2021 11:28, Oleg Babin via Tarantool-patches wrote:
>> Thanks a lot for your patch! See 5 comments below.
>>
>>
>> On 23.02.2021 03:15, Vladislav Shpilevoy wrote:
>>> Closes #147
>> Will read-only map-reduce functions be done in the scope of separate issue/patch?
>>
>> I know about #173 but seems we need to keep information about map_callro function.
> Perhaps there will be a new ticket, yes. Until 173 is fixed, ro refs seem
> pointless. The implementation depends on how exactly to fix 173.
>
> At this moment I didn't even design it yet. I am thinking if I need to
> account ro storage refs separated from rw refs in the scheduler or not.
> Implementation heavily depends on this.
>
> If I go for considering ro refs separated, it adds third group of operations
> to the scheduler complicating it significantly, and a second type of refs to
> the refs module obviously.
>
> It would allow to send buckets and mark them as GARBAGE/SENT while keep ro
> refs. But would complicate the garbage collector a bit as it would need to
> consider storage ro refs.
>
> On the other hand, it would heavily complicate the code in the scheduler.
> Like really heavy, I suppose, but I can be wrong.
>
> Also the win will be zeroed if we ever implement rebalancing of writable
> buckets. Besides, the main reason I am more into unified type of refs is
> that they are used solely for map-reduce. Which means they are taken on all
> storages. This means if you have SENDING readable bucket on one storage,
> you have RECEIVING non-readable bucket on another storage. Which makes
> such ro refs pointless for full cluster scan. Simply won't be able to take
> ro ref on the RECEIVING storage.
>
> If we go for unified refs, it would allow to keep the scheduler relatively
> simple, but requires urgent fix of 173. Otherwise you can't be sure your
> data is consistent. Even now you can't be sure with normal requests, which
> terrifies me and raises a huge question - why the fuck nobody cares? I
> think I already had a couple of nightmares about 173.

Yes, it's a problem but rebalansing is not quite often operation. So, in 
some

cases routeall() was enough. Anyway we didn't have any alternatives. But 
map-reduce it's

really often operation - any request over secondary index and you should 
scan whole cluster.

> Another issue - ro refs won't stop rebalancer from working and don't
> even participate in the scheduler, if we fix 173 in the simplest way -
> just invalidate all refs on the replica if a bucket starts moving. If
> the rebalancer works actively, nearly all your ro map-reduces will return
> errors during data migration because buckets will move constantly. No
> throttling via the scheduler.
>
> One way to go - switch all replica weights to defaults for the time of
> data migration manually. So map_callro() will go to master nodes and
> will participate in the scheduling.
>
> Another way to go - don't care and fix 173 in the simplest way. Weights
> anyway are used super rare AFAIK.
>
> Third way to go - implement some smarter fix of 173. I have many ideas here.
> But neither of them are quick.
>
>>> diff --git a/vshard/router/init.lua b/vshard/router/init.lua
>>> index 97bcb0a..8abd77f 100644
>>> --- a/vshard/router/init.lua
>>> +++ b/vshard/router/init.lua
>>> @@ -44,6 +44,11 @@ if not M then
>>>            module_version = 0,
>>>            -- Number of router which require collecting lua garbage.
>>>            collect_lua_garbage_cnt = 0,
>>> +
>>> +        ----------------------- Map-Reduce -----------------------
>>> +        -- Storage Ref ID. It must be unique for each ref request
>>> +        -- and therefore is global and monotonically growing.
>>> +        ref_id = 0,
>> Maybe 0ULL?
> Wouldn't break anyway - doubles are precise until 2^53. But an integer
> should be faster I hope.
>
> Changed to 0ULL.
>
> But still not sure. I asked Igor about this. If ULL/LL are more performant,
> I will also use them in storage.ref and storage.sched where possible. They
> have many counters.
>
>>>    @@ -674,6 +679,177 @@ local function router_call(router, bucket_id, opts, ...)
>>>                                ...)
>>>    end
>>>    +local router_map_callrw
>>> +
>>> +if util.version_is_at_least(1, 10, 0) then
>>> +--
>>> +-- Consistent Map-Reduce. The given function is called on all masters in the
>>> +-- cluster with a guarantee that in case of success it was executed with all
>>> +-- buckets being accessible for reads and writes.
>>> +--
>>> +-- Consistency in scope of map-reduce means all the data was accessible, and
>>> +-- didn't move during map requests execution. To preserve the consistency there
>>> +-- is a third stage - Ref. So the algorithm is actually Ref-Map-Reduce.
>>> +--
>>> +-- Refs are broadcast before Map stage to pin the buckets to their storages, and
>>> +-- ensure they won't move until maps are done.
>>> +--
>>> +-- Map requests are broadcast in case all refs are done successfully. They
>>> +-- execute the user function + delete the refs to enable rebalancing again.
>>> +--
>>> +-- On the storages there are additional means to ensure map-reduces don't block
>>> +-- rebalancing forever and vice versa.
>>> +--
>>> +-- The function is not as slow as it may seem - it uses netbox's feature
>>> +-- is_async to send refs and maps in parallel. So cost of the function is about
>>> +-- 2 network exchanges to the most far storage in terms of time.
>>> +--
>>> +-- @param router Router instance to use.
>>> +-- @param func Name of the function to call.
>>> +-- @param args Function arguments passed in netbox style (as an array).
>>> +-- @param opts Can only contain 'timeout' as a number of seconds. Note that the
>>> +--     refs may end up being kept on the storages during this entire timeout if
>>> +--     something goes wrong. For instance, network issues appear. This means
>>> +--     better not use a value bigger than necessary. A stuck infinite ref can
>>> +--     only be dropped by this router restart/reconnect or the storage restart.
>>> +--
>>> +-- @return In case of success - a map with replicaset UUID keys and values being
>>> +--     what the function returned from the replicaset.
>>> +--
>>> +-- @return In case of an error - nil, error object, optional UUID of the
>>> +--     replicaset where the error happened. UUID may be not present if it wasn't
>>> +--     about concrete replicaset. For example, not all buckets were found even
>>> +--     though all replicasets were scanned.
>>> +--
>>> +router_map_callrw = function(router, func, args, opts)
>>> +    local replicasets = router.replicasets
>> It would be great to filter here replicasets with bucket_count = 0 and weight = 0.
> Information on the router may be outdated. Even if bucket_count is 0,
> it may still mean the discovery didn't get to there yet. Or the
> discovery is simply disabled or dead due to a bug.
>
> Weight 0 also does not say anything certain about buckets on the
> storage. It can be that config on the router is outdated, or it is
> outdated on the storage. Or someone simply does not set the weights
> for router configs because they are not used here. Or the weights are
> really 0 and set everywhere, but rebalancing is in progress - buckets
> move from the storage slowly, and the scheduler allows to squeeze some
> map-reduces in the meantime.
>
>> In case if such "dummy" replicasets are disabled we get an error "connection refused".
> I need more info here. What is 'disabled' replicaset? And why would
> 0 discovered buckets or 0 weight lead to the refused connection?
>
> In case this is some cartridge specific shit, this is bad. As you
> can see above, I can't ignore such replicasets. I need to send requests
> to them anyway.
>
> There is a workaround though - even if an error has occurred, continue
> execution if even without the failed storages I still cover all the buckets.
> Then having 'disabled' replicasets in the config would result in some
> unnecessary faulty requests for each map call, but they would work.
>
> Although I don't know how to 'return' such errors. I don't want to log them
> on each request, and don't have a concept of a 'warning' object or something
> like this. Weird option - in case of this uncertain success return the
> result, error, uuid. So the function could return
>
> - nil, err[, uuid] - fail;
> - res              - success;
> - res, err[, uuid] - success but with a suspicious issue;
>
Seems I didn't fully state the problem. Maybe it's not even relevant 
issue and users

that do it just create their own problems. But there was a case when our 
customers added

new replicaset (in fact single instance) in cluster. This instance 
didn't have any data and had a weight=0.

Then they just turned off this instance after some time. And all 
requests that perform map-reduce started to fail with "Connection

refused" error.

It causes a question: "Why do our requests fail if we disable instance 
that doesn't have any data".

Yes, requirement of weight=0 is obviously not enough - because if 
rebalansing is in progress replicaset with weight=0

still could contain some data.


Considering an opportunity to finish requests if someone failed - I'm 
not sure that it's really needed.

Usually we don't need some partial result (moreover if it adds some 
workload).


Regarding the question of error. Cartridge has an example of error 
object for map_call function.

For errors cartridge creates some "object" that contains info about all 
errors that were returned.

It's not excellent solution but I think Yaroslav could give some 
thoughts here.


https://github.com/tarantool/cartridge/blob/cf195bc9576eb460d66d609c357bec5014a90d21/cartridge/pool.lua#L235


>>> +    local timeout = opts and opts.timeout or consts.CALL_TIMEOUT_MIN
>>> +    local deadline = fiber_clock() + timeout
>>> +    local err, err_uuid, res, ok, map
>>> +    local futures = {}
>>> +    local bucket_count = 0
>>> +    local opts_async = {is_async = true}
>>> +    local rs_count = 0
>>> +    local rid = M.ref_id
>>> +    M.ref_id = rid + 1
>>> +    -- Nil checks are done explicitly here (== nil instead of 'not'), because
>>> +    -- netbox requests return box.NULL instead of nils.
>>> +
>>> +    --
>>> +    -- Ref stage: send.
>>> +    --
>>> +    for uuid, rs in pairs(replicasets) do
>>> +        -- Netbox async requests work only with active connections. Need to wait
>>> +        -- for the connection explicitly.
>>> +        timeout, err = rs:wait_connected(timeout)
>>> +        if timeout == nil then
>>> +            err_uuid = uuid
>>> +            goto fail
>>> +        end
>>> +        res, err = rs:callrw('vshard.storage._call',
>>> +                              {'storage_ref', rid, timeout}, opts_async)
>>> +        if res == nil then
>>> +            err_uuid = uuid
>>> +            goto fail
>>> +        end
>>> +        futures[uuid] = res
>>> +        rs_count = rs_count + 1
>>> +    end
>>> +    map = table_new(0, rs_count)
>>> +    --
>>> +    -- Ref stage: collect.
>>> +    --
>>> +    for uuid, future in pairs(futures) do
>>> +        res, err = future:wait_result(timeout)
>>> +        -- Handle netbox error first.
>>> +        if res == nil then
>>> +            err_uuid = uuid
>>> +            goto fail
>>> +        end
>>> +        -- Ref returns nil,err or bucket count.
>>> +        res, err = unpack(res)
>> Seems `res, err = res[1], res[2]` could be a bit faster.
> Indeed. Applied:
>
> ====================
> @@ -767,7 +767,7 @@ router_map_callrw = function(router, func, args, opts)
>               goto fail
>           end
>           -- Ref returns nil,err or bucket count.
> -        res, err = unpack(res)
> +        res, err = res[1], res[2]
>           if res == nil then
>               err_uuid = uuid
> ====================
>
> On a not related note, I gave more thought to your idea with
> doing a 'map-reduce' but not on the whole cluster. And for this we
> could introduce 'sca_callrw/sca_callro'. Which mean 'scatter'.
>
> They could send your requests to the given replicasets (or all by
> default) and return whatever is back. Just a wrapper on top of a
> couple of loops with is_async netbox calls. Not sure if need to
> take storage refs for the calls. Their usage might be not related
> to buckets really. Just to access the storages and their own local
> data, not sharded data.
>
> In case you want scattering for accessing a set of buckets, I
> could add `bat_callrw/bat_callro`. Which means 'batch'. They would
> take a set of bucket ids, match them to replicasets, go to each
> replicaset just one time for all its buckets, take storage ref
> there, and execute your function. Or ref all the buckets individually
> but I don't know what to do if their number is big. Would increase
> request size, and referencing overhead too much.
>
> To attach some context to the buckets bat_call could accept pairs
> {bucket_id = {func, args}}. Or accept one function name and for
> each bucket have {bucket_id = {args}}. And bat_call will call your
> function with the given args on the storage. Maybe with such approach
> individual refs make more sense. Don't know.
>
> If you like these ideas, you can file tickets for them so as they
> wouldn't be lost and maybe eventually would be implemented.

Batch requests are quite often case and seems issue is already filed - 
https://github.com/tarantool/vshard/issues/176.

I'll think about "sca_callrw/sca_callro" if I will find some cases I 
file an issue. But currently it seems my

minds around scatter-operations are closely related with batch requests.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 11/11] router: introduce map_callrw()
  2021-02-25 12:43       ` Oleg Babin via Tarantool-patches
@ 2021-02-26 23:58         ` Vladislav Shpilevoy via Tarantool-patches
  2021-03-01 10:58           ` Oleg Babin via Tarantool-patches
  0 siblings, 1 reply; 47+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-02-26 23:58 UTC (permalink / raw)
  To: Oleg Babin, tarantool-patches, yaroslav.dynnikov

>>> On 23.02.2021 03:15, Vladislav Shpilevoy wrote:
>>>> Closes #147
>>> Will read-only map-reduce functions be done in the scope of separate issue/patch?
>>>
>>> I know about #173 but seems we need to keep information about map_callro function.
>> Perhaps there will be a new ticket, yes. Until 173 is fixed, ro refs seem
>> pointless. The implementation depends on how exactly to fix 173.
>>
>> At this moment I didn't even design it yet. I am thinking if I need to
>> account ro storage refs separated from rw refs in the scheduler or not.
>> Implementation heavily depends on this.
>>
>> If I go for considering ro refs separated, it adds third group of operations
>> to the scheduler complicating it significantly, and a second type of refs to
>> the refs module obviously.
>>
>> It would allow to send buckets and mark them as GARBAGE/SENT while keep ro
>> refs. But would complicate the garbage collector a bit as it would need to
>> consider storage ro refs.
>>
>> On the other hand, it would heavily complicate the code in the scheduler.
>> Like really heavy, I suppose, but I can be wrong.
>>
>> Also the win will be zeroed if we ever implement rebalancing of writable
>> buckets. Besides, the main reason I am more into unified type of refs is
>> that they are used solely for map-reduce. Which means they are taken on all
>> storages. This means if you have SENDING readable bucket on one storage,
>> you have RECEIVING non-readable bucket on another storage. Which makes
>> such ro refs pointless for full cluster scan. Simply won't be able to take
>> ro ref on the RECEIVING storage.
>>
>> If we go for unified refs, it would allow to keep the scheduler relatively
>> simple, but requires urgent fix of 173. Otherwise you can't be sure your
>> data is consistent. Even now you can't be sure with normal requests, which
>> terrifies me and raises a huge question - why the fuck nobody cares? I
>> think I already had a couple of nightmares about 173.
> 
> Yes, it's a problem but rebalansing is not quite often operation. So, in some
> 
> cases routeall() was enough. Anyway we didn't have any alternatives. But map-reduce it's
> 
> really often operation - any request over secondary index and you should scan whole cluster.

Have you tried building secondary indexes over buckets? There is an
algorithm, in case it is something perf sensitive.

You can store secondary index in another space and shard it independently.
And there are ways how to deal with inability to atomically update it and
the main space together. So almost always you will have at most 2 network
hops to at most 2 nodes to find the primary key. Regardless of cluster
size.

>> Another issue - ro refs won't stop rebalancer from working and don't
>> even participate in the scheduler, if we fix 173 in the simplest way -
>> just invalidate all refs on the replica if a bucket starts moving. If
>> the rebalancer works actively, nearly all your ro map-reduces will return
>> errors during data migration because buckets will move constantly. No
>> throttling via the scheduler.
>>
>> One way to go - switch all replica weights to defaults for the time of
>> data migration manually. So map_callro() will go to master nodes and
>> will participate in the scheduling.
>>
>> Another way to go - don't care and fix 173 in the simplest way. Weights
>> anyway are used super rare AFAIK.
>>
>> Third way to go - implement some smarter fix of 173. I have many ideas here.
>> But neither of them are quick.
>>
>>>> diff --git a/vshard/router/init.lua b/vshard/router/init.lua
>>>> index 97bcb0a..8abd77f 100644
>>>> --- a/vshard/router/init.lua
>>>> +++ b/vshard/router/init.lua
>>>> @@ -44,6 +44,11 @@ if not M then
>>>>            module_version = 0,
>>>>            -- Number of router which require collecting lua garbage.
>>>>            collect_lua_garbage_cnt = 0,
>>>> +
>>>> +        ----------------------- Map-Reduce -----------------------
>>>> +        -- Storage Ref ID. It must be unique for each ref request
>>>> +        -- and therefore is global and monotonically growing.
>>>> +        ref_id = 0,
>>> Maybe 0ULL?
>> Wouldn't break anyway - doubles are precise until 2^53. But an integer
>> should be faster I hope.
>>
>> Changed to 0ULL.

I reverted it back. Asked Igor and he reminded me it is cdata. So it
involves heavy stuff with metatables and shit. It is cheaper to simply
increment the plain number. I didn't measure though.

>>>> +router_map_callrw = function(router, func, args, opts)
>>>> +    local replicasets = router.replicasets
>>> It would be great to filter here replicasets with bucket_count = 0 and weight = 0.
>> Information on the router may be outdated. Even if bucket_count is 0,
>> it may still mean the discovery didn't get to there yet. Or the
>> discovery is simply disabled or dead due to a bug.
>>
>> Weight 0 also does not say anything certain about buckets on the
>> storage. It can be that config on the router is outdated, or it is
>> outdated on the storage. Or someone simply does not set the weights
>> for router configs because they are not used here. Or the weights are
>> really 0 and set everywhere, but rebalancing is in progress - buckets
>> move from the storage slowly, and the scheduler allows to squeeze some
>> map-reduces in the meantime.
>>
>>> In case if such "dummy" replicasets are disabled we get an error "connection refused".
>> I need more info here. What is 'disabled' replicaset? And why would
>> 0 discovered buckets or 0 weight lead to the refused connection?
>>
>> In case this is some cartridge specific shit, this is bad. As you
>> can see above, I can't ignore such replicasets. I need to send requests
>> to them anyway.
>>
>> There is a workaround though - even if an error has occurred, continue
>> execution if even without the failed storages I still cover all the buckets.
>> Then having 'disabled' replicasets in the config would result in some
>> unnecessary faulty requests for each map call, but they would work.
>>
>> Although I don't know how to 'return' such errors. I don't want to log them
>> on each request, and don't have a concept of a 'warning' object or something
>> like this. Weird option - in case of this uncertain success return the
>> result, error, uuid. So the function could return
>>
>> - nil, err[, uuid] - fail;
>> - res              - success;
>> - res, err[, uuid] - success but with a suspicious issue;
>>
> Seems I didn't fully state the problem. Maybe it's not even relevant issue and users
> 
> that do it just create their own problems. But there was a case when our customers added
> 
> new replicaset (in fact single instance) in cluster. This instance didn't have any data and had a weight=0.
> 
> Then they just turned off this instance after some time. And all requests that perform map-reduce started to fail with "Connection
> 
> refused" error.
> 
> It causes a question: "Why do our requests fail if we disable instance that doesn't have any data".
> 
> Yes, requirement of weight=0 is obviously not enough - because if rebalansing is in progress replicaset with weight=0
> 
> still could contain some data.
> 
> 
> Considering an opportunity to finish requests if someone failed - I'm not sure that it's really needed.
> 
> Usually we don't need some partial result (moreover if it adds some workload).

I want to emphasize - it won't be partial. It will be full result. In case the
'disabled' node does not have any data, the requests to the other nodes will
see that they covered all the buckets. So this 'disabled' node loss is not
critical.

Map-Reduce, at least how I understood it, is about providing access to all
buckets. You can succeed at this even if didn't manage to look into each node.
If the not responded nodes didn't have any data.

But sounds like a crutch for an outdated config. If this is not something
regular I need to support, I better leave it as is now then. A normal fail.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 11/11] router: introduce map_callrw()
  2021-02-26 23:58         ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-03-01 10:58           ` Oleg Babin via Tarantool-patches
  0 siblings, 0 replies; 47+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-03-01 10:58 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, yaroslav.dynnikov

Hi!


On 27.02.2021 02:58, Vladislav Shpilevoy wrote:
>>>> On 23.02.2021 03:15, Vladislav Shpilevoy wrote:
>>>>> Closes #147
>>>> Will read-only map-reduce functions be done in the scope of separate issue/patch?
>>>>
>>>> I know about #173 but seems we need to keep information about map_callro function.
>>> Perhaps there will be a new ticket, yes. Until 173 is fixed, ro refs seem
>>> pointless. The implementation depends on how exactly to fix 173.
>>>
>>> At this moment I didn't even design it yet. I am thinking if I need to
>>> account ro storage refs separated from rw refs in the scheduler or not.
>>> Implementation heavily depends on this.
>>>
>>> If I go for considering ro refs separated, it adds third group of operations
>>> to the scheduler complicating it significantly, and a second type of refs to
>>> the refs module obviously.
>>>
>>> It would allow to send buckets and mark them as GARBAGE/SENT while keep ro
>>> refs. But would complicate the garbage collector a bit as it would need to
>>> consider storage ro refs.
>>>
>>> On the other hand, it would heavily complicate the code in the scheduler.
>>> Like really heavy, I suppose, but I can be wrong.
>>>
>>> Also the win will be zeroed if we ever implement rebalancing of writable
>>> buckets. Besides, the main reason I am more into unified type of refs is
>>> that they are used solely for map-reduce. Which means they are taken on all
>>> storages. This means if you have SENDING readable bucket on one storage,
>>> you have RECEIVING non-readable bucket on another storage. Which makes
>>> such ro refs pointless for full cluster scan. Simply won't be able to take
>>> ro ref on the RECEIVING storage.
>>>
>>> If we go for unified refs, it would allow to keep the scheduler relatively
>>> simple, but requires urgent fix of 173. Otherwise you can't be sure your
>>> data is consistent. Even now you can't be sure with normal requests, which
>>> terrifies me and raises a huge question - why the fuck nobody cares? I
>>> think I already had a couple of nightmares about 173.
>> Yes, it's a problem but rebalansing is not quite often operation. So, in some
>>
>> cases routeall() was enough. Anyway we didn't have any alternatives. But map-reduce it's
>>
>> really often operation - any request over secondary index and you should scan whole cluster.
> Have you tried building secondary indexes over buckets? There is an
> algorithm, in case it is something perf sensitive.
>
> You can store secondary index in another space and shard it independently.
> And there are ways how to deal with inability to atomically update it and
> the main space together. So almost always you will have at most 2 network
> hops to at most 2 nodes to find the primary key. Regardless of cluster
> size.

K. Nazarov had such ideas and he implemented such thing. He called it 
"reverse index" but

finally this patch wasn't applied to project upstream.

It's interesting idea but I've not seen request for this feature from 
our customers.


>>> Another issue - ro refs won't stop rebalancer from working and don't
>>> even participate in the scheduler, if we fix 173 in the simplest way -
>>> just invalidate all refs on the replica if a bucket starts moving. If
>>> the rebalancer works actively, nearly all your ro map-reduces will return
>>> errors during data migration because buckets will move constantly. No
>>> throttling via the scheduler.
>>>
>>> One way to go - switch all replica weights to defaults for the time of
>>> data migration manually. So map_callro() will go to master nodes and
>>> will participate in the scheduling.
>>>
>>> Another way to go - don't care and fix 173 in the simplest way. Weights
>>> anyway are used super rare AFAIK.
>>>
>>> Third way to go - implement some smarter fix of 173. I have many ideas here.
>>> But neither of them are quick.
>>>
>>>>> diff --git a/vshard/router/init.lua b/vshard/router/init.lua
>>>>> index 97bcb0a..8abd77f 100644
>>>>> --- a/vshard/router/init.lua
>>>>> +++ b/vshard/router/init.lua
>>>>> @@ -44,6 +44,11 @@ if not M then
>>>>>             module_version = 0,
>>>>>             -- Number of router which require collecting lua garbage.
>>>>>             collect_lua_garbage_cnt = 0,
>>>>> +
>>>>> +        ----------------------- Map-Reduce -----------------------
>>>>> +        -- Storage Ref ID. It must be unique for each ref request
>>>>> +        -- and therefore is global and monotonically growing.
>>>>> +        ref_id = 0,
>>>> Maybe 0ULL?
>>> Wouldn't break anyway - doubles are precise until 2^53. But an integer
>>> should be faster I hope.
>>>
>>> Changed to 0ULL.
> I reverted it back. Asked Igor and he reminded me it is cdata. So it
> involves heavy stuff with metatables and shit. It is cheaper to simply
> increment the plain number. I didn't measure though.

Well, I hope it won't cause any problems. However I'm not sure that 
summing the numbers can lead to any noticeable load.


>>>>> +router_map_callrw = function(router, func, args, opts)
>>>>> +    local replicasets = router.replicasets
>>>> It would be great to filter here replicasets with bucket_count = 0 and weight = 0.
>>> Information on the router may be outdated. Even if bucket_count is 0,
>>> it may still mean the discovery didn't get to there yet. Or the
>>> discovery is simply disabled or dead due to a bug.
>>>
>>> Weight 0 also does not say anything certain about buckets on the
>>> storage. It can be that config on the router is outdated, or it is
>>> outdated on the storage. Or someone simply does not set the weights
>>> for router configs because they are not used here. Or the weights are
>>> really 0 and set everywhere, but rebalancing is in progress - buckets
>>> move from the storage slowly, and the scheduler allows to squeeze some
>>> map-reduces in the meantime.
>>>
>>>> In case if such "dummy" replicasets are disabled we get an error "connection refused".
>>> I need more info here. What is 'disabled' replicaset? And why would
>>> 0 discovered buckets or 0 weight lead to the refused connection?
>>>
>>> In case this is some cartridge specific shit, this is bad. As you
>>> can see above, I can't ignore such replicasets. I need to send requests
>>> to them anyway.
>>>
>>> There is a workaround though - even if an error has occurred, continue
>>> execution if even without the failed storages I still cover all the buckets.
>>> Then having 'disabled' replicasets in the config would result in some
>>> unnecessary faulty requests for each map call, but they would work.
>>>
>>> Although I don't know how to 'return' such errors. I don't want to log them
>>> on each request, and don't have a concept of a 'warning' object or something
>>> like this. Weird option - in case of this uncertain success return the
>>> result, error, uuid. So the function could return
>>>
>>> - nil, err[, uuid] - fail;
>>> - res              - success;
>>> - res, err[, uuid] - success but with a suspicious issue;
>>>
>> Seems I didn't fully state the problem. Maybe it's not even relevant issue and users
>>
>> that do it just create their own problems. But there was a case when our customers added
>>
>> new replicaset (in fact single instance) in cluster. This instance didn't have any data and had a weight=0.
>>
>> Then they just turned off this instance after some time. And all requests that perform map-reduce started to fail with "Connection
>>
>> refused" error.
>>
>> It causes a question: "Why do our requests fail if we disable instance that doesn't have any data".
>>
>> Yes, requirement of weight=0 is obviously not enough - because if rebalansing is in progress replicaset with weight=0
>>
>> still could contain some data.
>>
>>
>> Considering an opportunity to finish requests if someone failed - I'm not sure that it's really needed.
>>
>> Usually we don't need some partial result (moreover if it adds some workload).
> I want to emphasize - it won't be partial. It will be full result. In case the
> 'disabled' node does not have any data, the requests to the other nodes will
> see that they covered all the buckets. So this 'disabled' node loss is not
> critical.

But can I make sure that this node actually doesn't contain any data or 
it's really some problems

with discovery and etc. And in general we can have more than one such 
replicaset.

Maybe yes, sometimes you want to have some results even if some of 
requests fails. Maybe it's possible to be done under some special

flag? e.g. "ignore_errors" but this approach requires a bit different 
way how to process errors - you should return a map of results and map 
of errors.

Finally, feel free to ignore this suggestion - I'll filed an issue if 
I'm sure that we really need it.

> Map-Reduce, at least how I understood it, is about providing access to all
> buckets. You can succeed at this even if didn't manage to look into each node.
> If the not responded nodes didn't have any data.
>
> But sounds like a crutch for an outdated config. If this is not something
> regular I need to support, I better leave it as is now then. A normal fail.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 10/11] sched: introduce vshard.storage.sched module
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 10/11] sched: introduce vshard.storage.sched module Vladislav Shpilevoy via Tarantool-patches
  2021-02-24 10:28   ` Oleg Babin via Tarantool-patches
@ 2021-03-04 21:02   ` Oleg Babin via Tarantool-patches
  2021-03-05 22:06     ` Vladislav Shpilevoy via Tarantool-patches
  1 sibling, 1 reply; 47+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-03-04 21:02 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, yaroslav.dynnikov

Hi! I've looked your patch again. See one comment below.

On 23.02.2021 03:15, Vladislav Shpilevoy wrote:
> +--
> +-- Return the remaining timeout in case there was a yield. This helps to save
> +-- current clock get in the caller code if there were no yields.
> +--
> +local function sched_ref_start(timeout)
> +    local deadline = fiber_clock() + timeout
> +    local ok, err
> +    -- Fast-path. Moves are extremely rare. No need to inc-dec the ref queue
> +    -- then nor try to start some loops.
> +    if M.move_count == 0 and M.move_queue == 0 then
> +        goto success
> +    end
> +
> +    M.ref_queue = M.ref_queue + 1
> +
> +::retry::
> +    if M.move_count > 0 then
> +        goto wait_and_retry
> +    end
> +    -- Even if move count is zero, must ensure the time usage is fair. Does not
> +    -- matter in case the moves have no quota at all. That allows to ignore them
> +    -- infinitely until all refs end voluntarily.
> +    if M.move_queue > 0 and M.ref_strike >= M.ref_quota and
> +       M.move_quota > 0 then

Is it reasonable to check `move_quota > 0`. According tests it always 
should be positive.

I see similar check for `ref_quota` as well.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 09/11] ref: introduce vshard.storage.ref module
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 09/11] ref: introduce vshard.storage.ref module Vladislav Shpilevoy via Tarantool-patches
  2021-02-24 10:28   ` Oleg Babin via Tarantool-patches
@ 2021-03-04 21:22   ` Oleg Babin via Tarantool-patches
  2021-03-05 22:06     ` Vladislav Shpilevoy via Tarantool-patches
  2021-03-21 18:49   ` Vladislav Shpilevoy via Tarantool-patches
  2 siblings, 1 reply; 47+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-03-04 21:22 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, yaroslav.dynnikov

Hi! I've looked again. See 3 comments/questions below.

On 23.02.2021 03:15, Vladislav Shpilevoy wrote:
> +local function ref_session_new(sid)
> +    -- Session object does store its internal hot attributes in a table. Because
> +    -- it would mean access to any session attribute would cost at least one
> +    -- table indexing operation. Instead, all internal fields are stored as
> +    -- upvalues referenced by the methods defined as closures.
> +    --
> +    -- This means session creation may not very suitable for jitting, but it is
> +    -- very rare and attempts to optimize the most common case.
> +    --
> +    -- Still the public functions take 'self' object to make it look normally.
> +    -- They even use it a bit.
> +
> +    -- Ref map to get ref object by its ID.
> +    local ref_map = {}
> +    -- Ref heap sorted by their deadlines.
> +    local ref_heap = lheap.new(heap_min_deadline_cmp)
> +    -- Total number of refs of the session. Is used to drop the session without
> +    -- fullscan of the ref map. Heap size can't be used because not all refs are
> +    -- stored here. See more on that below.
> +    local count = 0

Maybe it's better to rename it to "global_count". Sometimes it's quite 
confusing to see `M.count +=` near `count += `.

Also you have "global_map" and "global_heap" so no reasons to call it 
just "count".

> +    -- Cache global session storages as upvalues to save on M indexing.
> +    local global_heap = M.session_heap
> +    local global_map = M.session_map
> +
> +    local function ref_session_discount(self, del_count)
> +        local new_count = M.count - del_count
> +        assert(new_count >= 0)
> +        M.count = new_count
> +
> +        new_count = count - del_count
> +        assert(new_count >= 0)
> +        count = new_count
> +    end
> +
> +    local function ref_session_update_deadline(self)
> +        local ref = ref_heap:top()
> +        if not ref then
> +            self.deadline = DEADLINE_INFINITY
> +            global_heap:update(self)
> +        else
> +            local deadline = ref.deadline
> +            if deadline ~= self.deadline then
> +                self.deadline = deadline
> +                global_heap:update(self)
> +            end
> +        end
> +    end
> +
> +    --
> +    -- Garbage collect at most 2 expired refs. The idea is that there is no a
> +    -- dedicated fiber for expired refs collection. It would be too expensive to
> +    -- wakeup a fiber on each added or removed or updated ref.
> +    --
> +    -- Instead, ref GC is mostly incremental and works by the principle "remove
> +    -- more than add". On each new ref added, two old refs try to expire. This
> +    -- way refs don't stack infinitely, and the expired refs are eventually
> +    -- removed. Because removal is faster than addition: -2 for each +1.
> +    --
> +    local function ref_session_gc_step(self, now)
> +        -- This is inlined 2 iterations of the more general GC procedure. The
> +        -- latter is not called in order to save on not having a loop,
> +        -- additional branches and variables.
> +        if self.deadline > now then
> +            return
> +        end
> +        local top = ref_heap:top()
> +        ref_heap:remove_top()
> +        ref_map[top.id] = nil
> +        top = ref_heap:top()
> +        if not top then
> +            self.deadline = DEADLINE_INFINITY
> +            global_heap:update(self)
> +            ref_session_discount(self, 1)
> +            return
> +        end
> +        local deadline = top.deadline
> +        if deadline >= now then
> +            self.deadline = deadline
> +            global_heap:update(self)
> +            ref_session_discount(self, 1)
> +            return
> +        end
> +        ref_heap:remove_top()
> +        ref_map[top.id] = nil
> +        top = ref_heap:top()
> +        if not top then
> +            self.deadline = DEADLINE_INFINITY
> +        else
> +            self.deadline = top.deadline
> +        end
> +        global_heap:update(self)
> +        ref_session_discount(self, 2)
> +    end
> +
> +    --
> +    -- GC expired refs until they end or the limit on the number of iterations
> +    -- is exhausted. The limit is supposed to prevent too long GC which would
> +    -- occupy TX thread unfairly.
> +    --
> +    -- Returns false if nothing to GC, or number of iterations left from the
> +    -- limit. The caller is supposed to yield when 0 is returned, and retry GC
> +    -- until it returns false.
> +    -- The function itself does not yield, because it is used from a more
> +    -- generic function GCing all sessions. It would not ever yield if all
> +    -- sessions would have less than limit refs, even if total ref count would
> +    -- be much bigger.
> +    --
> +    -- Besides, the session might be killed during general GC. There must not be
> +    -- any yields in session methods so as not to introduce a support of dead
> +    -- sessions.
> +    --
> +    local function ref_session_gc(self, limit, now)
> +        if self.deadline >= now then
> +            return false
> +        end

Here you mix "booleans" and "numbers" as return values. Maybe it's 
better to return "nil" here?


> +        local top = ref_heap:top()
> +        local del = 1
> +        local rest = 0
> +        local deadline
> +        repeat
> +            ref_heap:remove_top()
> +            ref_map[top.id] = nil
> +            top = ref_heap:top()
> +            if not top then
> +                self.deadline = DEADLINE_INFINITY
> +                rest = limit - del
> +                break
> +            end
> +            deadline = top.deadline
> +            if deadline >= now then
> +                self.deadline = deadline
> +                rest = limit - del
> +                break
> +            end
> +            del = del + 1
> +        until del >= limit
> +        ref_session_discount(self, del)
> +        global_heap:update(self)
> +        return rest
> +    end
> +
> +    local function ref_session_add(self, rid, deadline, now)
> +        if ref_map[rid] then
> +            return nil, lerror.vshard(lerror.code.STORAGE_REF_ADD,
> +                                      'duplicate ref')
> +        end
> +        local ref = {
> +            deadline = deadline,
> +            id = rid,
> +            -- Used by the heap.
> +            index = -1,
> +        }
> +        ref_session_gc_step(self, now)
> +        ref_map[rid] = ref
> +        ref_heap:push(ref)
> +        if deadline < self.deadline then
> +            self.deadline = deadline
> +            global_heap:update(self)
> +        end
> +        count = count + 1
> +        M.count = M.count + 1
> +        return true
> +    end
> +
> +    --
> +    -- Ref use means it can't be expired until deleted explicitly. Should be
> +    -- done when the request affecting the whole storage starts. After use it is
> +    -- important to call del afterwards - GC won't delete it automatically now.
> +    -- Unless the entire session is killed.
> +    --
> +    local function ref_session_use(self, rid)
> +        local ref = ref_map[rid]
> +        if not ref then
> +            return nil, lerror.vshard(lerror.code.STORAGE_REF_USE, 'no ref')
> +        end
> +        ref_heap:remove(ref)
> +        ref_session_update_deadline(self)
> +        return true
> +    end
> +
> +    local function ref_session_del(self, rid)
> +        local ref = ref_map[rid]
> +        if not ref then
> +            return nil, lerror.vshard(lerror.code.STORAGE_REF_DEL, 'no ref')
> +        end
> +        ref_heap:remove_try(ref)
> +        ref_map[rid] = nil
> +        ref_session_update_deadline(self)
> +        ref_session_discount(self, 1)
> +        return true
> +    end
> +
> +    local function ref_session_kill(self)
> +        global_map[sid] = nil
> +        global_heap:remove(self)
> +        ref_session_discount(self, count)
> +    end
> +
> +    -- Don't use __index. It is useless since all sessions use closures as
> +    -- methods. Also it is probably slower because on each method call would
> +    -- need to get the metatable, get __index, find the method here. While now
> +    -- it is only an index operation on the session object.

Side note: for heap you still use "__index" even heap uses closures as 
methods.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 09/11] ref: introduce vshard.storage.ref module
  2021-03-04 21:22   ` Oleg Babin via Tarantool-patches
@ 2021-03-05 22:06     ` Vladislav Shpilevoy via Tarantool-patches
  2021-03-09  8:03       ` Oleg Babin via Tarantool-patches
  0 siblings, 1 reply; 47+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-03-05 22:06 UTC (permalink / raw)
  To: Oleg Babin, tarantool-patches, yaroslav.dynnikov

Hi! Thanks for the review!

>> +local function ref_session_new(sid)
>> +    -- Session object does store its internal hot attributes in a table. Because
>> +    -- it would mean access to any session attribute would cost at least one
>> +    -- table indexing operation. Instead, all internal fields are stored as
>> +    -- upvalues referenced by the methods defined as closures.
>> +    --
>> +    -- This means session creation may not very suitable for jitting, but it is
>> +    -- very rare and attempts to optimize the most common case.
>> +    --
>> +    -- Still the public functions take 'self' object to make it look normally.
>> +    -- They even use it a bit.
>> +
>> +    -- Ref map to get ref object by its ID.
>> +    local ref_map = {}
>> +    -- Ref heap sorted by their deadlines.
>> +    local ref_heap = lheap.new(heap_min_deadline_cmp)
>> +    -- Total number of refs of the session. Is used to drop the session without
>> +    -- fullscan of the ref map. Heap size can't be used because not all refs are
>> +    -- stored here. See more on that below.
>> +    local count = 0
> 
> Maybe it's better to rename it to "global_count". Sometimes it's quite confusing to see `M.count +=` near `count += `.
> 
> Also you have "global_map" and "global_heap" so no reasons to call it just "count".

I have global_map and global_heap variables because I also have normal map and
heap, local to the session. To distinguish between them I added 'global_'
prefix to the global ones.

The count here is not global. It is local to the session. But I see the point.
I renamed it to `ref_count` to be consistent with `ref_map` and `ref_heap`.

====================
diff --git a/vshard/storage/ref.lua b/vshard/storage/ref.lua
index 7589cb9..27f7804 100644
--- a/vshard/storage/ref.lua
+++ b/vshard/storage/ref.lua
@@ -84,7 +84,7 @@ local function ref_session_new(sid)
     -- Total number of refs of the session. Is used to drop the session without
     -- fullscan of the ref map. Heap size can't be used because not all refs are
     -- stored here. See more on that below.
-    local count = 0
+    local ref_count = 0
     -- Cache global session storages as upvalues to save on M indexing.
     local global_heap = M.session_heap
     local global_map = M.session_map
@@ -94,9 +94,9 @@ local function ref_session_new(sid)
         assert(new_count >= 0)
         M.count = new_count
 
-        new_count = count - del_count
+        new_count = ref_count - del_count
         assert(new_count >= 0)
-        count = new_count
+        ref_count = new_count
     end
 
     local function ref_session_update_deadline(self)
@@ -224,7 +224,7 @@ local function ref_session_new(sid)
             self.deadline = deadline
             global_heap:update(self)
         end
-        count = count + 1
+        ref_count = ref_count + 1
         M.count = M.count + 1
         return true
     end
@@ -260,7 +260,7 @@ local function ref_session_new(sid)
     local function ref_session_kill(self)
         global_map[sid] = nil
         global_heap:remove(self)
-        ref_session_discount(self, count)
+        ref_session_discount(self, ref_count)
     end
 
     -- Don't use __index. It is useless since all sessions use closures as
====================

>> +
>> +    --
>> +    -- GC expired refs until they end or the limit on the number of iterations
>> +    -- is exhausted. The limit is supposed to prevent too long GC which would
>> +    -- occupy TX thread unfairly.
>> +    --
>> +    -- Returns false if nothing to GC, or number of iterations left from the
>> +    -- limit. The caller is supposed to yield when 0 is returned, and retry GC
>> +    -- until it returns false.
>> +    -- The function itself does not yield, because it is used from a more
>> +    -- generic function GCing all sessions. It would not ever yield if all
>> +    -- sessions would have less than limit refs, even if total ref count would
>> +    -- be much bigger.
>> +    --
>> +    -- Besides, the session might be killed during general GC. There must not be
>> +    -- any yields in session methods so as not to introduce a support of dead
>> +    -- sessions.
>> +    --
>> +    local function ref_session_gc(self, limit, now)
>> +        if self.deadline >= now then
>> +            return false
>> +        end
> 
> Here you mix "booleans" and "numbers" as return values. Maybe it's better to return "nil" here?

No problem:

====================
diff --git a/vshard/storage/ref.lua b/vshard/storage/ref.lua
index 27f7804..d31e3ed 100644
--- a/vshard/storage/ref.lua
+++ b/vshard/storage/ref.lua
@@ -164,9 +164,9 @@ local function ref_session_new(sid)
     -- is exhausted. The limit is supposed to prevent too long GC which would
     -- occupy TX thread unfairly.
     --
-    -- Returns false if nothing to GC, or number of iterations left from the
+    -- Returns nil if nothing to GC, or number of iterations left from the
     -- limit. The caller is supposed to yield when 0 is returned, and retry GC
-    -- until it returns false.
+    -- until it returns nil.
     -- The function itself does not yield, because it is used from a more
     -- generic function GCing all sessions. It would not ever yield if all
     -- sessions would have less than limit refs, even if total ref count would
@@ -178,7 +178,7 @@ local function ref_session_new(sid)
     --
     local function ref_session_gc(self, limit, now)
         if self.deadline >= now then
-            return false
+            return nil
         end
         local top = ref_heap:top()
         local del = 1
====================

>> +
>> +    -- Don't use __index. It is useless since all sessions use closures as
>> +    -- methods. Also it is probably slower because on each method call would
>> +    -- need to get the metatable, get __index, find the method here. While now
>> +    -- it is only an index operation on the session object.
> 
> Side note: for heap you still use "__index" even heap uses closures as methods.

Indeed, I should have thought of this. I updated the part1 branch, and rebased the
part2 branch. See the part1 email thread for the diff.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 10/11] sched: introduce vshard.storage.sched module
  2021-03-04 21:02   ` Oleg Babin via Tarantool-patches
@ 2021-03-05 22:06     ` Vladislav Shpilevoy via Tarantool-patches
  2021-03-09  8:03       ` Oleg Babin via Tarantool-patches
  0 siblings, 1 reply; 47+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-03-05 22:06 UTC (permalink / raw)
  To: Oleg Babin, tarantool-patches, yaroslav.dynnikov

Thanks for the review!

On 04.03.2021 22:02, Oleg Babin wrote:
> Hi! I've looked your patch again. See one comment below.
> 
> On 23.02.2021 03:15, Vladislav Shpilevoy wrote:
>> +--
>> +-- Return the remaining timeout in case there was a yield. This helps to save
>> +-- current clock get in the caller code if there were no yields.
>> +--
>> +local function sched_ref_start(timeout)
>> +    local deadline = fiber_clock() + timeout
>> +    local ok, err
>> +    -- Fast-path. Moves are extremely rare. No need to inc-dec the ref queue
>> +    -- then nor try to start some loops.
>> +    if M.move_count == 0 and M.move_queue == 0 then
>> +        goto success
>> +    end
>> +
>> +    M.ref_queue = M.ref_queue + 1
>> +
>> +::retry::
>> +    if M.move_count > 0 then
>> +        goto wait_and_retry
>> +    end
>> +    -- Even if move count is zero, must ensure the time usage is fair. Does not
>> +    -- matter in case the moves have no quota at all. That allows to ignore them
>> +    -- infinitely until all refs end voluntarily.
>> +    if M.move_queue > 0 and M.ref_strike >= M.ref_quota and
>> +       M.move_quota > 0 then
> 
> Is it reasonable to check `move_quota > 0`. According tests it always should be positive.
> 
> I see similar check for `ref_quota` as well.

These are special cases covered with tests in unit-tap/scheduler.test.lua
in test_move_zero_quota() and test_ref_zero_quota().

Zero quota means the operation can be suppressed by the other operation
infinitely long.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 09/11] ref: introduce vshard.storage.ref module
  2021-03-05 22:06     ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-03-09  8:03       ` Oleg Babin via Tarantool-patches
  0 siblings, 0 replies; 47+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-03-09  8:03 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, yaroslav.dynnikov

Hi! Thanks for fixes. LGTM.

On 06.03.2021 01:06, Vladislav Shpilevoy wrote:
> Hi! Thanks for the review!
>
>>> +local function ref_session_new(sid)
>>> +    -- Session object does store its internal hot attributes in a table. Because
>>> +    -- it would mean access to any session attribute would cost at least one
>>> +    -- table indexing operation. Instead, all internal fields are stored as
>>> +    -- upvalues referenced by the methods defined as closures.
>>> +    --
>>> +    -- This means session creation may not very suitable for jitting, but it is
>>> +    -- very rare and attempts to optimize the most common case.
>>> +    --
>>> +    -- Still the public functions take 'self' object to make it look normally.
>>> +    -- They even use it a bit.
>>> +
>>> +    -- Ref map to get ref object by its ID.
>>> +    local ref_map = {}
>>> +    -- Ref heap sorted by their deadlines.
>>> +    local ref_heap = lheap.new(heap_min_deadline_cmp)
>>> +    -- Total number of refs of the session. Is used to drop the session without
>>> +    -- fullscan of the ref map. Heap size can't be used because not all refs are
>>> +    -- stored here. See more on that below.
>>> +    local count = 0
>> Maybe it's better to rename it to "global_count". Sometimes it's quite confusing to see `M.count +=` near `count += `.
>>
>> Also you have "global_map" and "global_heap" so no reasons to call it just "count".
> I have global_map and global_heap variables because I also have normal map and
> heap, local to the session. To distinguish between them I added 'global_'
> prefix to the global ones.
>
> The count here is not global. It is local to the session. But I see the point.
> I renamed it to `ref_count` to be consistent with `ref_map` and `ref_heap`.
>
> ====================
> diff --git a/vshard/storage/ref.lua b/vshard/storage/ref.lua
> index 7589cb9..27f7804 100644
> --- a/vshard/storage/ref.lua
> +++ b/vshard/storage/ref.lua
> @@ -84,7 +84,7 @@ local function ref_session_new(sid)
>       -- Total number of refs of the session. Is used to drop the session without
>       -- fullscan of the ref map. Heap size can't be used because not all refs are
>       -- stored here. See more on that below.
> -    local count = 0
> +    local ref_count = 0
>       -- Cache global session storages as upvalues to save on M indexing.
>       local global_heap = M.session_heap
>       local global_map = M.session_map
> @@ -94,9 +94,9 @@ local function ref_session_new(sid)
>           assert(new_count >= 0)
>           M.count = new_count
>   
> -        new_count = count - del_count
> +        new_count = ref_count - del_count
>           assert(new_count >= 0)
> -        count = new_count
> +        ref_count = new_count
>       end
>   
>       local function ref_session_update_deadline(self)
> @@ -224,7 +224,7 @@ local function ref_session_new(sid)
>               self.deadline = deadline
>               global_heap:update(self)
>           end
> -        count = count + 1
> +        ref_count = ref_count + 1
>           M.count = M.count + 1
>           return true
>       end
> @@ -260,7 +260,7 @@ local function ref_session_new(sid)
>       local function ref_session_kill(self)
>           global_map[sid] = nil
>           global_heap:remove(self)
> -        ref_session_discount(self, count)
> +        ref_session_discount(self, ref_count)
>       end
>   
>       -- Don't use __index. It is useless since all sessions use closures as
> ====================
>
>>> +
>>> +    --
>>> +    -- GC expired refs until they end or the limit on the number of iterations
>>> +    -- is exhausted. The limit is supposed to prevent too long GC which would
>>> +    -- occupy TX thread unfairly.
>>> +    --
>>> +    -- Returns false if nothing to GC, or number of iterations left from the
>>> +    -- limit. The caller is supposed to yield when 0 is returned, and retry GC
>>> +    -- until it returns false.
>>> +    -- The function itself does not yield, because it is used from a more
>>> +    -- generic function GCing all sessions. It would not ever yield if all
>>> +    -- sessions would have less than limit refs, even if total ref count would
>>> +    -- be much bigger.
>>> +    --
>>> +    -- Besides, the session might be killed during general GC. There must not be
>>> +    -- any yields in session methods so as not to introduce a support of dead
>>> +    -- sessions.
>>> +    --
>>> +    local function ref_session_gc(self, limit, now)
>>> +        if self.deadline >= now then
>>> +            return false
>>> +        end
>> Here you mix "booleans" and "numbers" as return values. Maybe it's better to return "nil" here?
> No problem:
>
> ====================
> diff --git a/vshard/storage/ref.lua b/vshard/storage/ref.lua
> index 27f7804..d31e3ed 100644
> --- a/vshard/storage/ref.lua
> +++ b/vshard/storage/ref.lua
> @@ -164,9 +164,9 @@ local function ref_session_new(sid)
>       -- is exhausted. The limit is supposed to prevent too long GC which would
>       -- occupy TX thread unfairly.
>       --
> -    -- Returns false if nothing to GC, or number of iterations left from the
> +    -- Returns nil if nothing to GC, or number of iterations left from the
>       -- limit. The caller is supposed to yield when 0 is returned, and retry GC
> -    -- until it returns false.
> +    -- until it returns nil.
>       -- The function itself does not yield, because it is used from a more
>       -- generic function GCing all sessions. It would not ever yield if all
>       -- sessions would have less than limit refs, even if total ref count would
> @@ -178,7 +178,7 @@ local function ref_session_new(sid)
>       --
>       local function ref_session_gc(self, limit, now)
>           if self.deadline >= now then
> -            return false
> +            return nil
>           end
>           local top = ref_heap:top()
>           local del = 1
> ====================
>
>>> +
>>> +    -- Don't use __index. It is useless since all sessions use closures as
>>> +    -- methods. Also it is probably slower because on each method call would
>>> +    -- need to get the metatable, get __index, find the method here. While now
>>> +    -- it is only an index operation on the session object.
>> Side note: for heap you still use "__index" even heap uses closures as methods.
> Indeed, I should have thought of this. I updated the part1 branch, and rebased the
> part2 branch. See the part1 email thread for the diff.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 10/11] sched: introduce vshard.storage.sched module
  2021-03-05 22:06     ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-03-09  8:03       ` Oleg Babin via Tarantool-patches
  0 siblings, 0 replies; 47+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-03-09  8:03 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, yaroslav.dynnikov

Hi! Thanks for explanation. LGTM.

On 06.03.2021 01:06, Vladislav Shpilevoy wrote:
> Thanks for the review!
>
> On 04.03.2021 22:02, Oleg Babin wrote:
>> Hi! I've looked your patch again. See one comment below.
>>
>> On 23.02.2021 03:15, Vladislav Shpilevoy wrote:
>>> +--
>>> +-- Return the remaining timeout in case there was a yield. This helps to save
>>> +-- current clock get in the caller code if there were no yields.
>>> +--
>>> +local function sched_ref_start(timeout)
>>> +    local deadline = fiber_clock() + timeout
>>> +    local ok, err
>>> +    -- Fast-path. Moves are extremely rare. No need to inc-dec the ref queue
>>> +    -- then nor try to start some loops.
>>> +    if M.move_count == 0 and M.move_queue == 0 then
>>> +        goto success
>>> +    end
>>> +
>>> +    M.ref_queue = M.ref_queue + 1
>>> +
>>> +::retry::
>>> +    if M.move_count > 0 then
>>> +        goto wait_and_retry
>>> +    end
>>> +    -- Even if move count is zero, must ensure the time usage is fair. Does not
>>> +    -- matter in case the moves have no quota at all. That allows to ignore them
>>> +    -- infinitely until all refs end voluntarily.
>>> +    if M.move_queue > 0 and M.ref_strike >= M.ref_quota and
>>> +       M.move_quota > 0 then
>> Is it reasonable to check `move_quota > 0`. According tests it always should be positive.
>>
>> I see similar check for `ref_quota` as well.
> These are special cases covered with tests in unit-tap/scheduler.test.lua
> in test_move_zero_quota() and test_ref_zero_quota().
>
> Zero quota means the operation can be suppressed by the other operation
> infinitely long.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 00/11] VShard Map-Reduce, part 2: Ref, Sched, Map
  2021-02-23  0:15 [Tarantool-patches] [PATCH vshard 00/11] VShard Map-Reduce, part 2: Ref, Sched, Map Vladislav Shpilevoy via Tarantool-patches
                   ` (10 preceding siblings ...)
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 09/11] ref: introduce vshard.storage.ref module Vladislav Shpilevoy via Tarantool-patches
@ 2021-03-12 23:13 ` Vladislav Shpilevoy via Tarantool-patches
  2021-03-15  7:05   ` Oleg Babin via Tarantool-patches
  2021-03-28 18:17 ` Vladislav Shpilevoy via Tarantool-patches
  12 siblings, 1 reply; 47+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-03-12 23:13 UTC (permalink / raw)
  To: tarantool-patches, olegrok, yaroslav.dynnikov

Are the patches ok, or should I expect more comments?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 00/11] VShard Map-Reduce, part 2: Ref, Sched, Map
  2021-03-12 23:13 ` [Tarantool-patches] [PATCH vshard 00/11] VShard Map-Reduce, part 2: Ref, Sched, Map Vladislav Shpilevoy via Tarantool-patches
@ 2021-03-15  7:05   ` Oleg Babin via Tarantool-patches
  0 siblings, 0 replies; 47+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-03-15  7:05 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, yaroslav.dynnikov

Hi, Vlad! Thanks a lot for your patchsets. Both of them LGTM.

On 13.03.2021 02:13, Vladislav Shpilevoy wrote:
> Are the patches ok, or should I expect more comments?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 09/11] ref: introduce vshard.storage.ref module
  2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 09/11] ref: introduce vshard.storage.ref module Vladislav Shpilevoy via Tarantool-patches
  2021-02-24 10:28   ` Oleg Babin via Tarantool-patches
  2021-03-04 21:22   ` Oleg Babin via Tarantool-patches
@ 2021-03-21 18:49   ` Vladislav Shpilevoy via Tarantool-patches
  2 siblings, 0 replies; 47+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-03-21 18:49 UTC (permalink / raw)
  To: tarantool-patches, olegrok, yaroslav.dynnikov

After another self-review I found a bug here - a disconnected session was deleted
even if it had running requests. Because on_disconnect triggers are called in the
moment of disconnect, not when the session is deleted. My bad.

It should not happen, because there still might be a read-write request changing
the data, which needs them to be consistent.

It was also possible to make the session not deleted even if has unused refs,
but I decided that it would be safer to delete when used count becomes 0. This
should help in cases when users might set too big timeouts, but didn't get to the
"map" stage, which would prevent the sessions from deletion, and blocking the
rebalancing for too long time.

It still might happen, but at least we can tell such users to reconnect/restart
their routers instead of restarting the storage.

I updated the patch. Below is the incremental diff and a full diff of the patch
in the end.

====================
diff --git a/test/storage/ref.result b/test/storage/ref.result
index d5f4166..c115d99 100644
--- a/test/storage/ref.result
+++ b/test/storage/ref.result
@@ -371,18 +371,66 @@ _ = test_run:switch('storage_1_a')
  | ...
 
 --
--- Session disconnect removes its refs.
+-- Session disconnect keeps the refs, but the session is deleted when
+-- used ref count becomes 0. Unused refs don't prevent session deletion.
 --
-c:call('make_ref', {3, big_timeout})
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+keep_long_ref = true
+ | ---
+ | ...
+function long_ref_request(rid)                                                  \
+    local sid = box.session.id()                                                \
+    assert(lref.add(rid, sid, big_timeout))                                     \
+    assert(lref.use(rid, sid))                                                  \
+    while keep_long_ref do                                                      \
+        fiber.sleep(small_timeout)                                              \
+    end                                                                         \
+    assert(lref.del(rid, sid))                                                  \
+end
+ | ---
+ | ...
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+_ = c:call('long_ref_request', {3}, {is_async = true})
+ | ---
+ | ...
+c:call('make_ref', {4, big_timeout})
  | ---
  | - true
  | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+test_run:wait_cond(function() return lref.count == 2 end)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
 c:close()
  | ---
  | ...
+
 _ = test_run:switch('storage_2_a')
  | ---
  | ...
+-- Still 2 refs.
+assert(lref.count == 2)
+ | ---
+ | - true
+ | ...
+-- The long request ends and the session must be deleted - that was the last
+-- used ref.
+keep_long_ref = false
+ | ---
+ | ...
 test_run:wait_cond(function() return lref.count == 0 end)
  | ---
  | - true
diff --git a/test/storage/ref.test.lua b/test/storage/ref.test.lua
index b34a294..5b57ea4 100644
--- a/test/storage/ref.test.lua
+++ b/test/storage/ref.test.lua
@@ -154,11 +154,37 @@ assert(lref.count == 1)
 _ = test_run:switch('storage_1_a')
 
 --
--- Session disconnect removes its refs.
+-- Session disconnect keeps the refs, but the session is deleted when
+-- used ref count becomes 0. Unused refs don't prevent session deletion.
 --
-c:call('make_ref', {3, big_timeout})
+_ = test_run:switch('storage_2_a')
+keep_long_ref = true
+function long_ref_request(rid)                                                  \
+    local sid = box.session.id()                                                \
+    assert(lref.add(rid, sid, big_timeout))                                     \
+    assert(lref.use(rid, sid))                                                  \
+    while keep_long_ref do                                                      \
+        fiber.sleep(small_timeout)                                              \
+    end                                                                         \
+    assert(lref.del(rid, sid))                                                  \
+end
+
+_ = test_run:switch('storage_1_a')
+_ = c:call('long_ref_request', {3}, {is_async = true})
+c:call('make_ref', {4, big_timeout})
+
+_ = test_run:switch('storage_2_a')
+test_run:wait_cond(function() return lref.count == 2 end)
+
+_ = test_run:switch('storage_1_a')
 c:close()
+
 _ = test_run:switch('storage_2_a')
+-- Still 2 refs.
+assert(lref.count == 2)
+-- The long request ends and the session must be deleted - that was the last
+-- used ref.
+keep_long_ref = false
 test_run:wait_cond(function() return lref.count == 0 end)
 
 _ = test_run:switch("default")
diff --git a/test/unit-tap/ref.test.lua b/test/unit-tap/ref.test.lua
index 99ef69f..fdd0477 100755
--- a/test/unit-tap/ref.test.lua
+++ b/test/unit-tap/ref.test.lua
@@ -190,13 +190,39 @@ local function test_ref_del(test)
     test:is(lref.count, 0, 'now all is deleted')
 end
 
+local function test_ref_dead_session(test)
+    test:plan(4)
+
+    --
+    -- Session after disconnect still might have running requests. It must
+    -- be kept alive with its refs until the requests are done.
+    --
+    assert(lref.add(0, sid, small_timeout))
+    assert(lref.use(0, sid))
+    lref.kill(sid)
+    test:ok(lref.del(0, sid))
+
+    --
+    -- The dead session is kept only while the used requests are running. It is
+    -- deleted when use count becomes 0 even if there were unused refs.
+    --
+    assert(lref.add(0, sid, big_timeout))
+    assert(lref.add(1, sid, big_timeout))
+    assert(lref.use(0, sid))
+    lref.kill(sid)
+    test:is(lref.count, 2, '2 refs in a dead session')
+    test:ok(lref.del(0, sid), 'delete the used ref')
+    test:is(lref.count, 0, '0 refs - the unused ref was deleted with session')
+end
+
 local test = tap.test('ref')
-test:plan(5)
+test:plan(6)
 
 test:test('basic', test_ref_basic)
 test:test('incremental gc', test_ref_incremental_gc)
 test:test('gc', test_ref_gc)
 test:test('use', test_ref_use)
 test:test('del', test_ref_del)
+test:test('dead session use', test_ref_dead_session)
 
 os.exit(test:check() and 0 or 1)
diff --git a/vshard/storage/ref.lua b/vshard/storage/ref.lua
index 620913e..a024d8e 100644
--- a/vshard/storage/ref.lua
+++ b/vshard/storage/ref.lua
@@ -81,10 +81,17 @@ local function ref_session_new(sid)
     local ref_map = {}
     -- Ref heap sorted by their deadlines.
     local ref_heap = lheap.new(heap_min_deadline_cmp)
-    -- Total number of refs of the session. Is used to drop the session without
-    -- fullscan of the ref map. Heap size can't be used because not all refs are
-    -- stored here. See more on that below.
-    local ref_count = 0
+    -- Total number of refs of the session. Is used to drop the session when it
+    -- it is disconnected and has no refs anymore. Heap size can't be used
+    -- because not all refs are stored here.
+    local ref_count_total = 0
+    -- Number of refs in use. They are included into the total count. The used
+    -- refs are accounted explicitly in order to detect when a disconnected
+    -- session has no used refs anymore and can be deleted.
+    local ref_count_use = 0
+    -- When the session becomes disconnected, it must be deleted from the global
+    -- heap when all its used refs are gone.
+    local is_disconnected = false
     -- Cache global session storages as upvalues to save on M indexing.
     local global_heap = M.session_heap
     local global_map = M.session_map
@@ -94,9 +101,18 @@ local function ref_session_new(sid)
         assert(new_count >= 0)
         M.count = new_count
 
-        new_count = ref_count - del_count
+        new_count = ref_count_total - del_count
         assert(new_count >= 0)
-        ref_count = new_count
+        ref_count_total = new_count
+    end
+
+    local function ref_session_delete_if_not_used(self)
+        if not is_disconnected or ref_count_use > 0 then
+            return
+        end
+        ref_session_discount(self, ref_count_total)
+        global_map[sid] = nil
+        global_heap:remove(self)
     end
 
     local function ref_session_update_deadline(self)
@@ -224,7 +240,7 @@ local function ref_session_new(sid)
             self.deadline = deadline
             global_heap:update(self)
         end
-        ref_count = ref_count + 1
+        ref_count_total = ref_count_total + 1
         M.count = M.count + 1
         return true
     end
@@ -242,6 +258,7 @@ local function ref_session_new(sid)
         end
         ref_heap:remove(ref)
         ref_session_update_deadline(self)
+        ref_count_use = ref_count_use + 1
         return true
     end
 
@@ -250,17 +267,24 @@ local function ref_session_new(sid)
         if not ref then
             return nil, lerror.vshard(lerror.code.STORAGE_REF_DEL, 'no ref')
         end
-        ref_heap:remove_try(ref)
         ref_map[rid] = nil
-        ref_session_update_deadline(self)
-        ref_session_discount(self, 1)
+        if ref.index == -1 then
+            ref_session_update_deadline(self)
+            ref_session_discount(self, 1)
+            ref_count_use = ref_count_use - 1
+            ref_session_delete_if_not_used(self)
+        else
+            ref_heap:remove(ref)
+            ref_session_update_deadline(self)
+            ref_session_discount(self, 1)
+        end
         return true
     end
 
     local function ref_session_kill(self)
-        global_map[sid] = nil
-        global_heap:remove(self)
-        ref_session_discount(self, ref_count)
+        assert(not is_disconnected)
+        is_disconnected = true
+        ref_session_delete_if_not_used(self)
     end
 
     -- Don't use __index. It is useless since all sessions use closures as

====================

Here is the full patch:

====================

diff --git a/test/reload_evolution/storage.result b/test/reload_evolution/storage.result
index 9d30a04..c4a0cdd 100644
--- a/test/reload_evolution/storage.result
+++ b/test/reload_evolution/storage.result
@@ -227,6 +227,72 @@ box.space._bucket.index.status:count({vshard.consts.BUCKET.ACTIVE})
 ---
 - 1500
 ...
+--
+-- Ensure storage refs are enabled and work from the scratch via reload.
+--
+lref = require('vshard.storage.ref')
+---
+...
+vshard.storage.rebalancer_disable()
+---
+...
+big_timeout = 1000000
+---
+...
+timeout = 0.01
+---
+...
+lref.add(0, 0, big_timeout)
+---
+- true
+...
+status_index = box.space._bucket.index.status
+---
+...
+bucket_id_to_move = status_index:min({vshard.consts.BUCKET.ACTIVE}).id
+---
+...
+ok, err = vshard.storage.bucket_send(bucket_id_to_move, util.replicasets[2],    \
+                                     {timeout = timeout})
+---
+...
+assert(not ok and err.message)
+---
+- Storage is referenced
+...
+lref.del(0, 0)
+---
+- true
+...
+vshard.storage.bucket_send(bucket_id_to_move, util.replicasets[2],              \
+                           {timeout = big_timeout})
+---
+- true
+...
+wait_bucket_is_collected(bucket_id_to_move)
+---
+...
+test_run:switch('storage_2_a')
+---
+- true
+...
+vshard.storage.rebalancer_disable()
+---
+...
+big_timeout = 1000000
+---
+...
+bucket_id_to_move = test_run:eval('storage_1_a', 'return bucket_id_to_move')[1]
+---
+...
+vshard.storage.bucket_send(bucket_id_to_move, util.replicasets[1],              \
+                           {timeout = big_timeout})
+---
+- true
+...
+wait_bucket_is_collected(bucket_id_to_move)
+---
+...
 test_run:switch('default')
 ---
 - true
diff --git a/test/reload_evolution/storage.test.lua b/test/reload_evolution/storage.test.lua
index 639553e..c351ada 100644
--- a/test/reload_evolution/storage.test.lua
+++ b/test/reload_evolution/storage.test.lua
@@ -83,6 +83,34 @@ box.space._bucket.index.status:count({vshard.consts.BUCKET.ACTIVE})
 test_run:switch('storage_1_a')
 box.space._bucket.index.status:count({vshard.consts.BUCKET.ACTIVE})
 
+--
+-- Ensure storage refs are enabled and work from the scratch via reload.
+--
+lref = require('vshard.storage.ref')
+vshard.storage.rebalancer_disable()
+
+big_timeout = 1000000
+timeout = 0.01
+lref.add(0, 0, big_timeout)
+status_index = box.space._bucket.index.status
+bucket_id_to_move = status_index:min({vshard.consts.BUCKET.ACTIVE}).id
+ok, err = vshard.storage.bucket_send(bucket_id_to_move, util.replicasets[2],    \
+                                     {timeout = timeout})
+assert(not ok and err.message)
+lref.del(0, 0)
+vshard.storage.bucket_send(bucket_id_to_move, util.replicasets[2],              \
+                           {timeout = big_timeout})
+wait_bucket_is_collected(bucket_id_to_move)
+
+test_run:switch('storage_2_a')
+vshard.storage.rebalancer_disable()
+
+big_timeout = 1000000
+bucket_id_to_move = test_run:eval('storage_1_a', 'return bucket_id_to_move')[1]
+vshard.storage.bucket_send(bucket_id_to_move, util.replicasets[1],              \
+                           {timeout = big_timeout})
+wait_bucket_is_collected(bucket_id_to_move)
+
 test_run:switch('default')
 test_run:drop_cluster(REPLICASET_2)
 test_run:drop_cluster(REPLICASET_1)
diff --git a/test/storage/ref.result b/test/storage/ref.result
new file mode 100644
index 0000000..c115d99
--- /dev/null
+++ b/test/storage/ref.result
@@ -0,0 +1,447 @@
+-- test-run result file version 2
+test_run = require('test_run').new()
+ | ---
+ | ...
+netbox = require('net.box')
+ | ---
+ | ...
+REPLICASET_1 = { 'storage_1_a', 'storage_1_b' }
+ | ---
+ | ...
+REPLICASET_2 = { 'storage_2_a', 'storage_2_b' }
+ | ---
+ | ...
+
+test_run:create_cluster(REPLICASET_1, 'storage')
+ | ---
+ | ...
+test_run:create_cluster(REPLICASET_2, 'storage')
+ | ---
+ | ...
+util = require('util')
+ | ---
+ | ...
+util.wait_master(test_run, REPLICASET_1, 'storage_1_a')
+ | ---
+ | ...
+util.wait_master(test_run, REPLICASET_2, 'storage_2_a')
+ | ---
+ | ...
+util.map_evals(test_run, {REPLICASET_1, REPLICASET_2}, 'bootstrap_storage()')
+ | ---
+ | ...
+
+--
+-- gh-147: refs allow to pin all the buckets on the storage at once. Is invented
+-- for map-reduce functionality to pin all buckets on all storages in the
+-- cluster to execute consistent map-reduce calls on all cluster data.
+--
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+vshard.storage.rebalancer_disable()
+ | ---
+ | ...
+vshard.storage.bucket_force_create(1, 1500)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+vshard.storage.rebalancer_disable()
+ | ---
+ | ...
+vshard.storage.bucket_force_create(1501, 1500)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+lref = require('vshard.storage.ref')
+ | ---
+ | ...
+
+--
+-- Bucket moves are not allowed under a ref.
+--
+util = require('util')
+ | ---
+ | ...
+sid = 0
+ | ---
+ | ...
+rid = 0
+ | ---
+ | ...
+big_timeout = 1000000
+ | ---
+ | ...
+small_timeout = 0.001
+ | ---
+ | ...
+lref.add(rid, sid, big_timeout)
+ | ---
+ | - true
+ | ...
+-- Send fails.
+ok, err = vshard.storage.bucket_send(1, util.replicasets[2],                    \
+                                     {timeout = big_timeout})
+ | ---
+ | ...
+assert(not ok and err.message)
+ | ---
+ | - Storage is referenced
+ | ...
+lref.use(rid, sid)
+ | ---
+ | - true
+ | ...
+-- Still fails - use only makes ref undead until it is deleted explicitly.
+ok, err = vshard.storage.bucket_send(1, util.replicasets[2],                    \
+                                     {timeout = big_timeout})
+ | ---
+ | ...
+assert(not ok and err.message)
+ | ---
+ | - Storage is referenced
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+-- Receive (from another replicaset) also fails.
+big_timeout = 1000000
+ | ---
+ | ...
+ok, err = vshard.storage.bucket_send(1501, util.replicasets[1],                 \
+                                     {timeout = big_timeout})
+ | ---
+ | ...
+assert(not ok and err.message)
+ | ---
+ | - Storage is referenced
+ | ...
+
+--
+-- After unref all the bucket moves are allowed again.
+--
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+lref.del(rid, sid)
+ | ---
+ | - true
+ | ...
+
+vshard.storage.bucket_send(1, util.replicasets[2], {timeout = big_timeout})
+ | ---
+ | - true
+ | ...
+wait_bucket_is_collected(1)
+ | ---
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+vshard.storage.bucket_send(1, util.replicasets[1], {timeout = big_timeout})
+ | ---
+ | - true
+ | ...
+wait_bucket_is_collected(1)
+ | ---
+ | ...
+
+--
+-- While bucket move is in progress, ref won't work.
+--
+vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = true
+ | ---
+ | ...
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+fiber = require('fiber')
+ | ---
+ | ...
+_ = fiber.create(vshard.storage.bucket_send, 1, util.replicasets[2],            \
+                 {timeout = big_timeout})
+ | ---
+ | ...
+ok, err = lref.add(rid, sid, small_timeout)
+ | ---
+ | ...
+assert(not ok and err.message)
+ | ---
+ | - Timeout exceeded
+ | ...
+-- Ref will wait if timeout is big enough.
+ok, err = nil
+ | ---
+ | ...
+_ = fiber.create(function()                                                     \
+    ok, err = lref.add(rid, sid, big_timeout)                                   \
+end)
+ | ---
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = false
+ | ---
+ | ...
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+wait_bucket_is_collected(1)
+ | ---
+ | ...
+test_run:wait_cond(function() return ok or err end)
+ | ---
+ | - true
+ | ...
+lref.use(rid, sid)
+ | ---
+ | - true
+ | ...
+lref.del(rid, sid)
+ | ---
+ | - true
+ | ...
+assert(ok and not err)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+vshard.storage.bucket_send(1, util.replicasets[1], {timeout = big_timeout})
+ | ---
+ | - true
+ | ...
+wait_bucket_is_collected(1)
+ | ---
+ | ...
+
+--
+-- Refs are bound to sessions.
+--
+box.schema.user.grant('storage', 'super')
+ | ---
+ | ...
+lref = require('vshard.storage.ref')
+ | ---
+ | ...
+small_timeout = 0.001
+ | ---
+ | ...
+function make_ref(rid, timeout)                                                 \
+    return lref.add(rid, box.session.id(), timeout)                             \
+end
+ | ---
+ | ...
+function use_ref(rid)                                                           \
+    return lref.use(rid, box.session.id())                                      \
+end
+ | ---
+ | ...
+function del_ref(rid)                                                           \
+    return lref.del(rid, box.session.id())                                      \
+end
+ | ---
+ | ...
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+netbox = require('net.box')
+ | ---
+ | ...
+remote_uri = test_run:eval('storage_2_a', 'return box.cfg.listen')[1]
+ | ---
+ | ...
+c = netbox.connect(remote_uri)
+ | ---
+ | ...
+
+-- Ref is added and does not disappear anywhere on its own.
+c:call('make_ref', {1, small_timeout})
+ | ---
+ | - true
+ | ...
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+assert(lref.count == 1)
+ | ---
+ | - true
+ | ...
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+
+-- Use works.
+c:call('use_ref', {1})
+ | ---
+ | - true
+ | ...
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+assert(lref.count == 1)
+ | ---
+ | - true
+ | ...
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+
+-- Del works.
+c:call('del_ref', {1})
+ | ---
+ | - true
+ | ...
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+assert(lref.count == 0)
+ | ---
+ | - true
+ | ...
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+
+-- Expiration works. Try to add a second ref when the first one is expired - the
+-- first is collected and a subsequent use and del won't work.
+c:call('make_ref', {1, small_timeout})
+ | ---
+ | - true
+ | ...
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+assert(lref.count == 1)
+ | ---
+ | - true
+ | ...
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+
+fiber.sleep(small_timeout)
+ | ---
+ | ...
+c:call('make_ref', {2, small_timeout})
+ | ---
+ | - true
+ | ...
+ok, err = c:call('use_ref', {1})
+ | ---
+ | ...
+assert(ok == nil and err.message)
+ | ---
+ | - 'Can not use a storage ref: no ref'
+ | ...
+ok, err = c:call('del_ref', {1})
+ | ---
+ | ...
+assert(ok == nil and err.message)
+ | ---
+ | - 'Can not delete a storage ref: no ref'
+ | ...
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+assert(lref.count == 1)
+ | ---
+ | - true
+ | ...
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+
+--
+-- Session disconnect keeps the refs, but the session is deleted when
+-- used ref count becomes 0. Unused refs don't prevent session deletion.
+--
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+keep_long_ref = true
+ | ---
+ | ...
+function long_ref_request(rid)                                                  \
+    local sid = box.session.id()                                                \
+    assert(lref.add(rid, sid, big_timeout))                                     \
+    assert(lref.use(rid, sid))                                                  \
+    while keep_long_ref do                                                      \
+        fiber.sleep(small_timeout)                                              \
+    end                                                                         \
+    assert(lref.del(rid, sid))                                                  \
+end
+ | ---
+ | ...
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+_ = c:call('long_ref_request', {3}, {is_async = true})
+ | ---
+ | ...
+c:call('make_ref', {4, big_timeout})
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+test_run:wait_cond(function() return lref.count == 2 end)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch('storage_1_a')
+ | ---
+ | ...
+c:close()
+ | ---
+ | ...
+
+_ = test_run:switch('storage_2_a')
+ | ---
+ | ...
+-- Still 2 refs.
+assert(lref.count == 2)
+ | ---
+ | - true
+ | ...
+-- The long request ends and the session must be deleted - that was the last
+-- used ref.
+keep_long_ref = false
+ | ---
+ | ...
+test_run:wait_cond(function() return lref.count == 0 end)
+ | ---
+ | - true
+ | ...
+
+_ = test_run:switch("default")
+ | ---
+ | ...
+test_run:drop_cluster(REPLICASET_2)
+ | ---
+ | ...
+test_run:drop_cluster(REPLICASET_1)
+ | ---
+ | ...
diff --git a/test/storage/ref.test.lua b/test/storage/ref.test.lua
new file mode 100644
index 0000000..5b57ea4
--- /dev/null
+++ b/test/storage/ref.test.lua
@@ -0,0 +1,192 @@
+test_run = require('test_run').new()
+netbox = require('net.box')
+REPLICASET_1 = { 'storage_1_a', 'storage_1_b' }
+REPLICASET_2 = { 'storage_2_a', 'storage_2_b' }
+
+test_run:create_cluster(REPLICASET_1, 'storage')
+test_run:create_cluster(REPLICASET_2, 'storage')
+util = require('util')
+util.wait_master(test_run, REPLICASET_1, 'storage_1_a')
+util.wait_master(test_run, REPLICASET_2, 'storage_2_a')
+util.map_evals(test_run, {REPLICASET_1, REPLICASET_2}, 'bootstrap_storage()')
+
+--
+-- gh-147: refs allow to pin all the buckets on the storage at once. Is invented
+-- for map-reduce functionality to pin all buckets on all storages in the
+-- cluster to execute consistent map-reduce calls on all cluster data.
+--
+
+_ = test_run:switch('storage_1_a')
+vshard.storage.rebalancer_disable()
+vshard.storage.bucket_force_create(1, 1500)
+
+_ = test_run:switch('storage_2_a')
+vshard.storage.rebalancer_disable()
+vshard.storage.bucket_force_create(1501, 1500)
+
+_ = test_run:switch('storage_1_a')
+lref = require('vshard.storage.ref')
+
+--
+-- Bucket moves are not allowed under a ref.
+--
+util = require('util')
+sid = 0
+rid = 0
+big_timeout = 1000000
+small_timeout = 0.001
+lref.add(rid, sid, big_timeout)
+-- Send fails.
+ok, err = vshard.storage.bucket_send(1, util.replicasets[2],                    \
+                                     {timeout = big_timeout})
+assert(not ok and err.message)
+lref.use(rid, sid)
+-- Still fails - use only makes ref undead until it is deleted explicitly.
+ok, err = vshard.storage.bucket_send(1, util.replicasets[2],                    \
+                                     {timeout = big_timeout})
+assert(not ok and err.message)
+
+_ = test_run:switch('storage_2_a')
+-- Receive (from another replicaset) also fails.
+big_timeout = 1000000
+ok, err = vshard.storage.bucket_send(1501, util.replicasets[1],                 \
+                                     {timeout = big_timeout})
+assert(not ok and err.message)
+
+--
+-- After unref all the bucket moves are allowed again.
+--
+_ = test_run:switch('storage_1_a')
+lref.del(rid, sid)
+
+vshard.storage.bucket_send(1, util.replicasets[2], {timeout = big_timeout})
+wait_bucket_is_collected(1)
+
+_ = test_run:switch('storage_2_a')
+vshard.storage.bucket_send(1, util.replicasets[1], {timeout = big_timeout})
+wait_bucket_is_collected(1)
+
+--
+-- While bucket move is in progress, ref won't work.
+--
+vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = true
+
+_ = test_run:switch('storage_1_a')
+fiber = require('fiber')
+_ = fiber.create(vshard.storage.bucket_send, 1, util.replicasets[2],            \
+                 {timeout = big_timeout})
+ok, err = lref.add(rid, sid, small_timeout)
+assert(not ok and err.message)
+-- Ref will wait if timeout is big enough.
+ok, err = nil
+_ = fiber.create(function()                                                     \
+    ok, err = lref.add(rid, sid, big_timeout)                                   \
+end)
+
+_ = test_run:switch('storage_2_a')
+vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = false
+
+_ = test_run:switch('storage_1_a')
+wait_bucket_is_collected(1)
+test_run:wait_cond(function() return ok or err end)
+lref.use(rid, sid)
+lref.del(rid, sid)
+assert(ok and not err)
+
+_ = test_run:switch('storage_2_a')
+vshard.storage.bucket_send(1, util.replicasets[1], {timeout = big_timeout})
+wait_bucket_is_collected(1)
+
+--
+-- Refs are bound to sessions.
+--
+box.schema.user.grant('storage', 'super')
+lref = require('vshard.storage.ref')
+small_timeout = 0.001
+function make_ref(rid, timeout)                                                 \
+    return lref.add(rid, box.session.id(), timeout)                             \
+end
+function use_ref(rid)                                                           \
+    return lref.use(rid, box.session.id())                                      \
+end
+function del_ref(rid)                                                           \
+    return lref.del(rid, box.session.id())                                      \
+end
+
+_ = test_run:switch('storage_1_a')
+netbox = require('net.box')
+remote_uri = test_run:eval('storage_2_a', 'return box.cfg.listen')[1]
+c = netbox.connect(remote_uri)
+
+-- Ref is added and does not disappear anywhere on its own.
+c:call('make_ref', {1, small_timeout})
+_ = test_run:switch('storage_2_a')
+assert(lref.count == 1)
+_ = test_run:switch('storage_1_a')
+
+-- Use works.
+c:call('use_ref', {1})
+_ = test_run:switch('storage_2_a')
+assert(lref.count == 1)
+_ = test_run:switch('storage_1_a')
+
+-- Del works.
+c:call('del_ref', {1})
+_ = test_run:switch('storage_2_a')
+assert(lref.count == 0)
+_ = test_run:switch('storage_1_a')
+
+-- Expiration works. Try to add a second ref when the first one is expired - the
+-- first is collected and a subsequent use and del won't work.
+c:call('make_ref', {1, small_timeout})
+_ = test_run:switch('storage_2_a')
+assert(lref.count == 1)
+_ = test_run:switch('storage_1_a')
+
+fiber.sleep(small_timeout)
+c:call('make_ref', {2, small_timeout})
+ok, err = c:call('use_ref', {1})
+assert(ok == nil and err.message)
+ok, err = c:call('del_ref', {1})
+assert(ok == nil and err.message)
+_ = test_run:switch('storage_2_a')
+assert(lref.count == 1)
+_ = test_run:switch('storage_1_a')
+
+--
+-- Session disconnect keeps the refs, but the session is deleted when
+-- used ref count becomes 0. Unused refs don't prevent session deletion.
+--
+_ = test_run:switch('storage_2_a')
+keep_long_ref = true
+function long_ref_request(rid)                                                  \
+    local sid = box.session.id()                                                \
+    assert(lref.add(rid, sid, big_timeout))                                     \
+    assert(lref.use(rid, sid))                                                  \
+    while keep_long_ref do                                                      \
+        fiber.sleep(small_timeout)                                              \
+    end                                                                         \
+    assert(lref.del(rid, sid))                                                  \
+end
+
+_ = test_run:switch('storage_1_a')
+_ = c:call('long_ref_request', {3}, {is_async = true})
+c:call('make_ref', {4, big_timeout})
+
+_ = test_run:switch('storage_2_a')
+test_run:wait_cond(function() return lref.count == 2 end)
+
+_ = test_run:switch('storage_1_a')
+c:close()
+
+_ = test_run:switch('storage_2_a')
+-- Still 2 refs.
+assert(lref.count == 2)
+-- The long request ends and the session must be deleted - that was the last
+-- used ref.
+keep_long_ref = false
+test_run:wait_cond(function() return lref.count == 0 end)
+
+_ = test_run:switch("default")
+test_run:drop_cluster(REPLICASET_2)
+test_run:drop_cluster(REPLICASET_1)
diff --git a/test/unit-tap/ref.test.lua b/test/unit-tap/ref.test.lua
new file mode 100755
index 0000000..fdd0477
--- /dev/null
+++ b/test/unit-tap/ref.test.lua
@@ -0,0 +1,228 @@
+#!/usr/bin/env tarantool
+
+local tap = require('tap')
+local fiber = require('fiber')
+local lregistry = require('vshard.registry')
+local lref = require('vshard.storage.ref')
+
+local big_timeout = 1000000
+local small_timeout = 0.000001
+local sid = 0
+local sid2 = 1
+local sid3 = 2
+
+--
+-- gh-147: refs allow to pin all the buckets on the storage at once. Is invented
+-- for map-reduce functionality to pin all buckets on all storages in the
+-- cluster to execute consistent map-reduce calls on all cluster data.
+--
+
+--
+-- Refs use storage API to get bucket space state and wait on its changes. But
+-- not important for these unit tests.
+--
+local function bucket_are_all_rw()
+    return true
+end
+
+lregistry.storage = {
+    bucket_are_all_rw = bucket_are_all_rw,
+}
+
+local function test_ref_basic(test)
+    test:plan(15)
+
+    local rid = 0
+    local ok, err
+    --
+    -- Basic ref/unref.
+    --
+    ok, err = lref.add(rid, sid, big_timeout)
+    test:ok(ok and not err, '+1 ref')
+    test:is(lref.count, 1, 'accounted')
+    ok, err = lref.use(rid, sid)
+    test:ok(ok and not err, 'use the ref')
+    test:is(lref.count, 1, 'but still accounted')
+    ok, err = lref.del(rid, sid)
+    test:ok(ok and not err, '-1 ref')
+    test:is(lref.count, 0, 'accounted')
+
+    --
+    -- Bad ref ID.
+    --
+    rid = 1
+    ok, err = lref.use(rid, sid)
+    test:ok(not ok and err, 'invalid RID at use')
+    ok, err = lref.del(rid, sid)
+    test:ok(not ok and err, 'invalid RID at del')
+
+    --
+    -- Bad session ID.
+    --
+    lref.kill(sid)
+    rid = 0
+    ok, err = lref.use(rid, sid)
+    test:ok(not ok and err, 'invalid SID at use')
+    ok, err = lref.del(rid, sid)
+    test:ok(not ok and err, 'invalid SID at del')
+
+    --
+    -- Duplicate ID.
+    --
+    ok, err = lref.add(rid, sid, big_timeout)
+    test:ok(ok and not err, 'add ref')
+    ok, err = lref.add(rid, sid, big_timeout)
+    test:ok(not ok and err, 'duplicate ref')
+    test:is(lref.count, 1, 'did not affect count')
+    test:ok(lref.use(rid, sid) and lref.del(rid, sid), 'del old ref')
+    test:is(lref.count, 0, 'accounted')
+end
+
+local function test_ref_incremental_gc(test)
+    test:plan(20)
+
+    --
+    -- Ref addition expires 2 old refs.
+    --
+    local ok, err
+    for i = 0, 2 do
+        assert(lref.add(i, sid, small_timeout))
+    end
+    fiber.sleep(small_timeout)
+    test:is(lref.count, 3, 'expired refs are still here')
+    test:ok(lref.add(3, sid, 0), 'add new ref')
+    -- 3 + 1 new - 2 old = 2.
+    test:is(lref.count, 2, 'it collected 2 old refs')
+    -- Sleep again so the just created ref with 0 timeout becomes older than the
+    -- deadline.
+    fiber.sleep(small_timeout)
+    test:ok(lref.add(4, sid, 0), 'add new ref')
+    -- 2 + 1 new - 2 old = 1.
+    test:is(lref.count, 1, 'it collected 2 old refs')
+    test:ok(lref.del(4, sid), 'del the latest manually')
+
+    --
+    -- Incremental GC works fine if only one ref was GCed.
+    --
+    test:ok(lref.add(0, sid, small_timeout), 'add ref with small timeout')
+    test:ok(lref.add(1, sid, big_timeout), 'add ref with big timeout')
+    fiber.sleep(small_timeout)
+    test:ok(lref.add(2, sid, 0), 'add ref with 0 timeout')
+    test:is(lref.count, 2, 'collected 1 old ref, 1 is kept')
+    test:ok(lref.del(2, sid), 'del newest ref, it was not collected')
+    test:ok(lref.del(1, sid), 'del ref with big timeout')
+    test:ok(lref.count, 0, 'all is deleted')
+
+    --
+    -- GC works fine when only one ref was left and it was expired.
+    --
+    test:ok(lref.add(0, sid, small_timeout), 'add ref with small timeout')
+    test:is(lref.count, 1, '1 ref total')
+    fiber.sleep(small_timeout)
+    test:ok(lref.add(1, sid, big_timeout), 'add ref with big timeout')
+    test:is(lref.count, 1, 'collected the old one')
+    lref.gc()
+    test:is(lref.count, 1, 'still 1 - timeout was big')
+    test:ok(lref.del(1, sid), 'delete it')
+    test:is(lref.count, 0, 'no refs')
+end
+
+local function test_ref_gc(test)
+    test:plan(7)
+
+    --
+    -- Generic GC works fine with multiple sessions.
+    --
+    assert(lref.add(0, sid, big_timeout))
+    assert(lref.add(1, sid, small_timeout))
+    assert(lref.add(0, sid3, small_timeout))
+    assert(lref.add(0, sid2, small_timeout))
+    assert(lref.add(1, sid2, big_timeout))
+    assert(lref.add(1, sid3, big_timeout))
+    test:is(lref.count, 6, 'add 6 refs total')
+    fiber.sleep(small_timeout)
+    lref.gc()
+    test:is(lref.count, 3, '3 collected')
+    test:ok(lref.del(0, sid), 'del first')
+    test:ok(lref.del(1, sid2), 'del second')
+    test:ok(lref.del(1, sid3), 'del third')
+    test:is(lref.count, 0, '3 deleted')
+    lref.gc()
+    test:is(lref.count, 0, 'gc on empty refs did not break anything')
+end
+
+local function test_ref_use(test)
+    test:plan(7)
+
+    --
+    -- Ref use updates the session heap.
+    --
+    assert(lref.add(0, sid, small_timeout))
+    assert(lref.add(0, sid2, big_timeout))
+    test:ok(lref.count, 2, 'add 2 refs')
+    test:ok(lref.use(0, sid), 'use one with small timeout')
+    lref.gc()
+    test:is(lref.count, 2, 'still 2 refs')
+    fiber.sleep(small_timeout)
+    test:is(lref.count, 2, 'still 2 refs after sleep')
+    test:ok(lref.del(0, sid, 'del first'))
+    test:ok(lref.del(0, sid2, 'del second'))
+    test:is(lref.count, 0, 'now all is deleted')
+end
+
+local function test_ref_del(test)
+    test:plan(7)
+
+    --
+    -- Ref del updates the session heap.
+    --
+    assert(lref.add(0, sid, small_timeout))
+    assert(lref.add(0, sid2, big_timeout))
+    test:is(lref.count, 2, 'add 2 refs')
+    test:ok(lref.del(0, sid), 'del with small timeout')
+    lref.gc()
+    test:is(lref.count, 1, '1 ref remains')
+    fiber.sleep(small_timeout)
+    test:is(lref.count, 1, '1 ref remains after sleep')
+    lref.gc()
+    test:is(lref.count, 1, '1 ref remains after sleep and gc')
+    test:ok(lref.del(0, sid2), 'del with big timeout')
+    test:is(lref.count, 0, 'now all is deleted')
+end
+
+local function test_ref_dead_session(test)
+    test:plan(4)
+
+    --
+    -- Session after disconnect still might have running requests. It must
+    -- be kept alive with its refs until the requests are done.
+    --
+    assert(lref.add(0, sid, small_timeout))
+    assert(lref.use(0, sid))
+    lref.kill(sid)
+    test:ok(lref.del(0, sid))
+
+    --
+    -- The dead session is kept only while the used requests are running. It is
+    -- deleted when use count becomes 0 even if there were unused refs.
+    --
+    assert(lref.add(0, sid, big_timeout))
+    assert(lref.add(1, sid, big_timeout))
+    assert(lref.use(0, sid))
+    lref.kill(sid)
+    test:is(lref.count, 2, '2 refs in a dead session')
+    test:ok(lref.del(0, sid), 'delete the used ref')
+    test:is(lref.count, 0, '0 refs - the unused ref was deleted with session')
+end
+
+local test = tap.test('ref')
+test:plan(6)
+
+test:test('basic', test_ref_basic)
+test:test('incremental gc', test_ref_incremental_gc)
+test:test('gc', test_ref_gc)
+test:test('use', test_ref_use)
+test:test('del', test_ref_del)
+test:test('dead session use', test_ref_dead_session)
+
+os.exit(test:check() and 0 or 1)
diff --git a/vshard/consts.lua b/vshard/consts.lua
index cf3f422..0ffe0e2 100644
--- a/vshard/consts.lua
+++ b/vshard/consts.lua
@@ -48,4 +48,5 @@ return {
     DISCOVERY_TIMEOUT = 10,
 
     TIMEOUT_INFINITY = 500 * 365 * 86400,
+    DEADLINE_INFINITY = math.huge,
 }
diff --git a/vshard/error.lua b/vshard/error.lua
index a6f46a9..b02bfe9 100644
--- a/vshard/error.lua
+++ b/vshard/error.lua
@@ -130,6 +130,25 @@ local error_message_template = {
         name = 'TOO_MANY_RECEIVING',
         msg = 'Too many receiving buckets at once, please, throttle'
     },
+    [26] = {
+        name = 'STORAGE_IS_REFERENCED',
+        msg = 'Storage is referenced'
+    },
+    [27] = {
+        name = 'STORAGE_REF_ADD',
+        msg = 'Can not add a storage ref: %s',
+        args = {'reason'},
+    },
+    [28] = {
+        name = 'STORAGE_REF_USE',
+        msg = 'Can not use a storage ref: %s',
+        args = {'reason'},
+    },
+    [29] = {
+        name = 'STORAGE_REF_DEL',
+        msg = 'Can not delete a storage ref: %s',
+        args = {'reason'},
+    },
 }
 
 --
diff --git a/vshard/storage/CMakeLists.txt b/vshard/storage/CMakeLists.txt
index 3f4ed43..7c1e97d 100644
--- a/vshard/storage/CMakeLists.txt
+++ b/vshard/storage/CMakeLists.txt
@@ -1,2 +1,2 @@
-install(FILES init.lua reload_evolution.lua
+install(FILES init.lua reload_evolution.lua ref.lua
         DESTINATION ${TARANTOOL_INSTALL_LUADIR}/vshard/storage)
diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
index de05531..d023583 100644
--- a/vshard/storage/init.lua
+++ b/vshard/storage/init.lua
@@ -17,6 +17,7 @@ if rawget(_G, MODULE_INTERNALS) then
         'vshard.replicaset', 'vshard.util',
         'vshard.storage.reload_evolution',
         'vshard.lua_gc', 'vshard.rlist', 'vshard.registry',
+        'vshard.heap', 'vshard.storage.ref',
     }
     for _, module in pairs(vshard_modules) do
         package.loaded[module] = nil
@@ -30,6 +31,7 @@ local lreplicaset = require('vshard.replicaset')
 local util = require('vshard.util')
 local lua_gc = require('vshard.lua_gc')
 local lregistry = require('vshard.registry')
+local lref = require('vshard.storage.ref')
 local reload_evolution = require('vshard.storage.reload_evolution')
 local fiber_cond_wait = util.fiber_cond_wait
 local bucket_ref_new
@@ -1140,6 +1142,9 @@ local function bucket_recv_xc(bucket_id, from, data, opts)
             return nil, lerror.vshard(lerror.code.WRONG_BUCKET, bucket_id, msg,
                                       from)
         end
+        if lref.count > 0 then
+            return nil, lerror.vshard(lerror.code.STORAGE_IS_REFERENCED)
+        end
         if is_this_replicaset_locked() then
             return nil, lerror.vshard(lerror.code.REPLICASET_IS_LOCKED)
         end
@@ -1441,6 +1446,9 @@ local function bucket_send_xc(bucket_id, destination, opts, exception_guard)
 
     local _bucket = box.space._bucket
     local bucket = _bucket:get({bucket_id})
+    if lref.count > 0 then
+        return nil, lerror.vshard(lerror.code.STORAGE_IS_REFERENCED)
+    end
     if is_this_replicaset_locked() then
         return nil, lerror.vshard(lerror.code.REPLICASET_IS_LOCKED)
     end
@@ -2528,6 +2536,7 @@ local function storage_cfg(cfg, this_replica_uuid, is_reload)
         box.space._bucket:on_replace(nil, M.bucket_on_replace)
         M.bucket_on_replace = nil
     end
+    lref.cfg()
     if is_master then
         box.space._bucket:on_replace(bucket_generation_increment)
         M.bucket_on_replace = bucket_generation_increment
diff --git a/vshard/storage/ref.lua b/vshard/storage/ref.lua
new file mode 100644
index 0000000..a024d8e
--- /dev/null
+++ b/vshard/storage/ref.lua
@@ -0,0 +1,395 @@
+--
+-- 'Ref' module helps to ensure that all buckets on the storage stay writable
+-- while there is at least one ref on the storage.
+-- Having storage referenced allows to execute any kinds of requests on all the
+-- visible data in all spaces in locally stored buckets. This is useful when
+-- need to access tons of buckets at once, especially when exact bucket IDs are
+-- not known.
+--
+-- Refs have deadlines. So as the storage wouldn't freeze not being able to move
+-- buckets until restart in case a ref is not deleted due to an error in user's
+-- code or disconnect.
+--
+-- The disconnects and restarts mean the refs can't be global. Otherwise any
+-- kinds of global counters, uuids and so on, even paired with any ids from a
+-- client could clash between clients on their reconnects or storage restarts.
+-- Unless they establish a TCP-like session, which would be too complicated.
+--
+-- Instead, the refs are spread over the existing box sessions. This allows to
+-- bind refs of each client to its TCP connection and not care about how to make
+-- them unique across all sessions, how not to mess the refs on restart, and how
+-- to drop the refs when a client disconnects.
+--
+
+local MODULE_INTERNALS = '__module_vshard_storage_ref'
+-- Update when change behaviour of anything in the file, to be able to reload.
+local MODULE_VERSION = 1
+
+local lfiber = require('fiber')
+local lheap = require('vshard.heap')
+local lerror = require('vshard.error')
+local lconsts = require('vshard.consts')
+local lregistry = require('vshard.registry')
+local fiber_clock = lfiber.clock
+local fiber_yield = lfiber.yield
+local DEADLINE_INFINITY = lconsts.DEADLINE_INFINITY
+local LUA_CHUNK_SIZE = lconsts.LUA_CHUNK_SIZE
+
+--
+-- Binary heap sort. Object with the closest deadline should be on top.
+--
+local function heap_min_deadline_cmp(ref1, ref2)
+    return ref1.deadline < ref2.deadline
+end
+
+local M = rawget(_G, MODULE_INTERNALS)
+if not M then
+    M = {
+        module_version = MODULE_VERSION,
+        -- Total number of references in all sessions.
+        count = 0,
+        -- Heap of session objects. Each session has refs sorted by their
+        -- deadline. The sessions themselves are also sorted by deadlines.
+        -- Session deadline is defined as the closest deadline of all its refs.
+        -- Or infinity in case there are no refs in it.
+        session_heap = lheap.new(heap_min_deadline_cmp),
+        -- Map of session objects. This is used to get session object by its ID.
+        session_map = {},
+        -- On session disconnect trigger to kill the dead sessions. It is saved
+        -- here for the sake of future reload to be able to delete the old
+        -- on disconnect function before setting a new one.
+        on_disconnect = nil,
+    }
+else
+    -- No reload so far. This is a first version. Return as is.
+    return M
+end
+
+local function ref_session_new(sid)
+    -- Session object does not store its internal hot attributes in a table.
+    -- Because it would mean access to any session attribute would cost at least
+    -- one table indexing operation. Instead, all internal fields are stored as
+    -- upvalues referenced by the methods defined as closures.
+    --
+    -- This means session creation may not very suitable for jitting, but it is
+    -- very rare and attempts to optimize the most common case.
+    --
+    -- Still the public functions take 'self' object to make it look normally.
+    -- They even use it a bit.
+
+    -- Ref map to get ref object by its ID.
+    local ref_map = {}
+    -- Ref heap sorted by their deadlines.
+    local ref_heap = lheap.new(heap_min_deadline_cmp)
+    -- Total number of refs of the session. Is used to drop the session when it
+    -- it is disconnected and has no refs anymore. Heap size can't be used
+    -- because not all refs are stored here.
+    local ref_count_total = 0
+    -- Number of refs in use. They are included into the total count. The used
+    -- refs are accounted explicitly in order to detect when a disconnected
+    -- session has no used refs anymore and can be deleted.
+    local ref_count_use = 0
+    -- When the session becomes disconnected, it must be deleted from the global
+    -- heap when all its used refs are gone.
+    local is_disconnected = false
+    -- Cache global session storages as upvalues to save on M indexing.
+    local global_heap = M.session_heap
+    local global_map = M.session_map
+
+    local function ref_session_discount(self, del_count)
+        local new_count = M.count - del_count
+        assert(new_count >= 0)
+        M.count = new_count
+
+        new_count = ref_count_total - del_count
+        assert(new_count >= 0)
+        ref_count_total = new_count
+    end
+
+    local function ref_session_delete_if_not_used(self)
+        if not is_disconnected or ref_count_use > 0 then
+            return
+        end
+        ref_session_discount(self, ref_count_total)
+        global_map[sid] = nil
+        global_heap:remove(self)
+    end
+
+    local function ref_session_update_deadline(self)
+        local ref = ref_heap:top()
+        if not ref then
+            self.deadline = DEADLINE_INFINITY
+            global_heap:update(self)
+        else
+            local deadline = ref.deadline
+            if deadline ~= self.deadline then
+                self.deadline = deadline
+                global_heap:update(self)
+            end
+        end
+    end
+
+    --
+    -- Garbage collect at most 2 expired refs. The idea is that there is no a
+    -- dedicated fiber for expired refs collection. It would be too expensive to
+    -- wakeup a fiber on each added or removed or updated ref.
+    --
+    -- Instead, ref GC is mostly incremental and works by the principle "remove
+    -- more than add". On each new ref added, two old refs try to expire. This
+    -- way refs don't stack infinitely, and the expired refs are eventually
+    -- removed. Because removal is faster than addition: -2 for each +1.
+    --
+    local function ref_session_gc_step(self, now)
+        -- This is inlined 2 iterations of the more general GC procedure. The
+        -- latter is not called in order to save on not having a loop,
+        -- additional branches and variables.
+        if self.deadline > now then
+            return
+        end
+        local top = ref_heap:top()
+        ref_heap:remove_top()
+        ref_map[top.id] = nil
+        top = ref_heap:top()
+        if not top then
+            self.deadline = DEADLINE_INFINITY
+            global_heap:update(self)
+            ref_session_discount(self, 1)
+            return
+        end
+        local deadline = top.deadline
+        if deadline >= now then
+            self.deadline = deadline
+            global_heap:update(self)
+            ref_session_discount(self, 1)
+            return
+        end
+        ref_heap:remove_top()
+        ref_map[top.id] = nil
+        top = ref_heap:top()
+        if not top then
+            self.deadline = DEADLINE_INFINITY
+        else
+            self.deadline = top.deadline
+        end
+        global_heap:update(self)
+        ref_session_discount(self, 2)
+    end
+
+    --
+    -- GC expired refs until they end or the limit on the number of iterations
+    -- is exhausted. The limit is supposed to prevent too long GC which would
+    -- occupy TX thread unfairly.
+    --
+    -- Returns nil if nothing to GC, or number of iterations left from the
+    -- limit. The caller is supposed to yield when 0 is returned, and retry GC
+    -- until it returns nil.
+    -- The function itself does not yield, because it is used from a more
+    -- generic function GCing all sessions. It would not ever yield if all
+    -- sessions would have less than limit refs, even if total ref count would
+    -- be much bigger.
+    --
+    -- Besides, the session might be killed during general GC. There must not be
+    -- any yields in session methods so as not to introduce a support of dead
+    -- sessions.
+    --
+    local function ref_session_gc(self, limit, now)
+        if self.deadline >= now then
+            return nil
+        end
+        local top = ref_heap:top()
+        local del = 1
+        local rest = 0
+        local deadline
+        repeat
+            ref_heap:remove_top()
+            ref_map[top.id] = nil
+            top = ref_heap:top()
+            if not top then
+                self.deadline = DEADLINE_INFINITY
+                rest = limit - del
+                break
+            end
+            deadline = top.deadline
+            if deadline >= now then
+                self.deadline = deadline
+                rest = limit - del
+                break
+            end
+            del = del + 1
+        until del >= limit
+        ref_session_discount(self, del)
+        global_heap:update(self)
+        return rest
+    end
+
+    local function ref_session_add(self, rid, deadline, now)
+        if ref_map[rid] then
+            return nil, lerror.vshard(lerror.code.STORAGE_REF_ADD,
+                                      'duplicate ref')
+        end
+        local ref = {
+            deadline = deadline,
+            id = rid,
+            -- Used by the heap.
+            index = -1,
+        }
+        ref_session_gc_step(self, now)
+        ref_map[rid] = ref
+        ref_heap:push(ref)
+        if deadline < self.deadline then
+            self.deadline = deadline
+            global_heap:update(self)
+        end
+        ref_count_total = ref_count_total + 1
+        M.count = M.count + 1
+        return true
+    end
+
+    --
+    -- Ref use means it can't be expired until deleted explicitly. Should be
+    -- done when the request affecting the whole storage starts. After use it is
+    -- important to call del afterwards - GC won't delete it automatically now.
+    -- Unless the entire session is killed.
+    --
+    local function ref_session_use(self, rid)
+        local ref = ref_map[rid]
+        if not ref then
+            return nil, lerror.vshard(lerror.code.STORAGE_REF_USE, 'no ref')
+        end
+        ref_heap:remove(ref)
+        ref_session_update_deadline(self)
+        ref_count_use = ref_count_use + 1
+        return true
+    end
+
+    local function ref_session_del(self, rid)
+        local ref = ref_map[rid]
+        if not ref then
+            return nil, lerror.vshard(lerror.code.STORAGE_REF_DEL, 'no ref')
+        end
+        ref_map[rid] = nil
+        if ref.index == -1 then
+            ref_session_update_deadline(self)
+            ref_session_discount(self, 1)
+            ref_count_use = ref_count_use - 1
+            ref_session_delete_if_not_used(self)
+        else
+            ref_heap:remove(ref)
+            ref_session_update_deadline(self)
+            ref_session_discount(self, 1)
+        end
+        return true
+    end
+
+    local function ref_session_kill(self)
+        assert(not is_disconnected)
+        is_disconnected = true
+        ref_session_delete_if_not_used(self)
+    end
+
+    -- Don't use __index. It is useless since all sessions use closures as
+    -- methods. Also it is probably slower because on each method call would
+    -- need to get the metatable, get __index, find the method here. While now
+    -- it is only an index operation on the session object.
+    local session = {
+        deadline = DEADLINE_INFINITY,
+        -- Used by the heap.
+        index = -1,
+        -- Methods.
+        del = ref_session_del,
+        gc = ref_session_gc,
+        add = ref_session_add,
+        use = ref_session_use,
+        kill = ref_session_kill,
+    }
+    global_map[sid] = session
+    global_heap:push(session)
+    return session
+end
+
+local function ref_gc()
+    local session_heap = M.session_heap
+    local session = session_heap:top()
+    if not session then
+        return
+    end
+    local limit = LUA_CHUNK_SIZE
+    local now = fiber_clock()
+    repeat
+        limit = session:gc(limit, now)
+        if not limit then
+            return
+        end
+        if limit == 0 then
+            fiber_yield()
+            limit = LUA_CHUNK_SIZE
+            now = fiber_clock()
+        end
+        session = session_heap:top()
+    until not session
+end
+
+local function ref_add(rid, sid, timeout)
+    local now = fiber_clock()
+    local deadline = now + timeout
+    local ok, err, session
+    local storage = lregistry.storage
+    while not storage.bucket_are_all_rw() do
+        ok, err = storage.bucket_generation_wait(timeout)
+        if not ok then
+            return nil, err
+        end
+        now = fiber_clock()
+        timeout = deadline - now
+    end
+    session = M.session_map[sid]
+    if not session then
+        session = ref_session_new(sid)
+    end
+    return session:add(rid, deadline, now)
+end
+
+local function ref_use(rid, sid)
+    local session = M.session_map[sid]
+    if not session then
+        return nil, lerror.vshard(lerror.code.STORAGE_REF_USE, 'no session')
+    end
+    return session:use(rid)
+end
+
+local function ref_del(rid, sid)
+    local session = M.session_map[sid]
+    if not session then
+        return nil, lerror.vshard(lerror.code.STORAGE_REF_DEL, 'no session')
+    end
+    return session:del(rid)
+end
+
+local function ref_kill_session(sid)
+    local session = M.session_map[sid]
+    if session then
+        session:kill()
+    end
+end
+
+local function ref_on_session_disconnect()
+    ref_kill_session(box.session.id())
+end
+
+local function ref_cfg()
+    if M.on_disconnect then
+        pcall(box.session.on_disconnect, nil, M.on_disconnect)
+    end
+    box.session.on_disconnect(ref_on_session_disconnect)
+    M.on_disconnect = ref_on_session_disconnect
+end
+
+M.del = ref_del
+M.gc = ref_gc
+M.add = ref_add
+M.use = ref_use
+M.cfg = ref_cfg
+M.kill = ref_kill_session
+lregistry.storage_ref = M
+
+return M



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 00/11] VShard Map-Reduce, part 2: Ref, Sched, Map
  2021-02-23  0:15 [Tarantool-patches] [PATCH vshard 00/11] VShard Map-Reduce, part 2: Ref, Sched, Map Vladislav Shpilevoy via Tarantool-patches
                   ` (11 preceding siblings ...)
  2021-03-12 23:13 ` [Tarantool-patches] [PATCH vshard 00/11] VShard Map-Reduce, part 2: Ref, Sched, Map Vladislav Shpilevoy via Tarantool-patches
@ 2021-03-28 18:17 ` Vladislav Shpilevoy via Tarantool-patches
  12 siblings, 0 replies; 47+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-03-28 18:17 UTC (permalink / raw)
  To: tarantool-patches, olegrok, yaroslav.dynnikov

Pushed to master, together with part 1.

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2021-03-28 18:17 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-23  0:15 [Tarantool-patches] [PATCH vshard 00/11] VShard Map-Reduce, part 2: Ref, Sched, Map Vladislav Shpilevoy via Tarantool-patches
2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 01/11] error: introduce vshard.error.timeout() Vladislav Shpilevoy via Tarantool-patches
2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
2021-02-24 21:46     ` Vladislav Shpilevoy via Tarantool-patches
2021-02-25 12:42       ` Oleg Babin via Tarantool-patches
2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 10/11] sched: introduce vshard.storage.sched module Vladislav Shpilevoy via Tarantool-patches
2021-02-24 10:28   ` Oleg Babin via Tarantool-patches
2021-02-24 21:50     ` Vladislav Shpilevoy via Tarantool-patches
2021-03-04 21:02   ` Oleg Babin via Tarantool-patches
2021-03-05 22:06     ` Vladislav Shpilevoy via Tarantool-patches
2021-03-09  8:03       ` Oleg Babin via Tarantool-patches
2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 11/11] router: introduce map_callrw() Vladislav Shpilevoy via Tarantool-patches
2021-02-24 10:28   ` Oleg Babin via Tarantool-patches
2021-02-24 22:04     ` Vladislav Shpilevoy via Tarantool-patches
2021-02-25 12:43       ` Oleg Babin via Tarantool-patches
2021-02-26 23:58         ` Vladislav Shpilevoy via Tarantool-patches
2021-03-01 10:58           ` Oleg Babin via Tarantool-patches
2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 02/11] storage: add helper for local functions invocation Vladislav Shpilevoy via Tarantool-patches
2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 03/11] storage: cache bucket count Vladislav Shpilevoy via Tarantool-patches
2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
2021-02-24 21:47     ` Vladislav Shpilevoy via Tarantool-patches
2021-02-25 12:42       ` Oleg Babin via Tarantool-patches
2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 04/11] registry: module for circular deps resolution Vladislav Shpilevoy via Tarantool-patches
2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 05/11] util: introduce safe fiber_cond_wait() Vladislav Shpilevoy via Tarantool-patches
2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
2021-02-24 21:48     ` Vladislav Shpilevoy via Tarantool-patches
2021-02-25 12:42       ` Oleg Babin via Tarantool-patches
2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 06/11] util: introduce fiber_is_self_canceled() Vladislav Shpilevoy via Tarantool-patches
2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 07/11] storage: introduce bucket_generation_wait() Vladislav Shpilevoy via Tarantool-patches
2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 08/11] storage: introduce bucket_are_all_rw() Vladislav Shpilevoy via Tarantool-patches
2021-02-24 10:27   ` Oleg Babin via Tarantool-patches
2021-02-24 21:48     ` Vladislav Shpilevoy via Tarantool-patches
2021-02-23  0:15 ` [Tarantool-patches] [PATCH vshard 09/11] ref: introduce vshard.storage.ref module Vladislav Shpilevoy via Tarantool-patches
2021-02-24 10:28   ` Oleg Babin via Tarantool-patches
2021-02-24 21:49     ` Vladislav Shpilevoy via Tarantool-patches
2021-02-25 12:42       ` Oleg Babin via Tarantool-patches
2021-03-04 21:22   ` Oleg Babin via Tarantool-patches
2021-03-05 22:06     ` Vladislav Shpilevoy via Tarantool-patches
2021-03-09  8:03       ` Oleg Babin via Tarantool-patches
2021-03-21 18:49   ` Vladislav Shpilevoy via Tarantool-patches
2021-03-12 23:13 ` [Tarantool-patches] [PATCH vshard 00/11] VShard Map-Reduce, part 2: Ref, Sched, Map Vladislav Shpilevoy via Tarantool-patches
2021-03-15  7:05   ` Oleg Babin via Tarantool-patches
2021-03-28 18:17 ` Vladislav Shpilevoy via Tarantool-patches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox