<div dir="ltr"><div dir="ltr"><div>And one more question.</div><div><br></div><div><div><div dir="ltr" data-smartmail="gmail_signature"><div dir="ltr"><span><div><div dir="ltr">Best regards</div><div>Yaroslav Dynnikov<br></div></div></span></div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, 5 Aug 2020 at 01:45, Yaroslav Dynnikov <<a href="mailto:yaroslav.dynnikov@tarantool.org" target="_blank">yaroslav.dynnikov@tarantool.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div>Hi, Vlad</div><div><br></div><div>Thanks for the fix a lot. In general it looks good to me, but I've got few questions. Find them below.<br></div><div><br></div><div><div><div><div dir="ltr"><div dir="ltr"><span><div><div dir="ltr">Best regards</div><div>Yaroslav Dynnikov<br></div></div></span></div></div></div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, 5 Aug 2020 at 00:04, Vladislav Shpilevoy <<a href="mailto:v.shpilevoy@tarantool.org" target="_blank">v.shpilevoy@tarantool.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Before the patch a replica couldn't call vshard.storage.cfg()<br>

before the master is bootstrapped. That could lead to an exception<br>

in the middle of vshard.storage.cfg().<br>

<br>

It sounds not so bad and even looks unreachable, because replica<br>

bootstrap anyway is blocked in box.cfg() until a master node is<br>

started, otherwise the replica instance is terminated.<br>

But the box.cfg bootstrap != vshard bootstrap.<br>

<br>

It could be that both replica and master nodes are already<br>

bootstrapped and working, but vshard.storage.cfg() wasn't called<br>

yet. In that case vshard.storage.cfg() on the replica wouldn't<br>

block in box.cfg(), and would happily fail a few lines later in<br>

vshard.storage.cfg() at attempt to touch not existing<br>

box.space._bucket.<br>

<br>

That was the case for cartridge. The framework configures<br>

instances in its own way before vshard.storage.cfg(). Then it<br>

calls vshard.storage.cfg() on them in arbitrary order, and<br>

sometimes replicas were configured earlier than the master.<br>

<br>

The fail is fixed by simply skipping the _bucket's trigger<br>

installation on the replica node. Because so far it is not needed<br>

here for anything. The trigger's sole purpose is to increment<br>

bucket generation used only for DML on _bucket on the master node.<br>

<br>

Now vshard.storage.cfg() on replica is able to finish before<br>

master does the same. However the replica still won't be able to<br>

handle vshard.storage.* requests until receives vshard schema<br>

from master.<br>

<br>

Closes #237<br>

---<br>

Branch: <a href="http://github.com/tarantool/vshard/tree/gerold103/gh-237-boot-replica-first" rel="noreferrer" target="_blank">http://github.com/tarantool/vshard/tree/gerold103/gh-237-boot-replica-first</a><br>

Issue: <a href="https://github.com/tarantool/vshard/issues/237" rel="noreferrer" target="_blank">https://github.com/tarantool/vshard/issues/237</a><br>

<br>

 test/lua_libs/storage_template.lua      |  55 +++++++++++-<br>

 test/reload_evolution/storage.result    |  10 +++<br>

 test/reload_evolution/storage.test.lua  |   8 ++<br>

 test/router/boot_replica_first.result   | 112 ++++++++++++++++++++++++<br>

 test/router/boot_replica_first.test.lua |  42 +++++++++<br>

 vshard/storage/init.lua                 |  18 +++-<br>

 vshard/storage/reload_evolution.lua     |  10 ++-<br>

 7 files changed, 251 insertions(+), 4 deletions(-)<br>

 create mode 100644 test/router/boot_replica_first.result<br>

 create mode 100644 test/router/boot_replica_first.test.lua<br>

<br>

diff --git a/test/lua_libs/storage_template.lua b/test/lua_libs/storage_template.lua<br>

index 29a753d..84e4180 100644<br>

--- a/test/lua_libs/storage_template.lua<br>

+++ b/test/lua_libs/storage_template.lua<br>

@@ -1,5 +1,7 @@<br>

 #!/usr/bin/env tarantool<br>

<br>

+local luri = require('uri')<br>

+<br>

 NAME = require('fio').basename(arg[0], '.lua')<br>

 fiber = require('fiber')<br>

 test_run = require('test_run').new()<br>

@@ -10,12 +12,63 @@ log = require('log')<br>

 if not cfg.shard_index then<br>

     cfg.shard_index = 'bucket_id'<br>

 end<br>

+instance_uuid = util.name_to_uuid[NAME]<br>

+<br>

+--<br>

+-- Bootstrap the instance exactly like vshard does. But don't<br>

+-- initialize any vshard-specific code.<br>

+--<br>

+local function boot_like_vshard()<br>

+    assert(type(box.cfg) == 'function')<br>

+    for rs_uuid, rs in pairs(cfg.sharding) do<br>

+        for replica_uuid, replica in pairs(rs.replicas) do<br>

+            if replica_uuid == instance_uuid then<br>

+                local box_cfg = {replication = {}}<br>

+                box_cfg.instance_uuid = replica_uuid<br>

+                box_cfg.replicaset_uuid = rs_uuid<br>

+                box_cfg.listen = replica.uri<br>

+                box_cfg.read_only = not replica.master<br>

+                box_cfg.replication_connect_quorum = 0<br>

+                box_cfg.replication_timeout = 0.1<br>

+                for _, replica in pairs(rs.replicas) do<br>

+                    table.insert(box_cfg.replication, replica.uri)<br>

+                end<br>

+                box.cfg(box_cfg)<br>

+                if not replica.master then<br>

+                    return<br>

+                end<br>

+                local uri = luri.parse(replica.uri)<br>

+                box.schema.user.create(uri.login, {<br>

+                    password = uri.password, if_not_exists = true,<br>

+                })<br>

+                box.schema.user.grant(uri.login, 'super')<br>

+                return<br>

+            end<br>

+        end<br>

+    end<br>

+    assert(false)<br></blockquote><div><br></div><div>Should there be some meaningful message?<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

+end<br>

+<br>

+local omit_cfg = false<br>

+local i = 1<br>

+while arg[i] ~= nil do<br>

+    local key = arg[i]<br>

+    i = i + 1<br>

+    if key == 'boot_before_cfg' then<br>

+        boot_like_vshard()<br>

+        omit_cfg = true<br>

+    end<br>

+end<br>

<br>

 vshard = require('vshard')<br>

 echo_count = 0<br>

 cfg.replication_connect_timeout = 3<br>

 cfg.replication_timeout = 0.1<br>

-vshard.storage.cfg(cfg, util.name_to_uuid[NAME])<br>

+<br>

+if not omit_cfg then<br>

+    vshard.storage.cfg(cfg, instance_uuid)<br>

+end<br>

+<br>

 function bootstrap_storage(engine)<br>

     box.once("testapp:schema:1", function()<br>

         if rawget(_G, 'CHANGE_SPACE_IDS') then<br>

diff --git a/test/reload_evolution/storage.result b/test/reload_evolution/storage.result<br>

index ebf4fdc..4652c4f 100644<br>

--- a/test/reload_evolution/storage.result<br>

+++ b/test/reload_evolution/storage.result<br>

@@ -92,6 +92,16 @@ test_run:grep_log('storage_2_a', 'vshard.storage.reload_evolution: upgraded to')<br>

 ...<br>

 vshard.storage.internal.reload_version<br>

 ---<br>

+- 2<br>

+...<br>

+--<br>

+-- gh-237: should be only one trigger. During gh-237 the trigger installation<br>

+-- became conditional and therefore required to remember the current trigger<br>

+-- somewhere. When reloaded from the old version, the trigger needed to be<br>

+-- fetched from _bucket:on_replace().<br>

+--<br>

+#box.space._bucket:on_replace()<br>

+---<br>

 - 1<br>

 ...<br>

 -- Make sure storage operates well.<br>

diff --git a/test/reload_evolution/storage.test.lua b/test/reload_evolution/storage.test.lua<br>

index 56c1693..06f7117 100644<br>

--- a/test/reload_evolution/storage.test.lua<br>

+++ b/test/reload_evolution/storage.test.lua<br>

@@ -37,6 +37,14 @@ package.loaded['vshard.storage'] = nil<br>

 vshard.storage = require("vshard.storage")<br>

 test_run:grep_log('storage_2_a', 'vshard.storage.reload_evolution: upgraded to') ~= nil<br>

 vshard.storage.internal.reload_version<br>

+--<br>

+-- gh-237: should be only one trigger. During gh-237 the trigger installation<br>

+-- became conditional and therefore required to remember the current trigger<br>

+-- somewhere. When reloaded from the old version, the trigger needed to be<br>

+-- fetched from _bucket:on_replace().<br>

+--<br>

+#box.space._bucket:on_replace()<br>

+<br>

 -- Make sure storage operates well.<br>

 vshard.storage.bucket_force_drop(2000)<br>

 vshard.storage.bucket_force_create(2000)<br>

diff --git a/test/router/boot_replica_first.result b/test/router/boot_replica_first.result<br>

new file mode 100644<br>

index 0000000..1705230<br>

--- /dev/null<br>

+++ b/test/router/boot_replica_first.result<br>

@@ -0,0 +1,112 @@<br>

+-- test-run result file version 2<br>

+test_run = require('test_run').new()<br>

+ | ---<br>

+ | ...<br>

+REPLICASET_1 = { 'box_1_a', 'box_1_b', 'box_1_c' }<br>

+ | ---<br>

+ | ...<br>

+test_run:create_cluster(REPLICASET_1, 'router', {args = 'boot_before_cfg'})<br>

+ | ---<br>

+ | ...<br>

+util = require('util')<br>

+ | ---<br>

+ | ...<br>

+util.wait_master(test_run, REPLICASET_1, 'box_1_a')<br>

+ | ---<br>

+ | ...<br>

+_ = test_run:cmd("create server router with script='router/router_2.lua'")<br>

+ | ---<br>

+ | ...<br>

+_ = test_run:cmd("start server router")<br>

+ | ---<br>

+ | ...<br>

+<br>

+--<br>

+-- gh-237: replica should be able to boot before master. Before the issue was<br>

+-- fixed, replica always tried to install a trigger on _bucket space even when<br>

+-- it was not created on a master yet - that lead to an exception in<br>

+-- storage.cfg. Now it should not install the trigger at all, because anyway it<br>

+-- is not needed on replica for anything.<br>

+--<br>

+<br>

+test_run:switch('box_1_b')<br>

+ | ---<br>

+ | - true<br>

+ | ...<br>

+vshard.storage.cfg(cfg, instance_uuid)<br>

+ | ---<br>

+ | ...<br>

+-- _bucket is not created yet. Will fail.<br>

+util.check_error(vshard.storage.call, 1, 'read', 'echo', {100})<br>

+ | ---<br>

+ | - attempt to index field '_bucket' (a nil value)<br>

+ | ...<br>

+<br>

+test_run:switch('default')<br>

+ | ---<br>

+ | - true<br>

+ | ...<br>

+util.map_evals(test_run, {REPLICASET_1}, 'bootstrap_storage(\'memtx\')')<br>

+ | ---<br>

+ | ...<br>

+<br>

+test_run:switch('box_1_a')<br>

+ | ---<br>

+ | - true<br>

+ | ...<br>

+vshard.storage.cfg(cfg, instance_uuid)<br>

+ | ---<br>

+ | ...<br>

+<br>

+test_run:switch('box_1_b')<br>

+ | ---<br>

+ | - true<br>

+ | ...<br>

+test_run:wait_lsn('box_1_b', 'box_1_a')<br>

+ | ---<br>

+ | ...<br>

+-- Fails, but gracefully. Because the bucket is not found here.<br>

+vshard.storage.call(1, 'read', 'echo', {100})<br>

+ | ---<br>

+ | - null<br>

+ | - bucket_id: 1<br>

+ |   reason: Not found<br>

+ |   code: 1<br>

+ |   type: ShardingError<br>

+ |   message: 'Cannot perform action with bucket 1, reason: Not found'<br>

+ |   name: WRONG_BUCKET<br>

+ | ...<br>

+-- Should not have triggers.<br>

+#box.space._bucket:on_replace()<br>

+ | ---<br>

+ | - 0<br>

+ | ...<br>

+<br>

+test_run:switch('router')<br>

+ | ---<br>

+ | - true<br>

+ | ...<br>

+vshard.router.bootstrap()<br>

+ | ---<br>

+ | - true<br>

+ | ...<br>

+vshard.router.callro(1, 'echo', {100})<br>

+ | ---<br>

+ | - 100<br>

+ | ...<br>

+<br>

+test_run:switch("default")<br>

+ | ---<br>

+ | - true<br>

+ | ...<br>

+test_run:cmd('stop server router')<br>

+ | ---<br>

+ | - true<br>

+ | ...<br>

+test_run:cmd('delete server router')<br>

+ | ---<br>

+ | - true<br>

+ | ...<br>

+test_run:drop_cluster(REPLICASET_1)<br>

+ | ---<br>

+ | ...<br>

diff --git a/test/router/boot_replica_first.test.lua b/test/router/boot_replica_first.test.lua<br>

new file mode 100644<br>

index 0000000..7b1b3fd<br>

--- /dev/null<br>

+++ b/test/router/boot_replica_first.test.lua<br>

@@ -0,0 +1,42 @@<br>

+test_run = require('test_run').new()<br>

+REPLICASET_1 = { 'box_1_a', 'box_1_b', 'box_1_c' }<br>

+test_run:create_cluster(REPLICASET_1, 'router', {args = 'boot_before_cfg'})<br>

+util = require('util')<br>

+util.wait_master(test_run, REPLICASET_1, 'box_1_a')<br>

+_ = test_run:cmd("create server router with script='router/router_2.lua'")<br>

+_ = test_run:cmd("start server router")<br>

+<br>

+--<br>

+-- gh-237: replica should be able to boot before master. Before the issue was<br>

+-- fixed, replica always tried to install a trigger on _bucket space even when<br>

+-- it was not created on a master yet - that lead to an exception in<br>

+-- storage.cfg. Now it should not install the trigger at all, because anyway it<br>

+-- is not needed on replica for anything.<br>

+--<br>

+<br>

+test_run:switch('box_1_b')<br>

+vshard.storage.cfg(cfg, instance_uuid)<br>

+-- _bucket is not created yet. Will fail.<br>

+util.check_error(vshard.storage.call, 1, 'read', 'echo', {100})<br>

+<br>

+test_run:switch('default')<br>

+util.map_evals(test_run, {REPLICASET_1}, 'bootstrap_storage(\'memtx\')')<br>

+<br>

+test_run:switch('box_1_a')<br>

+vshard.storage.cfg(cfg, instance_uuid)<br>

+<br>

+test_run:switch('box_1_b')<br>

+test_run:wait_lsn('box_1_b', 'box_1_a')<br>

+-- Fails, but gracefully. Because the bucket is not found here.<br>

+vshard.storage.call(1, 'read', 'echo', {100})<br>

+-- Should not have triggers.<br>

+#box.space._bucket:on_replace()<br>

+<br>

+test_run:switch('router')<br>

+vshard.router.bootstrap()<br>

+vshard.router.callro(1, 'echo', {100})<br>

+<br>

+test_run:switch("default")<br>

+test_run:cmd('stop server router')<br>

+test_run:cmd('delete server router')<br>

+test_run:drop_cluster(REPLICASET_1)<br>

diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua<br>

index c6a78fe..ed577f9 100644<br>

--- a/vshard/storage/init.lua<br>

+++ b/vshard/storage/init.lua<br>

@@ -94,6 +94,17 @@ if not M then<br>

         -- detect that _bucket was not changed between yields.<br>

         --<br>

         bucket_generation = 0,<br>

+        --<br>

+        -- Reference to the function used as on_replace trigger on<br>

+        -- _bucket space. It is used to replace the trigger with<br>

+        -- a new function when reload happens. It is kept<br>

+        -- explicitly because the old function is deleted on<br>

+        -- reload from the global namespace. On the other hand, it<br>

+        -- is still stored in _bucket:on_replace() somewhere, but<br>

+        -- it is not known where. The only 100% way to be able to<br>

+        -- replace the old function is to keep its reference.<br>

+        --<br>

+        bucket_on_replace = nil,<br>

<br>

         ------------------- Garbage collection -------------------<br>

         -- Fiber to remove garbage buckets data.<br>

@@ -2435,8 +2446,11 @@ local function storage_cfg(cfg, this_replica_uuid, is_reload)<br>

     local uri = luri.parse(this_replica.uri)<br>

     schema_upgrade(is_master, uri.login, uri.password)<br>

<br>

-    local old_trigger = box.space._bucket:on_replace()[1]<br>

-    box.space._bucket:on_replace(bucket_generation_increment, old_trigger)<br>

+    if is_master then<br>

+        local old_trigger = M.bucket_on_replace<br>

+        box.space._bucket:on_replace(bucket_generation_increment, old_trigger)<br>

+        M.bucket_on_replace = bucket_generation_increment<br>

+    end<br></blockquote></div></div></blockquote><div><br></div><div>It seems that the trigger is never removed. If a replica was a master some time ago, it still has the trigger.</div><div>Is it safe?<br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

     lreplicaset.rebind_replicasets(new_replicasets, M.replicasets)<br>

     lreplicaset.outdate_replicasets(M.replicasets)<br>

diff --git a/vshard/storage/reload_evolution.lua b/vshard/storage/reload_evolution.lua<br>

index 5d09b11..f38af74 100644<br>

--- a/vshard/storage/reload_evolution.lua<br>

+++ b/vshard/storage/reload_evolution.lua<br>

@@ -17,6 +17,14 @@ migrations[#migrations + 1] = function(M)<br>

     -- Code to update Lua objects.<br>

 end<br>

<br>

+migrations[#migrations + 1] = function(M)<br>

+    local bucket = box.space._bucket<br>

+    if bucket then<br>

+        assert(M.bucket_on_replace == nil)<br>

+        M.bucket_on_replace = bucket:on_replace()[1]<br>

+    end<br>

+end<br>

+<br>

 --<br>

 -- Perform an update based on a version stored in `M` (internals).<br>

 -- @param M Old module internals which should be updated.<br>

@@ -33,7 +41,7 @@ local function upgrade(M)<br>

         log.error(err_msg)<br>

         error(err_msg)<br>

     end<br>

-    for i = start_version, #migrations  do<br>

+    for i = start_version + 1, #migrations do<br></blockquote><div><br></div><div>Is it already tested somewhere else (migrations I mean)? This change looks like a fix for some other problem.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

         local ok, err = pcall(migrations[i], M)<br>

         if ok then<br>

             <a href="http://log.info" rel="noreferrer" target="_blank">log.info</a>('vshard.storage.reload_evolution: upgraded to %d version',<br>

-- <br>

2.21.1 (Apple Git-122.3)<br>

<br>

</blockquote></div></div>

</blockquote></div></div>