Tarantool development patches archive
 help / color / mirror / Atom feed
* [tarantool-patches] [PATCH 0/2][vshard] preserve route map
@ 2018-06-15 12:47 AKhatskevich
  2018-06-15 12:47 ` [tarantool-patches] [PATCH 1/2] Preserve route_map on router.cfg AKhatskevich
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: AKhatskevich @ 2018-06-15 12:47 UTC (permalink / raw)
  To: v.shpilevoy, tarantool-patches

Branch: https://github.com/tarantool/vshard/tree/kh/gh-117-save-route-map
Issue: https://github.com/tarantool/vshard/issues/117

Commits:
 * Preserve route_map on router.cfg
 * Fix discovery/reconfigure race
    fixes race condition

 test/router/router.result   | 104 ++++++++++++++++++++++++++++++++++++++++++++
 test/router/router.test.lua |  66 ++++++++++++++++++++++++++++
 vshard/router/init.lua      |  22 ++++++++--
 3 files changed, 189 insertions(+), 3 deletions(-)

-- 
2.14.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tarantool-patches] [PATCH 1/2] Preserve route_map on router.cfg
  2018-06-15 12:47 [tarantool-patches] [PATCH 0/2][vshard] preserve route map AKhatskevich
@ 2018-06-15 12:47 ` AKhatskevich
  2018-06-21 12:54   ` [tarantool-patches] " Vladislav Shpilevoy
  2018-06-15 12:47 ` [tarantool-patches] [PATCH 2/2] Fix discovery/reconfigure race AKhatskevich
  2018-06-21 12:54 ` [tarantool-patches] Re: [PATCH 0/2][vshard] preserve route map Vladislav Shpilevoy
  2 siblings, 1 reply; 14+ messages in thread
From: AKhatskevich @ 2018-06-15 12:47 UTC (permalink / raw)
  To: v.shpilevoy, tarantool-patches

Routes in `M.route_map` can be preserved during reconfigure process.
This is necessary to prevent a dramatic performance degradation just
	after reconfugure and reload.

Closes #117
---
 test/router/router.result   | 42 ++++++++++++++++++++++++++++++++++++++++++
 test/router/router.test.lua | 24 ++++++++++++++++++++++++
 vshard/router/init.lua      |  7 +++++--
 3 files changed, 71 insertions(+), 2 deletions(-)

diff --git a/test/router/router.result b/test/router/router.result
index 2ee1bff..5643f3e 100644
--- a/test/router/router.result
+++ b/test/router/router.result
@@ -1057,6 +1057,48 @@ error_messages
 - - Use replica:is_connected(...) instead of replica.is_connected(...)
   - Use replica:safe_uri(...) instead of replica.safe_uri(...)
 ...
+--
+-- gh-117: Preserve route_map on router.cfg.
+--
+bucket_to_old_rs = {}
+---
+...
+bucket_cnt = 0
+---
+...
+test_run:cmd("setopt delimiter ';'")
+---
+- true
+...
+for bucket, rs in pairs(vshard.router.internal.route_map) do
+    bucket_to_old_rs[bucket] = rs
+    bucket_cnt = bucket_cnt + 1
+end;
+---
+...
+bucket_cnt;
+---
+- 3000
+...
+vshard.router.cfg(cfg);
+---
+...
+for bucket, old_rs in pairs(bucket_to_old_rs) do
+    local old_uuid = old_rs.uuid
+    local rs = vshard.router.internal.route_map[bucket]
+    if not rs or not old_uuid == rs.uuid then
+        error("Bucket lost during reconfigure.")
+    end
+    if rs == old_rs then
+        error("route_map was not updataed.")
+    end
+end;
+---
+...
+test_run:cmd("setopt delimiter ''");
+---
+- true
+...
 _ = test_run:cmd("switch default")
 ---
 ...
diff --git a/test/router/router.test.lua b/test/router/router.test.lua
index fae8e24..106f3d8 100644
--- a/test/router/router.test.lua
+++ b/test/router/router.test.lua
@@ -389,6 +389,30 @@ end;
 test_run:cmd("setopt delimiter ''");
 error_messages
 
+--
+-- gh-117: Preserve route_map on router.cfg.
+--
+bucket_to_old_rs = {}
+bucket_cnt = 0
+test_run:cmd("setopt delimiter ';'")
+for bucket, rs in pairs(vshard.router.internal.route_map) do
+    bucket_to_old_rs[bucket] = rs
+    bucket_cnt = bucket_cnt + 1
+end;
+bucket_cnt;
+vshard.router.cfg(cfg);
+for bucket, old_rs in pairs(bucket_to_old_rs) do
+    local old_uuid = old_rs.uuid
+    local rs = vshard.router.internal.route_map[bucket]
+    if not rs or not old_uuid == rs.uuid then
+        error("Bucket lost during reconfigure.")
+    end
+    if rs == old_rs then
+        error("route_map was not updataed.")
+    end
+end;
+test_run:cmd("setopt delimiter ''");
+
 _ = test_run:cmd("switch default")
 test_run:drop_cluster(REPLICASET_2)
 
diff --git a/vshard/router/init.lua b/vshard/router/init.lua
index 21093e5..7e765fa 100644
--- a/vshard/router/init.lua
+++ b/vshard/router/init.lua
@@ -475,8 +475,6 @@ local function router_cfg(cfg)
     log.info("Box has been configured")
     M.total_bucket_count = total_bucket_count
     M.collect_lua_garbage = collect_lua_garbage
-    -- TODO: update existing route map in-place
-    M.route_map = {}
     M.replicasets = new_replicasets
     -- Move connections from an old configuration to a new one.
     -- It must be done with no yields to prevent usage both of not
@@ -489,6 +487,11 @@ local function router_cfg(cfg)
     for _, replicaset in pairs(new_replicasets) do
         replicaset:connect_all()
     end
+    -- Update existing route map in-place.
+    for bucket, rs in pairs(M.route_map) do
+        M.route_map[bucket] = M.replicasets[rs.uuid]
+    end
+
     lreplicaset.wait_masters_connect(new_replicasets)
     if M.failover_fiber == nil then
         log.info('Start failover fiber')
-- 
2.14.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tarantool-patches] [PATCH 2/2] Fix discovery/reconfigure race
  2018-06-15 12:47 [tarantool-patches] [PATCH 0/2][vshard] preserve route map AKhatskevich
  2018-06-15 12:47 ` [tarantool-patches] [PATCH 1/2] Preserve route_map on router.cfg AKhatskevich
@ 2018-06-15 12:47 ` AKhatskevich
  2018-06-21 12:54   ` [tarantool-patches] " Vladislav Shpilevoy
  2018-06-21 12:54 ` [tarantool-patches] Re: [PATCH 0/2][vshard] preserve route map Vladislav Shpilevoy
  2 siblings, 1 reply; 14+ messages in thread
From: AKhatskevich @ 2018-06-15 12:47 UTC (permalink / raw)
  To: v.shpilevoy, tarantool-patches

This commit prevents discovery fiber from discovering old replicasets
and spoiling `route_map`.
---
 test/router/router.result   | 62 +++++++++++++++++++++++++++++++++++++++++++++
 test/router/router.test.lua | 42 ++++++++++++++++++++++++++++++
 vshard/router/init.lua      | 15 ++++++++++-
 3 files changed, 118 insertions(+), 1 deletion(-)

diff --git a/test/router/router.result b/test/router/router.result
index 5643f3e..e61505e 100644
--- a/test/router/router.result
+++ b/test/router/router.result
@@ -1095,6 +1095,68 @@ for bucket, old_rs in pairs(bucket_to_old_rs) do
 end;
 ---
 ...
+--
+-- Check route_map is not filled with old replica objects after
+-- recpnfigure.
+--
+-- Perform #replicasets phases of discovery, to update replicasets
+-- object in for loop of discovery fiber since previous cfg.
+for _, __ in pairs(vshard.router.internal.replicasets) do
+    vshard.router.discovery_wakeup()
+    fiber.sleep(0.02)
+end;
+---
+...
+-- Simulate long `callro`.
+-- Stuck on first rs in replicasets.
+vshard.router.internal.errinj.LONG_DISCOVERY = true;
+---
+...
+for _, __ in pairs(vshard.router.internal.replicasets) do
+    vshard.router.discovery_wakeup()
+    fiber.sleep(0.02)
+end;
+---
+...
+vshard.router.cfg(cfg);
+---
+...
+vshard.router.internal.errinj.LONG_DISCOVERY = nil;
+---
+...
+-- Do discovery iteration.
+vshard.router.discovery_wakeup()
+fiber.sleep(0.02)
+
+rs_cnt = 0;
+---
+...
+new_replicasets = {}
+for _, rs in pairs(vshard.router.internal.replicasets) do
+    new_replicasets[rs] = true
+    rs_cnt = rs_cnt + 1
+end;
+---
+...
+rs_cnt;
+---
+- 2
+...
+bucket_cnt = 0;
+---
+...
+for bucket_id, rs in pairs(vshard.router.internal.route_map) do
+    if not new_replicasets[rs] then
+        error('Old object added to route_map.')
+    end
+    bucket_cnt = bucket_cnt + 1
+end;
+---
+...
+bucket_cnt;
+---
+- 3000
+...
 test_run:cmd("setopt delimiter ''");
 ---
 - true
diff --git a/test/router/router.test.lua b/test/router/router.test.lua
index 106f3d8..528a84b 100644
--- a/test/router/router.test.lua
+++ b/test/router/router.test.lua
@@ -411,6 +411,48 @@ for bucket, old_rs in pairs(bucket_to_old_rs) do
         error("route_map was not updataed.")
     end
 end;
+
+--
+-- Check route_map is not filled with old replica objects after
+-- recpnfigure.
+--
+
+-- Perform #replicasets phases of discovery, to update replicasets
+-- object in for loop of discovery fiber since previous cfg.
+for _, __ in pairs(vshard.router.internal.replicasets) do
+    vshard.router.discovery_wakeup()
+    fiber.sleep(0.02)
+end;
+-- Simulate long `callro`.
+-- Stuck on first rs in replicasets.
+vshard.router.internal.errinj.LONG_DISCOVERY = true;
+for _, __ in pairs(vshard.router.internal.replicasets) do
+    vshard.router.discovery_wakeup()
+    fiber.sleep(0.02)
+end;
+
+vshard.router.cfg(cfg);
+vshard.router.internal.errinj.LONG_DISCOVERY = nil;
+-- Do discovery iteration.
+vshard.router.discovery_wakeup()
+fiber.sleep(0.02)
+
+rs_cnt = 0;
+new_replicasets = {}
+for _, rs in pairs(vshard.router.internal.replicasets) do
+    new_replicasets[rs] = true
+    rs_cnt = rs_cnt + 1
+end;
+rs_cnt;
+bucket_cnt = 0;
+for bucket_id, rs in pairs(vshard.router.internal.route_map) do
+    if not new_replicasets[rs] then
+        error('Old object added to route_map.')
+    end
+    bucket_cnt = bucket_cnt + 1
+end;
+bucket_cnt;
+
 test_run:cmd("setopt delimiter ''");
 
 _ = test_run:cmd("switch default")
diff --git a/vshard/router/init.lua b/vshard/router/init.lua
index 7e765fa..df5b343 100644
--- a/vshard/router/init.lua
+++ b/vshard/router/init.lua
@@ -127,10 +127,23 @@ local function discovery_f(module_version)
     local iterations_until_lua_gc =
         consts.COLLECT_LUA_GARBAGE_INTERVAL / consts.DISCOVERY_INTERVAL
     while module_version == M.module_version do
-        for _, replicaset in pairs(M.replicasets) do
+        local old_replicasets = M.replicasets
+        for rs_uuid, replicaset in pairs(M.replicasets) do
             local active_buckets, err =
                 replicaset:callro('vshard.storage.buckets_discovery', {},
                                   {timeout = 2})
+            while M.errinj.LONG_DISCOVERY do
+                -- Stuck on the first replicaset.
+                if rs_uuid ~= select(1, next(M.replicasets)) then
+                    break
+                end
+                lfiber.sleep(0.01)
+            end
+            -- Renew replicasets object in case of reconfigure
+            -- and reload events.
+            if M.replicasets ~= old_replicasets then
+                break
+            end
             if not active_buckets then
                 log.error('Error during discovery %s: %s', replicaset, err)
             else
-- 
2.14.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tarantool-patches] Re: [PATCH 1/2] Preserve route_map on router.cfg
  2018-06-15 12:47 ` [tarantool-patches] [PATCH 1/2] Preserve route_map on router.cfg AKhatskevich
@ 2018-06-21 12:54   ` Vladislav Shpilevoy
  2018-06-25 12:48     ` Alex Khatskevich
  0 siblings, 1 reply; 14+ messages in thread
From: Vladislav Shpilevoy @ 2018-06-21 12:54 UTC (permalink / raw)
  To: tarantool-patches, AKhatskevich

Thanks for the patch! See 2 comments below.

On 15/06/2018 15:47, AKhatskevich wrote:
> Routes in `M.route_map` can be preserved during reconfigure process.
> This is necessary to prevent a dramatic performance degradation just
> 	after reconfugure and reload.

1. Problems with alignment.

2. Typo.

> 
> Closes #117
> ---
>   test/router/router.result   | 42 ++++++++++++++++++++++++++++++++++++++++++
>   test/router/router.test.lua | 24 ++++++++++++++++++++++++
>   vshard/router/init.lua      |  7 +++++--
>   3 files changed, 71 insertions(+), 2 deletions(-)
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tarantool-patches] Re: [PATCH 2/2] Fix discovery/reconfigure race
  2018-06-15 12:47 ` [tarantool-patches] [PATCH 2/2] Fix discovery/reconfigure race AKhatskevich
@ 2018-06-21 12:54   ` Vladislav Shpilevoy
  2018-06-25 12:48     ` Alex Khatskevich
  0 siblings, 1 reply; 14+ messages in thread
From: Vladislav Shpilevoy @ 2018-06-21 12:54 UTC (permalink / raw)
  To: tarantool-patches, AKhatskevich

Thanks for the patch! See 5 comments below.

On 15/06/2018 15:47, AKhatskevich wrote:
> This commit prevents discovery fiber from discovering old replicasets
> and spoiling `route_map`.
> ---
>   test/router/router.result   | 62 +++++++++++++++++++++++++++++++++++++++++++++
>   test/router/router.test.lua | 42 ++++++++++++++++++++++++++++++
>   vshard/router/init.lua      | 15 ++++++++++-
>   3 files changed, 118 insertions(+), 1 deletion(-)
> 
> diff --git a/test/router/router.result b/test/router/router.result
> index 5643f3e..e61505e 100644
> --- a/test/router/router.result
> +++ b/test/router/router.result
> @@ -1095,6 +1095,68 @@ for bucket, old_rs in pairs(bucket_to_old_rs) do
>   end;
>   ---
>   ...
> +--
> +-- Check route_map is not filled with old replica objects after
> +-- recpnfigure.

1. Typo.

> +--
> +-- Perform #replicasets phases of discovery, to update replicasets
> +-- object in for loop of discovery fiber since previous cfg.
> +for _, __ in pairs(vshard.router.internal.replicasets) do
> +    vshard.router.discovery_wakeup()
> +    fiber.sleep(0.02)
> +end;
> +---
> +...
> +-- Simulate long `callro`.
> +-- Stuck on first rs in replicasets.
> +vshard.router.internal.errinj.LONG_DISCOVERY = true;

2. I do not see this error injection in the M.errinj
declaration.

> +---
> +...
> +for _, __ in pairs(vshard.router.internal.replicasets) do
> +    vshard.router.discovery_wakeup()
> +    fiber.sleep(0.02)
> +end;

3. This cycle makes no sense. With set LONG_DISCOVERY it is
equivalent to calling router.discovery_wakeup() once.

> +---
> +...
> +vshard.router.cfg(cfg);
> +---
> +...
> +vshard.router.internal.errinj.LONG_DISCOVERY = nil;
> +---
> +...
> +-- Do discovery iteration.
> +vshard.router.discovery_wakeup()
> +fiber.sleep(0.02)

4. Concrete timeouts are the way to create an unstable test.
Please, get rid of them and replace with 'while not cond do wait end'
where necessary.

> +
> +rs_cnt = 0;
> +---
> +...
> +new_replicasets = {}
> +for _, rs in pairs(vshard.router.internal.replicasets) do
> +    new_replicasets[rs] = true
> +    rs_cnt = rs_cnt + 1
> +end;
> +---
> +...
> +rs_cnt;
> +---
> +- 2
> +...
> +bucket_cnt = 0;
> +---
> +...
> +for bucket_id, rs in pairs(vshard.router.internal.route_map) do
> +    if not new_replicasets[rs] then> +        error('Old object added to route_map.')
> +    end
> +    bucket_cnt = bucket_cnt + 1
> +end;
> +---
> +...
> +bucket_cnt;
> +---
> +- 3000
> +...
>   test_run:cmd("setopt delimiter ''");
>   ---
>   - true
> diff --git a/vshard/router/init.lua b/vshard/router/init.lua
> index 7e765fa..df5b343 100644
> --- a/vshard/router/init.lua
> +++ b/vshard/router/init.lua
> @@ -127,10 +127,23 @@ local function discovery_f(module_version)
>       local iterations_until_lua_gc =
>           consts.COLLECT_LUA_GARBAGE_INTERVAL / consts.DISCOVERY_INTERVAL
>       while module_version == M.module_version do
> -        for _, replicaset in pairs(M.replicasets) do
> +        local old_replicasets = M.replicasets
> +        for rs_uuid, replicaset in pairs(M.replicasets) do
>               local active_buckets, err =
>                   replicaset:callro('vshard.storage.buckets_discovery', {},
>                                     {timeout = 2})
> +            while M.errinj.LONG_DISCOVERY do
> +                -- Stuck on the first replicaset.
> +                if rs_uuid ~= select(1, next(M.replicasets)) then
> +                    break
> +                end
> +                lfiber.sleep(0.01)
> +            end
> +            -- Renew replicasets object in case of reconfigure
> +            -- and reload events.

5. You do not renew here anything.

> +            if M.replicasets ~= old_replicasets then
> +                break
> +            end
>               if not active_buckets then
>                   log.error('Error during discovery %s: %s', replicaset, err)
>               else
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tarantool-patches] Re: [PATCH 0/2][vshard] preserve route map
  2018-06-15 12:47 [tarantool-patches] [PATCH 0/2][vshard] preserve route map AKhatskevich
  2018-06-15 12:47 ` [tarantool-patches] [PATCH 1/2] Preserve route_map on router.cfg AKhatskevich
  2018-06-15 12:47 ` [tarantool-patches] [PATCH 2/2] Fix discovery/reconfigure race AKhatskevich
@ 2018-06-21 12:54 ` Vladislav Shpilevoy
  2018-06-25 11:52   ` Alex Khatskevich
  2 siblings, 1 reply; 14+ messages in thread
From: Vladislav Shpilevoy @ 2018-06-21 12:54 UTC (permalink / raw)
  To: tarantool-patches, AKhatskevich

Please, describe a goal of the patchset here.

On 15/06/2018 15:47, AKhatskevich wrote:
> Branch: https://github.com/tarantool/vshard/tree/kh/gh-117-save-route-map
> Issue: https://github.com/tarantool/vshard/issues/117
> 
> Commits:
>   * Preserve route_map on router.cfg
>   * Fix discovery/reconfigure race
>      fixes race condition
> 
>   test/router/router.result   | 104 ++++++++++++++++++++++++++++++++++++++++++++
>   test/router/router.test.lua |  66 ++++++++++++++++++++++++++++
>   vshard/router/init.lua      |  22 ++++++++--
>   3 files changed, 189 insertions(+), 3 deletions(-)
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tarantool-patches] Re: [PATCH 0/2][vshard] preserve route map
  2018-06-21 12:54 ` [tarantool-patches] Re: [PATCH 0/2][vshard] preserve route map Vladislav Shpilevoy
@ 2018-06-25 11:52   ` Alex Khatskevich
  0 siblings, 0 replies; 14+ messages in thread
From: Alex Khatskevich @ 2018-06-25 11:52 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches

This commit prevents a dramatic performance degradation just
after reconfugure and reload by preserving routes to buckets.


On 21.06.2018 15:54, Vladislav Shpilevoy wrote:
> Please, describe a goal of the patchset here.
>
> On 15/06/2018 15:47, AKhatskevich wrote:
>> Branch: 
>> https://github.com/tarantool/vshard/tree/kh/gh-117-save-route-map
>> Issue: https://github.com/tarantool/vshard/issues/117
>>
>> Commits:
>>   * Preserve route_map on router.cfg
>>   * Fix discovery/reconfigure race
>>      fixes race condition
>>
>>   test/router/router.result   | 104 
>> ++++++++++++++++++++++++++++++++++++++++++++
>>   test/router/router.test.lua |  66 ++++++++++++++++++++++++++++
>>   vshard/router/init.lua      |  22 ++++++++--
>>   3 files changed, 189 insertions(+), 3 deletions(-)
>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tarantool-patches] Re: [PATCH 2/2] Fix discovery/reconfigure race
  2018-06-21 12:54   ` [tarantool-patches] " Vladislav Shpilevoy
@ 2018-06-25 12:48     ` Alex Khatskevich
  2018-06-26 11:11       ` Vladislav Shpilevoy
  0 siblings, 1 reply; 14+ messages in thread
From: Alex Khatskevich @ 2018-06-25 12:48 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches


>> +--
>> +-- Check route_map is not filled with old replica objects after
>> +-- recpnfigure.
>
> 1. Typo.
fixed
>
>>
>> +-- Simulate long `callro`.
>> +-- Stuck on first rs in replicasets.
>> +vshard.router.internal.errinj.LONG_DISCOVERY = true;
>
> 2. I do not see this error injection in the M.errinj
> declaration.
Fixed
+Renamed to ERRINJ_LONG_DISCOVERY
>
>> +---
>> +...
>> +for _, __ in pairs(vshard.router.internal.replicasets) do
>> +    vshard.router.discovery_wakeup()
>> +    fiber.sleep(0.02)
>> +end;
>
> 3. This cycle makes no sense. With set LONG_DISCOVERY it is
> equivalent to calling router.discovery_wakeup() once.
Disagree.
I need to stuck on the first routemap element. However I need to wakeup 
it #replicasets
times, because the discovery_fiber sleeps after each iteration.
>
>> +---
>> +...
>> +vshard.router.cfg(cfg);
>> +---
>> +...
>> +vshard.router.internal.errinj.LONG_DISCOVERY = nil;
>> +---
>> +...
>> +-- Do discovery iteration.
>> +vshard.router.discovery_wakeup()
>> +fiber.sleep(0.02)
>
> 4. Concrete timeouts are the way to create an unstable test.
> Please, get rid of them and replace with 'while not cond do wait end'
> where necessary.
It is important that discovery fiber did only one iteration of 
discovery. So, If I want some `while` mechanism I want to introduce 
several synchronization and block/unblock phases.
I get you point and suggest compromise here.
1. I have places an assert here, which checks that expected discovery 
has occurred.
2. I have launched the test 100 times and it did not fail
>
>>   - true
>> diff --git a/vshard/router/init.lua b/vshard/router/init.lua
>> index 7e765fa..df5b343 100644
>> --- a/vshard/router/init.lua
>> +++ b/vshard/router/init.lua
>> @@ -127,10 +127,23 @@ local function discovery_f(module_version)
>> +            -- Renew replicasets object in case of reconfigure
>> +            -- and reload events.
>
> 5. You do not renew here anything.
>
>> +            if M.replicasets ~= old_replicasets then
>> +                break
>> +            end
I have rewritten the comment. The replicasets variable captured by the
for loop should be updated by calling `break` and exiting the for loop.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tarantool-patches] Re: [PATCH 1/2] Preserve route_map on router.cfg
  2018-06-21 12:54   ` [tarantool-patches] " Vladislav Shpilevoy
@ 2018-06-25 12:48     ` Alex Khatskevich
  0 siblings, 0 replies; 14+ messages in thread
From: Alex Khatskevich @ 2018-06-25 12:48 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches


>> Routes in `M.route_map` can be preserved during reconfigure process.
>> This is necessary to prevent a dramatic performance degradation just
>>     after reconfugure and reload.
>
> 1. Problems with alignment.
>
> 2. Typo.
Fixed

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tarantool-patches] Re: [PATCH 2/2] Fix discovery/reconfigure race
  2018-06-25 12:48     ` Alex Khatskevich
@ 2018-06-26 11:11       ` Vladislav Shpilevoy
  2018-06-26 14:03         ` Alex Khatskevich
  0 siblings, 1 reply; 14+ messages in thread
From: Vladislav Shpilevoy @ 2018-06-26 11:11 UTC (permalink / raw)
  To: Alex Khatskevich, tarantool-patches

Thanks for the patch! See 5 comments below.

>>
>> 3. This cycle makes no sense. With set LONG_DISCOVERY it is
>> equivalent to calling router.discovery_wakeup() once.
> Disagree.
> I need to stuck on the first routemap element. However I need to wakeup it #replicasets
> times, because the discovery_fiber sleeps after each iteration.

1. I removed this cycle together with any wakeups and the test passed. So you are
wrong. Please, make the test be testing that discovery really works.

This is my diff that does not break the test:

	@@ -419,17 +419,11 @@ end;
	
	 -- Perform #replicasets phases of discovery, to update replicasets
	 -- object in for loop of discovery fiber since previous cfg.
	-for _, __ in pairs(vshard.router.internal.replicasets) do
	-    vshard.router.discovery_wakeup()
	-    fiber.sleep(0.02)
	-end;
	+vshard.router.discovery_wakeup();
	 -- Simulate long `callro`.
	 -- Stuck on first rs in replicasets.
	 vshard.router.internal.errinj.ERRINJ_LONG_DISCOVERY = true;
	-for _, __ in pairs(vshard.router.internal.replicasets) do
	-    vshard.router.discovery_wakeup()
	-    fiber.sleep(0.02)
	-end;
	+fiber.sleep(0.02);
	
	 vshard.router.cfg(cfg);
	 vshard.router.internal.route_map[333] = nil

>>
>>> +---
>>> +...
>>> +vshard.router.cfg(cfg);
>>> +---
>>> +...
>>> +vshard.router.internal.errinj.LONG_DISCOVERY = nil;
>>> +---
>>> +...
>>> +-- Do discovery iteration.
>>> +vshard.router.discovery_wakeup()
>>> +fiber.sleep(0.02)
>>
>> 4. Concrete timeouts are the way to create an unstable test.
>> Please, get rid of them and replace with 'while not cond do wait end'
>> where necessary.
> It is important that discovery fiber did only one iteration of discovery. So, If I want some `while` mechanism I want to introduce several synchronization and block/unblock phases.
> I get you point and suggest compromise here.
> 1. I have places an assert here, which checks that expected discovery has occurred.
> 2. I have launched the test 100 times and it did not fail

2. Please, just remove this single sleeps and do 'while cond do fiber.sleep(0.01) end'.
The fact that the test is passed on your computer means nothing. It can always pass on your
machine, but sometimes fail on mine, easily. Or on the travis, or on a customer machine.

You do not need any 'synchronization and block/unblock phases'. You already know what do
you expect to see after the sleep, so just put this condition in the while.

If you want to test some deadlines like it is done for auto network timeouts for the
router, then just remember the current time before while, afterwards and test that
the duration was < your limit.

I have already debugged several of such '100% passing tests with hard timeouts'
and do not want to do it again.

>>
>>>   - true
>>> diff --git a/vshard/router/init.lua b/vshard/router/init.lua
>>> index 7e765fa..df5b343 100644
>>> --- a/vshard/router/init.lua
>>> +++ b/vshard/router/init.lua
>>> @@ -127,10 +127,23 @@ local function discovery_f(module_version)
>>> +            -- Renew replicasets object in case of reconfigure
>>> +            -- and reload events.
>>
>> 5. You do not renew here anything.
>>
>>> +            if M.replicasets ~= old_replicasets then
>>> +                break
>>> +            end
> I have rewritten the comment. The replicasets variable captured by the
> for loop should be updated by calling `break` and exiting the for loop.
> 
> 

3. Please, put a new patch version here.

> diff --git a/vshard/router/init.lua b/vshard/router/init.lua
> index 7e765fa..5fc0f59 100644
> --- a/vshard/router/init.lua
> +++ b/vshard/router/init.lua
> @@ -127,10 +128,23 @@ local function discovery_f(module_version)
>      local iterations_until_lua_gc =
>          consts.COLLECT_LUA_GARBAGE_INTERVAL / consts.DISCOVERY_INTERVAL
>      while module_version == M.module_version do
> -        for _, replicaset in pairs(M.replicasets) do
> +        local old_replicasets = M.replicasets
> +        for rs_uuid, replicaset in pairs(M.replicasets) do
>              local active_buckets, err =
>                  replicaset:callro('vshard.storage.buckets_discovery', {},
>                                    {timeout = 2})
> +            while M.errinj.ERRINJ_LONG_DISCOVERY do
> +                -- Stuck on the first replicaset.

4. When you set this errinj, it sticks on the first rs with no special
checks and blocks other replicasets too. So remove this check. Just do
sleep until the errinj is set.

To prevent your dissension: I removed the check and the test passed too.
Together with the diff above about wakeups removal.

> +                if rs_uuid ~= select(1, next(M.replicasets)) then

5. select(1... here does nothing. It is equivalent to

     rs_uuid ~= next(M.replicasets)

You can start arguing, but do not. I have already checked it.

> +                    break
> +                end
> +                lfiber.sleep(0.01)
> +            end
> +            -- Renew replicasets object captured by the for loop
> +            -- in case of reconfigure and reload events.
> +            if M.replicasets ~= old_replicasets then
> +                break
> +            end
>              if not active_buckets then
>                  log.error('Error during discovery %s: %s', replicaset, err)
>              else

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tarantool-patches] Re: [PATCH 2/2] Fix discovery/reconfigure race
  2018-06-26 11:11       ` Vladislav Shpilevoy
@ 2018-06-26 14:03         ` Alex Khatskevich
  2018-06-27 11:45           ` Vladislav Shpilevoy
  0 siblings, 1 reply; 14+ messages in thread
From: Alex Khatskevich @ 2018-06-26 14:03 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches


> 1. I removed this cycle together with any wakeups and the test passed. 
> So you are
> wrong. Please, make the test be testing that discovery really works.
>
> 2. Please, just remove this single sleeps and do 'while cond do 
> fiber.sleep(0.01) end'.
> The fact that the test is passed on your computer means nothing. It 
> can always pass on your
> machine, but sometimes fail on mine, easily. Or on the travis, or on a 
> customer machine.
>
> You do not need any 'synchronization and block/unblock phases'. You 
> already know what do
> you expect to see after the sleep, so just put this condition in the 
> while.
>
> If you want to test some deadlines like it is done for auto network 
> timeouts for the
> router, then just remember the current time before while, afterwards 
> and test that
> the duration was < your limit.
>
> I have already debugged several of such '100% passing tests with hard 
> timeouts'
> and do not want to do it again.
>
Fixed as discussed verbally
> 3. Please, put a new patch version here.
below
>
>> diff --git a/vshard/router/init.lua b/vshard/router/init.lua
>> index 7e765fa..5fc0f59 100644
>> --- a/vshard/router/init.lua
>> +++ b/vshard/router/init.lua
>> @@ -127,10 +128,23 @@ local function discovery_f(module_version)
>>      local iterations_until_lua_gc =
>>          consts.COLLECT_LUA_GARBAGE_INTERVAL / consts.DISCOVERY_INTERVAL
>>      while module_version == M.module_version do
>> -        for _, replicaset in pairs(M.replicasets) do
>> +        local old_replicasets = M.replicasets
>> +        for rs_uuid, replicaset in pairs(M.replicasets) do
>>              local active_buckets, err =
>> replicaset:callro('vshard.storage.buckets_discovery', {},
>>                                    {timeout = 2})
>> +            while M.errinj.ERRINJ_LONG_DISCOVERY do
>> +                -- Stuck on the first replicaset.
>
> 4. When you set this errinj, it sticks on the first rs with no special
> checks and blocks other replicasets too. So remove this check. Just do
> sleep until the errinj is set.
>
> To prevent your dissension: I removed the check and the test passed too.
> Together with the diff above about wakeups removal.

Fixed as discussed verbally
>
> 5. select(1... here does nothing. It is equivalent to
>
>     rs_uuid ~= next(M.replicasets)
deleted



Full diff:

commit bc90c3db5f5b1b663c747b8c2829fc8528af70cf
Author: AKhatskevich <avkhatskevich@tarantool.org>
Date:   Thu Jun 14 16:03:09 2018 +0300

     Fix discovery/reconfigure race

     This commit prevents discovery fiber from discovering old replicasets
     and spoiling `route_map`.

diff --git a/test/router/router.result b/test/router/router.result
index 5643f3e..36d54bf 100644
--- a/test/router/router.result
+++ b/test/router/router.result
@@ -1095,6 +1095,55 @@ for bucket, old_rs in pairs(bucket_to_old_rs) do
  end;
  ---
  ...
+--
+-- Check route_map is not filled with old replica objects after
+-- reconfigure.
+--
+-- Simulate long `callro`.
+vshard.router.internal.errinj.ERRINJ_LONG_DISCOVERY = true;
+---
+...
+while not vshard.router.internal.errinj.ERRINJ_LONG_DISCOVERY == 
'waiting' do
+    vshard.router.discovery_wakeup()
+    fiber.sleep(0.02)
+end;
+---
+...
+vshard.router.cfg(cfg);
+---
+...
+route_map = vshard.router.internal.route_map
+for bucket_id, _ in pairs(route_map) do
+    route_map[bucket_id] = nil
+end;
+---
+...
+vshard.router.internal.errinj.ERRINJ_LONG_DISCOVERY = false;
+---
+...
+-- Do discovery iteration. Upload buckets from the
+-- first replicaset.
+while not next(vshard.router.internal.route_map) do
+    vshard.router.discovery_wakeup()
+    fiber.sleep(0.01)
+end;
+---
+...
+new_replicasets = {};
+---
+...
+for _, rs in pairs(vshard.router.internal.replicasets) do
+    new_replicasets[rs] = true
+end;
+---
+...
+_, rs = next(vshard.router.internal.route_map);
+---
+...
+new_replicasets[rs] == true;
+---
+- true
+...
  test_run:cmd("setopt delimiter ''");
  ---
  - true
diff --git a/test/router/router.test.lua b/test/router/router.test.lua
index 106f3d8..843db46 100644
--- a/test/router/router.test.lua
+++ b/test/router/router.test.lua
@@ -411,6 +411,35 @@ for bucket, old_rs in pairs(bucket_to_old_rs) do
          error("route_map was not updataed.")
      end
  end;
+
+--
+-- Check route_map is not filled with old replica objects after
+-- reconfigure.
+--
+-- Simulate long `callro`.
+vshard.router.internal.errinj.ERRINJ_LONG_DISCOVERY = true;
+while not vshard.router.internal.errinj.ERRINJ_LONG_DISCOVERY == 
'waiting' do
+    vshard.router.discovery_wakeup()
+    fiber.sleep(0.02)
+end;
+vshard.router.cfg(cfg);
+route_map = vshard.router.internal.route_map
+for bucket_id, _ in pairs(route_map) do
+    route_map[bucket_id] = nil
+end;
+vshard.router.internal.errinj.ERRINJ_LONG_DISCOVERY = false;
+-- Do discovery iteration. Upload buckets from the
+-- first replicaset.
+while not next(vshard.router.internal.route_map) do
+    vshard.router.discovery_wakeup()
+    fiber.sleep(0.01)
+end;
+new_replicasets = {};
+for _, rs in pairs(vshard.router.internal.replicasets) do
+    new_replicasets[rs] = true
+end;
+_, rs = next(vshard.router.internal.route_map);
+new_replicasets[rs] == true;
  test_run:cmd("setopt delimiter ''");

  _ = test_run:cmd("switch default")
diff --git a/vshard/router/init.lua b/vshard/router/init.lua
index 7e765fa..c125125 100644
--- a/vshard/router/init.lua
+++ b/vshard/router/init.lua
@@ -13,6 +13,7 @@ if not M then
          errinj = {
              ERRINJ_FAILOVER_CHANGE_CFG = false,
              ERRINJ_RELOAD = false,
+            ERRINJ_LONG_DISCOVERY = false,
          },
          -- Bucket map cache.
          route_map = {},
@@ -127,10 +128,19 @@ local function discovery_f(module_version)
      local iterations_until_lua_gc =
          consts.COLLECT_LUA_GARBAGE_INTERVAL / consts.DISCOVERY_INTERVAL
      while module_version == M.module_version do
-        for _, replicaset in pairs(M.replicasets) do
+        local old_replicasets = M.replicasets
+        for rs_uuid, replicaset in pairs(M.replicasets) do
              local active_buckets, err =
replicaset:callro('vshard.storage.buckets_discovery', {},
                                    {timeout = 2})
+            while M.errinj.ERRINJ_LONG_DISCOVERY do
+                lfiber.sleep(0.01)
+            end
+            -- Renew replicasets object captured by the for loop
+            -- in case of reconfigure and reload events.
+            if M.replicasets ~= old_replicasets then
+                break
+            end
              if not active_buckets then
                  log.error('Error during discovery %s: %s', replicaset, 
err)
              else

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tarantool-patches] Re: [PATCH 2/2] Fix discovery/reconfigure race
  2018-06-26 14:03         ` Alex Khatskevich
@ 2018-06-27 11:45           ` Vladislav Shpilevoy
  2018-06-27 19:50             ` Alex Khatskevich
  0 siblings, 1 reply; 14+ messages in thread
From: Vladislav Shpilevoy @ 2018-06-27 11:45 UTC (permalink / raw)
  To: Alex Khatskevich, tarantool-patches

Thanks for the fixes! See 2 comments below.

> commit bc90c3db5f5b1b663c747b8c2829fc8528af70cf
> Author: AKhatskevich <avkhatskevich@tarantool.org>
> Date:   Thu Jun 14 16:03:09 2018 +0300
> 
>      Fix discovery/reconfigure race
> 
>      This commit prevents discovery fiber from discovering old replicasets
>      and spoiling `route_map`.
> 
> diff --git a/test/router/router.result b/test/router/router.result
> index 5643f3e..36d54bf 100644
> --- a/test/router/router.result
> +++ b/test/router/router.result
> @@ -1095,6 +1095,55 @@ for bucket, old_rs in pairs(bucket_to_old_rs) do
>   end;
>   ---
>   ...
> +--
> +-- Check route_map is not filled with old replica objects after
> +-- reconfigure.
> +--
> +-- Simulate long `callro`.
> +vshard.router.internal.errinj.ERRINJ_LONG_DISCOVERY = true;
> +---
> +...
> +while not vshard.router.internal.errinj.ERRINJ_LONG_DISCOVERY == 'waiting' do

1. Why 'not value == needed_value'? Why not 'value ~= needed_value'? And how
does it work? I checked this thing in Lua and got these results:

-- Before cycle
tarantool> not true == 'waiting'
---
- false
...


-- After 'waiting' is set
tarantool> not 'waiting' == 'waiting'
---
- false
...


So this cycle is never run. And I proved it with putting assert(false)
into the first line - the test passed as well:

	[001] +++ router/router.reject	Wed Jun 27 14:39:30 2018
	[001] @@ -1104,6 +1104,7 @@
	[001]  ---
	[001]  ...
	[001]  while not vshard.router.internal.errinj.ERRINJ_LONG_DISCOVERY == 'waiting' do
	[001] +    assert(false)
	[001]      vshard.router.discovery_wakeup()
	[001]      fiber.sleep(0.02)
	[001]  end;
	[001]

Please, investigate why your test passes even without this cycle again.

> +    vshard.router.discovery_wakeup()
> +    fiber.sleep(0.02)
> +end;
> +---
> +...
> +vshard.router.cfg(cfg);
> +---
> +...
> +route_map = vshard.router.internal.route_map
> +for bucket_id, _ in pairs(route_map) do
> +    route_map[bucket_id] = nil
> +end;

2. Why not just 'vshard.router.internal.route_map = {}' ?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tarantool-patches] Re: [PATCH 2/2] Fix discovery/reconfigure race
  2018-06-27 11:45           ` Vladislav Shpilevoy
@ 2018-06-27 19:50             ` Alex Khatskevich
  2018-06-28 19:41               ` Vladislav Shpilevoy
  0 siblings, 1 reply; 14+ messages in thread
From: Alex Khatskevich @ 2018-06-27 19:50 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches


>>   ...
>> +--
>> +-- Check route_map is not filled with old replica objects after
>> +-- reconfigure.
>> +--
>> +-- Simulate long `callro`.
>> +vshard.router.internal.errinj.ERRINJ_LONG_DISCOVERY = true;
>> +---
>> +...
>> +while not vshard.router.internal.errinj.ERRINJ_LONG_DISCOVERY == 
>> 'waiting' do
>
> 1. Why 'not value == needed_value'? Why not 'value ~= needed_value'? 
> And how
> does it work? I checked this thing in Lua and got these results:
>
> Please, investigate why your test passes even without this cycle again.
Priorities...
Fixed.
Test was passing because it always passes on the new implementation, and 
fails only on old implementation and  this cycle makes the probability 
of failure 100% in this case.
After your finding, I have returned the old implementation and made sure 
that it really fails.
>
>> +    vshard.router.discovery_wakeup()
>> +    fiber.sleep(0.02)
>> +end;
>> +---
>> +...
>> +vshard.router.cfg(cfg);
>> +---
>> +...
>> +route_map = vshard.router.internal.route_map
>> +for bucket_id, _ in pairs(route_map) do
>> +    route_map[bucket_id] = nil
>> +end;
>
> 2. Why not just 'vshard.router.internal.route_map = {}' ?
Fixed

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tarantool-patches] Re: [PATCH 2/2] Fix discovery/reconfigure race
  2018-06-27 19:50             ` Alex Khatskevich
@ 2018-06-28 19:41               ` Vladislav Shpilevoy
  0 siblings, 0 replies; 14+ messages in thread
From: Vladislav Shpilevoy @ 2018-06-28 19:41 UTC (permalink / raw)
  To: tarantool-patches, Alex Khatskevich

Thanks for the patch! I pushed it.

On 27/06/2018 22:50, Alex Khatskevich wrote:
> 
>>>   ...
>>> +--
>>> +-- Check route_map is not filled with old replica objects after
>>> +-- reconfigure.
>>> +--
>>> +-- Simulate long `callro`.
>>> +vshard.router.internal.errinj.ERRINJ_LONG_DISCOVERY = true;
>>> +---
>>> +...
>>> +while not vshard.router.internal.errinj.ERRINJ_LONG_DISCOVERY == 'waiting' do
>>
>> 1. Why 'not value == needed_value'? Why not 'value ~= needed_value'? And how
>> does it work? I checked this thing in Lua and got these results:
>>
>> Please, investigate why your test passes even without this cycle again.
> Priorities...
> Fixed.
> Test was passing because it always passes on the new implementation, and fails only on old implementation and  this cycle makes the probability of failure 100% in this case.
> After your finding, I have returned the old implementation and made sure that it really fails.
>>
>>> +    vshard.router.discovery_wakeup()
>>> +    fiber.sleep(0.02)
>>> +end;
>>> +---
>>> +...
>>> +vshard.router.cfg(cfg);
>>> +---
>>> +...
>>> +route_map = vshard.router.internal.route_map
>>> +for bucket_id, _ in pairs(route_map) do
>>> +    route_map[bucket_id] = nil
>>> +end;
>>
>> 2. Why not just 'vshard.router.internal.route_map = {}' ?
> Fixed
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-06-28 19:41 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-15 12:47 [tarantool-patches] [PATCH 0/2][vshard] preserve route map AKhatskevich
2018-06-15 12:47 ` [tarantool-patches] [PATCH 1/2] Preserve route_map on router.cfg AKhatskevich
2018-06-21 12:54   ` [tarantool-patches] " Vladislav Shpilevoy
2018-06-25 12:48     ` Alex Khatskevich
2018-06-15 12:47 ` [tarantool-patches] [PATCH 2/2] Fix discovery/reconfigure race AKhatskevich
2018-06-21 12:54   ` [tarantool-patches] " Vladislav Shpilevoy
2018-06-25 12:48     ` Alex Khatskevich
2018-06-26 11:11       ` Vladislav Shpilevoy
2018-06-26 14:03         ` Alex Khatskevich
2018-06-27 11:45           ` Vladislav Shpilevoy
2018-06-27 19:50             ` Alex Khatskevich
2018-06-28 19:41               ` Vladislav Shpilevoy
2018-06-21 12:54 ` [tarantool-patches] Re: [PATCH 0/2][vshard] preserve route map Vladislav Shpilevoy
2018-06-25 11:52   ` Alex Khatskevich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox