[tarantool-patches] Re: [PATCH 2/2] Fix discovery/reconfigure race

Vladislav Shpilevoy v.shpilevoy at tarantool.org
Tue Jun 26 14:11:42 MSK 2018


Thanks for the patch! See 5 comments below.

>>
>> 3. This cycle makes no sense. With set LONG_DISCOVERY it is
>> equivalent to calling router.discovery_wakeup() once.
> Disagree.
> I need to stuck on the first routemap element. However I need to wakeup it #replicasets
> times, because the discovery_fiber sleeps after each iteration.

1. I removed this cycle together with any wakeups and the test passed. So you are
wrong. Please, make the test be testing that discovery really works.

This is my diff that does not break the test:

	@@ -419,17 +419,11 @@ end;
	
	 -- Perform #replicasets phases of discovery, to update replicasets
	 -- object in for loop of discovery fiber since previous cfg.
	-for _, __ in pairs(vshard.router.internal.replicasets) do
	-    vshard.router.discovery_wakeup()
	-    fiber.sleep(0.02)
	-end;
	+vshard.router.discovery_wakeup();
	 -- Simulate long `callro`.
	 -- Stuck on first rs in replicasets.
	 vshard.router.internal.errinj.ERRINJ_LONG_DISCOVERY = true;
	-for _, __ in pairs(vshard.router.internal.replicasets) do
	-    vshard.router.discovery_wakeup()
	-    fiber.sleep(0.02)
	-end;
	+fiber.sleep(0.02);
	
	 vshard.router.cfg(cfg);
	 vshard.router.internal.route_map[333] = nil

>>
>>> +---
>>> +...
>>> +vshard.router.cfg(cfg);
>>> +---
>>> +...
>>> +vshard.router.internal.errinj.LONG_DISCOVERY = nil;
>>> +---
>>> +...
>>> +-- Do discovery iteration.
>>> +vshard.router.discovery_wakeup()
>>> +fiber.sleep(0.02)
>>
>> 4. Concrete timeouts are the way to create an unstable test.
>> Please, get rid of them and replace with 'while not cond do wait end'
>> where necessary.
> It is important that discovery fiber did only one iteration of discovery. So, If I want some `while` mechanism I want to introduce several synchronization and block/unblock phases.
> I get you point and suggest compromise here.
> 1. I have places an assert here, which checks that expected discovery has occurred.
> 2. I have launched the test 100 times and it did not fail

2. Please, just remove this single sleeps and do 'while cond do fiber.sleep(0.01) end'.
The fact that the test is passed on your computer means nothing. It can always pass on your
machine, but sometimes fail on mine, easily. Or on the travis, or on a customer machine.

You do not need any 'synchronization and block/unblock phases'. You already know what do
you expect to see after the sleep, so just put this condition in the while.

If you want to test some deadlines like it is done for auto network timeouts for the
router, then just remember the current time before while, afterwards and test that
the duration was < your limit.

I have already debugged several of such '100% passing tests with hard timeouts'
and do not want to do it again.

>>
>>>   - true
>>> diff --git a/vshard/router/init.lua b/vshard/router/init.lua
>>> index 7e765fa..df5b343 100644
>>> --- a/vshard/router/init.lua
>>> +++ b/vshard/router/init.lua
>>> @@ -127,10 +127,23 @@ local function discovery_f(module_version)
>>> +            -- Renew replicasets object in case of reconfigure
>>> +            -- and reload events.
>>
>> 5. You do not renew here anything.
>>
>>> +            if M.replicasets ~= old_replicasets then
>>> +                break
>>> +            end
> I have rewritten the comment. The replicasets variable captured by the
> for loop should be updated by calling `break` and exiting the for loop.
> 
> 

3. Please, put a new patch version here.

> diff --git a/vshard/router/init.lua b/vshard/router/init.lua
> index 7e765fa..5fc0f59 100644
> --- a/vshard/router/init.lua
> +++ b/vshard/router/init.lua
> @@ -127,10 +128,23 @@ local function discovery_f(module_version)
>      local iterations_until_lua_gc =
>          consts.COLLECT_LUA_GARBAGE_INTERVAL / consts.DISCOVERY_INTERVAL
>      while module_version == M.module_version do
> -        for _, replicaset in pairs(M.replicasets) do
> +        local old_replicasets = M.replicasets
> +        for rs_uuid, replicaset in pairs(M.replicasets) do
>              local active_buckets, err =
>                  replicaset:callro('vshard.storage.buckets_discovery', {},
>                                    {timeout = 2})
> +            while M.errinj.ERRINJ_LONG_DISCOVERY do
> +                -- Stuck on the first replicaset.

4. When you set this errinj, it sticks on the first rs with no special
checks and blocks other replicasets too. So remove this check. Just do
sleep until the errinj is set.

To prevent your dissension: I removed the check and the test passed too.
Together with the diff above about wakeups removal.

> +                if rs_uuid ~= select(1, next(M.replicasets)) then

5. select(1... here does nothing. It is equivalent to

     rs_uuid ~= next(M.replicasets)

You can start arguing, but do not. I have already checked it.

> +                    break
> +                end
> +                lfiber.sleep(0.01)
> +            end
> +            -- Renew replicasets object captured by the for loop
> +            -- in case of reconfigure and reload events.
> +            if M.replicasets ~= old_replicasets then
> +                break
> +            end
>              if not active_buckets then
>                  log.error('Error during discovery %s: %s', replicaset, err)
>              else





More information about the Tarantool-patches mailing list