[tarantool-patches] Re: [PATCH 2/2] Complete module reload

Vladislav Shpilevoy v.shpilevoy at tarantool.org
Sun Jun 17 22:46:21 MSK 2018


Thanks for the patch! See 11 comments below.

On 09/06/2018 20:47, AKhatskevich wrote:
> In case one need to upgrade vshard to a new version, this commit
> introduces a reload mechanism which allows to do that for a
> noticeable variety of possible changes (between the two versions).

Nitpicking: it does not introduce reload mechanism. This commit
just improves the thing.

Maybe should we place here a manual how to reload?
A short example.

> 
> Reload process:
>   * reload all vshard modules
>   * create new `replicaset` and `replica` objects
>   * reuse old netbox connections in new replica objects if
>     possible
>   * update router/storage.internal table
>   * after a `reload_deprecate_timeout` disable old instances of
>     `replicaset` and `replica` objects
> 
> Reload works for modules:
>   * vshard.router
>   * vshard.storage
> 
> In case reload process fails, old router/storage module continue
> working properly.
> 
> Extra changes:
>   * add `util.async_task` method, which runs a function after a
>     delay
>   * delete replicaset:rebind_connections method as it is replaced
>     with `rebind_replicasets` which updates all replicasets at once
>   * introduce `module_reloading` for distinguishing reloaf from

1. reloaf -> reload.

>     reconfigure
> * introduce MODULE_INTERNALS which stores name of the module
>    internal data in the global namespace
> 
> Closes #112
> ---
>   test/router/reload.result    | 109 +++++++++++++++++++++++++++++++++++++++++++
>   test/router/reload.test.lua  |  41 ++++++++++++++++
>   test/router/router.result    |   3 +-
>   test/storage/reload.result   |  61 ++++++++++++++++++++++++
>   test/storage/reload.test.lua |  24 ++++++++++
>   vshard/replicaset.lua        |  91 +++++++++++++++++++++++++++++-------
>   vshard/router/init.lua       |  40 ++++++++++++----
>   vshard/storage/init.lua      |  44 ++++++++++++-----
>   vshard/util.lua              |  20 ++++++++
>   9 files changed, 392 insertions(+), 41 deletions(-)
> 

I will review the tests later, after we finish with the code.

> diff --git a/vshard/replicaset.lua b/vshard/replicaset.lua
> index 99f59aa..48053a0 100644
> --- a/vshard/replicaset.lua
> +++ b/vshard/replicaset.lua
> @@ -48,7 +48,8 @@ local lerror = require('vshard.error')
>   local fiber = require('fiber')
>   local luri = require('uri')
>   local ffi = require('ffi')
> -local gsc = require('vshard.util').generate_self_checker
> +local util = require('vshard.util')
> +local gsc = util.generate_self_checker
>   
>   --
>   -- on_connect() trigger for net.box
> @@ -338,26 +339,79 @@ local function replicaset_tostring(replicaset)
>   end
>   
>   --
> --- Rebind connections of old replicas to new ones.
> +-- Deprecate old objects in case of reload.
> +--
> +local function clean_old_objects(old_replicaset_mt, old_replica_mt,
> +                                 old_replicas)
> +    local function get_deprecated_warning(obj_name)
> +        return function(...)
> +            local finfo = debug.getinfo(2)
> +            local file = finfo.short_src
> +            local name = finfo.name
> +            local line = finfo.currentline
> +            local err_fmt = '%s.%s:%s: Object %s is outdated. Use new instance'
> +            error(string.format(err_fmt, file, name, line, obj_name))
> +        end
> +    end

2. Too complex.

2.1 Why could not you just declare this function out of clean_old_object()?
It does not depend on any upvalue here. And it is parsed every time when
clean_old_objects is called slowing it down.

2.2. Too complex error message. Lets just throw error that object is
outdated.

> +    local function replace_mt_methods(mt, method)
> +        if not mt then
> +            return
> +        end
> +        for name, _ in pairs(mt.__index) do
> +            mt.__index[name] = method
> +        end
> +    end
> +    replace_mt_methods(old_replicaset_mt, get_deprecated_warning('replicaset'))
> +    replace_mt_methods(old_replica_mt, get_deprecated_warning('replica'))

3. Why replace_mt_methods needs old mt? Why not just replace the whole old
__index?

I thought you would do something like this:

getmetatable(old_object).__index = function()
     return function() error('The object is outdated') end
end

Any attempt to call a method in an old object leads to
throwing an error.

It allows to remove both replace_mt_methods, get_deprecated_warning.

And one more remark: our convention is to throw only OOM and misusage
errors. Here we should return two values: nil, error_object. Lets
add a new error code for this object.

> +    for _, replica in pairs(old_replicas) do
> +        replica.conn = nil
> +    end
> +    log.info('Old replicaset and replica objects are cleaned up.')
> +end
> +
>   --
> -local function replicaset_rebind_connections(replicaset)
> -    for _, replica in pairs(replicaset.replicas) do
> -        local old_replica = replica.old_replica
> -        if old_replica then
> -            local conn = old_replica.conn
> -            replica.conn = conn
> -            replica.down_ts = old_replica.down_ts
> -            replica.net_timeout = old_replica.net_timeout
> -            replica.net_sequential_ok = old_replica.net_sequential_ok
> -            replica.net_sequential_fail = old_replica.net_sequential_fail
> -            if conn then
> -                conn.replica = replica
> -                conn.replicaset = replicaset
> -                old_replica.conn = nil
> +-- Copy netbox conections from old replica objects to new ones.
> +--
> +local function rebind_replicasets(replicasets, deprecate, deprecate_timeout)

4. Lets always deprecate old objects. Reconfigure is like reload called from API.
And remove 'deprecate' parameter.

On reconfiguration deprecating is even more useful because if during the
reconfiguration a cluster topology is changed, then some of buckets can be
transferred making old routes (and thus old replicaset objects that
had been got by router.route() call) outdated.

> +    -- Collect data to deprecate old objects.
> +    local old_replicaset_mt = nil
> +    local old_replica_mt = nil
> +    local old_replicas = {}
> +    for _, replicaset in pairs(replicasets) do
> +        if replicaset.old_replicaset then
> +            local old_mt = getmetatable(replicaset.old_replicaset)
> +            assert(old_replicaset_mt == nil or old_replicaset_mt == old_mt)
> +            assert(not deprecate or
> +                   old_replicaset_mt ~= getmetatable(replicaset))
> +            old_replicaset_mt = old_mt
> +            replicaset.old_replicaset = nil
> +        end
> +        for _, replica in pairs(replicaset.replicas) do
> +            local old_replica = replica.old_replica
> +            if old_replica then
> +                local old_mt = getmetatable(old_replica)
> +                assert(old_replica_mt == nil or old_replica_mt == old_mt)
> +                assert(not deprecate or old_replica_mt ~= getmetatable(replica))
> +                old_replica_mt = old_mt
> +                table.insert(old_replicas, old_replica)
> +                local conn = old_replica.conn
> +                replica.conn = conn
> +                replica.down_ts = old_replica.down_ts
> +                replica.net_timeout = old_replica.net_timeout
> +                replica.net_sequential_ok = old_replica.net_sequential_ok
> +                replica.net_sequential_fail = old_replica.net_sequential_fail
> +                if conn then
> +                    conn.replica = replica
> +                    conn.replicaset = replicaset
> +                end
> +                replica.old_replica = nil
>               end
> -            replica.old_replica = nil
>           end
>       end
> +    if deprecate then
> +        util.async_task(deprecate_timeout, clean_old_objects, old_replicaset_mt,
> +                        old_replica_mt, old_replicas)

5. Here you have lost replicas and replicasets that had been removed from
the configuration. Not updated, but exactly removed. You iterate over new
replicasets only, that has not links to removed ones.

You should not deprecate old objects via links in new ones. You should have
two different methods: to schedule old objects deprecation and to rebind
connections. Looks, like rebind_connections should be returned.

> +    end
>   end
>   
>   --
> @@ -523,6 +576,7 @@ local function buildall(sharding_cfg, old_replicasets)
>               weight = replicaset.weight,
>               bucket_count = 0,
>               lock = replicaset.lock,
> +            old_replicaset = old_replicaset,

6. You should not have this link.

>           }, replicaset_mt)
>           local priority_list = {}
> diff --git a/vshard/router/init.lua b/vshard/router/init.lua
> index 1dee80c..c075436 100644
> --- a/vshard/router/init.lua
> +++ b/vshard/router/init.lua
> @@ -129,6 +143,9 @@ local function discovery_f(module_version)
>           consts.COLLECT_LUA_GARBAGE_INTERVAL / consts.DISCOVERY_INTERVAL
>       while module_version == M.module_version do
>           for _, replicaset in pairs(M.replicasets) do
> +            if module_version ~= M.module_version then
> +                return
> +            end

7. As I told you verbally, this check should be done after callro.
I can reload the module during callro(), and old replicaset object
will fill the route map. Can you please extract this 3-line fix in a
separate commit with a test?

>               local active_buckets, err =
>                   replicaset:callro('vshard.storage.buckets_discovery', {},
>                                     {timeout = 2})
> @@ -457,8 +474,11 @@ end
>   -- Configuration
>   --------------------------------------------------------------------------------
>   
> +-- Distinguish reload from reconfigure.
> +local module_reloading = true
>   local function router_cfg(cfg)
>       cfg = lcfg.check(cfg)
> +    local current_cfg = table.deepcopy(cfg)

8. cfg is already deepcopy of the original cfg.

>       if not M.replicasets then
>           log.info('Starting router configuration')
>       else
> diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
> index 879c7c4..b570821 100644
> --- a/vshard/storage/init.lua
> +++ b/vshard/storage/init.lua
> @@ -2,14 +2,27 @@ local log = require('log')
>   local luri = require('uri')
>   local lfiber = require('fiber')
>   local netbox = require('net.box') -- for net.box:self()
> +local trigger = require('internal.trigger')
> +
> +local MODULE_INTERNALS = '__module_vshard_storage'
> +-- Reload requirements, in case this module is reloaded manually.
> +if rawget(_G, MODULE_INTERNALS) then
> +    local vshard_modules = {
> +        'vshard.consts', 'vshard.error', 'vshard.cfg',
> +        'vshard.replicaset', 'vshard.util',
> +    }
> +    for _, module in pairs(vshard_modules) do
> +        package.loaded[module] = nil
> +    end
> +end
>   local consts = require('vshard.consts')
>   local lerror = require('vshard.error')
> -local util = require('vshard.util')
>   local lcfg = require('vshard.cfg')
>   local lreplicaset = require('vshard.replicaset')
> -local trigger = require('internal.trigger')
> +local util = require('vshard.util')
>   
> -local M = rawget(_G, '__module_vshard_storage')
> +

9. Extra empty line.

> +local M = rawget(_G, MODULE_INTERNALS)
>   if not M then
>       --
>       -- The module is loaded for the first time.
> @@ -21,6 +34,8 @@ if not M then
>           -- See format in replicaset.lua.
>           --
>           replicasets = nil,
> +        -- Time to deprecate old objects on reload.
> +        reload_deprecate_timeout = nil,

10. Looks like this thing is always 'nil' until somebody does
not manually set it via storage/router.internal. It is not good.
Please, move it to configuration and create a default value.

>           -- Triggers on master switch event. They are called right
>           -- before the event occurs.
>           _on_master_enable = trigger.new('_on_master_enable'),
> diff --git a/vshard/util.lua b/vshard/util.lua
> index bb71318..f5bd78c 100644
> --- a/vshard/util.lua
> +++ b/vshard/util.lua
> @@ -75,8 +75,28 @@ local function generate_self_checker(obj_name, func_name, mt, func)
>       end
>   end
>   
> +
> +--
> +-- Run a function without interrupting current fiber.
> +-- @param delay Delay in seconds before the task should be
> +--        executed.
> +-- @param task Function to be executed.
> +-- @param ... Arguments which would be passed to the `task`.
> +--
> +local function async_task(delay, task, ...)
> +    assert(delay == nil or type(delay) == 'number')
> +    local function wait_and_call(...)
> +        if delay then
> +            fiber.sleep(delay)
> +        end
> +        task(...)
> +    end

11. Please, avoid closures when possible. Every time the function is
called the closure is parsed again.

> +    fiber.create(wait_and_call, ...)
> +end
> +
>   return {
>       tuple_extract_key = tuple_extract_key,
>       reloadable_fiber_f = reloadable_fiber_f,
>       generate_self_checker = generate_self_checker,
> +    async_task = async_task,
>   }
> 




More information about the Tarantool-patches mailing list