[Tarantool-patches] [PATCH vshard 5/6] router: introduce automatic master discovery

Fri Jul 2 14:48:20 MSK 2021

Thanks for your patch. See 8 comments below.

On 02.07.2021 01:09, Vladislav Shpilevoy wrote:
> Part of #75
>
> @TarantoolBot document
> Title: VShard router master discovery
>
> Router used not to be able to find master nodes in the configured
> replicasets on its own. It relied only on how they were specified
> in the config.
>
> This becomes a problem when master changes and the change is not
> delivered to the router's config somewhy. For instance, the router
> does not rely on a central config provider. Or it does rely, but
> the provider can't deliver a new config due to any reason.
>
> This is getting especially tricky with built-in automatic master
> elections which are not supported by vshard yet, but they will be,
> they won't depend on any config. Master role won't be pinned to
> one node then.

Is it possible to implement some simple test that shows if RAFT changes

a leader, vshard catches this changes? Of course, it will be test for 
new Tarantool versions.

> Now there is a new feature to overcome the master search problem -
> configurable automatic master discovery on the router.
>
> Router goes to the replicasets, marked as having an auto master,
> finds a master in them, and periodically checks if the master is
> still a master.
>
> When a master in a replicaset stops being a master, the router
> walks all nodes of the replicaset and finds who is the new master.
>
> To turn the feature on there is a new option in router's config:
> `master = 'auto'`. It should be specified per-replicaset, and is
> not compatible with specifying a master manually.
>
> This is how a good config looks:
> ```
> config = {
>      sharding = {
>          <replicaset uuid> = {
>              master = 'auto',
>              replicas = {...},
>          },
>          ...
>      },
>      ...
> }
> ```
>
> This is how a bad config looks:
> ```
> config = {
>      sharding = {
>          <replicaset uuid> = {
>              master = 'auto',
>              replicas = {
>                  <replica uuid1> = {
>                      master = true,
>                      ...
>                  },
>                  <replica uuid2> = {
>                      master = false,
>                      ...
>                  },
>              },
>          },
>          ...
>      },
>      ...
> }
> ```
> It will not work, because either `master = 'auto'` can be
> specified, or the master is assigned manually. Not both at the
> same time.
> ---
>   test/router/master_discovery.result   | 514 ++++++++++++++++++++++++++
>   test/router/master_discovery.test.lua | 248 +++++++++++++
>   vshard/consts.lua                     |   5 +
>   vshard/error.lua                      |   5 +
>   vshard/replicaset.lua                 | 203 +++++++++-
>   vshard/router/init.lua                |  98 ++++-
>   6 files changed, 1051 insertions(+), 22 deletions(-)
>   create mode 100644 test/router/master_discovery.result
>   create mode 100644 test/router/master_discovery.test.lua
>
>
>
> diff --git a/vshard/consts.lua b/vshard/consts.lua
> index 47a893b..66a09ae 100644
> --- a/vshard/consts.lua
> +++ b/vshard/consts.lua
> @@ -52,6 +52,11 @@ return {
>       DISCOVERY_WORK_STEP = 0.01,
>       DISCOVERY_TIMEOUT = 10,
>   
> +    MASTER_SEARCH_IDLE_INTERVAL = 5,
> +    MASTER_SEARCH_WORK_INTERVAL = 0.5,
> +    MASTER_SEARCH_BACKOFF_INTERVAL = 5,
> +    MASTER_SEARCH_TIEMOUT = 5,
> +
>       TIMEOUT_INFINITY = 500 * 365 * 86400,
>       DEADLINE_INFINITY = math.huge,
>   }
> diff --git a/vshard/error.lua b/vshard/error.lua
> index bcbcd71..e2d1a31 100644
> --- a/vshard/error.lua
> +++ b/vshard/error.lua
> @@ -153,6 +153,11 @@ local error_message_template = {
>           name = 'BUCKET_RECV_DATA_ERROR',
>           msg = 'Can not receive the bucket %s data in space "%s" at tuple %s: %s',
>           args = {'bucket_id', 'space', 'tuple', 'reason'},
> +    },
> +    [31] = {
> +        name = 'MULTIPLE_MASTERS_FOUND',
> +        msg = 'Found more than one master in replicaset %s on nodes %s and %s',
> +        args = {'replicaset_uuid', 'master1', 'master2'},
>       }
>   }

Is it possible to test this error?

>   
> diff --git a/vshard/replicaset.lua b/vshard/replicaset.lua
> index fa048c9..7747258 100644
> --- a/vshard/replicaset.lua
> +++ b/vshard/replicaset.lua
> @@ -25,6 +25,10 @@
>   --          }
>   --      },
>   --      master = <master server from the array above>,
> +--      master_cond = <condition variable signaled when the replicaset finds or
> +--                     changes its master>,
> +--      is_auto_master = <true when is configured to find the master on
> +--                        its own>,
>   --      replica = <nearest available replica object>,
>   --      balance_i = <index of a next replica in priority_list to
>   --                   use for a load-balanced request>,
> @@ -55,6 +59,8 @@ local luuid = require('uuid')
>   local ffi = require('ffi')
>   local util = require('vshard.util')
>   local fiber_clock = fiber.clock
> +local fiber_yield = fiber.yield
> +local fiber_cond_wait = util.fiber_cond_wait
>   local gsc = util.generate_self_checker
>   
>   --
> @@ -159,6 +165,26 @@ local function replicaset_connect_to_replica(replicaset, replica)
>       return conn
>   end
>   
> +local function replicaset_wait_master(replicaset, timeout)
> +    local master = replicaset.master
> +    -- Fast path - master is almost always known.
> +    if master then
> +        return master, timeout
> +    end
> +    -- Slow path.
> +    local deadline = fiber_clock() + timeout
> +    repeat
> +        if not replicaset.is_auto_master or
> +           not fiber_cond_wait(replicaset.master_cond, timeout) then
> +            return nil, lerror.vshard(lerror.code.MISSING_MASTER,
> +                                      replicaset.uuid)
> +        end
> +        timeout = deadline - fiber_clock()
> +        master = replicaset.master
> +    until master
> +    return master, timeout
> +end
> +
>   --
>   -- Create net.box connection to master.
>   --
> @@ -175,7 +201,13 @@ end
>   -- Wait until the master instance is connected.
>   --
>   local function replicaset_wait_connected(replicaset, timeout)
> -    return netbox_wait_connected(replicaset_connect_master(replicaset), timeout)
> +    local master
> +    master, timeout = replicaset_wait_master(replicaset, timeout)
> +    if not master then
> +        return nil, timeout
> +    end
> +    local conn = replicaset_connect_to_replica(replicaset, master)
> +    return netbox_wait_connected(conn, timeout)
>   end
>   
>   --
> @@ -345,18 +377,30 @@ local function replicaset_master_call(replicaset, func, args, opts)
>       assert(opts == nil or type(opts) == 'table')
>       assert(type(func) == 'string', 'function name')
>       assert(args == nil or type(args) == 'table', 'function arguments')
> -    local conn, err = replicaset_connect_master(replicaset)
> -    if not conn then
> -        return nil, err
> -    end
> -    if not opts then
> -        opts = {timeout = replicaset.master.net_timeout}
> -    elseif not opts.timeout then
> -        opts = table.copy(opts)
> -        opts.timeout = replicaset.master.net_timeout
> +    local master = replicaset.master
> +    if not master then
> +        opts = opts and table.copy(opts) or {}
> +        if opts.is_async then
> +            return nil, lerror.vshard(lerror.code.MISSING_MASTER,
> +                                      replicaset.uuid)
> +        end

Could / should we here wakeup master discovery if "auto" is specified?

> +        local timeout = opts.timeout or consts.MASTER_SEARCH_TIEMOUT
> +        master, timeout = replicaset_wait_master(replicaset, timeout)
> +        if not master then
> +            return nil, timeout
> +        end
> +        opts.timeout = timeout
> +    else
> +        if not opts then
> +            opts = {timeout = master.net_timeout}
> +        elseif not opts.timeout then
> +            opts = table.copy(opts)
> +            opts.timeout = master.net_timeout
> +        end
>       end
> +    replicaset_connect_to_replica(replicaset, master)
>       local net_status, storage_status, retval, error_object =
> -        replica_call(replicaset.master, func, args, opts)
> +        replica_call(master, func, args, opts)
>       -- Ignore net_status - master does not retry requests.
>       return storage_status, retval, error_object
>   end
> @@ -425,11 +469,6 @@ local function replicaset_template_multicallro(prefer_replica, balance)
>                   return r
>               end
>           end
> -        local conn, err = replicaset_connect_master(replicaset)
> -        if not conn then
> -            return nil, err
> -        end
> -        return master
>       end

Why don't we need this part anymore?

>       return function(replicaset, func, args, opts)
> @@ -444,9 +483,13 @@ local function replicaset_template_multicallro(prefer_replica, balance)
>           end
>           local end_time = fiber_clock() + timeout
>           while not net_status and timeout > 0 do
> -            replica, err = pick_next_replica(replicaset)
> +            replica = pick_next_replica(replicaset)
>               if not replica then
> -                return nil, err
> +                replica, timeout = replicaset_wait_master(replicaset, timeout)
> +                if not replica then
> +                    return nil, timeout
> +                end
> +                replicaset_connect_to_replica(replicaset, replica)
>               end
>               opts.timeout = timeout
>               net_status, storage_status, retval, err =
> @@ -508,7 +551,128 @@ local function rebind_replicasets(replicasets, old_replicasets)
>                   end
>               end
>           end
> +        if old_replicaset then
> +            -- Take a hint from the old replicaset who is the master now.
> +            if replicaset.is_auto_master then
> +                local master = old_replicaset.master
> +                if master then
> +                    replicaset.master = replicaset.replicas[master.uuid]
> +                end
> +            end
> +            -- Stop waiting for master in the old replicaset. Its running
> +            -- requests won't find it anyway. Auto search works only for the
> +            -- most actual replicaset objects.
> +            if old_replicaset.is_auto_master then
> +                old_replicaset.is_auto_master = false
> +                old_replicaset.master_cond:broadcast()
> +            end
> +        end
> +    end
> +end
> +
> +--
> +-- Check if the master is still master, and find a new master if there is no a
> +-- known one.
> +--
> +local function replicaset_locate_master(replicaset)
> +    local is_done = true
> +    local is_nop = true
> +    if not replicaset.is_auto_master then
> +        return is_done, is_nop
> +    end
> +    local func = 'vshard.storage._call'
> +    local args = {'info'}
> +    local const_timeout = consts.MASTER_SEARCH_TIEMOUT
> +    local sync_opts = {timeout = const_timeout}
> +    local ok, res, err, f
> +    local master = replicaset.master
> +    if master then
> +        ok, res, err = replica_call(master, func, args, sync_opts)
> +        if not ok then
> +            return is_done, is_nop, err
> +        end
> +        if res.is_master then
> +            return is_done, is_nop
> +        end
> +        log.info('Master of replicaset %s, node %s, has resigned. Trying '..
> +                 'to find a new one', replicaset.uuid, master.uuid)
> +        replicaset.master = nil
> +    end
> +    is_nop = false
> +
> +    local last_err
> +    local futures = {}
> +    local timeout = const_timeout
> +    local deadline = fiber_clock() + timeout
> +    local async_opts = {is_async = true}
> +    local replicaset_uuid = replicaset.uuid
> +    for replica_uuid, replica in pairs(replicaset.replicas) do
> +        local conn = replica.conn
> +        timeout, err = netbox_wait_connected(conn, timeout)
> +        if not timeout then
> +            last_err = err
> +            timeout = deadline - fiber_clock()
> +        else
> +            ok, f = pcall(conn.call, conn, func, args, async_opts)
> +            if not ok then
> +                last_err = lerror.make(f)
> +            else
> +                futures[replica_uuid] = f
> +            end
> +        end
> +    end
> +    local master_uuid
> +    for replica_uuid, f in pairs(futures) do
> +        if timeout < 0 then
> +            -- Netbox uses cond var inside, whose wait throws an error when gets
> +            -- a negative timeout. Avoid that.
> +            timeout = 0
> +        end
> +        res, err = f:wait_result(timeout)
> +        timeout = deadline - fiber_clock()
> +        if not res then
> +            f:discard()
> +            last_err = err
> +            goto next_wait

If timeout will be negative we anyway go to next_wait and turn it to 0 
at the next iteration.

> +        end
> +        res = res[1]
> +        if not res.is_master then
> +            goto next_wait
> +        end
> +        if not master_uuid then
> +            master_uuid = replica_uuid
> +            goto next_wait
> +        end
> +        is_done = false
> +        last_err = lerror.vshard(lerror.code.MULTIPLE_MASTERS_FOUND,
> +                                 replicaset_uuid, master_uuid, replica_uuid)
> +        do return is_done, is_nop, last_err end
> +        ::next_wait::
> +    end
> +    master = replicaset.replicas[master_uuid]
> +    if master then
> +        log.info('Found master for replicaset %s: %s', replicaset_uuid,
> +                 master_uuid)
> +        replicaset.master = master
> +        replicaset.master_cond:broadcast()
> +    else
> +        is_done = false
> +    end
> +    return is_done, is_nop, last_err
> +end
> +
> +local function locate_masters(replicasets)
> +    local is_all_done = true
> +    local is_all_nop = true
> +    local last_err
> +    for _, replicaset in pairs(replicasets) do
> +        local is_done, is_nop, err = replicaset_locate_master(replicaset)

I think we should log result of master discovery. Especially if error 
occurred.

> +        is_all_done = is_all_done and is_done
> +        is_all_nop = is_all_nop and is_nop
> +        last_err = err or last_err
> +        fiber_yield()
>       end
> +    return is_all_done, is_all_nop, last_err
>   end
>   
>   --
> @@ -729,6 +893,8 @@ local function buildall(sharding_cfg)
>               bucket_count = 0,
>               lock = replicaset.lock,
>               balance_i = 1,
> +            is_auto_master = replicaset.master == 'auto',
> +            master_cond = fiber.cond(),
>           }, replicaset_mt)
>           local priority_list = {}
>           for replica_uuid, replica in pairs(replicaset.replicas) do
> @@ -802,4 +968,5 @@ return {
>       wait_masters_connect = wait_masters_connect,
>       rebind_replicasets = rebind_replicasets,
>       outdate_replicasets = outdate_replicasets,
> +    locate_masters = locate_masters,
>   }
> diff --git a/vshard/router/init.lua b/vshard/router/init.lua
> index 5e2a96b..9407ccd 100644
> --- a/vshard/router/init.lua
> +++ b/vshard/router/init.lua
> @@ -68,6 +68,8 @@ local ROUTER_TEMPLATE = {
>           replicasets = nil,
>           -- Fiber to maintain replica connections.
>           failover_fiber = nil,
> +        -- Fiber to watch for master changes and find new masters.
> +        master_search_fiber = nil,
>           -- Fiber to discovery buckets in background.
>           discovery_fiber = nil,
>           -- How discovery works. On - work infinitely. Off - no
> @@ -1030,6 +1032,93 @@ local function failover_f(router)
>       end
>   end
>   
> +--------------------------------------------------------------------------------
> +-- Master search
> +--------------------------------------------------------------------------------
> +
> +local function master_search_step(router)
> +    local ok, is_done, is_nop, err = pcall(lreplicaset.locate_masters,
> +                                           router.replicasets)
> +    if not ok then
> +        err = is_done
> +        is_done = false
> +        is_nop = false
> +    end
> +    return is_done, is_nop, err
> +end
> +
> +--
> +-- Master discovery background function. It is supposed to notice master changes
> +-- and find new masters in the replicasets, which are configured for that.
> +--
> +-- XXX: due to polling the search might notice master change not right when it
Is this TODO related to https://github.com/tarantool/vshard/issues/284 ?
> +-- happens. In future it makes sense to rewrite master search using
> +-- subscriptions. The problem is that at the moment of writing the subscriptions
> +-- are not working well in all Tarantool versions.
> +--
> +local function master_search_f(router)
> +    local module_version = M.module_version
> +    local is_in_progress = false
> +    while module_version == M.module_version do
> +        local timeout
> +        local start_time = fiber_clock()
> +        local is_done, is_nop, err = master_search_step(router)
> +        if err then
> +            log.error('Error during master search: %s', lerror.make(err))
> +        end
> +        if is_done then
> +            timeout = consts.MASTER_SEARCH_IDLE_INTERVAL
> +        elseif err then
> +            timeout = consts.MASTER_SEARCH_BACKOFF_INTERVAL
> +        else
> +            timeout = consts.MASTER_SEARCH_WORK_INTERVAL
> +        end
> +        if not is_in_progress then
> +            if not is_nop and is_done then
> +                log.info('Master search happened')
> +            elseif not is_done then
> +                log.info('Master search is started')
> +                is_in_progress = true
> +            end
> +        elseif is_done then
> +            log.info('Master search is finished')
> +            is_in_progress = false
> +        end
> +        local end_time = fiber_clock()
> +        local duration = end_time - start_time
> +        if not is_nop then
> +            log.verbose('Master search step took %s seconds. Next in %s '..
> +                        'seconds', duration, timeout)
> +        end
> +        lfiber.sleep(timeout)
> +    end
> +end
> +
> +local function master_search_set(router)
> +    local enable = false
> +    for _, rs in pairs(router.replicasets) do
> +        if rs.is_auto_master then
> +            enable = true
> +            break
> +        end
> +    end
> +    local search_fiber = router.master_search_fiber
> +    if enable and search_fiber == nil then
> +        log.info('Master auto search is enabled')
> +        router.master_search_fiber = util.reloadable_fiber_create(
> +            'vshard.master_search.' .. router.name, M, 'master_search_f',
> +            router)

On 1.10 fiber name is limited with 32 symbols.

tarantool> #'vshard.master_search.'
---
- 21
...

That mean that router name is limited with 11 chars.

It could be a problem. As I see you don't use {truncate = true} when 
assign a name

https://github.com/tarantool/vshard/blob/a394e3f4008a7436f011c81028dcaadc270eebf6/vshard/util.lua#L93

Also it's possible to wakeup discovery and rebalanser fibers. Probably 
we should give a way to

wakeup master_search_fiber.

> +    elseif not enable and search_fiber ~= nil then
> +        -- Do not make users pay for what they do not use - when the search is
> +        -- disabled for all replicasets, there should not be any fiber.
> +        log.info('Master auto search is disabled')
> +        if search_fiber:status() ~= 'dead' then
> +            search_fiber:cancel()
> +        end
> +        router.master_search_fiber = nil
> +    end
> +end
> +
>   --------------------------------------------------------------------------------
>   -- Configuration
>   --------------------------------------------------------------------------------
> @@ -1100,6 +1189,7 @@ local function router_cfg(router, cfg, is_reload)
>               'vshard.failover.' .. router.name, M, 'failover_f', router)
>       end
>       discovery_set(router, vshard_cfg.discovery_mode)
> +    master_search_set(router)
>   end
>   
>   --------------------------------------------------------------------------------
> @@ -1535,6 +1625,10 @@ end
>   --------------------------------------------------------------------------------
>   -- Module definition
>   --------------------------------------------------------------------------------
> +M.discovery_f = discovery_f
> +M.failover_f = failover_f
> +M.master_search_f = master_search_f
> +M.router_mt = router_mt
>   --
>   -- About functions, saved in M, and reloading see comment in
>   -- storage/init.lua.
> @@ -1556,10 +1650,6 @@ else
>       M.module_version = M.module_version + 1
>   end
>   
> -M.discovery_f = discovery_f
> -M.failover_f = failover_f
> -M.router_mt = router_mt
> -
>   module.cfg = legacy_cfg
>   module.new = router_new
>   module.internal = M