Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH vshard 0/2] VShard replica backoff, part 1
@ 2021-12-04  0:19 Vladislav Shpilevoy via Tarantool-patches
  2021-12-04  0:19 ` [Tarantool-patches] [PATCH vshard 1/2] router: drop wait_connected from master discovery Vladislav Shpilevoy via Tarantool-patches
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-12-04  0:19 UTC (permalink / raw)
  To: tarantool-patches, olegrok

The patchset makes a couple of changes needed to properly implement replica
backoff feature.

However they are useful regardless of the backoff, so I am sending them right
away.

Branch: http://github.com/tarantool/vshard/tree/gerold103/gh-298-replica-backoff-part-1
Issue: https://github.com/tarantool/vshard/issues/298

Vladislav Shpilevoy (2):
  router: drop wait_connected from master discovery
  router: don't fallback RO to master right away

 test/failover/failover.result         | 64 +++++++++++++------
 test/failover/failover.test.lua       | 30 +++++----
 test/router/master_discovery.result   | 92 +++++++++++++++++++++++----
 test/router/master_discovery.test.lua | 38 ++++++++---
 vshard/replicaset.lua                 | 27 ++++----
 5 files changed, 186 insertions(+), 65 deletions(-)

-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Tarantool-patches] [PATCH vshard 1/2] router: drop wait_connected from master discovery
  2021-12-04  0:19 [Tarantool-patches] [PATCH vshard 0/2] VShard replica backoff, part 1 Vladislav Shpilevoy via Tarantool-patches
@ 2021-12-04  0:19 ` Vladislav Shpilevoy via Tarantool-patches
  2021-12-05 17:44   ` Oleg Babin via Tarantool-patches
  2021-12-04  0:19 ` [Tarantool-patches] [PATCH vshard 2/2] router: don't fallback RO to master right away Vladislav Shpilevoy via Tarantool-patches
  2021-12-07 22:38 ` [Tarantool-patches] [PATCH vshard 0/2] VShard replica backoff, part 1 Vladislav Shpilevoy via Tarantool-patches
  2 siblings, 1 reply; 10+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-12-04  0:19 UTC (permalink / raw)
  To: tarantool-patches, olegrok

Master discovery tried to wait for connection establishment for
the discovery timeout for each instance in the replicaset.

The problem is that if one of replicas is dead, the discovery will
waste its entire timeout on just this waiting. For all the
requests sent to connected replicas after this one it will have 0
timeout and won't properly wait for their results.

For example, this is how master discovery could work:

    send requests:
        replica1 wait_connected + send,
        replica2 wait_connected fails on timeout
        replica3 wait_connected works if was connected + send

    collect responses:
        replica1 wait_result(0 timeout)
        replica2 skip
        replica3 wait_result(0 timeout)

The entire timeout was wasted on 'replica2 wait_connected' during
request sending. Replica1 result could be delivered fine because
it was in progress while replica2 was waiting. So having 0 timeout
in it is not a problem. It had time to be executed. But replica3's
request has very few chances to be delivered in time. It was just
sent and is collected almost immediately.

The worst case is when the first replica is dead. Then it is very
likely neither of requests will be delivered. Due to all result
wait timeouts being 0.

Although there is a certain chance that the next requests will be
extra quick, so writing a stable test for that does not seem
possible.

The bug was discovered while working on #288. For its testing it
was needed to stop one instance and master_discovery test started
failing.
---
 vshard/replicaset.lua | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/vshard/replicaset.lua b/vshard/replicaset.lua
index 55028bd..174c761 100644
--- a/vshard/replicaset.lua
+++ b/vshard/replicaset.lua
@@ -682,11 +682,7 @@ local function replicaset_locate_master(replicaset)
     local replicaset_uuid = replicaset.uuid
     for replica_uuid, replica in pairs(replicaset.replicas) do
         local conn = replica.conn
-        timeout, err = netbox_wait_connected(conn, timeout)
-        if not timeout then
-            last_err = err
-            timeout = deadline - fiber_clock()
-        else
+        if conn:is_connected() then
             ok, f = pcall(conn.call, conn, func, args, async_opts)
             if not ok then
                 last_err = lerror.make(f)
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Tarantool-patches] [PATCH vshard 2/2] router: don't fallback RO to master right away
  2021-12-04  0:19 [Tarantool-patches] [PATCH vshard 0/2] VShard replica backoff, part 1 Vladislav Shpilevoy via Tarantool-patches
  2021-12-04  0:19 ` [Tarantool-patches] [PATCH vshard 1/2] router: drop wait_connected from master discovery Vladislav Shpilevoy via Tarantool-patches
@ 2021-12-04  0:19 ` Vladislav Shpilevoy via Tarantool-patches
  2021-12-05 17:53   ` Oleg Babin via Tarantool-patches
  2021-12-07 22:38 ` [Tarantool-patches] [PATCH vshard 0/2] VShard replica backoff, part 1 Vladislav Shpilevoy via Tarantool-patches
  2 siblings, 1 reply; 10+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-12-04  0:19 UTC (permalink / raw)
  To: tarantool-patches, olegrok

RO requests use the replica with the highest prio as specified in
the weights matrix.

If the best replica is not available now and failover didn't
happen yet, then RO requests used to fallback to master. Even if
there were other RO replicas with better prio.

This patch makes RO call firstly try the currently selected most
prio replica. If it is not available (no other connected replicas
at all or failover didn't happen yet), the call will try to walk
the prio list starting from this replica until it finds an
available one.

If it also fails, the call will try to walk the list from the
beginning hoping that the unavailable replica wasn't the best one
and there might be better option on the other side of the prio
list.

The patch was done in scope of task about replica backoff (#298)
because the problem would additionally exist when the best replica
is in backoff, not only disconnected. It would get worse.

Closes #288
Needed for #298
---
 test/failover/failover.result         | 64 +++++++++++++------
 test/failover/failover.test.lua       | 30 +++++----
 test/router/master_discovery.result   | 92 +++++++++++++++++++++++----
 test/router/master_discovery.test.lua | 38 ++++++++---
 vshard/replicaset.lua                 | 21 ++++--
 5 files changed, 185 insertions(+), 60 deletions(-)

diff --git a/test/failover/failover.result b/test/failover/failover.result
index bae57fa..31eda0c 100644
--- a/test/failover/failover.result
+++ b/test/failover/failover.result
@@ -85,8 +85,8 @@ test_run:cmd('switch default')
 --   available;
 -- * up nearest replica priority if the best one is available
 --   again;
--- * replicaset uses master connection, if the nearest's one is
---   not available before call();
+-- * replicaset uses next prio available connection, if the
+--   nearest's one is not available before call();
 -- * current nearest connection is not down, when trying to
 --   connect to the replica with less weight.
 --
@@ -199,30 +199,17 @@ test_run:cmd('stop server box_1_d')
 - true
 ...
 -- Down_ts must be set in on_disconnect() trigger.
-while rs1.replica.down_ts == nil do fiber.sleep(0.1) end
+box_1_d = rs1.replicas[names.replica_uuid.box_1_d]
 ---
 ...
--- Try to execute read-only request - it must use master
--- connection, because a replica's one is not available.
-vshard.router.call(1, 'read', 'echo', {123})
----
-- 123
-...
-test_run:switch('box_1_a')
----
-- true
-...
-echo_count
----
-- 2
-...
-test_run:switch('router_1')
+test_run:wait_cond(function() return box_1_d.down_ts ~= nil end)
 ---
 - true
 ...
 -- New replica is box_1_b.
-while rs1.replica.name ~= 'box_1_b' do fiber.sleep(0.1) end
+test_run:wait_cond(function() return rs1.replica.name == 'box_1_b' end)
 ---
+- true
 ...
 rs1.replica.down_ts == nil
 ---
@@ -247,14 +234,49 @@ test_run:cmd('switch box_1_b')
 ...
 -- Ensure the 'read' echo was executed on box_1_b - nearest
 -- available replica.
-echo_count
+assert(echo_count == 1)
 ---
-- 1
+- true
+...
+test_run:switch('router_1')
+---
+- true
+...
+--
+-- Kill the best replica. Don't need to wait for failover to happen for the
+-- router to start using the next best available replica.
+--
+test_run:cmd('stop server box_1_b')
+---
+- true
+...
+box_1_b = rs1.replicas[names.replica_uuid.box_1_b]
+---
+...
+test_run:wait_cond(function() return box_1_b.down_ts ~= nil end)
+---
+- true
+...
+vshard.router.callro(1, 'echo', {123})
+---
+- 123
+...
+test_run:switch('box_1_c')
+---
+- true
+...
+assert(echo_count == 1)
+---
+- true
 ...
 test_run:switch('router_1')
 ---
 - true
 ...
+test_run:cmd('start server box_1_b')
+---
+- true
+...
 -- Revive the best replica. A router must reconnect to it in
 -- FAILOVER_UP_TIMEOUT seconds.
 test_run:cmd('start server box_1_d')
diff --git a/test/failover/failover.test.lua b/test/failover/failover.test.lua
index a969e0e..b713319 100644
--- a/test/failover/failover.test.lua
+++ b/test/failover/failover.test.lua
@@ -43,8 +43,8 @@ test_run:cmd('switch default')
 --   available;
 -- * up nearest replica priority if the best one is available
 --   again;
--- * replicaset uses master connection, if the nearest's one is
---   not available before call();
+-- * replicaset uses next prio available connection, if the
+--   nearest's one is not available before call();
 -- * current nearest connection is not down, when trying to
 --   connect to the replica with less weight.
 --
@@ -86,15 +86,10 @@ rs1.replica_up_ts - old_up_ts >= vshard.consts.FAILOVER_UP_TIMEOUT
 -- box_1_d.
 test_run:cmd('stop server box_1_d')
 -- Down_ts must be set in on_disconnect() trigger.
-while rs1.replica.down_ts == nil do fiber.sleep(0.1) end
--- Try to execute read-only request - it must use master
--- connection, because a replica's one is not available.
-vshard.router.call(1, 'read', 'echo', {123})
-test_run:switch('box_1_a')
-echo_count
-test_run:switch('router_1')
+box_1_d = rs1.replicas[names.replica_uuid.box_1_d]
+test_run:wait_cond(function() return box_1_d.down_ts ~= nil end)
 -- New replica is box_1_b.
-while rs1.replica.name ~= 'box_1_b' do fiber.sleep(0.1) end
+test_run:wait_cond(function() return rs1.replica.name == 'box_1_b' end)
 rs1.replica.down_ts == nil
 rs1.replica_up_ts ~= nil
 test_run:grep_log('router_1', 'New replica box_1_b%(storage%@')
@@ -103,8 +98,21 @@ vshard.router.callro(1, 'echo', {123})
 test_run:cmd('switch box_1_b')
 -- Ensure the 'read' echo was executed on box_1_b - nearest
 -- available replica.
-echo_count
+assert(echo_count == 1)
+test_run:switch('router_1')
+
+--
+-- gh-288: kill the best replica. Don't need to wait for failover to happen for
+-- the router to start using the next best available replica.
+--
+test_run:cmd('stop server box_1_b')
+box_1_b = rs1.replicas[names.replica_uuid.box_1_b]
+test_run:wait_cond(function() return box_1_b.down_ts ~= nil end)
+vshard.router.callro(1, 'echo', {123})
+test_run:switch('box_1_c')
+assert(echo_count == 1)
 test_run:switch('router_1')
+test_run:cmd('start server box_1_b')
 
 -- Revive the best replica. A router must reconnect to it in
 -- FAILOVER_UP_TIMEOUT seconds.
diff --git a/test/router/master_discovery.result b/test/router/master_discovery.result
index 7ebb67d..94a848b 100644
--- a/test/router/master_discovery.result
+++ b/test/router/master_discovery.result
@@ -325,6 +325,22 @@ end)
 start_aggressive_master_search()
  | ---
  | ...
+test_run:cmd('stop server storage_1_b')
+ | ---
+ | - true
+ | ...
+rs1 = vshard.router.static.replicasets[util.replicasets[1]]
+ | ---
+ | ...
+replica = rs1.replicas[util.name_to_uuid.storage_1_b]
+ | ---
+ | ...
+-- Ensure the replica is not available. Otherwise RO requests sneak into it
+-- instead of waiting for master.
+test_run:wait_cond(function() return not replica:is_connected() end)
+ | ---
+ | - true
+ | ...
 
 forget_masters()
  | ---
@@ -337,8 +353,6 @@ vshard.router.callrw(1501, 'echo', {1}, opts_big_timeout)
 forget_masters()
  | ---
  | ...
--- XXX: this should not depend on master so much. RO requests should be able to
--- go to replicas.
 vshard.router.callro(1501, 'echo', {1}, opts_big_timeout)
  | ---
  | - 1
@@ -357,8 +371,6 @@ vshard.router.route(1501):callrw('echo', {1}, opts_big_timeout)
 forget_masters()
  | ---
  | ...
--- XXX: the same as above - should not really wait for master. Regardless of it
--- being auto or not.
 vshard.router.route(1501):callro('echo', {1}, opts_big_timeout)
  | ---
  | - 1
@@ -369,6 +381,10 @@ vshard.router.route(1501):callro('echo', {1}, opts_big_timeout)
 stop_aggressive_master_search()
  | ---
  | ...
+test_run:cmd('start server storage_1_b')
+ | ---
+ | - true
+ | ...
 
 test_run:switch('storage_1_a')
  | ---
@@ -411,7 +427,7 @@ vshard.storage.cfg(cfg, instance_uuid)
  | ---
  | ...
 
--- Try to make RW and RO requests but then turn of the auto search.
+-- Try to make an RW request but then turn of the auto search.
 test_run:switch('router_1')
  | ---
  | - true
@@ -425,14 +441,6 @@ f1 = fiber.create(function()
 end)
  | ---
  | ...
--- XXX: should not really wait for master since this is an RO request. It could
--- use a replica.
-f2 = fiber.create(function()                                                    \
-    fiber.self():set_joinable(true)                                             \
-    return vshard.router.callro(1501, 'echo', {1}, opts_big_timeout)            \
-end)
- | ---
- | ...
 fiber.sleep(0.01)
  | ---
  | ...
@@ -449,6 +457,31 @@ f1:join()
  |   replicaset_uuid: <replicaset_1>
  |   message: Master is not configured for replicaset <replicaset_1>
  | ...
+
+-- Try to make an RO request but then turn of the auto search.
+test_run:cmd('stop server storage_1_a')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server storage_1_b')
+ | ---
+ | - true
+ | ...
+forget_masters()
+ | ---
+ | ...
+f2 = fiber.create(function()                                                    \
+    fiber.self():set_joinable(true)                                             \
+    return vshard.router.callro(1501, 'echo', {1}, opts_big_timeout)            \
+end)
+ | ---
+ | ...
+fiber.sleep(0.01)
+ | ---
+ | ...
+disable_auto_masters()
+ | ---
+ | ...
 f2:join()
  | ---
  | - true
@@ -459,6 +492,14 @@ f2:join()
  |   replicaset_uuid: <replicaset_1>
  |   message: Master is not configured for replicaset <replicaset_1>
  | ...
+test_run:cmd('start server storage_1_a')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server storage_1_b')
+ | ---
+ | - true
+ | ...
 
 --
 -- Multiple masters logging.
@@ -473,6 +514,9 @@ replicas = cfg.sharding[util.replicasets[1]].replicas
 replicas[util.name_to_uuid.storage_1_a].master = true
  | ---
  | ...
+replicas[util.name_to_uuid.storage_1_b].master = false
+ | ---
+ | ...
 vshard.storage.cfg(cfg, instance_uuid)
  | ---
  | ...
@@ -484,6 +528,9 @@ test_run:switch('storage_1_b')
 replicas = cfg.sharding[util.replicasets[1]].replicas
  | ---
  | ...
+replicas[util.name_to_uuid.storage_1_a].master = false
+ | ---
+ | ...
 replicas[util.name_to_uuid.storage_1_b].master = true
  | ---
  | ...
@@ -495,6 +542,25 @@ test_run:switch('router_1')
  | ---
  | - true
  | ...
+-- Ensure both replicas are connected. Otherwise the router can go to only one,
+-- find it is master, and won't go to the second one until the first one resigns
+-- or dies.
+rs1 = vshard.router.static.replicasets[util.replicasets[1]]
+ | ---
+ | ...
+replica1 = rs1.replicas[util.name_to_uuid.storage_1_a]
+ | ---
+ | ...
+replica2 = rs1.replicas[util.name_to_uuid.storage_1_b]
+ | ---
+ | ...
+test_run:wait_cond(function()                                                   \
+    return replica1:is_connected() and replica2:is_connected()                  \
+end)
+ | ---
+ | - true
+ | ...
+
 forget_masters()
  | ---
  | ...
diff --git a/test/router/master_discovery.test.lua b/test/router/master_discovery.test.lua
index 6276dc9..8cacd6d 100644
--- a/test/router/master_discovery.test.lua
+++ b/test/router/master_discovery.test.lua
@@ -181,24 +181,27 @@ end)
 -- Call tries to wait for master if has enough time left.
 --
 start_aggressive_master_search()
+test_run:cmd('stop server storage_1_b')
+rs1 = vshard.router.static.replicasets[util.replicasets[1]]
+replica = rs1.replicas[util.name_to_uuid.storage_1_b]
+-- Ensure the replica is not available. Otherwise RO requests sneak into it
+-- instead of waiting for master.
+test_run:wait_cond(function() return not replica:is_connected() end)
 
 forget_masters()
 vshard.router.callrw(1501, 'echo', {1}, opts_big_timeout)
 
 forget_masters()
--- XXX: this should not depend on master so much. RO requests should be able to
--- go to replicas.
 vshard.router.callro(1501, 'echo', {1}, opts_big_timeout)
 
 forget_masters()
 vshard.router.route(1501):callrw('echo', {1}, opts_big_timeout)
 
 forget_masters()
--- XXX: the same as above - should not really wait for master. Regardless of it
--- being auto or not.
 vshard.router.route(1501):callro('echo', {1}, opts_big_timeout)
 
 stop_aggressive_master_search()
+test_run:cmd('start server storage_1_b')
 
 test_run:switch('storage_1_a')
 assert(echo_count == 4)
@@ -218,23 +221,30 @@ replicas = cfg.sharding[util.replicasets[1]].replicas
 replicas[util.name_to_uuid.storage_1_a].master = false
 vshard.storage.cfg(cfg, instance_uuid)
 
--- Try to make RW and RO requests but then turn of the auto search.
+-- Try to make an RW request but then turn of the auto search.
 test_run:switch('router_1')
 forget_masters()
 f1 = fiber.create(function()                                                    \
     fiber.self():set_joinable(true)                                             \
     return vshard.router.callrw(1501, 'echo', {1}, opts_big_timeout)            \
 end)
--- XXX: should not really wait for master since this is an RO request. It could
--- use a replica.
+fiber.sleep(0.01)
+disable_auto_masters()
+f1:join()
+
+-- Try to make an RO request but then turn of the auto search.
+test_run:cmd('stop server storage_1_a')
+test_run:cmd('stop server storage_1_b')
+forget_masters()
 f2 = fiber.create(function()                                                    \
     fiber.self():set_joinable(true)                                             \
     return vshard.router.callro(1501, 'echo', {1}, opts_big_timeout)            \
 end)
 fiber.sleep(0.01)
 disable_auto_masters()
-f1:join()
 f2:join()
+test_run:cmd('start server storage_1_a')
+test_run:cmd('start server storage_1_b')
 
 --
 -- Multiple masters logging.
@@ -242,14 +252,26 @@ f2:join()
 test_run:switch('storage_1_a')
 replicas = cfg.sharding[util.replicasets[1]].replicas
 replicas[util.name_to_uuid.storage_1_a].master = true
+replicas[util.name_to_uuid.storage_1_b].master = false
 vshard.storage.cfg(cfg, instance_uuid)
 
 test_run:switch('storage_1_b')
 replicas = cfg.sharding[util.replicasets[1]].replicas
+replicas[util.name_to_uuid.storage_1_a].master = false
 replicas[util.name_to_uuid.storage_1_b].master = true
 vshard.storage.cfg(cfg, instance_uuid)
 
 test_run:switch('router_1')
+-- Ensure both replicas are connected. Otherwise the router can go to only one,
+-- find it is master, and won't go to the second one until the first one resigns
+-- or dies.
+rs1 = vshard.router.static.replicasets[util.replicasets[1]]
+replica1 = rs1.replicas[util.name_to_uuid.storage_1_a]
+replica2 = rs1.replicas[util.name_to_uuid.storage_1_b]
+test_run:wait_cond(function()                                                   \
+    return replica1:is_connected() and replica2:is_connected()                  \
+end)
+
 forget_masters()
 start_aggressive_master_search()
 test_run:wait_log('router_1', 'Found more than one master', nil, 10)
diff --git a/vshard/replicaset.lua b/vshard/replicaset.lua
index 174c761..93f210e 100644
--- a/vshard/replicaset.lua
+++ b/vshard/replicaset.lua
@@ -455,18 +455,25 @@ local function replicaset_template_multicallro(prefer_replica, balance)
                     return r
                 end
             end
-        elseif prefer_replica then
-            r = replicaset.replica
+        else
+            local start_r = replicaset.replica
+            r = start_r
             while r do
-                if r:is_connected() and r ~= master then
+                if r:is_connected() and (not prefer_replica or r ~= master) then
                     return r
                 end
                 r = r.next_by_priority
             end
-        else
-            r = replicaset.replica
-            if r and r:is_connected() then
-                return r
+            -- Iteration above could start not from the best prio replica.
+            -- Check the beginning of the list too.
+            for _, r in pairs(replicaset.priority_list) do
+                if r == start_r then
+                    -- Reached already checked part.
+                    break
+                end
+                if r:is_connected() and (not prefer_replica or r ~= master) then
+                    return r
+                end
             end
         end
     end
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 1/2] router: drop wait_connected from master discovery
  2021-12-04  0:19 ` [Tarantool-patches] [PATCH vshard 1/2] router: drop wait_connected from master discovery Vladislav Shpilevoy via Tarantool-patches
@ 2021-12-05 17:44   ` Oleg Babin via Tarantool-patches
  2021-12-06 23:34     ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 1 reply; 10+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-12-05 17:44 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches

Thanks for your patch. Looks reasonable.

However I have one question. Before this patch if some

connection was down we perform wait_connected and in some

cases it could lead to successful reconnection. Currently we just skip

broken connections and don't try to reconnect.

Could it be a problem or we perform reconnection in another place?

What will be happen if all connections will be down?


On 04.12.2021 03:19, Vladislav Shpilevoy wrote:
> Master discovery tried to wait for connection establishment for
> the discovery timeout for each instance in the replicaset.
>
> The problem is that if one of replicas is dead, the discovery will
> waste its entire timeout on just this waiting. For all the
> requests sent to connected replicas after this one it will have 0
> timeout and won't properly wait for their results.
>
> For example, this is how master discovery could work:
>
>      send requests:
>          replica1 wait_connected + send,
>          replica2 wait_connected fails on timeout
>          replica3 wait_connected works if was connected + send
>
>      collect responses:
>          replica1 wait_result(0 timeout)
>          replica2 skip
>          replica3 wait_result(0 timeout)
>
> The entire timeout was wasted on 'replica2 wait_connected' during
> request sending. Replica1 result could be delivered fine because
> it was in progress while replica2 was waiting. So having 0 timeout
> in it is not a problem. It had time to be executed. But replica3's
> request has very few chances to be delivered in time. It was just
> sent and is collected almost immediately.
>
> The worst case is when the first replica is dead. Then it is very
> likely neither of requests will be delivered. Due to all result
> wait timeouts being 0.
>
> Although there is a certain chance that the next requests will be
> extra quick, so writing a stable test for that does not seem
> possible.
>
> The bug was discovered while working on #288. For its testing it
> was needed to stop one instance and master_discovery test started
> failing.
> ---
>   vshard/replicaset.lua | 6 +-----
>   1 file changed, 1 insertion(+), 5 deletions(-)
>
> diff --git a/vshard/replicaset.lua b/vshard/replicaset.lua
> index 55028bd..174c761 100644
> --- a/vshard/replicaset.lua
> +++ b/vshard/replicaset.lua
> @@ -682,11 +682,7 @@ local function replicaset_locate_master(replicaset)
>       local replicaset_uuid = replicaset.uuid
>       for replica_uuid, replica in pairs(replicaset.replicas) do
>           local conn = replica.conn
> -        timeout, err = netbox_wait_connected(conn, timeout)
> -        if not timeout then
> -            last_err = err
> -            timeout = deadline - fiber_clock()
> -        else
> +        if conn:is_connected() then
>               ok, f = pcall(conn.call, conn, func, args, async_opts)
>               if not ok then
>                   last_err = lerror.make(f)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 2/2] router: don't fallback RO to master right away
  2021-12-04  0:19 ` [Tarantool-patches] [PATCH vshard 2/2] router: don't fallback RO to master right away Vladislav Shpilevoy via Tarantool-patches
@ 2021-12-05 17:53   ` Oleg Babin via Tarantool-patches
  2021-12-06 23:56     ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 1 reply; 10+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-12-05 17:53 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches

Thanks for your patch. See one comment below.

Also as I see CI is red on your branch. Is it related to your changes?


On 04.12.2021 03:19, Vladislav Shpilevoy wrote:
> RO requests use the replica with the highest prio as specified in
> the weights matrix.
>
> If the best replica is not available now and failover didn't
> happen yet, then RO requests used to fallback to master. Even if
> there were other RO replicas with better prio.
>
> This patch makes RO call firstly try the currently selected most
> prio replica. If it is not available (no other connected replicas
> at all or failover didn't happen yet), the call will try to walk
> the prio list starting from this replica until it finds an
> available one.
>
> If it also fails, the call will try to walk the list from the
> beginning hoping that the unavailable replica wasn't the best one
> and there might be better option on the other side of the prio
> list.
>
> The patch was done in scope of task about replica backoff (#298)
> because the problem would additionally exist when the best replica
> is in backoff, not only disconnected. It would get worse.
>
> Closes #288
> Needed for #298
> ---
>   test/failover/failover.result         | 64 +++++++++++++------
>   test/failover/failover.test.lua       | 30 +++++----
>   test/router/master_discovery.result   | 92 +++++++++++++++++++++++----
>   test/router/master_discovery.test.lua | 38 ++++++++---
>   vshard/replicaset.lua                 | 21 ++++--
>   5 files changed, 185 insertions(+), 60 deletions(-)
>
> diff --git a/test/failover/failover.result b/test/failover/failover.result
> index bae57fa..31eda0c 100644
> --- a/test/failover/failover.result
> +++ b/test/failover/failover.result
> @@ -85,8 +85,8 @@ test_run:cmd('switch default')
>   --   available;
>   -- * up nearest replica priority if the best one is available
>   --   again;
> --- * replicaset uses master connection, if the nearest's one is
> ---   not available before call();
> +-- * replicaset uses next prio available connection, if the
> +--   nearest's one is not available before call();
>   -- * current nearest connection is not down, when trying to
>   --   connect to the replica with less weight.
>   --
> @@ -199,30 +199,17 @@ test_run:cmd('stop server box_1_d')
>   - true
>   ...
>   -- Down_ts must be set in on_disconnect() trigger.
> -while rs1.replica.down_ts == nil do fiber.sleep(0.1) end
> +box_1_d = rs1.replicas[names.replica_uuid.box_1_d]
>   ---
>   ...
> --- Try to execute read-only request - it must use master
> --- connection, because a replica's one is not available.
> -vshard.router.call(1, 'read', 'echo', {123})
> ----
> -- 123
> -...
> -test_run:switch('box_1_a')
> ----
> -- true
> -...
> -echo_count
> ----
> -- 2
> -...
> -test_run:switch('router_1')
> +test_run:wait_cond(function() return box_1_d.down_ts ~= nil end)
>   ---
>   - true
>   ...
>   -- New replica is box_1_b.
> -while rs1.replica.name ~= 'box_1_b' do fiber.sleep(0.1) end
> +test_run:wait_cond(function() return rs1.replica.name == 'box_1_b' end)
>   ---
> +- true
>   ...
>   rs1.replica.down_ts == nil
>   ---
> @@ -247,14 +234,49 @@ test_run:cmd('switch box_1_b')
>   ...
>   -- Ensure the 'read' echo was executed on box_1_b - nearest
>   -- available replica.
> -echo_count
> +assert(echo_count == 1)
>   ---
> -- 1
> +- true
> +...
> +test_run:switch('router_1')
> +---
> +- true
> +...
> +--
> +-- Kill the best replica. Don't need to wait for failover to happen for the
> +-- router to start using the next best available replica.
> +--
> +test_run:cmd('stop server box_1_b')
> +---
> +- true
> +...
> +box_1_b = rs1.replicas[names.replica_uuid.box_1_b]
> +---
> +...
> +test_run:wait_cond(function() return box_1_b.down_ts ~= nil end)
> +---
> +- true
> +...
> +vshard.router.callro(1, 'echo', {123})
> +---
> +- 123
> +...
> +test_run:switch('box_1_c')
> +---
> +- true
> +...
> +assert(echo_count == 1)
> +---
> +- true
>   ...
>   test_run:switch('router_1')
>   ---
>   - true
>   ...
> +test_run:cmd('start server box_1_b')
> +---
> +- true
> +...
>   -- Revive the best replica. A router must reconnect to it in
>   -- FAILOVER_UP_TIMEOUT seconds.
>   test_run:cmd('start server box_1_d')
> diff --git a/test/failover/failover.test.lua b/test/failover/failover.test.lua
> index a969e0e..b713319 100644
> --- a/test/failover/failover.test.lua
> +++ b/test/failover/failover.test.lua
> @@ -43,8 +43,8 @@ test_run:cmd('switch default')
>   --   available;
>   -- * up nearest replica priority if the best one is available
>   --   again;
> --- * replicaset uses master connection, if the nearest's one is
> ---   not available before call();
> +-- * replicaset uses next prio available connection, if the
> +--   nearest's one is not available before call();
>   -- * current nearest connection is not down, when trying to
>   --   connect to the replica with less weight.
>   --
> @@ -86,15 +86,10 @@ rs1.replica_up_ts - old_up_ts >= vshard.consts.FAILOVER_UP_TIMEOUT
>   -- box_1_d.
>   test_run:cmd('stop server box_1_d')
>   -- Down_ts must be set in on_disconnect() trigger.
> -while rs1.replica.down_ts == nil do fiber.sleep(0.1) end
> --- Try to execute read-only request - it must use master
> --- connection, because a replica's one is not available.
> -vshard.router.call(1, 'read', 'echo', {123})
> -test_run:switch('box_1_a')
> -echo_count
> -test_run:switch('router_1')
> +box_1_d = rs1.replicas[names.replica_uuid.box_1_d]
> +test_run:wait_cond(function() return box_1_d.down_ts ~= nil end)
>   -- New replica is box_1_b.
> -while rs1.replica.name ~= 'box_1_b' do fiber.sleep(0.1) end
> +test_run:wait_cond(function() return rs1.replica.name == 'box_1_b' end)
>   rs1.replica.down_ts == nil
>   rs1.replica_up_ts ~= nil
>   test_run:grep_log('router_1', 'New replica box_1_b%(storage%@')
> @@ -103,8 +98,21 @@ vshard.router.callro(1, 'echo', {123})
>   test_run:cmd('switch box_1_b')
>   -- Ensure the 'read' echo was executed on box_1_b - nearest
>   -- available replica.
> -echo_count
> +assert(echo_count == 1)
> +test_run:switch('router_1')
> +
> +--
> +-- gh-288: kill the best replica. Don't need to wait for failover to happen for
> +-- the router to start using the next best available replica.
> +--
> +test_run:cmd('stop server box_1_b')
> +box_1_b = rs1.replicas[names.replica_uuid.box_1_b]
> +test_run:wait_cond(function() return box_1_b.down_ts ~= nil end)
> +vshard.router.callro(1, 'echo', {123})
> +test_run:switch('box_1_c')
> +assert(echo_count == 1)
>   test_run:switch('router_1')
> +test_run:cmd('start server box_1_b')
>   
>   -- Revive the best replica. A router must reconnect to it in
>   -- FAILOVER_UP_TIMEOUT seconds.
> diff --git a/test/router/master_discovery.result b/test/router/master_discovery.result
> index 7ebb67d..94a848b 100644
> --- a/test/router/master_discovery.result
> +++ b/test/router/master_discovery.result
> @@ -325,6 +325,22 @@ end)
>   start_aggressive_master_search()
>    | ---
>    | ...
> +test_run:cmd('stop server storage_1_b')
> + | ---
> + | - true
> + | ...
> +rs1 = vshard.router.static.replicasets[util.replicasets[1]]
> + | ---
> + | ...
> +replica = rs1.replicas[util.name_to_uuid.storage_1_b]
> + | ---
> + | ...
> +-- Ensure the replica is not available. Otherwise RO requests sneak into it
> +-- instead of waiting for master.
> +test_run:wait_cond(function() return not replica:is_connected() end)
> + | ---
> + | - true
> + | ...
>   
>   forget_masters()
>    | ---
> @@ -337,8 +353,6 @@ vshard.router.callrw(1501, 'echo', {1}, opts_big_timeout)
>   forget_masters()
>    | ---
>    | ...
> --- XXX: this should not depend on master so much. RO requests should be able to
> --- go to replicas.
>   vshard.router.callro(1501, 'echo', {1}, opts_big_timeout)
>    | ---
>    | - 1
> @@ -357,8 +371,6 @@ vshard.router.route(1501):callrw('echo', {1}, opts_big_timeout)
>   forget_masters()
>    | ---
>    | ...
> --- XXX: the same as above - should not really wait for master. Regardless of it
> --- being auto or not.
>   vshard.router.route(1501):callro('echo', {1}, opts_big_timeout)
>    | ---
>    | - 1
> @@ -369,6 +381,10 @@ vshard.router.route(1501):callro('echo', {1}, opts_big_timeout)
>   stop_aggressive_master_search()
>    | ---
>    | ...
> +test_run:cmd('start server storage_1_b')
> + | ---
> + | - true
> + | ...
>   
>   test_run:switch('storage_1_a')
>    | ---
> @@ -411,7 +427,7 @@ vshard.storage.cfg(cfg, instance_uuid)
>    | ---
>    | ...
>   
> --- Try to make RW and RO requests but then turn of the auto search.
> +-- Try to make an RW request but then turn of the auto search.
>   test_run:switch('router_1')
>    | ---
>    | - true
> @@ -425,14 +441,6 @@ f1 = fiber.create(function()
>   end)
>    | ---
>    | ...
> --- XXX: should not really wait for master since this is an RO request. It could
> --- use a replica.
> -f2 = fiber.create(function()                                                    \
> -    fiber.self():set_joinable(true)                                             \
> -    return vshard.router.callro(1501, 'echo', {1}, opts_big_timeout)            \
> -end)
> - | ---
> - | ...
>   fiber.sleep(0.01)
>    | ---
>    | ...
> @@ -449,6 +457,31 @@ f1:join()
>    |   replicaset_uuid: <replicaset_1>
>    |   message: Master is not configured for replicaset <replicaset_1>
>    | ...
> +
> +-- Try to make an RO request but then turn of the auto search.
> +test_run:cmd('stop server storage_1_a')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server storage_1_b')
> + | ---
> + | - true
> + | ...
> +forget_masters()
> + | ---
> + | ...
> +f2 = fiber.create(function()                                                    \
> +    fiber.self():set_joinable(true)                                             \
> +    return vshard.router.callro(1501, 'echo', {1}, opts_big_timeout)            \
> +end)
> + | ---
> + | ...
> +fiber.sleep(0.01)
> + | ---
> + | ...
> +disable_auto_masters()
> + | ---
> + | ...
>   f2:join()
>    | ---
>    | - true
> @@ -459,6 +492,14 @@ f2:join()
>    |   replicaset_uuid: <replicaset_1>
>    |   message: Master is not configured for replicaset <replicaset_1>
>    | ...
> +test_run:cmd('start server storage_1_a')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server storage_1_b')
> + | ---
> + | - true
> + | ...
>   
>   --
>   -- Multiple masters logging.
> @@ -473,6 +514,9 @@ replicas = cfg.sharding[util.replicasets[1]].replicas
>   replicas[util.name_to_uuid.storage_1_a].master = true
>    | ---
>    | ...
> +replicas[util.name_to_uuid.storage_1_b].master = false
> + | ---
> + | ...
>   vshard.storage.cfg(cfg, instance_uuid)
>    | ---
>    | ...
> @@ -484,6 +528,9 @@ test_run:switch('storage_1_b')
>   replicas = cfg.sharding[util.replicasets[1]].replicas
>    | ---
>    | ...
> +replicas[util.name_to_uuid.storage_1_a].master = false
> + | ---
> + | ...
>   replicas[util.name_to_uuid.storage_1_b].master = true
>    | ---
>    | ...
> @@ -495,6 +542,25 @@ test_run:switch('router_1')
>    | ---
>    | - true
>    | ...
> +-- Ensure both replicas are connected. Otherwise the router can go to only one,
> +-- find it is master, and won't go to the second one until the first one resigns
> +-- or dies.
> +rs1 = vshard.router.static.replicasets[util.replicasets[1]]
> + | ---
> + | ...
> +replica1 = rs1.replicas[util.name_to_uuid.storage_1_a]
> + | ---
> + | ...
> +replica2 = rs1.replicas[util.name_to_uuid.storage_1_b]
> + | ---
> + | ...
> +test_run:wait_cond(function()                                                   \
> +    return replica1:is_connected() and replica2:is_connected()                  \
> +end)
> + | ---
> + | - true
> + | ...
> +
>   forget_masters()
>    | ---
>    | ...
> diff --git a/test/router/master_discovery.test.lua b/test/router/master_discovery.test.lua
> index 6276dc9..8cacd6d 100644
> --- a/test/router/master_discovery.test.lua
> +++ b/test/router/master_discovery.test.lua
> @@ -181,24 +181,27 @@ end)
>   -- Call tries to wait for master if has enough time left.
>   --
>   start_aggressive_master_search()
> +test_run:cmd('stop server storage_1_b')
> +rs1 = vshard.router.static.replicasets[util.replicasets[1]]
> +replica = rs1.replicas[util.name_to_uuid.storage_1_b]
> +-- Ensure the replica is not available. Otherwise RO requests sneak into it
> +-- instead of waiting for master.
> +test_run:wait_cond(function() return not replica:is_connected() end)
>   
>   forget_masters()
>   vshard.router.callrw(1501, 'echo', {1}, opts_big_timeout)
>   
>   forget_masters()
> --- XXX: this should not depend on master so much. RO requests should be able to
> --- go to replicas.
>   vshard.router.callro(1501, 'echo', {1}, opts_big_timeout)
>   
>   forget_masters()
>   vshard.router.route(1501):callrw('echo', {1}, opts_big_timeout)
>   
>   forget_masters()
> --- XXX: the same as above - should not really wait for master. Regardless of it
> --- being auto or not.
>   vshard.router.route(1501):callro('echo', {1}, opts_big_timeout)
>   
>   stop_aggressive_master_search()
> +test_run:cmd('start server storage_1_b')
>   
>   test_run:switch('storage_1_a')
>   assert(echo_count == 4)
> @@ -218,23 +221,30 @@ replicas = cfg.sharding[util.replicasets[1]].replicas
>   replicas[util.name_to_uuid.storage_1_a].master = false
>   vshard.storage.cfg(cfg, instance_uuid)
>   
> --- Try to make RW and RO requests but then turn of the auto search.
> +-- Try to make an RW request but then turn of the auto search.
>   test_run:switch('router_1')
>   forget_masters()
>   f1 = fiber.create(function()                                                    \
>       fiber.self():set_joinable(true)                                             \
>       return vshard.router.callrw(1501, 'echo', {1}, opts_big_timeout)            \
>   end)
> --- XXX: should not really wait for master since this is an RO request. It could
> --- use a replica.
> +fiber.sleep(0.01)
> +disable_auto_masters()
> +f1:join()
> +
> +-- Try to make an RO request but then turn of the auto search.
> +test_run:cmd('stop server storage_1_a')
> +test_run:cmd('stop server storage_1_b')
> +forget_masters()
>   f2 = fiber.create(function()                                                    \
>       fiber.self():set_joinable(true)                                             \
>       return vshard.router.callro(1501, 'echo', {1}, opts_big_timeout)            \
>   end)
>   fiber.sleep(0.01)
>   disable_auto_masters()
> -f1:join()
>   f2:join()
> +test_run:cmd('start server storage_1_a')
> +test_run:cmd('start server storage_1_b')
>   
>   --
>   -- Multiple masters logging.
> @@ -242,14 +252,26 @@ f2:join()
>   test_run:switch('storage_1_a')
>   replicas = cfg.sharding[util.replicasets[1]].replicas
>   replicas[util.name_to_uuid.storage_1_a].master = true
> +replicas[util.name_to_uuid.storage_1_b].master = false
>   vshard.storage.cfg(cfg, instance_uuid)
>   
>   test_run:switch('storage_1_b')
>   replicas = cfg.sharding[util.replicasets[1]].replicas
> +replicas[util.name_to_uuid.storage_1_a].master = false
>   replicas[util.name_to_uuid.storage_1_b].master = true
>   vshard.storage.cfg(cfg, instance_uuid)
>   
>   test_run:switch('router_1')
> +-- Ensure both replicas are connected. Otherwise the router can go to only one,
> +-- find it is master, and won't go to the second one until the first one resigns
> +-- or dies.
> +rs1 = vshard.router.static.replicasets[util.replicasets[1]]
> +replica1 = rs1.replicas[util.name_to_uuid.storage_1_a]
> +replica2 = rs1.replicas[util.name_to_uuid.storage_1_b]
> +test_run:wait_cond(function()                                                   \
> +    return replica1:is_connected() and replica2:is_connected()                  \
> +end)
> +
>   forget_masters()
>   start_aggressive_master_search()
>   test_run:wait_log('router_1', 'Found more than one master', nil, 10)
> diff --git a/vshard/replicaset.lua b/vshard/replicaset.lua
> index 174c761..93f210e 100644
> --- a/vshard/replicaset.lua
> +++ b/vshard/replicaset.lua
> @@ -455,18 +455,25 @@ local function replicaset_template_multicallro(prefer_replica, balance)
>                       return r
>                   end
>               end
> -        elseif prefer_replica then
> -            r = replicaset.replica
> +        else
> +            local start_r = replicaset.replica
> +            r = start_r
>               while r do
> -                if r:is_connected() and r ~= master then
> +                if r:is_connected() and (not prefer_replica or r ~= master) then
>                       return r
>                   end
>                   r = r.next_by_priority
>               end
> -        else
> -            r = replicaset.replica
> -            if r and r:is_connected() then
> -                return r
> +            -- Iteration above could start not from the best prio replica.
> +            -- Check the beginning of the list too.
> +            for _, r in pairs(replicaset.priority_list) do

pairs can't be jitted (at least now until 
https://github.com/tarantool/tarantool/issues/6475).

Also pairs don't give any guarantees about iteration order. Could it be 
replaced with ipairs?


> +                if r == start_r then
> +                    -- Reached already checked part.
> +                    break
> +                end
> +                if r:is_connected() and (not prefer_replica or r ~= master) then
> +                    return r
> +                end
>               end
>           end
>       end

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 1/2] router: drop wait_connected from master discovery
  2021-12-05 17:44   ` Oleg Babin via Tarantool-patches
@ 2021-12-06 23:34     ` Vladislav Shpilevoy via Tarantool-patches
  2021-12-07  6:49       ` Oleg Babin via Tarantool-patches
  0 siblings, 1 reply; 10+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-12-06 23:34 UTC (permalink / raw)
  To: Oleg Babin, tarantool-patches

Hi! Thanks for the review!

On 05.12.2021 18:44, Oleg Babin wrote:
> Thanks for your patch. Looks reasonable.
> 
> However I have one question. Before this patch if some
> 
> connection was down we perform wait_connected and in some
> 
> cases it could lead to successful reconnection. Currently we just skip
> 
> broken connections and don't try to reconnect.
> 
> Could it be a problem or we perform reconnection in another place?
> 
> What will be happen if all connections will be down?

All connections use reconnect_after option in netbox. Thus netbox
handles reconnect. If all are down, netbox_wait_connected won't help
either. So you will need to make several master discovery iterations
anyway until one is found.

Also before this patch netbox_wait_connected() couldn't boost
successful reconnection anyhow. It was just waiting. It didn't affect
the connection state.

After this patch the waiting is dropped to try to talk to the
disconnected nodes later on a next master discovery iteration.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 2/2] router: don't fallback RO to master right away
  2021-12-05 17:53   ` Oleg Babin via Tarantool-patches
@ 2021-12-06 23:56     ` Vladislav Shpilevoy via Tarantool-patches
  2021-12-07  6:49       ` Oleg Babin via Tarantool-patches
  0 siblings, 1 reply; 10+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-12-06 23:56 UTC (permalink / raw)
  To: Oleg Babin, tarantool-patches

Thanks for the review!

On 05.12.2021 18:53, Oleg Babin wrote:
> Thanks for your patch. See one comment below.
> 
> Also as I see CI is red on your branch. Is it related to your changes?

Yes, I didn't update .result file in one of the tests. Done now.

>> diff --git a/vshard/replicaset.lua b/vshard/replicaset.lua
>> index 174c761..93f210e 100644
>> --- a/vshard/replicaset.lua
>> +++ b/vshard/replicaset.lua
>> @@ -455,18 +455,25 @@ local function replicaset_template_multicallro(prefer_replica, balance)
>>                       return r
>>                   end
>>               end
>> -        elseif prefer_replica then
>> -            r = replicaset.replica
>> +        else
>> +            local start_r = replicaset.replica
>> +            r = start_r
>>               while r do
>> -                if r:is_connected() and r ~= master then
>> +                if r:is_connected() and (not prefer_replica or r ~= master) then
>>                       return r
>>                   end
>>                   r = r.next_by_priority
>>               end
>> -        else
>> -            r = replicaset.replica
>> -            if r and r:is_connected() then
>> -                return r
>> +            -- Iteration above could start not from the best prio replica.
>> +            -- Check the beginning of the list too.
>> +            for _, r in pairs(replicaset.priority_list) do
> 
> pairs can't be jitted (at least now until https://github.com/tarantool/tarantool/issues/6475).
> 
> Also pairs don't give any guarantees about iteration order. Could it be replaced with ipairs?

It can be. Replaced with ipairs on the branch.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 1/2] router: drop wait_connected from master discovery
  2021-12-06 23:34     ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-12-07  6:49       ` Oleg Babin via Tarantool-patches
  0 siblings, 0 replies; 10+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-12-07  6:49 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches

Thanks for explanation. LGTM.

On 07.12.2021 02:34, Vladislav Shpilevoy wrote:
> Hi! Thanks for the review!
>
> On 05.12.2021 18:44, Oleg Babin wrote:
>> Thanks for your patch. Looks reasonable.
>>
>> However I have one question. Before this patch if some
>>
>> connection was down we perform wait_connected and in some
>>
>> cases it could lead to successful reconnection. Currently we just skip
>>
>> broken connections and don't try to reconnect.
>>
>> Could it be a problem or we perform reconnection in another place?
>>
>> What will be happen if all connections will be down?
> All connections use reconnect_after option in netbox. Thus netbox
> handles reconnect. If all are down, netbox_wait_connected won't help
> either. So you will need to make several master discovery iterations
> anyway until one is found.
>
> Also before this patch netbox_wait_connected() couldn't boost
> successful reconnection anyhow. It was just waiting. It didn't affect
> the connection state.
>
> After this patch the waiting is dropped to try to talk to the
> disconnected nodes later on a next master discovery iteration.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 2/2] router: don't fallback RO to master right away
  2021-12-06 23:56     ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-12-07  6:49       ` Oleg Babin via Tarantool-patches
  0 siblings, 0 replies; 10+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-12-07  6:49 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches

Thanks for your fixes. LGTM.

On 07.12.2021 02:56, Vladislav Shpilevoy wrote:
> Thanks for the review!
>
> On 05.12.2021 18:53, Oleg Babin wrote:
>> Thanks for your patch. See one comment below.
>>
>> Also as I see CI is red on your branch. Is it related to your changes?
> Yes, I didn't update .result file in one of the tests. Done now.
>
>>> diff --git a/vshard/replicaset.lua b/vshard/replicaset.lua
>>> index 174c761..93f210e 100644
>>> --- a/vshard/replicaset.lua
>>> +++ b/vshard/replicaset.lua
>>> @@ -455,18 +455,25 @@ local function replicaset_template_multicallro(prefer_replica, balance)
>>>                        return r
>>>                    end
>>>                end
>>> -        elseif prefer_replica then
>>> -            r = replicaset.replica
>>> +        else
>>> +            local start_r = replicaset.replica
>>> +            r = start_r
>>>                while r do
>>> -                if r:is_connected() and r ~= master then
>>> +                if r:is_connected() and (not prefer_replica or r ~= master) then
>>>                        return r
>>>                    end
>>>                    r = r.next_by_priority
>>>                end
>>> -        else
>>> -            r = replicaset.replica
>>> -            if r and r:is_connected() then
>>> -                return r
>>> +            -- Iteration above could start not from the best prio replica.
>>> +            -- Check the beginning of the list too.
>>> +            for _, r in pairs(replicaset.priority_list) do
>> pairs can't be jitted (at least now until https://github.com/tarantool/tarantool/issues/6475).
>>
>> Also pairs don't give any guarantees about iteration order. Could it be replaced with ipairs?
> It can be. Replaced with ipairs on the branch.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 0/2] VShard replica backoff, part 1
  2021-12-04  0:19 [Tarantool-patches] [PATCH vshard 0/2] VShard replica backoff, part 1 Vladislav Shpilevoy via Tarantool-patches
  2021-12-04  0:19 ` [Tarantool-patches] [PATCH vshard 1/2] router: drop wait_connected from master discovery Vladislav Shpilevoy via Tarantool-patches
  2021-12-04  0:19 ` [Tarantool-patches] [PATCH vshard 2/2] router: don't fallback RO to master right away Vladislav Shpilevoy via Tarantool-patches
@ 2021-12-07 22:38 ` Vladislav Shpilevoy via Tarantool-patches
  2 siblings, 0 replies; 10+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-12-07 22:38 UTC (permalink / raw)
  To: tarantool-patches, olegrok

Pushed to master.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-12-07 22:38 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-04  0:19 [Tarantool-patches] [PATCH vshard 0/2] VShard replica backoff, part 1 Vladislav Shpilevoy via Tarantool-patches
2021-12-04  0:19 ` [Tarantool-patches] [PATCH vshard 1/2] router: drop wait_connected from master discovery Vladislav Shpilevoy via Tarantool-patches
2021-12-05 17:44   ` Oleg Babin via Tarantool-patches
2021-12-06 23:34     ` Vladislav Shpilevoy via Tarantool-patches
2021-12-07  6:49       ` Oleg Babin via Tarantool-patches
2021-12-04  0:19 ` [Tarantool-patches] [PATCH vshard 2/2] router: don't fallback RO to master right away Vladislav Shpilevoy via Tarantool-patches
2021-12-05 17:53   ` Oleg Babin via Tarantool-patches
2021-12-06 23:56     ` Vladislav Shpilevoy via Tarantool-patches
2021-12-07  6:49       ` Oleg Babin via Tarantool-patches
2021-12-07 22:38 ` [Tarantool-patches] [PATCH vshard 0/2] VShard replica backoff, part 1 Vladislav Shpilevoy via Tarantool-patches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox