[Tarantool-patches] [PATCH vshard 1/2] router: drop wait_connected from master discovery

Vladislav Shpilevoy v.shpilevoy at tarantool.org
Sat Dec 4 03:19:37 MSK 2021


Master discovery tried to wait for connection establishment for
the discovery timeout for each instance in the replicaset.

The problem is that if one of replicas is dead, the discovery will
waste its entire timeout on just this waiting. For all the
requests sent to connected replicas after this one it will have 0
timeout and won't properly wait for their results.

For example, this is how master discovery could work:

    send requests:
        replica1 wait_connected + send,
        replica2 wait_connected fails on timeout
        replica3 wait_connected works if was connected + send

    collect responses:
        replica1 wait_result(0 timeout)
        replica2 skip
        replica3 wait_result(0 timeout)

The entire timeout was wasted on 'replica2 wait_connected' during
request sending. Replica1 result could be delivered fine because
it was in progress while replica2 was waiting. So having 0 timeout
in it is not a problem. It had time to be executed. But replica3's
request has very few chances to be delivered in time. It was just
sent and is collected almost immediately.

The worst case is when the first replica is dead. Then it is very
likely neither of requests will be delivered. Due to all result
wait timeouts being 0.

Although there is a certain chance that the next requests will be
extra quick, so writing a stable test for that does not seem
possible.

The bug was discovered while working on #288. For its testing it
was needed to stop one instance and master_discovery test started
failing.
---
 vshard/replicaset.lua | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/vshard/replicaset.lua b/vshard/replicaset.lua
index 55028bd..174c761 100644
--- a/vshard/replicaset.lua
+++ b/vshard/replicaset.lua
@@ -682,11 +682,7 @@ local function replicaset_locate_master(replicaset)
     local replicaset_uuid = replicaset.uuid
     for replica_uuid, replica in pairs(replicaset.replicas) do
         local conn = replica.conn
-        timeout, err = netbox_wait_connected(conn, timeout)
-        if not timeout then
-            last_err = err
-            timeout = deadline - fiber_clock()
-        else
+        if conn:is_connected() then
             ok, f = pcall(conn.call, conn, func, args, async_opts)
             if not ok then
                 last_err = lerror.make(f)
-- 
2.24.3 (Apple Git-128)



More information about the Tarantool-patches mailing list