[Tarantool-patches] [PATCH vshard 1/2] router: drop wait_connected from master discovery
Vladislav Shpilevoy
v.shpilevoy at tarantool.org
Sat Dec 4 03:19:37 MSK 2021
Master discovery tried to wait for connection establishment for
the discovery timeout for each instance in the replicaset.
The problem is that if one of replicas is dead, the discovery will
waste its entire timeout on just this waiting. For all the
requests sent to connected replicas after this one it will have 0
timeout and won't properly wait for their results.
For example, this is how master discovery could work:
send requests:
replica1 wait_connected + send,
replica2 wait_connected fails on timeout
replica3 wait_connected works if was connected + send
collect responses:
replica1 wait_result(0 timeout)
replica2 skip
replica3 wait_result(0 timeout)
The entire timeout was wasted on 'replica2 wait_connected' during
request sending. Replica1 result could be delivered fine because
it was in progress while replica2 was waiting. So having 0 timeout
in it is not a problem. It had time to be executed. But replica3's
request has very few chances to be delivered in time. It was just
sent and is collected almost immediately.
The worst case is when the first replica is dead. Then it is very
likely neither of requests will be delivered. Due to all result
wait timeouts being 0.
Although there is a certain chance that the next requests will be
extra quick, so writing a stable test for that does not seem
possible.
The bug was discovered while working on #288. For its testing it
was needed to stop one instance and master_discovery test started
failing.
---
vshard/replicaset.lua | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/vshard/replicaset.lua b/vshard/replicaset.lua
index 55028bd..174c761 100644
--- a/vshard/replicaset.lua
+++ b/vshard/replicaset.lua
@@ -682,11 +682,7 @@ local function replicaset_locate_master(replicaset)
local replicaset_uuid = replicaset.uuid
for replica_uuid, replica in pairs(replicaset.replicas) do
local conn = replica.conn
- timeout, err = netbox_wait_connected(conn, timeout)
- if not timeout then
- last_err = err
- timeout = deadline - fiber_clock()
- else
+ if conn:is_connected() then
ok, f = pcall(conn.call, conn, func, args, async_opts)
if not ok then
last_err = lerror.make(f)
--
2.24.3 (Apple Git-128)
More information about the Tarantool-patches
mailing list