From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Vladimir Davydov Subject: [PATCH 2/2] replication: don't stop syncing on configuration errors Date: Sun, 23 Sep 2018 18:31:21 +0300 Message-Id: In-Reply-To: References: In-Reply-To: References: To: kostja@tarantool.org Cc: tarantool-patches@freelists.org List-ID: When replication is restarted with the same replica set configuration (i.e. box.cfg{replication = box.cfg.replication}), there's a chance that an old relay will be still running on the master at the time when a new applier tries to subscribe. In this case the applier will get an error: main/152/applier/localhost:62649 I> can't join/subscribe main/152/applier/localhost:62649 xrow.c:891 E> ER_CFG: Incorrect value for option 'replication': duplicate connection with the same replica UUID Such an error won't stop the applier - it will keep trying to reconnect: main/152/applier/localhost:62649 I> will retry every 1.00 second However, it will stop synchronization so that box.cfg() will return without an error, but leave the replica in the orphan mode: main/151/console/::1:42606 C> failed to synchronize with 1 out of 1 replicas main/151/console/::1:42606 C> entering orphan mode main/151/console/::1:42606 I> set 'replication' configuration option to "localhost:62649" In a second, the stray relay on the master will probably exit and the applier will manage to subscribe so that the replica will leave the orphan mode: main/152/applier/localhost:62649 C> leaving orphan mode This is very annoying, because there's no need to enter the orphan mode in this case - we could as well keep trying to synchronize until the applier finally succeeds to subscribe or replication_sync_timeout is triggered. So this patch makes appliers enter "loading" state on configuration errors, the same state they enter if they detect that bootstrap hasn't finished yet. This guarantees that configuration errors, like the one above, won't break synchronization and leave the user gaping at the unprovoked orphan mode. Apart from the issue in question (#3636), this patch also fixes spurious replication-py/multi test failures that happened for exactly the same reason (#3692). Closes #3636 Closes #3692 --- src/box/applier.cc | 2 +- test/replication/sync.result | 82 +++++++++++++++++++++++++++++++++++++-- test/replication/sync.test.lua | 35 ++++++++++++++++- test/replication/wal_off.result | 9 +++++ test/replication/wal_off.test.lua | 3 ++ 5 files changed, 125 insertions(+), 6 deletions(-) diff --git a/src/box/applier.cc b/src/box/applier.cc index 7304e83f..7da278e6 100644 --- a/src/box/applier.cc +++ b/src/box/applier.cc @@ -619,7 +619,7 @@ applier_f(va_list ap) e->errcode() == ER_NO_SUCH_USER) { /* Invalid configuration */ applier_log_error(applier, e); - applier_disconnect(applier, APPLIER_DISCONNECTED); + applier_disconnect(applier, APPLIER_LOADING); goto reconnect; } else if (e->errcode() == ER_SYSTEM) { /* System error from master instance. */ diff --git a/test/replication/sync.result b/test/replication/sync.result index 23978e48..81de6075 100644 --- a/test/replication/sync.result +++ b/test/replication/sync.result @@ -213,21 +213,95 @@ box.info.ro -- false --- - false ... +-- gh-3636: Check that replica set sync doesn't stop on cfg errors. +-- To do that we inject an error on the master to temporarily block +-- the relay thread from exiting, then reconfigure replication on +-- the slave using the same configuration. Since the relay is still +-- running when replication is reconfigured, the replica will get +-- ER_CFG "duplicate connection with the same replica UUID" error. +-- It should print it to the log, but keep trying to synchronize. +-- Eventually, it should leave box.cfg() following the master. +box.cfg{replication_timeout = 0.1} +--- +... +box.cfg{replication_sync_lag = 1} +--- +... +box.cfg{replication_sync_timeout = 10} +--- +... test_run:cmd("switch default") --- - true ... -test_run:cmd("stop server replica") +box.error.injection.set("ERRINJ_RELAY_TIMEOUT", 0) +--- +- ok +... +box.error.injection.set('ERRINJ_WAL_DELAY', true) +--- +- ok +... +test_run:cmd("setopt delimiter ';'") --- - true ... -test_run:cmd("cleanup server replica") +_ = fiber.create(function() + box.space.test:replace{123456789} +end); +--- +... +_ = fiber.create(function() + fiber.sleep(0.1) + box.error.injection.set('ERRINJ_WAL_DELAY', false) +end); +--- +... +test_run:cmd("setopt delimiter ''"); --- - true ... -box.error.injection.set("ERRINJ_RELAY_TIMEOUT", 0) +test_run:cmd("switch replica") --- -- ok +- true +... +replication = box.cfg.replication +--- +... +box.cfg{replication = {}} +--- +... +box.cfg{replication = replication} +--- +... +box.info.status -- running +--- +- running +... +box.info.ro -- false +--- +- false +... +box.info.replication[1].upstream.status -- follow +--- +- follow +... +test_run:grep_log('replica', 'ER_CFG.*') +--- +- 'ER_CFG: Incorrect value for option ''replication'': duplicate connection with the + same replica UUID' +... +test_run:cmd("switch default") +--- +- true +... +test_run:cmd("stop server replica") +--- +- true +... +test_run:cmd("cleanup server replica") +--- +- true ... box.space.test:drop() --- diff --git a/test/replication/sync.test.lua b/test/replication/sync.test.lua index 7d7ee757..a5cfab8d 100644 --- a/test/replication/sync.test.lua +++ b/test/replication/sync.test.lua @@ -108,10 +108,43 @@ repeat fiber.sleep(0.01) until box.info.status ~= 'orphan' box.info.status -- running box.info.ro -- false +-- gh-3636: Check that replica set sync doesn't stop on cfg errors. +-- To do that we inject an error on the master to temporarily block +-- the relay thread from exiting, then reconfigure replication on +-- the slave using the same configuration. Since the relay is still +-- running when replication is reconfigured, the replica will get +-- ER_CFG "duplicate connection with the same replica UUID" error. +-- It should print it to the log, but keep trying to synchronize. +-- Eventually, it should leave box.cfg() following the master. +box.cfg{replication_timeout = 0.1} +box.cfg{replication_sync_lag = 1} +box.cfg{replication_sync_timeout = 10} + +test_run:cmd("switch default") +box.error.injection.set("ERRINJ_RELAY_TIMEOUT", 0) +box.error.injection.set('ERRINJ_WAL_DELAY', true) +test_run:cmd("setopt delimiter ';'") +_ = fiber.create(function() + box.space.test:replace{123456789} +end); +_ = fiber.create(function() + fiber.sleep(0.1) + box.error.injection.set('ERRINJ_WAL_DELAY', false) +end); +test_run:cmd("setopt delimiter ''"); +test_run:cmd("switch replica") + +replication = box.cfg.replication +box.cfg{replication = {}} +box.cfg{replication = replication} +box.info.status -- running +box.info.ro -- false +box.info.replication[1].upstream.status -- follow +test_run:grep_log('replica', 'ER_CFG.*') + test_run:cmd("switch default") test_run:cmd("stop server replica") test_run:cmd("cleanup server replica") -box.error.injection.set("ERRINJ_RELAY_TIMEOUT", 0) box.space.test:drop() box.schema.user.revoke('guest', 'replication') diff --git a/test/replication/wal_off.result b/test/replication/wal_off.result index f50bfc2f..e3b5709e 100644 --- a/test/replication/wal_off.result +++ b/test/replication/wal_off.result @@ -78,9 +78,18 @@ test_run:cmd('switch default') --- - true ... +replication_sync_timeout = box.cfg.replication_sync_timeout +--- +... +box.cfg { replication_sync_timeout = 0.01 } +--- +... box.cfg { replication = wal_off_uri } --- ... +box.cfg { replication_sync_timeout = replication_sync_timeout } +--- +... check = "Read access to universe" --- ... diff --git a/test/replication/wal_off.test.lua b/test/replication/wal_off.test.lua index 43cd0aed..81fcf0b3 100644 --- a/test/replication/wal_off.test.lua +++ b/test/replication/wal_off.test.lua @@ -27,7 +27,10 @@ test_run:cmd('switch wal_off') box.schema.user.revoke('guest', 'replication') test_run:cmd('switch default') +replication_sync_timeout = box.cfg.replication_sync_timeout +box.cfg { replication_sync_timeout = 0.01 } box.cfg { replication = wal_off_uri } +box.cfg { replication_sync_timeout = replication_sync_timeout } check = "Read access to universe" while string.find(box.info.replication[wal_off_id].upstream.message, check) == nil do fiber.sleep(0.01) end box.cfg { replication = "" } -- 2.11.0