Tarantool development patches archive
 help / color / mirror / Atom feed
* [tarantool-patches] [PATCH] replication: display correct status at upstream
@ 2018-04-27 10:39 Konstantin Belyavskiy
  2018-04-28 20:30 ` [tarantool-patches] " Konstantin Osipov
  0 siblings, 1 reply; 4+ messages in thread
From: Konstantin Belyavskiy @ 2018-04-27 10:39 UTC (permalink / raw)
  To: georgy; +Cc: tarantool-patches

This fix improves 'box.info.replication' output.
If downstream fails and thus disconnects from upstream, improve
logging by printing 'status: disconnected'.
Add relay_state { NONE, CONNECTED, DISCONNECTED } to track replica
presence, once connected it either CONNECTED or DISCONNECTED until
master is reset.

Closes #3365
---
Ticket: https://github.com/tarantool/tarantool/issues/3365
Branch: https://github.com/tarantool/tarantool/compare/gh-3365-display-an-error-at-downstream-on-replica-failure-or-disconnect

 src/box/lua/info.c                        |   7 ++
 src/box/replication.cc                    |   2 +
 src/box/replication.h                     |  17 +++++
 test/replication/recovery_quorum.result   | 115 ++++++++++++++++++++++++++++++
 test/replication/recovery_quorum.test.lua |  36 ++++++++++
 5 files changed, 177 insertions(+)
 create mode 100644 test/replication/recovery_quorum.result
 create mode 100644 test/replication/recovery_quorum.test.lua

diff --git a/src/box/lua/info.c b/src/box/lua/info.c
index 8e8fd9d97..25bce8565 100644
--- a/src/box/lua/info.c
+++ b/src/box/lua/info.c
@@ -149,6 +149,13 @@ lbox_pushreplica(lua_State *L, struct replica *replica)
 		lua_pushstring(L, "downstream");
 		lbox_pushrelay(L, relay);
 		lua_settable(L, -3);
+	} else if (replica->relay_state == RELAY_DISCONNECTED) {
+		lua_pushstring(L, "downstream");
+		lua_newtable(L);
+		lua_pushstring(L, "status");
+		lua_pushstring(L, "disconnected");
+		lua_settable(L, -3);
+		lua_settable(L, -3);
 	}
 }
 
diff --git a/src/box/replication.cc b/src/box/replication.cc
index 0b770c913..3ae6e739b 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -661,6 +661,7 @@ replica_set_relay(struct replica *replica, struct relay *relay)
 {
 	assert(replica->id != REPLICA_ID_NIL);
 	assert(replica->relay == NULL);
+	replica->relay_state = RELAY_CONNECTED;
 	replica->relay = relay;
 }
 
@@ -669,6 +670,7 @@ replica_clear_relay(struct replica *replica)
 {
 	assert(replica->relay != NULL);
 	replica->relay = NULL;
+	replica->relay_state = RELAY_DISCONNECTED;
 	if (replica_is_orphan(replica)) {
 		replica_hash_remove(&replicaset.hash, replica);
 		replica_delete(replica);
diff --git a/src/box/replication.h b/src/box/replication.h
index 8a9d57543..e1d4aab6d 100644
--- a/src/box/replication.h
+++ b/src/box/replication.h
@@ -226,6 +226,21 @@ enum replica_state {
 	REPLICA_SYNCED,
 };
 
+enum relay_state {
+	/**
+	 * Applier has not connected to the master or not expected.
+	 */
+	RELAY_NONE,
+	/**
+	 * Applier has connected to the master.
+	 */
+	RELAY_CONNECTED,
+	/**
+	 * Applier disconnected from the master.
+	 */
+	RELAY_DISCONNECTED,
+};
+
 /**
  * Summary information about a replica in the replica set.
  */
@@ -256,6 +271,8 @@ struct replica {
 	struct trigger on_applier_state;
 	/** Replica sync state. */
 	enum replica_state state;
+	/** Relay sync state. */
+	enum relay_state relay_state;
 };
 
 enum {
diff --git a/test/replication/recovery_quorum.result b/test/replication/recovery_quorum.result
new file mode 100644
index 000000000..fef4df9de
--- /dev/null
+++ b/test/replication/recovery_quorum.result
@@ -0,0 +1,115 @@
+--
+-- gh-3365: display an error in upstream on downstream failure.
+-- Create a gap in LSN to cause replica's failure.
+--
+test_run = require('test_run').new()
+---
+...
+SERVERS = {'master_quorum1', 'master_quorum2'}
+---
+...
+-- Deploy a cluster.
+test_run:create_cluster(SERVERS)
+---
+...
+test_run:wait_fullmesh(SERVERS)
+---
+...
+test_run:cmd("switch master_quorum1")
+---
+- true
+...
+repl = box.cfg.replication
+---
+...
+box.cfg{replication = ""}
+---
+...
+test_run:cmd("switch master_quorum2")
+---
+- true
+...
+box.space.test:insert{1}
+---
+- [1]
+...
+box.snapshot()
+---
+- ok
+...
+box.space.test:insert{2}
+---
+- [2]
+...
+box.snapshot()
+---
+- ok
+...
+test_run:cmd("switch default")
+---
+- true
+...
+fio = require('fio')
+---
+...
+fio.unlink(fio.pathjoin(fio.abspath("."), string.format('master_quorum2/%020d.xlog', 5)))
+---
+- true
+...
+test_run:cmd("switch master_quorum1")
+---
+- true
+...
+box.cfg{replication = repl}
+---
+...
+require('fiber').sleep(0.1)
+---
+...
+box.space.test:select()
+---
+- []
+...
+other_id = box.info.id % 2 + 1
+---
+...
+box.info.replication[other_id].upstream.status
+---
+- stopped
+...
+box.info.replication[other_id].upstream.message:match("Missing")
+---
+- Missing
+...
+test_run:cmd("switch master_quorum2")
+---
+- true
+...
+box.space.test:select()
+---
+- - [1]
+  - [2]
+...
+other_id = box.info.id % 2 + 1
+---
+...
+box.info.replication[other_id].upstream.status
+---
+- follow
+...
+box.info.replication[other_id].upstream.message
+---
+- null
+...
+box.info.replication[other_id].downstream
+---
+- status: disconnected
+...
+test_run:cmd("switch default")
+---
+- true
+...
+-- Cleanup.
+test_run:drop_cluster(SERVERS)
+---
+...
diff --git a/test/replication/recovery_quorum.test.lua b/test/replication/recovery_quorum.test.lua
new file mode 100644
index 000000000..f91ca1ca2
--- /dev/null
+++ b/test/replication/recovery_quorum.test.lua
@@ -0,0 +1,36 @@
+--
+-- gh-3365: display an error in upstream on downstream failure.
+-- Create a gap in LSN to cause replica's failure.
+--
+test_run = require('test_run').new()
+SERVERS = {'master_quorum1', 'master_quorum2'}
+-- Deploy a cluster.
+test_run:create_cluster(SERVERS)
+test_run:wait_fullmesh(SERVERS)
+test_run:cmd("switch master_quorum1")
+repl = box.cfg.replication
+box.cfg{replication = ""}
+test_run:cmd("switch master_quorum2")
+box.space.test:insert{1}
+box.snapshot()
+box.space.test:insert{2}
+box.snapshot()
+test_run:cmd("switch default")
+fio = require('fio')
+fio.unlink(fio.pathjoin(fio.abspath("."), string.format('master_quorum2/%020d.xlog', 5)))
+test_run:cmd("switch master_quorum1")
+box.cfg{replication = repl}
+require('fiber').sleep(0.1)
+box.space.test:select()
+other_id = box.info.id % 2 + 1
+box.info.replication[other_id].upstream.status
+box.info.replication[other_id].upstream.message:match("Missing")
+test_run:cmd("switch master_quorum2")
+box.space.test:select()
+other_id = box.info.id % 2 + 1
+box.info.replication[other_id].upstream.status
+box.info.replication[other_id].upstream.message
+box.info.replication[other_id].downstream
+test_run:cmd("switch default")
+-- Cleanup.
+test_run:drop_cluster(SERVERS)
-- 
2.14.3 (Apple Git-98)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [tarantool-patches] Re: [PATCH] replication: display correct status at upstream
  2018-04-27 10:39 [tarantool-patches] [PATCH] replication: display correct status at upstream Konstantin Belyavskiy
@ 2018-04-28 20:30 ` Konstantin Osipov
  2018-04-28 20:45   ` Konstantin Osipov
  2018-05-04 14:09   ` [tarantool-patches] " Konstantin Belyavskiy
  0 siblings, 2 replies; 4+ messages in thread
From: Konstantin Osipov @ 2018-04-28 20:30 UTC (permalink / raw)
  To: tarantool-patches; +Cc: georgy

* Konstantin Belyavskiy <k.belyavskiy@tarantool.org> [18/04/27 13:43]:
> This fix improves 'box.info.replication' output.
> If downstream fails and thus disconnects from upstream, improve
> logging by printing 'status: disconnected'.
> Add relay_state { NONE, CONNECTED, DISCONNECTED } to track replica
> presence, once connected it either CONNECTED or DISCONNECTED until
> master is reset.
> 
> Closes #3365

Hi,

the patch is very good, almost ready to push, but I would like to
request  more work.

First, a very minor reason why I'm not pushing it right away is
that I believe relay states should be matched with applier states.

The matching applier states are OFF, FOLLOW and STOPPED. 
I think it would be easier for users if we don't invent new states
on relay side. 

Second, we allocate relay object on stack; this seems to be a historical
artifact, we have had struct relay before we got struct replica.
Relay has a diagnostics area, so by keeping the relay around we
will be able to display the last error in a way similar to
applier. 

I'm not talking about pushing the message back from the applier to
relay - this seems to be an unnecessary hassle and will complicate
things quite a bit.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [tarantool-patches] Re: [PATCH] replication: display correct status at upstream
  2018-04-28 20:30 ` [tarantool-patches] " Konstantin Osipov
@ 2018-04-28 20:45   ` Konstantin Osipov
  2018-05-04 14:09   ` [tarantool-patches] " Konstantin Belyavskiy
  1 sibling, 0 replies; 4+ messages in thread
From: Konstantin Osipov @ 2018-04-28 20:45 UTC (permalink / raw)
  To: tarantool-patches; +Cc: georgy

* Konstantin Osipov <kostja@tarantool.org> [18/04/28 23:30]:

> * Konstantin Belyavskiy <k.belyavskiy@tarantool.org> [18/04/27 13:43]:
> > This fix improves 'box.info.replication' output.
> > If downstream fails and thus disconnects from upstream, improve
> > logging by printing 'status: disconnected'.
> > Add relay_state { NONE, CONNECTED, DISCONNECTED } to track replica
> > presence, once connected it either CONNECTED or DISCONNECTED until
> > master is reset.
> > 
> > Closes #3365
> 
> Hi,
> 
> the patch is very good, almost ready to push, but I would like to
> request  more work.
> 
> First, a very minor reason why I'm not pushing it right away is
> that I believe relay states should be matched with applier states.
> 
> The matching applier states are OFF, FOLLOW and STOPPED. 
> I think it would be easier for users if we don't invent new states
> on relay side. 
> 
> Second, we allocate relay object on stack; this seems to be a historical
> artifact, we have had struct relay before we got struct replica.
> Relay has a diagnostics area, so by keeping the relay around we
> will be able to display the last error in a way similar to
> applier. 
> 
> I'm not talking about pushing the message back from the applier to
> relay - this seems to be an unnecessary hassle and will complicate
> things quite a bit.

One more thing. As I wrote earlier, it is difficult or impossible
to reliable deliver the error message from applier to the relay.
But what we could do at the relay side, if we see that state is
"stopped" and there is no message in relay->diag, we could set a
helper error ER_RELAY_STOPPED "Please check peer upstream status
fin box.info.replication for reason". This would direct users
towards finding the cause of the problem.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [tarantool-patches] Re: [tarantool-patches] Re: [PATCH] replication: display correct status at upstream
  2018-04-28 20:30 ` [tarantool-patches] " Konstantin Osipov
  2018-04-28 20:45   ` Konstantin Osipov
@ 2018-05-04 14:09   ` Konstantin Belyavskiy
  1 sibling, 0 replies; 4+ messages in thread
From: Konstantin Belyavskiy @ 2018-05-04 14:09 UTC (permalink / raw)
  To: Konstantin Osipov, georgy; +Cc: tarantool-patches

[-- Attachment #1: Type: text/plain, Size: 1988 bytes --]

Ok, so we can keep relay state and get last error from it.
For example in case of disconnect (broken pipe) we got SocketError
and in case of LSN gap, XlogGapError.
So for diagnostic if state is STOPPED just print last error from relay.

So changes to do:
1. Align relay state to match applier state (OFF, FOLLOW and STOPPED).
2. Do not destroy relay on error.
3. To simplify diagnostic, box.info.replication print last error if STOPPED.

Kostya, Georgy are you OK?


>Суббота, 28 апреля 2018, 23:30 +03:00 от Konstantin Osipov <kostja@tarantool.org>:
>
>* Konstantin Belyavskiy < k.belyavskiy@tarantool.org > [18/04/27 13:43]:
>> This fix improves 'box.info.replication' output.
>> If downstream fails and thus disconnects from upstream, improve
>> logging by printing 'status: disconnected'.
>> Add relay_state { NONE, CONNECTED, DISCONNECTED } to track replica
>> presence, once connected it either CONNECTED or DISCONNECTED until
>> master is reset.
>> 
>> Closes #3365
>
>Hi,
>
>the patch is very good, almost ready to push, but I would like to
>request  more work.
>
>First, a very minor reason why I'm not pushing it right away is
>that I believe relay states should be matched with applier states.
>
>The matching applier states are OFF, FOLLOW and STOPPED. 
>I think it would be easier for users if we don't invent new states
>on relay side. 
>
>Second, we allocate relay object on stack; this seems to be a historical
>artifact, we have had struct relay before we got struct replica.
>Relay has a diagnostics area, so by keeping the relay around we
>will be able to display the last error in a way similar to
>applier. 
>
>I'm not talking about pushing the message back from the applier to
>relay - this seems to be an unnecessary hassle and will complicate
>things quite a bit.
>
>
>-- 
>Konstantin Osipov, Moscow, Russia,  +7 903 626 22 32
>http://tarantool.io -  www.twitter.com/kostja_osipov
>


Best regards,
Konstantin Belyavskiy
k.belyavskiy@tarantool.org

[-- Attachment #2: Type: text/html, Size: 2878 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-05-04 14:09 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-27 10:39 [tarantool-patches] [PATCH] replication: display correct status at upstream Konstantin Belyavskiy
2018-04-28 20:30 ` [tarantool-patches] " Konstantin Osipov
2018-04-28 20:45   ` Konstantin Osipov
2018-05-04 14:09   ` [tarantool-patches] " Konstantin Belyavskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox