* [tarantool-patches] [PATCH v2] replication: do not fetch records twice @ 2018-06-14 16:11 Konstantin Belyavskiy 2018-07-10 15:16 ` [tarantool-patches] " Vladislav Shpilevoy 0 siblings, 1 reply; 4+ messages in thread From: Konstantin Belyavskiy @ 2018-06-14 16:11 UTC (permalink / raw) To: georgy; +Cc: tarantool-patches This is a draft paper covering following topics: 1. Draft protocol for discovering and maintaining network topology in case of large arbitrary network. 2. List of required changes to support this feature. 3. Open questions and alternatives. RFC for #3294 --- Ticket: https://github.com/tarantool/tarantool/issues/3294 Branch: https://github.com/tarantool/tarantool/tree/kbelyavs/gh-3294-do-not-fetch-records-twice-rfc RFC: https://github.com/tarantool/tarantool/blob/kbelyavs/gh-3294-do-not-fetch-records-twice-rfc/doc/rfc/topology_discovering_protocol.md doc/rfc/topology_discovering_protocol.md | 66 ++++++++++++++++++++++++++++++++ 1 file changed, 66 insertions(+) create mode 100644 doc/rfc/topology_discovering_protocol.md diff --git a/doc/rfc/topology_discovering_protocol.md b/doc/rfc/topology_discovering_protocol.md new file mode 100644 index 000000000..f99bbaba2 --- /dev/null +++ b/doc/rfc/topology_discovering_protocol.md @@ -0,0 +1,66 @@ +# Topology Discovering Protocol + +* **Status**: In progress +* **Start date**: 25-04-2018 +* **Authors**: Konstantin Belyavskiy @kbelyavs k.belyavskiy@tarantool.org, Georgy Kirichenko @georgy georgy@tarantool.org, Konstantin Osipov @kostja kostja@tarantool.org +* **Issues**: [#3294](https://github.com/tarantool/tarantool/issues/3294) + +## Summary + +Introduce a new space **_routing** to store topology - a directed graph of routes between master and connected replicas. +Each node is responsible to keep its current list of subscriptions in this table. For example, on subscribe a new node inserts new records to this table representing its current list of subscriptions. If connection to some peer is dropped, node should delete associated records (subscription to this node and all nodes subscribed through it if any). Each time a node is connected again, a table also should be updated. Thus, for each change in topology, each affected downstream node (replica) should update associated records in this table. +Every change in this table should trigger a specific logic, which is responsible for subscriptions and could issue a resubscribe request if, for example, a new node is available or shorter path is found. If there are more than one path available, when a path with shorter number of intermediate peers should be preferred. + +This Draft covers following topics: +- Discovering and maintaining current network topology. Propose a protocol describing how individual peer can observe topology and defining each node responsibility. +- Selective Subscribe. Extend SUBSCRIBE command with a list of server UUIDs for which SUBSCRIBE should fetch changes. In a full mesh configuration, only download records originating from the immediate peer. Do not download the records from other peers twice. +- Implement trigger and subscription logic, maintaining subscriptions based on known current network topology. + +## Background and motivation + +Currently each Tarantool instance will download from all peers all records in their WAL except records with instance id equal to the self instance id. For example, in a full mesh of 3 replicas all record will be fetched twice. Instead, it could send a subscribe request to its peers with server ids which are not present in other subscribe requests. +In more complex case, if there is no direct connection between two nodes, to subscribe through intermediate peers we should know network topology. So the first task is to build a topology and then a subscription logic based on observed topology. + +## Detailed design + +Building such topology is possible based on following principles: +- Each node is required to notify all his downstream peers (replicas) in case of changes with his upstream subscription configuration. It could be done by add/update/delete records in **_routing** table. +- The connection with lesser count of intermediate nodes has the highest priority. Lets define the number of edges between two peers as a Depth. So if A has direct connection with B, then Depth is 1 and if A connected with C through B, then Depth is 2. So if direct path between two nodes exists then it should be used in downstream peer subscription. +- In case of equal Depth connections first wins. But if shorter path is found, then node first should reconnect and then notify downstream peers with updated paths. + +**_routing** table details. + +| Subscriber | Replication Source | Subscribed via | Depth | +| :--------- | :----------------- | :------------- | :------ | +| UUID | UUID | UUID | Integer | + +So peer notifies his downstream peers (replicas) with updates in **_cluster** table which replicates to peers. When transmitting updated path to next level, increment Depth by one (if B connected to A with depth X and B is has only one upstream C then record will be {UUID_C, UUID_A, UUID_B, X+1}). + +### List of changes + +1. Extend IPROTO_SUBSCRIBE command with a list of server UUIDs for which SUBSCRIBE should fetch changes. Store this UUIDs within applier's internal data structure. By default issuing SUBSCRIBE with empty list what means no filtering at all. +2. Implement white-list filtering in relay. After processing SUBSCRIBE request, relay has a list of UUIDs. Extract associated peer ids and fill in a filter. By default transmit all records, unless SUBSCRIBE was done with at least one server UUID. In latter case drop all records except originating from replicas in this list. +3. After issuing REQUEST_VOTE to all peers, subscription logic knows a map of server UUIDs, their peers and their vclocks. For each reachable UUID select shortest path and assign UUIDs to direct peer through it this pass goes. Issue the subscribe requests. Notify downstream peers with new topology. +4. Rebalancing. Connect/disconnect should trigger logic to start reassigning process. + - On disconnect first find "orphan" and then reassigned all reachable UUIDs to direct peers through who shortest path goes. Notify downstream peers. + - On connect, by iterating through appliers list, find UUIDs with shorter path found, reassign them to correct peers and issue SUBSCRIBE for recently connected applier and for the one from whom we get these UUIDs back. + +### Details and open questions + +On connect (new client or the old one reconnects) two options are available: +1. SUBSCRIBE only to direct peer and wait for updates in **_cluster** to initiate further subscriptions. +2. SUBSCRIBE without any UUIDs (that means subscribe to all). + +## Rationale and alternatives + +### Topology Discovering + +Instead of **_cluster** table updates, encoded _iproto_ messages could be used. In this case, on every change in peer upstream topology, it should send a Map of *{UUID: depth}* representing its current list of subscriptions to all downstream peers (excluding subset of subscriptions obtaining from this peer in master-master configuration). + +### On network configuration change + +On network configuration change what first, to notify peers or try to resubscribe? +1. If the peer is a direct peer, then we have most recent information about this node based on connection status. If available subscribe and immediately notify downstream peers. +2. On disconnect, it's more complex since in a connected subset if some node is disconnected, others can try to reconnect to this dead node through other nodes, but they do decisions based on old information resulting to massive resubscribe request (node A thinks that it has connection to C through B, but B thinks that it is connected through A). So I think, first need to notify replicas, that connection is dropped, and if other path is available try to resubscribe and then notify all downstream again. Or need to think about some kind of acknowledgement since it could be based on outdated information. +3. On shorter path found, first resubscribe, then notify downstream peers. +4. Balancing. It's possible to slightly extend topology with number of peers subscribed for balancing but does it really needed? -- 2.14.3 (Apple Git-98) ^ permalink raw reply [flat|nested] 4+ messages in thread
* [tarantool-patches] Re: [PATCH v2] replication: do not fetch records twice 2018-06-14 16:11 [tarantool-patches] [PATCH v2] replication: do not fetch records twice Konstantin Belyavskiy @ 2018-07-10 15:16 ` Vladislav Shpilevoy 2018-08-06 11:03 ` [tarantool-patches] " Konstantin Belyavskiy 0 siblings, 1 reply; 4+ messages in thread From: Vladislav Shpilevoy @ 2018-07-10 15:16 UTC (permalink / raw) To: tarantool-patches, Konstantin Belyavskiy, georgy Hello. Thanks for the RFC! See my comments below. > +Building such topology is possible based on following principles: > +- Each node is required to notify all his downstream peers (replicas) in case of changes with his upstream subscription configuration. It could be done by add/update/delete records in **_routing** table. > +- The connection with lesser count of intermediate nodes has the highest priority. Lets define the number of edges between two peers as a Depth. So if A has direct connection with B, then Depth is 1 and if A connected with C through B, then Depth is 2. So if direct path between two nodes exists then it should be used in downstream peer subscription. > +- In case of equal Depth connections first wins. But if shorter path is found, then node first should reconnect and then notify downstream peers with updated paths. > + > +**_routing** table details. > + > +| Subscriber | Replication Source | Subscribed via | Depth | > +| :--------- | :----------------- | :------------- | :------ | > +| UUID | UUID | UUID | Integer | > + > +So peer notifies his downstream peers (replicas) with updates in **_cluster** table which replicates to peers. Why _cluster? I thought that you use _routing table. And by the way, please, say why we can not reuse _cluster for this? To be honest, I very upset at how much new system spaces we are creating now: _routing here, _promotion in my patch. Can't we find a way how to make them optional? How to create such system replicaset spaces on demand? > When transmitting updated path to next level, increment Depth by one (if B connected to A with > depth X and B is has only one upstream C then record will be {UUID_C, UUID_A, UUID_B, X+1}). Typo 'is has'. And I can not understand. Please, explain in layman's terms. Here C has replication source A and B, right? And A has replication source B. So C subscribed on B via A? I would like to see here concrete box.cfg examples for each instance. > + > +### List of changes > + > +1. Extend IPROTO_SUBSCRIBE command with a list of server UUIDs for which SUBSCRIBE should fetch changes. Store this UUIDs within applier's internal data structure. By default issuing SUBSCRIBE with empty list what means no filtering at all. > +2. Implement white-list filtering in relay. After processing SUBSCRIBE request, relay has a list of UUIDs. Extract associated peer ids and fill in a filter. By default transmit all records, unless SUBSCRIBE was done with at least one server UUID. In latter case drop all records except originating from replicas in this list. > +3. After issuing REQUEST_VOTE to all peers, subscription logic knows a map of server UUIDs, their peers and their vclocks. For each reachable UUID select shortest path and assign UUIDs to direct peer through it this pass goes. Issue the subscribe requests. Notify downstream peers with new topology. > +4. Rebalancing. Connect/disconnect should trigger logic to start reassigning process. > + - On disconnect first find "orphan" and then reassigned all reachable UUIDs to direct peers through who shortest path goes. Notify downstream peers. > + - On connect, by iterating through appliers list, find UUIDs with shorter path found, reassign them to correct peers and issue SUBSCRIBE for recently connected applier and for the one from whom we get these UUIDs back. > + Why is so different connect/disconnect events processing? As I understand, in both cases you should recalculate optimal routes in the same way, from scratch. > +### Details and open questions > + > +On connect (new client or the old one reconnects) two options are available: > +1. SUBSCRIBE only to direct peer and wait for updates in **_cluster** to initiate further subscriptions. > +2. SUBSCRIBE without any UUIDs (that means subscribe to all). > + > +## Rationale and alternatives > + > +### Topology Discovering > + > +Instead of **_cluster** table updates, Again: _cluster or _routing? > encoded _iproto_ messages could be used. In this case, on every change in peer upstream topology, it should send a Map of *{UUID: depth}* representing its current list of subscriptions to all downstream peers (excluding subset of subscriptions obtaining from this peer in master-master configuration). > + > +### On network configuration change > + > +On network configuration change what first, to notify peers or try to resubscribe? > +1. If the peer is a direct peer, then we have most recent information about this node based on connection status. If available subscribe and immediately notify downstream peers. > +2. On disconnect, it's more complex since in a connected subset if some node is disconnected, others can try to reconnect to this dead node through other nodes, but they do decisions based on old information resulting to massive resubscribe request (node A thinks that it has connection to C through B, but B thinks that it is connected through A). So I think, first need to notify replicas, that connection is dropped, and if other path is available try to resubscribe and then notify all downstream again. Or need to think about some kind of acknowledgement since it could be based on outdated information. > +3. On shorter path found, first resubscribe, then notify downstream peers. > +4. Balancing. It's possible to slightly extend topology with number of peers subscribed for balancing but does it really needed? > ^ permalink raw reply [flat|nested] 4+ messages in thread
* [tarantool-patches] Re: [tarantool-patches] Re: [PATCH v2] replication: do not fetch records twice 2018-07-10 15:16 ` [tarantool-patches] " Vladislav Shpilevoy @ 2018-08-06 11:03 ` Konstantin Belyavskiy 2018-08-08 19:35 ` Vladislav Shpilevoy 0 siblings, 1 reply; 4+ messages in thread From: Konstantin Belyavskiy @ 2018-08-06 11:03 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tarantool-patches [-- Attachment #1.1: Type: text/plain, Size: 6754 bytes --] Hello, Vlad. Thank you for the review. I fixed issues you had mentioned. In comments below I will try to explain my position regarding to several design proposal. Please find an updated RFC in attachment. >Вторник, 10 июля 2018, 18:16 +03:00 от Vladislav Shpilevoy <v.shpilevoy@tarantool.org>: > >Hello. Thanks for the RFC! See my comments below. > >> +Building such topology is possible based on following principles: >> +- Each node is required to notify all his downstream peers (replicas) in case of changes with his upstream subscription configuration. It could be done by add/update/delete records in **_routing** table. >> +- The connection with lesser count of intermediate nodes has the highest priority. Lets define the number of edges between two peers as a Depth. So if A has direct connection with B, then Depth is 1 and if A connected with C through B, then Depth is 2. So if direct path between two nodes exists then it should be used in downstream peer subscription. >> +- In case of equal Depth connections first wins. But if shorter path is found, then node first should reconnect and then notify downstream peers with updated paths. >> + >> +**_routing** table details. >> + >> +| Subscriber | Replication Source | Subscribed via | Depth | >> +| :--------- | :----------------- | :------------- | :------ | >> +| UUID | UUID | UUID | Integer | >> + >> +So peer notifies his downstream peers (replicas) with updates in **_cluster** table which replicates to peers. > >Why _cluster? I thought that you use _routing table. And by the way, please, say >why we can not reuse _cluster for this? To be honest, I very upset at how much >new system spaces we are creating now: _routing here, _promotion in my patch. > >Can't we find a way how to make them optional? How to create such system >replicaset spaces on demand? Using system spaces seems the easiest option since it's guarantees luck of conflicts. Since it's only a draft proposal, final format of such space is not yet defined. After finishing this I will look for possibilities to not occupy extra system space slot, but reuse existing one or create on demand if it's necessary. But right now let me assume that this is a stand alone table defined within system space. > >> When transmitting updated path to next level, increment Depth by one (if B connected to A with >> depth X and B is has only one upstream C then record will be {UUID_C, UUID_A, UUID_B, X+1}). > >Typo 'is has'. > >And I can not understand. Please, explain in layman's terms. Here >C has replication source A and B, right? >And A has replication source B. >So C subscribed on B via A? > >I would like to see here concrete box.cfg examples for each instance. Rewrite this section to make it more clear. >> + >> +### List of changes >> + >> +1. Extend IPROTO_SUBSCRIBE command with a list of server UUIDs for which SUBSCRIBE should fetch changes. Store this UUIDs within applier's internal data structure. By default issuing SUBSCRIBE with empty list what means no filtering at all. >> +2. Implement white-list filtering in relay. After processing SUBSCRIBE request, relay has a list of UUIDs. Extract associated peer ids and fill in a filter. By default transmit all records, unless SUBSCRIBE was done with at least one server UUID. In latter case drop all records except originating from replicas in this list. >> +3. After issuing REQUEST_VOTE to all peers, subscription logic knows a map of server UUIDs, their peers and their vclocks. For each reachable UUID select shortest path and assign UUIDs to direct peer through it this pass goes. Issue the subscribe requests. Notify downstream peers with new topology. >> +4. Rebalancing. Connect/disconnect should trigger logic to start reassigning process. >> + - On disconnect first find "orphan" and then reassigned all reachable UUIDs to direct peers through who shortest path goes. Notify downstream peers. >> + - On connect, by iterating through appliers list, find UUIDs with shorter path found, reassign them to correct peers and issue SUBSCRIBE for recently connected applier and for the one from whom we get these UUIDs back. >> + > >Why is so different connect/disconnect events processing? As I understand, in both cases you should >recalculate optimal routes in the same way, from scratch. Updated this section. My motivation to separate connect and disconnect is because disconnect could cause a race condition if, for example A has a downstreams B and C, which also connected together. Now imagine situation, that A disconnects and both B and C try to subscribe to A through each other. In Rationale and alternatives section I provide more details with two alternative solutions for this possible conflict. > >> +### Details and open questions >> + >> +On connect (new client or the old one reconnects) two options are available: >> +1. SUBSCRIBE only to direct peer and wait for updates in **_cluster** to initiate further subscriptions. >> +2. SUBSCRIBE without any UUIDs (that means subscribe to all). >> + >> +## Rationale and alternatives >> + >> +### Topology Discovering >> + >> +Instead of **_cluster** table updates, > >Again: _cluster or _routing? _routing, fixed. > >> encoded _iproto_ messages could be used. In this case, on every change in peer upstream topology, it should send a Map of *{UUID: depth}* representing its current list of subscriptions to all downstream peers (excluding subset of subscriptions obtaining from this peer in master-master configuration). >> + >> +### On network configuration change >> + >> +On network configuration change what first, to notify peers or try to resubscribe? >> +1. If the peer is a direct peer, then we have most recent information about this node based on connection status. If available subscribe and immediately notify downstream peers. >> +2. On disconnect, it's more complex since in a connected subset if some node is disconnected, others can try to reconnect to this dead node through other nodes, but they do decisions based on old information resulting to massive resubscribe request (node A thinks that it has connection to C through B, but B thinks that it is connected through A). So I think, first need to notify replicas, that connection is dropped, and if other path is available try to resubscribe and then notify all downstream again. Or need to think about some kind of acknowledgement since it could be based on outdated information. >> +3. On shorter path found, first resubscribe, then notify downstream peers. >> +4. Balancing. It's possible to slightly extend topology with number of peers subscribed for balancing but does it really needed? >> > Best regards, Konstantin Belyavskiy k.belyavskiy@tarantool.org [-- Attachment #1.2: Type: text/html, Size: 8310 bytes --] [-- Attachment #2: topology_discovering_protocol.md --] [-- Type: application/octet-stream, Size: 8052 bytes --] # Topology Discovering Protocol * **Status**: In progress * **Start date**: 25-04-2018 * **Authors**: Konstantin Belyavskiy @kbelyavs k.belyavskiy@tarantool.org, Georgy Kirichenko @georgy georgy@tarantool.org, Konstantin Osipov @kostja kostja@tarantool.org * **Issues**: [#3294](https://github.com/tarantool/tarantool/issues/3294) ## Summary Introduce a new space **_routing** to store topology - a directed graph of routes between master and connected replicas. Each node is responsible to keep its current list of subscriptions in this table. For example, on subscribe a new node inserts new records to this table representing its current list of subscriptions. If connection to some peer is dropped, node should delete associated records (subscription to this node and all nodes subscribed through it if any). Each time a node is connected again, a table also should be updated. Thus, for each change in topology, each affected downstream node (replica) should update associated records in this table. Every change in this table should trigger a specific logic, which is responsible for subscriptions and could issue a resubscribe request if, for example, a new node is available or shorter path is found. If there are more than one path available, when a path with shorter number of intermediate peers should be preferred. This Draft covers following topics: - Discovering and maintaining current network topology. Propose a protocol describing how individual peer can observe topology and defining each node responsibility. - Selective Subscribe. Extend SUBSCRIBE command with a list of server UUIDs for which SUBSCRIBE should fetch changes. In a full mesh configuration, only download records originating from the immediate peer. Do not download the records from other peers twice. - Implement trigger and subscription logic, maintaining subscriptions based on known current network topology. ## Background and motivation Currently each Tarantool instance will download from all peers all records in their WAL except records with instance id equal to the self instance id. For example, in a full mesh of 3 replicas all record will be fetched twice. Instead, it could send a subscribe request to its peers with server ids which are not present in other subscribe requests. In more complex case, if there is no direct connection between two nodes, to subscribe through intermediate peers we should know network topology. So the first task is to build a topology and then a subscription logic based on observed topology. ## Detailed design Building such topology is possible based on following principles: - Each node is required to notify all his downstream peers (replicas) in case of changes with his upstream subscription configuration. It could be done by add/update/delete records in **_routing** table. - The connection with lesser count of intermediate nodes has the highest priority. Lets define the number of edges between two peers as a Depth. So if A has direct connection with B, then Depth is 1 and if A connected with C through B, then Depth is 2. So if direct path between two nodes exists then it should be used in downstream peer subscription. - In case of equal Depth connections first wins. But if shorter path is found, then node first should reconnect and then notify downstream peers with updated paths. **_routing** table details. | Subscriber | Replication Source | Subscribed via | Depth | | :--------- | :----------------- | :------------- | :------ | | UUID | UUID | UUID | Integer | So peer notifies his downstream peers (replicas) with updates in **_routing** table which replicates to peers. For example, imagine that A has a downstream B and B has C (there is no direct path between A and C). In such case B will have only one record in his table: {B_UUID, A_UUID, A_UUID, 1}. Initially C will have two records: 1. {C_UUID, B_UUID, B_UUID, 1} - its own connection to B; 2. {B_UUID, A_UUID, A_UUID, 1} - the one replicated from B; Now it sees opportunity to connect to A through B, so it will insert third record: 3. {C_UUID, A_UUID, B_UUID, 2}. ### List of changes 1. Extend IPROTO_SUBSCRIBE command with a list of server UUIDs for which SUBSCRIBE should fetch changes. Store this UUIDs within applier's internal data structure. By default issuing SUBSCRIBE with empty list what means no filtering at all. 2. Implement white-list filtering in relay. After processing SUBSCRIBE request, relay has a list of UUIDs. Extract associated peer ids and fill in a filter. By default transmit all records, unless SUBSCRIBE was done with at least one server UUID. In latter case drop all records except originating from replicas in this list. 3. After issuing REQUEST_VOTE to all peers, subscription logic knows a map of server UUIDs, their peers and their vclocks. For each reachable UUID select shortest path and assign UUIDs to direct peer through it this pass goes. Issue the subscribe requests. Notify downstream peers with new topology. 4. On connect/disconnect actions and rebalancing. Every update in **_routing** table or connect/disconnect with direct upstream peers should trigger logic which may start reassigning process. - On disconnect from direct master, a peer should first remove associated records from his table and replicates it to his downstreams, then reassigned all reachable UUIDs he was subsrcibed via this peer to other direct peers using shortest path rule. After successful connection notify downstream peers again by inserting new records to table. - On connect, by iterating through appliers list, find UUIDs with shorter path found, reassign them to correct peers and issue SUBSCRIBE for recently connected applier and for the one from whom we get these UUIDs back. - On every change in table if this change affected this peer (either shorter path is found or connection to some peer through one of upstreams is no longer available) do the same action as for disconnect/connect from direct peer. ### Details and open questions On connect (new client or the old one reconnects) two options are available: 1. SUBSCRIBE only to direct peer and wait for updates in **_cluster** to initiate further subscriptions. 2. SUBSCRIBE without any UUIDs (that means subscribe to all). ## Rationale and alternatives ### Topology Discovering Instead of **_routing** table updates, encoded _iproto_ messages could be used. In this case, on every change in peer upstream topology, it should send a Map of *{UUID: depth}* representing its current list of subscriptions to all downstream peers (excluding subset of subscriptions obtaining from this peer in master-master configuration). ### On network configuration change On network configuration change what first, to notify peers or try to resubscribe? 1. If the peer is a direct peer, then we have most recent information about this node based on connection status. If available subscribe and immediately notify downstream peers. 2. On disconnect, it's more complex since in a connected subset if some node is disconnected, others can try to reconnect to this dead node through other nodes, but they do decisions based on old information resulting to massive resubscribe request (node A thinks that it has connection to C through B, but B thinks that it is connected through A). So I think, first need to notify replicas, that connection is dropped, and if other path is available try to resubscribe and then notify all downstream again. Or need to think about some kind of acknowledgement since it could be based on outdated information. To avoid such loops, one of two technique could be used: a search for loops in directed graph algorithm or use of validated Subscriber (based on fact that direct peer receives heartbeat messages from master and subscription chain should ends with validated peer). 3. On shorter path found, first resubscribe, then notify downstream peers. 4. Balancing. It's possible to slightly extend topology with number of peers subscribed for balancing but does it really needed? ^ permalink raw reply [flat|nested] 4+ messages in thread
* [tarantool-patches] Re: [PATCH v2] replication: do not fetch records twice 2018-08-06 11:03 ` [tarantool-patches] " Konstantin Belyavskiy @ 2018-08-08 19:35 ` Vladislav Shpilevoy 0 siblings, 0 replies; 4+ messages in thread From: Vladislav Shpilevoy @ 2018-08-08 19:35 UTC (permalink / raw) To: Konstantin Belyavskiy; +Cc: tarantool-patches Hi! Thanks for the fixes! See 4 comments below. 1. Please, name the file according to the template described here: https://github.com/tarantool/tarantool/blob/kbelyavs/gh-3294-do-not-fetch-records-twice-rfc/doc/rfc/template.md#summary Your file name should match "NNNN-name_in_snake_case.md" where NNNN - is an issue number. > # Topology Discovering Protocol > > * **Status**: In progress > * **Start date**: 25-04-2018 > * **Authors**: Konstantin Belyavskiy @kbelyavs k.belyavskiy@tarantool.org, Georgy Kirichenko @georgy georgy@tarantool.org, Konstantin Osipov @kostja kostja@tarantool.org > * **Issues**: [#3294](https://github.com/tarantool/tarantool/issues/3294) > > ## Summary > > Introduce a new space **_routing** to store topology - a directed graph of routes between master and connected replicas. > Each node is responsible to keep its current list of subscriptions in this table. For example, on subscribe a new node inserts new records to this table representing its current list of subscriptions. If connection to some peer is dropped, node should delete associated records (subscription to this node and all nodes subscribed through it if any). Each time a node is connected again, a table also should be updated. Thus, for each change in topology, each affected downstream node (replica) should update associated records in this table. > Every change in this table should trigger a specific logic, which is responsible for subscriptions and could issue a resubscribe request if, for example, a new node is available or shorter path is found. If there are more than one path available, when a path with shorter number of intermediate peers should be preferred. 2. "when a path" -> "then a path". > > This Draft covers following topics: > - Discovering and maintaining current network topology. Propose a protocol describing how individual peer can observe topology and defining each node responsibility. > - Selective Subscribe. Extend SUBSCRIBE command with a list of server UUIDs for which SUBSCRIBE should fetch changes. In a full mesh configuration, only download records originating from the immediate peer. Do not download the records from other peers twice. > - Implement trigger and subscription logic, maintaining subscriptions based on known current network topology. > > ## Background and motivation > > Currently each Tarantool instance will download from all peers all records in their WAL except records with instance id equal to the self instance id. For example, in a full mesh of 3 replicas all record will be fetched twice. Instead, it could send a subscribe request to its peers with server ids which are not present in other subscribe requests. > In more complex case, if there is no direct connection between two nodes, to subscribe through intermediate peers we should know network topology. So the first task is to build a topology and then a subscription logic based on observed topology. > > ## Detailed design > > Building such topology is possible based on following principles: > - Each node is required to notify all his downstream peers (replicas) in case of changes with his upstream subscription configuration. It could be done by add/update/delete records in **_routing** table. > - The connection with lesser count of intermediate nodes has the highest priority. Lets define the number of edges between two peers as a Depth. So if A has direct connection with B, then Depth is 1 and if A connected with C through B, then Depth is 2. So if direct path between two nodes exists then it should be used in downstream peer subscription. > - In case of equal Depth connections first wins. But if shorter path is found, then node first should reconnect and then notify downstream peers with updated paths. > > **_routing** table details. > > | Subscriber | Replication Source | Subscribed via | Depth | > | :--------- | :----------------- | :------------- | :------ | > | UUID | UUID | UUID | Integer | > > So peer notifies his downstream peers (replicas) with updates in **_routing** table which replicates to peers. > For example, imagine that A has a downstream B and B has C (there is no direct path between A and C). In such case B will have only one record in his table: {B_UUID, A_UUID, A_UUID, 1}. Initially C will have two records: > 1. {C_UUID, B_UUID, B_UUID, 1} - its own connection to B; > 2. {B_UUID, A_UUID, A_UUID, 1} - the one replicated from B; > Now it sees opportunity to connect to A through B, so it will insert third record: > 3. {C_UUID, A_UUID, B_UUID, 2}. > > ### List of changes > > 1. Extend IPROTO_SUBSCRIBE command with a list of server UUIDs for which SUBSCRIBE should fetch changes. Store this UUIDs within applier's internal data structure. By default issuing SUBSCRIBE with empty list what means no filtering at all. > 2. Implement white-list filtering in relay. After processing SUBSCRIBE request, relay has a list of UUIDs. Extract associated peer ids and fill in a filter. By default transmit all records, unless SUBSCRIBE was done with at least one server UUID. In latter case drop all records except originating from replicas in this list. > 3. After issuing REQUEST_VOTE to all peers, subscription logic knows a map of server UUIDs, their peers and their vclocks. For each reachable UUID select shortest path and assign UUIDs to direct peer through it this pass goes. Issue the subscribe requests. Notify downstream peers with new topology. > 4. On connect/disconnect actions and rebalancing. > Every update in **_routing** table or connect/disconnect with direct upstream peers should trigger logic which may start reassigning process. > - On disconnect from direct master, a peer should first remove associated records from his table and replicates it to his downstreams, then reassigned all reachable UUIDs he was subsrcibed via this peer to other direct peers using shortest path rule. After successful connection notify downstream peers again by inserting new records to table. > - On connect, by iterating through appliers list, find UUIDs with shorter path found, reassign them to correct peers and issue SUBSCRIBE for recently connected applier and for the one from whom we get these UUIDs back. > - On every change in table if this change affected this peer (either shorter path is found or connection to some peer through one of upstreams is no longer available) do the same action as for disconnect/connect from direct peer. > > ### Details and open questions > > On connect (new client or the old one reconnects) two options are available: > 1. SUBSCRIBE only to direct peer and wait for updates in **_cluster** to initiate further subscriptions. 3. Again the same typo? _cluster -> _routing? > 2. SUBSCRIBE without any UUIDs (that means subscribe to all). > > ## Rationale and alternatives > > ### Topology Discovering > > Instead of **_routing** table updates, encoded _iproto_ messages could be used. In this case, on every change in peer upstream topology, it should send a Map of *{UUID: depth}* representing its current list of subscriptions to all downstream peers (excluding subset of subscriptions obtaining from this peer in master-master configuration). > > ### On network configuration change > > On network configuration change what first, to notify peers or try to resubscribe? > 1. If the peer is a direct peer, then we have most recent information about this node based on connection status. If available subscribe and immediately notify downstream peers. > 2. On disconnect, it's more complex since in a connected subset if some node is disconnected, others can try to reconnect to this dead node through other nodes, but they do decisions based on old information resulting to massive resubscribe request (node A thinks that it has connection to C through B, but B thinks that it is connected through A). So I think, first need to notify replicas, that connection is dropped, and if other path is available try to resubscribe and then notify all downstream again. Or need to think about some kind of acknowledgement since it could be based on outdated information. To avoid such loops, one of two technique could be used: a search for loops in directed graph algorithm or use of validated Subscriber (based on fact that direct peer receives heartbeat messages from master and subscription chain should ends with validated peer). > 3. On shorter path found, first resubscribe, then notify downstream peers. > 4. Balancing. It's possible to slightly extend topology with number of peers subscribed for balancing but does it really needed? 4. Other things are ok for me, but I am not sure still that I've comprehended it. ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2018-08-08 19:35 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-06-14 16:11 [tarantool-patches] [PATCH v2] replication: do not fetch records twice Konstantin Belyavskiy 2018-07-10 15:16 ` [tarantool-patches] " Vladislav Shpilevoy 2018-08-06 11:03 ` [tarantool-patches] " Konstantin Belyavskiy 2018-08-08 19:35 ` Vladislav Shpilevoy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox