[Tarantool-patches] [PATCH v2 1/5] box: introduce matrix clock

Thu Mar 19 14:28:06 MSK 2020

> 19 марта 2020 г., в 12:17, Sergey Ostanevich <sergos at tarantool.org> написал(а):
> 
> On 19 мар 11:41, 'Konstantin Osipov' wrote:
>> * Timur Safin <tsafin at tarantool.org> [20/03/19 11:11]:
>>> : > A matrix clock which allows to maintain a set of vclocks and
>>> : > their components order. The main target is to be able to
>>> : > build a vclock which contains lsns each one is less or equal
>>> : > that n corresponding lsn from a matrix clock.
>>> : >
>>> : > The purpose of the matrix clock is to evaluate a vclock
>>> : > which is already processed by wal consumers like relays
>>> : > or to obtain a majority vclock to commit journal entries
>>> : > in case of synchronous replication.
>>> : >
>>> : > @sergepetrenko: refactoring & rewrite comments to doxygen style.
>>> : 
>>> : I think we have discussed that matrix clock should not be pushed.
>>> : 
>>> : It's a huge over complication.
>>> : 
>>> : Why are you committing this?
>>> : 
>>> : --
>>> : Konstantin Osipov, Moscow, Russia
>>> 
>>> That's the very good question. There is smell of some miscommunication
>>> between parties involved, hopefully we will resolve it soon. 
>>> Last time we gathered to discuss sync replications, the consensus was 
>>> That we do not want matrix clock as they overcomplicate conflict resolution process (so, at least, it was not looking like prerequisite to sync
>>> replications mechanism).

Hi, Timur. It’s not about conflict resolution or anything user-visible. I guess the structure naming
is misguiding here. Please see a more detailed answer below.

>> 
>> George should clarify, but AFAIU his original design introduced
>> matrix clock to GC and to sync replication. These series only
>> touch the GC. George reported there was an issue with the current
>> GC tracker, basically it becomes non-function when sync
>> replication is in place -I don't know what the issue is. 
>> 
>> I'd love to discuss the problem first, and then see alternatives.
>> 
>> The thing is, I'd like our vclock to become sparse one day and be
>> able to contain thousands of entries. We could use a dynamic data structure
>> which changes the underlying structure depending on the actual
>> member count.
>> To get there and stay efficient, we need to make sure we never
>> copy entire vclock by value, and instead begin passing objects
>> representing a "vclock diff" around. Maintaining a sparse matrix would be
>> hell in this case.
>> 
>>> Serge, if I miss some important detains here, I'd love to get corrected 
>>> here. I do feel there are some other reasons needed, which I probably
>>> simply not aware of.
> 
> The discussion was with MRG users of Tarantool and their point was: they
> are facing problems with consistent state restore and root cause
> analysis in case of Tarantool failure. 
> It's a huge burden for them to debug and fix their applications even with 
> vector clock, while introduction of matrix one will make this task 
> impossible. 

The patches are not making it harder for the customers. The ‘matrix clock’ is just an
in-memory structure, which cannot be accessed by the database user.
It is used to build the minimal possible clock, which is still needed by the replicas.
We may rename it to whatever you wish, just like Georgy said. Let it be vclock sorted set
(although we already have a vclock set), or something else, like vclock sorter, whatever.
This structure is not persisted anywhere and no one sees it.

We do need to rework our GC, and Georgy’s patches regarding matrix clock perform exactly
this task.

First of all, let me describe the issue with the current WAL GC:
it tracks consumers (i.e. remote replicas) by their vclock signature,
which is the sum of all vclock components.
It has always been wrong (since the introduction of vclocks, at least):
Say, you have 2 masters, A and B with ids 1 and 2 respectively, and a replica C with id 3.
The example will be a little synthetic, but it illustrates the problem:
Say С replicates from both A and B, and there is no replication between A and B (say, the
instances were reconfigured to not replicate from each other).
Now, say replica C has followed A and B to vclock {1:5, 2:13}. At the same time, A has lsn 10
and B has lsn 15. A and B do not know about each other’s changes, so A’s vclock is {1:10} and
B’s vclock is {2:15}. Now imagine A does a snapshot and creates a new xlog with signature 10.
A’s directory will look like: 00…000.xlog 00…010.snap 00….010.xlog
Replica C reports its vclock {1:5, 2:13} to A, A uses the vclock to update the corresponding GC
consumer. Since signatures are used, GC consumer is assigned a signature = 13 + 5 = 18.
This is greater than the signature of the last xlog on A (10), so the previous xlog (00…00.xlog) can be
deleted (at least A assumes it can be). Actually, replica still needs 00…00.xlog, because it contains
rows corresponding to vclocks {1:6} - {1:10}, which haven’t been replicated yet.

If instead of using vclock signatures, gc consumers used vclocks, such a problem wouldn’t arise.
Replia would report its vclock {1:5, 2:13}. The vclock is NOT strictly greater than A’s most recent
xlog vclock ({1:10}), so the previous log is kept until replica reports a vclock {1:10, 2:something}.
(or {1:11, …} and so on).

Local space rework performed in this patchset in scope of the issue #4114 makes the problem even
more obvious: I want to use 0-th vclock component to track local space changes.
Now master has it’s own 0-th vclock component, and replica has its own one.
If I used the replica clock «as is» with current GC consumer, it wouldn’t work at all: see example 1 below.
If I substituted the replica clock[0] with masters clock[0] it also wouldn’t work, see example 2 below.

Example 1:
master vclock {1:10}, replica vclock {0:5, 1:5}. Replica still needs rows in range {1:5} - {1:10}, but its vclock
signature is the same as master’s one, so master deletes logs needed by replica.

Example 2:
master vclock {0:100500, 1:10}, replica vclock {0:1, 1:5}.
master substitutes replica’s 0 vclock component, so it becomes {0:100500, 1:5}. Now replica vclock
signature is 100505, which is much greater than signature 10 of the xlog containing rows in range
{1:1} - {1:10} and especially the range still needed by replica: {1:6} - {1:10}.

It might seem than ignoring 0 vclock component is an option, but it’s not: we may ignore 0-th vclock components
for GC consumers corresponding to remote replicas, but local instance checkpoints(snapshots) are also tracked
together with GC consumers, and 0 vclock component cannot be ignored for them.

OK, so now it’s clear that we need to use vclocks to track GC consumers. How do we find the minimal vclock
that is still needed? We need the biggest vclock, that is less than vclock of every GC consumer. This vclock
can be built using the matrix clock Georgy introduced. That’s it. Matrix clock works as follows:
say you have consumer clocks: {1:5, 2:10, 3:15}, {1:10, 2:6, 3:10} and {1:10, 2:15, 3:9}. The matrix clock will
build a vclock {1:5, 2:6, 3:9}. This vclock holds the minimum component per lsn that is still needed.
Now all the logs with vclocks strictly less, than the built one, may be deleted.

I hope my explanation answers most of the questions. Feel free to ask.

--
Serge Petrenko
sergepetrenko at tarantool.org

> 
>> 
>> -- 
>> Konstantin Osipov, Moscow, Russia

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.tarantool.org/pipermail/tarantool-patches/attachments/20200319/4d237d4b/attachment.html>