is misguiding here. Please see a more detailed answer below.
George should clarify, but AFAIU his original design introduced
matrix clock to GC and to sync replication. These series only
touch the GC. George reported there was an issue with the current
GC tracker, basically it becomes non-function when sync
replication is in place -I don't know what the issue is.
I'd love to discuss the problem first, and then see alternatives.
The thing is, I'd like our vclock to become sparse one day and be
able to contain thousands of entries. We could use a dynamic data structure
which changes the underlying structure depending on the actual
member count.
To get there and stay efficient, we need to make sure we never
copy entire vclock by value, and instead begin passing objects
representing a "vclock diff" around. Maintaining a sparse matrix would be
hell in this case.
Serge, if I miss some important detains here, I'd love to get corrected
here. I do feel there are some other reasons needed, which I probably
simply not aware of.
The discussion was with MRG users of Tarantool and their point was: theyare facing problems with consistent state restore and root causeanalysis in case of Tarantool failure. It's a huge burden for them to debug and fix their applications even with vector clock, while introduction of matrix one will make this task impossible.
The patches are not making it harder for the customers. The ‘matrix clock’ is just an
in-memory structure, which cannot be accessed by the database user.
It is used to build the minimal possible clock, which is still needed by the replicas.
We may rename it to whatever you wish, just like Georgy said. Let it be vclock sorted set
(although we already have a vclock set), or something else, like vclock sorter, whatever.
This structure is not persisted anywhere and no one sees it.
We do need to rework our GC, and Georgy’s patches regarding matrix clock perform exactly
this task.
First of all, let me describe the issue with the current WAL GC:
it tracks consumers (i.e. remote replicas) by their vclock signature,
which is the sum of all vclock components.
It has always been wrong (since the introduction of vclocks, at least):
Say, you have 2 masters, A and B with ids 1 and 2 respectively, and a replica C with id 3.
The example will be a little synthetic, but it illustrates the problem:
Say С replicates from both A and B, and there is no replication between A and B (say, the
instances were reconfigured to not replicate from each other).
Now, say replica C has followed A and B to vclock {1:5, 2:13}. At the same time, A has lsn 10
and B has lsn 15. A and B do not know about each other’s changes, so A’s vclock is {1:10} and
B’s vclock is {2:15}. Now imagine A does a snapshot and creates a new xlog with signature 10.
A’s directory will look like: 00…000.xlog 00…010.snap 00….010.xlog
Replica C reports its vclock {1:5, 2:13} to A, A uses the vclock to update the corresponding GC
consumer. Since signatures are used, GC consumer is assigned a signature = 13 + 5 = 18.
This is greater than the signature of the last xlog on A (10), so the previous xlog (00…00.xlog) can be
deleted (at least A assumes it can be). Actually, replica still needs 00…00.xlog, because it contains
rows corresponding to vclocks {1:6} - {1:10}, which haven’t been replicated yet.
If instead of using vclock signatures, gc consumers used vclocks, such a problem wouldn’t arise.
Replia would report its vclock {1:5, 2:13}. The vclock is NOT strictly greater than A’s most recent
xlog vclock ({1:10}), so the previous log is kept until replica reports a vclock {1:10, 2:something}.
(or {1:11, …} and so on).
Local space rework performed in this patchset in scope of the issue #4114 makes the problem even
more obvious: I want to use 0-th vclock component to track local space changes.
Now master has it’s own 0-th vclock component, and replica has its own one.
If I used the replica clock «as is» with current GC consumer, it wouldn’t work at all: see example 1 below.
If I substituted the replica clock[0] with masters clock[0] it also wouldn’t work, see example 2 below.
Example 1:
master vclock {1:10}, replica vclock {0:5, 1:5}. Replica still needs rows in range {1:5} - {1:10}, but its vclock
signature is the same as master’s one, so master deletes logs needed by replica.
Example 2:
master vclock {0:100500, 1:10}, replica vclock {0:1, 1:5}.
master substitutes replica’s 0 vclock component, so it becomes {0:100500, 1:5}. Now replica vclock
signature is 100505, which is much greater than signature 10 of the xlog containing rows in range
{1:1} - {1:10} and especially the range still needed by replica: {1:6} - {1:10}.
It might seem than ignoring 0 vclock component is an option, but it’s not: we may ignore 0-th vclock components
for GC consumers corresponding to remote replicas, but local instance checkpoints(snapshots) are also tracked
together with GC consumers, and 0 vclock component cannot be ignored for them.
OK, so now it’s clear that we need to use vclocks to track GC consumers. How do we find the minimal vclock
that is still needed? We need the biggest vclock, that is less than vclock of every GC consumer. This vclock
can be built using the matrix clock Georgy introduced. That’s it. Matrix clock works as follows:
say you have consumer clocks: {1:5, 2:10, 3:15}, {1:10, 2:6, 3:10} and {1:10, 2:15, 3:9}. The matrix clock will
build a vclock {1:5, 2:6, 3:9}. This vclock holds the minimum component per lsn that is still needed.
Now all the logs with vclocks strictly less, than the built one, may be deleted.
I hope my explanation answers most of the questions. Feel free to ask.
--
Konstantin Osipov, Moscow, Russia