From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp59.i.mail.ru (smtp59.i.mail.ru [217.69.128.39]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 41A56469719 for ; Thu, 19 Mar 2020 14:28:08 +0300 (MSK) From: Serge Petrenko Message-Id: <022BA1E5-1F84-400F-BF09-D363338CC296@tarantool.org> Content-Type: multipart/alternative; boundary="Apple-Mail=_66CF2734-4891-43AE-80F2-9044E1BC00E5" Mime-Version: 1.0 (Mac OS X Mail 13.0 \(3608.40.2.2.4\)) Date: Thu, 19 Mar 2020 14:28:06 +0300 In-Reply-To: <20200319091706.GH188@tarantool.org> References: <680467d22cb2864fb4c2d18ac33c4cccb272ebbb.1584558067.git.sergepetrenko@tarantool.org> <20200318200846.GB17681@atlas> <09ba01d5fdc5$f5e1fab0$e1a5f010$@tarantool.org> <20200319084144.GD5707@atlas> <20200319091706.GH188@tarantool.org> Subject: Re: [Tarantool-patches] [PATCH v2 1/5] box: introduce matrix clock List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Sergey Ostanevich , Timur Safin , Konstantin Osipov Cc: kirichenkoga@gmail.com, tarantool-patches@dev.tarantool.org, Vladislav Shpilevoy --Apple-Mail=_66CF2734-4891-43AE-80F2-9044E1BC00E5 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > 19 =D0=BC=D0=B0=D1=80=D1=82=D0=B0 2020 =D0=B3., =D0=B2 12:17, Sergey = Ostanevich =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB= (=D0=B0): >=20 > On 19 =D0=BC=D0=B0=D1=80 11:41, 'Konstantin Osipov' wrote: >> * Timur Safin [20/03/19 11:11]: >>> : > A matrix clock which allows to maintain a set of vclocks and >>> : > their components order. The main target is to be able to >>> : > build a vclock which contains lsns each one is less or equal >>> : > that n corresponding lsn from a matrix clock. >>> : > >>> : > The purpose of the matrix clock is to evaluate a vclock >>> : > which is already processed by wal consumers like relays >>> : > or to obtain a majority vclock to commit journal entries >>> : > in case of synchronous replication. >>> : > >>> : > @sergepetrenko: refactoring & rewrite comments to doxygen style. >>> :=20 >>> : I think we have discussed that matrix clock should not be pushed. >>> :=20 >>> : It's a huge over complication. >>> :=20 >>> : Why are you committing this? >>> :=20 >>> : -- >>> : Konstantin Osipov, Moscow, Russia >>>=20 >>> That's the very good question. There is smell of some = miscommunication >>> between parties involved, hopefully we will resolve it soon.=20 >>> Last time we gathered to discuss sync replications, the consensus = was=20 >>> That we do not want matrix clock as they overcomplicate conflict = resolution process (so, at least, it was not looking like prerequisite = to sync >>> replications mechanism). Hi, Timur. It=E2=80=99s not about conflict resolution or anything = user-visible. I guess the structure naming is misguiding here. Please see a more detailed answer below. >>=20 >> George should clarify, but AFAIU his original design introduced >> matrix clock to GC and to sync replication. These series only >> touch the GC. George reported there was an issue with the current >> GC tracker, basically it becomes non-function when sync >> replication is in place -I don't know what the issue is.=20 >>=20 >> I'd love to discuss the problem first, and then see alternatives. >>=20 >> The thing is, I'd like our vclock to become sparse one day and be >> able to contain thousands of entries. We could use a dynamic data = structure >> which changes the underlying structure depending on the actual >> member count. >> To get there and stay efficient, we need to make sure we never >> copy entire vclock by value, and instead begin passing objects >> representing a "vclock diff" around. Maintaining a sparse matrix = would be >> hell in this case. >>=20 >>> Serge, if I miss some important detains here, I'd love to get = corrected=20 >>> here. I do feel there are some other reasons needed, which I = probably >>> simply not aware of. >=20 > The discussion was with MRG users of Tarantool and their point was: = they > are facing problems with consistent state restore and root cause > analysis in case of Tarantool failure.=20 > It's a huge burden for them to debug and fix their applications even = with=20 > vector clock, while introduction of matrix one will make this task=20 > impossible.=20 The patches are not making it harder for the customers. The =E2=80=98matri= x clock=E2=80=99 is just an in-memory structure, which cannot be accessed by the database user. It is used to build the minimal possible clock, which is still needed by = the replicas. We may rename it to whatever you wish, just like Georgy said. Let it be = vclock sorted set (although we already have a vclock set), or something else, like vclock = sorter, whatever. This structure is not persisted anywhere and no one sees it. We do need to rework our GC, and Georgy=E2=80=99s patches regarding = matrix clock perform exactly this task. First of all, let me describe the issue with the current WAL GC: it tracks consumers (i.e. remote replicas) by their vclock signature, which is the sum of all vclock components. It has always been wrong (since the introduction of vclocks, at least): Say, you have 2 masters, A and B with ids 1 and 2 respectively, and a = replica C with id 3. The example will be a little synthetic, but it illustrates the problem: Say =D0=A1 replicates from both A and B, and there is no replication = between A and B (say, the instances were reconfigured to not replicate from each other). Now, say replica C has followed A and B to vclock {1:5, 2:13}. At the = same time, A has lsn 10 and B has lsn 15. A and B do not know about each other=E2=80=99s = changes, so A=E2=80=99s vclock is {1:10} and B=E2=80=99s vclock is {2:15}. Now imagine A does a snapshot and creates = a new xlog with signature 10. A=E2=80=99s directory will look like: 00=E2=80=A6000.xlog 00=E2=80=A6010.s= nap 00=E2=80=A6.010.xlog Replica C reports its vclock {1:5, 2:13} to A, A uses the vclock to = update the corresponding GC consumer. Since signatures are used, GC consumer is assigned a signature = =3D 13 + 5 =3D 18. This is greater than the signature of the last xlog on A (10), so the = previous xlog (00=E2=80=A600.xlog) can be deleted (at least A assumes it can be). Actually, replica still needs = 00=E2=80=A600.xlog, because it contains rows corresponding to vclocks {1:6} - {1:10}, which haven=E2=80=99t been = replicated yet. If instead of using vclock signatures, gc consumers used vclocks, such a = problem wouldn=E2=80=99t arise. Replia would report its vclock {1:5, 2:13}. The vclock is NOT strictly = greater than A=E2=80=99s most recent xlog vclock ({1:10}), so the previous log is kept until replica reports = a vclock {1:10, 2:something}. (or {1:11, =E2=80=A6} and so on). Local space rework performed in this patchset in scope of the issue = #4114 makes the problem even more obvious: I want to use 0-th vclock component to track local space = changes. Now master has it=E2=80=99s own 0-th vclock component, and replica has = its own one. If I used the replica clock =C2=ABas is=C2=BB with current GC consumer, = it wouldn=E2=80=99t work at all: see example 1 below. If I substituted the replica clock[0] with masters clock[0] it also = wouldn=E2=80=99t work, see example 2 below. Example 1: master vclock {1:10}, replica vclock {0:5, 1:5}. Replica still needs = rows in range {1:5} - {1:10}, but its vclock signature is the same as master=E2=80=99s one, so master deletes logs = needed by replica. Example 2: master vclock {0:100500, 1:10}, replica vclock {0:1, 1:5}. master substitutes replica=E2=80=99s 0 vclock component, so it becomes = {0:100500, 1:5}. Now replica vclock signature is 100505, which is much greater than signature 10 of the xlog = containing rows in range {1:1} - {1:10} and especially the range still needed by replica: {1:6} - = {1:10}. It might seem than ignoring 0 vclock component is an option, but it=E2=80=99= s not: we may ignore 0-th vclock components for GC consumers corresponding to remote replicas, but local instance = checkpoints(snapshots) are also tracked together with GC consumers, and 0 vclock component cannot be ignored for = them. OK, so now it=E2=80=99s clear that we need to use vclocks to track GC = consumers. How do we find the minimal vclock that is still needed? We need the biggest vclock, that is less than = vclock of every GC consumer. This vclock can be built using the matrix clock Georgy introduced. That=E2=80=99s = it. Matrix clock works as follows: say you have consumer clocks: {1:5, 2:10, 3:15}, {1:10, 2:6, 3:10} and = {1:10, 2:15, 3:9}. The matrix clock will build a vclock {1:5, 2:6, 3:9}. This vclock holds the minimum component = per lsn that is still needed. Now all the logs with vclocks strictly less, than the built one, may be = deleted. I hope my explanation answers most of the questions. Feel free to ask. -- Serge Petrenko sergepetrenko@tarantool.org >=20 >>=20 >> --=20 >> Konstantin Osipov, Moscow, Russia --Apple-Mail=_66CF2734-4891-43AE-80F2-9044E1BC00E5 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8



19 =D0=BC=D0=B0=D1=80=D1=82=D0=B0 2020 =D0=B3., =D0=B2 12:17, = Sergey Ostanevich <sergos@tarantool.org> =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0= =D0=BB(=D0=B0):

On 19 =D0=BC=D0= =B0=D1=80 11:41, 'Konstantin Osipov' wrote:
* = Timur Safin <tsafin@tarantool.org> [20/03/19 11:11]:
: > A matrix clock = which allows to maintain a set of vclocks and
: > their = components order. The main target is to be able to
: > = build a vclock which contains lsns each one is less or equal
: > that n corresponding lsn from a matrix clock.
: >
: > The purpose of the matrix clock = is to evaluate a vclock
: > which is already processed = by wal consumers like relays
: > or to obtain a = majority vclock to commit journal entries
: > in case = of synchronous replication.
: >
: > = @sergepetrenko: refactoring & rewrite comments to doxygen style.
: 
: I think we have discussed that matrix clock should not be = pushed.
: 
: It's a = huge over complication.
: 
: Why are = you committing this?
: 
: --
: Konstantin Osipov, Moscow, Russia

That's the very good question. There is smell of some = miscommunication
between parties involved, hopefully we = will resolve it soon. 
Last time we = gathered to discuss sync replications, the consensus was 
That we do = not want matrix clock as they overcomplicate conflict resolution process = (so, at least, it was not looking like prerequisite to sync
replications mechanism).

Hi, Timur. It=E2=80=99s not about conflict resolution = or anything user-visible. I guess the structure naming
is = misguiding here. Please see a more detailed answer below.


George should clarify, but AFAIU his original design = introduced
matrix clock to GC and to sync replication. = These series only
touch the GC. George reported there was = an issue with the current
GC tracker, basically it becomes = non-function when sync
replication is in place -I don't = know what the issue is. 

I'd love to discuss the problem first, and then see = alternatives.

The thing is, I'd like our = vclock to become sparse one day and be
able to contain = thousands of entries. We could use a dynamic data structure
which changes the underlying structure depending on the = actual
member count.
To get there and stay = efficient, we need to make sure we never
copy entire = vclock by value, and instead begin passing objects
representing a "vclock diff" around. Maintaining a sparse = matrix would be
hell in this case.

Serge, if I miss some = important detains here, I'd love to get corrected 
here. I do = feel there are some other reasons needed, which I probably
simply not aware of.

The discussion was with MRG users of Tarantool and their = point was: they
are facing problems with consistent state restore and root = cause
analysis in = case of Tarantool failure. 
It's a huge burden for them to = debug and fix their applications even with 
vector clock, while introduction = of matrix one will make this task 
impossible. 

The patches = are not making it harder for the customers. The =E2=80=98matrix clock=E2=80= =99 is just an
in-memory structure, which cannot be accessed = by the database user.
It is used to build the minimal possible = clock, which is still needed by the replicas.
We may rename it = to whatever you wish, just like Georgy said. Let it be vclock sorted = set
(although we already have a vclock set), or something = else, like vclock sorter, whatever.
This structure is not = persisted anywhere and no one sees it.

We do need to rework our GC, and Georgy=E2=80=99s = patches regarding matrix clock perform exactly
this = task.

First of all, let me describe = the issue with the current WAL GC:
it tracks consumers (i.e. = remote replicas) by their vclock signature,
which is the sum = of all vclock components.
It has always been wrong (since the = introduction of vclocks, at least):
Say, you have 2 masters, A = and B with ids 1 and 2 respectively, and a replica C with id = 3.
The example will be a little synthetic, but it illustrates = the problem:
Say =D0=A1 replicates from both A and B, and = there is no replication between A and B (say, the
instances = were reconfigured to not replicate from each other).
Now, say = replica C has followed A and B to vclock {1:5, 2:13}. At the same time, = A has lsn 10
and B has lsn 15. A and B do not know about each = other=E2=80=99s changes, so A=E2=80=99s vclock is {1:10} = and
B=E2=80=99s vclock is {2:15}. Now imagine A does a = snapshot and creates a new xlog with signature 10.
A=E2=80=99s = directory will look like: 00=E2=80=A6000.xlog 00=E2=80=A6010.snap = 00=E2=80=A6.010.xlog
Replica C reports its vclock {1:5, 2:13} = to A, A uses the vclock to update the corresponding = GC
consumer. Since signatures are used, GC consumer is = assigned a signature =3D 13 + 5 =3D 18.
This is greater than = the signature of the last xlog on A (10), so the previous xlog = (00=E2=80=A600.xlog) can be
deleted (at least A assumes it can = be). Actually, replica still needs 00=E2=80=A600.xlog, because it = contains
rows corresponding to vclocks {1:6} - {1:10}, which = haven=E2=80=99t been replicated yet.

If instead of using vclock signatures, gc = consumers used vclocks, such a problem wouldn=E2=80=99t = arise.
Replia would report its vclock {1:5, 2:13}. The vclock = is NOT strictly greater than A=E2=80=99s most recent
xlog = vclock ({1:10}), so the previous log is kept until replica reports a = vclock {1:10, 2:something}.
(or {1:11, =E2=80=A6} and so = on).

Local space rework performed in = this patchset in scope of the issue #4114 makes the problem = even
more obvious: I want to use 0-th vclock component to = track local space changes.
Now master has it=E2=80=99s own = 0-th vclock component, and replica has its own one.
If I used = the replica clock =C2=ABas is=C2=BB with current GC consumer, it = wouldn=E2=80=99t work at all: see example 1 below.
If I = substituted the replica clock[0] with masters clock[0] it also = wouldn=E2=80=99t work, see example 2 below.

Example 1:
master vclock {1:10}, replica = vclock {0:5, 1:5}. Replica still needs rows in range {1:5} - {1:10}, but = its vclock
signature is the same as master=E2=80=99s one, so = master deletes logs needed by replica.

Example 2:
master vclock {0:100500, = 1:10}, replica vclock {0:1, 1:5}.
master substitutes = replica=E2=80=99s 0 vclock component, so it becomes {0:100500, 1:5}. Now = replica vclock
signature is 100505, which is much greater than = signature 10 of the xlog containing rows in range
{1:1} - = {1:10} and especially the range still needed by replica: {1:6} - = {1:10}.

It might seem than ignoring = 0 vclock component is an option, but it=E2=80=99s not: we may ignore = 0-th vclock components
for GC consumers corresponding to = remote replicas, but local instance checkpoints(snapshots) are also = tracked
together with GC consumers, and 0 vclock component = cannot be ignored for them.

OK, so = now it=E2=80=99s clear that we need to use vclocks to track GC = consumers. How do we find the minimal vclock
that is still = needed? We need the biggest vclock, that is less than vclock of every GC = consumer. This vclock
can be built using the matrix clock = Georgy introduced. That=E2=80=99s it. Matrix clock works as = follows:
say you have consumer clocks: {1:5, 2:10, 3:15}, = {1:10, 2:6, 3:10} and {1:10, 2:15, 3:9}. The matrix clock = will
build a vclock {1:5, 2:6, 3:9}. This vclock holds the = minimum component per lsn that is still needed.
Now all the = logs with vclocks strictly less, than the built one, may be = deleted.

I hope my explanation = answers most of the questions. Feel free to ask.




-- 
Konstantin = Osipov, Moscow, Russia

= --Apple-Mail=_66CF2734-4891-43AE-80F2-9044E1BC00E5--