Tarantool development patches archive
 help / color / mirror / Atom feed
From: Serge Petrenko via Tarantool-patches <tarantool-patches@dev.tarantool.org>
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>, gorcunov@gmail.com
Cc: tarantool-patches@dev.tarantool.org
Subject: Re: [Tarantool-patches] [PATCH 2/7] replication: forbid implicit limbo owner transition
Date: Fri, 18 Jun 2021 00:00:07 +0300	[thread overview]
Message-ID: <67248926-d264-9d01-a24e-d0fa108708e9@tarantool.org> (raw)
In-Reply-To: <797fdc67-944d-67f2-cd44-ff8ce79ce342@tarantool.org>



15.06.2021 23:55, Vladislav Shpilevoy пишет:
> Good job on the patch!
>
> See 4 comments below.

Thanks for the review!

> On 10.06.2021 15:32, Serge Petrenko via Tarantool-patches wrote:
>> Forbid limbo ownership transition without an explicit promote.
>> Make it so that synchronous transactions may be committed only after it
>> is claimed by some instance via a PROMOTE request.
>>
>> Make everyone but the limbo owner read-only even when the limbo is
>> empty.
>>
>> Part-of #6034
>>
>> @TarantoolBot document
>> Title: synchronous replication changes
>>
>> `box.info.synchro.queue` receives a new field: `owner`. It's a replica
>> id of the instance owning the synchronous transaction limbo.
>>
>> Once some instance owns the limbo, every other instance becomes
>> read-only. When the limbo is unclaimed, e.g.
>> `box.info.synchro.queue.owner` is `0`, everyone may be writeable, but
>> cannot create synchronous transactions.
>>
>> In order to claim or re-claim the limbo, you have to issue
>> `box.ctl.promote()` on the instance you wish to promote.
>>
>> When elections are enabled, the instance issues `box.ctl.promote()`
>> automatically once it wins the elections.
> 1. It might be not a good idea to mention the limbo in the public
> documentation. It is not described there anyhow, and raises the
> question "what is limbo?".
>
> Maybe better replace "limbo" with "queue" everywhere in the
> public texts such as doc requests, changelogs, and error
> messages.

Ok. Let it be 'synchronous transaction queue', or simply 'the queue' then.

>> diff --git a/src/box/errcode.h b/src/box/errcode.h
>> index d93820e96..e75f54a01 100644
>> --- a/src/box/errcode.h
>> +++ b/src/box/errcode.h
>> @@ -277,6 +277,7 @@ struct errcode_record {
>>   	/*222 */_(ER_QUORUM_WAIT,		"Couldn't wait for quorum %d: %s") \
>>   	/*223 */_(ER_INTERFERING_PROMOTE,	"Instance with replica id %u was promoted first") \
>>   	/*224 */_(ER_RAFT_DISABLED,		"Elections were turned off while running box.ctl.promote()")\
>> +	/*225 */_(ER_LIMBO_UNCLAIMED,		"Synchronous transaction limbo doesn't belong to any instance")\
> 2. The same as above - lets not mention limbo in the public
> space. Might be ER_SYNC_UNCLAIMED, or ER_SYNC_QUEUE_UNCLAIMED,
> or a similar option. And in the error message:
>
> 	transaction limbo -> transaction queue

No problem.

>> diff --git a/test/replication/qsync_basic.test.lua b/test/replication/qsync_basic.test.lua
>> index 75c9b222b..6a49e2b01 100644
>> --- a/test/replication/qsync_basic.test.lua
>> +++ b/test/replication/qsync_basic.test.lua
>> @@ -248,29 +249,6 @@ for i = 1, 100 do box.space.sync:delete{i} end
>>   test_run:cmd('switch replica')
>>   box.space.sync:count()
>>   
>> ---
>> --- gh-5445: NOPs bypass the limbo for the sake of vclock bumps from foreign
>> --- instances, but also works for local rows.
> 3. Please, try to keep this test somehow. You can try to perform
> these nops locally on the default node. Otherwise is_fully_nop in
> txn_journal_entry_new() remains untested.

Ok, I've reworked the test, so please find an incremental diff below.

> 4. When I tried to run the tests, I got a crash:

I believe you've run the tests on this exact commit?
If yes, then all the crashes are fixed in the following commits.
This is quite confusing, indeed, so let me reorder everything
so that fixes come earlier than the commit which has some issues to fix.

> [010] replication/gh-5195-qsync-replica-write.test.l> memtx
> [010]
> [010] [Instance "replica" killed by signal: 6 (SIGABRT)]
> [010]
> [010] Last 15 lines of Tarantool Log file [Instance "replica"][/Users/gerold/Work/Repositories/tarantool/test/var/010_replication/replica.log]:
> [010] 2021-06-15 22:13:04.821 [40345] main/112/applier/unix/:/Users/gerold/Work/Repositories/tarantool/test/var/010_replication/master.socket-iproto I> RAFT: message {term: 1, state: follower} from 1
> [010] 2021-06-15 22:13:04.824 [40345] main/113/applierw/unix/:/Users/gerold/Work/Repositories/tarantool/test/var/010_replication/master.socket-iproto C> leaving orphan mode
> [010] 2021-06-15 22:13:04.824 [40345] main/103/replica I> replica set sync complete
> [010] 2021-06-15 22:13:04.824 [40345] main/103/replica C> leaving orphan mode
> [010] 2021-06-15 22:13:04.825 [40345] main/103/replica I> set 'log_level' configuration option to 5
> [010] 2021-06-15 22:13:04.825 [40345] main/107/checkpoint_daemon I> scheduled next checkpoint for Tue Jun 15 23:20:29 2021
> [010] 2021-06-15 22:13:04.825 [40345] main/103/replica I> set 'replication_timeout' configuration option to 0.1
> [010] 2021-06-15 22:13:04.825 [40345] main/103/replica I> set 'memtx_memory' configuration option to 107374182
> [010] 2021-06-15 22:13:04.826 [40345] main/103/replica I> set 'replication_sync_timeout' configuration option to 100
> [010] 2021-06-15 22:13:04.826 [40345] main/103/replica I> set 'listen' configuration option to "\/Users\/gerold\/Work\/Repositories\/tarantool\/test\/var\/010_replication\/replica.socket-iproto"
> [010] 2021-06-15 22:13:04.826 [40345] main/103/replica I> set 'replication' configuration option to ["\/Users\/gerold\/Work\/Repositories\/tarantool\/test\/var\/010_replication\/master.socket-iproto"]
> [010] 2021-06-15 22:13:04.826 [40345] main/103/replica I> set 'log_format' configuration option to "plain"
> [010] 2021-06-15 22:13:04.827 [40345] main C> entering the event loop
> [010] 2021-06-15 22:13:04.903 [40345] main/128/console/unix/: I> set 'replication_synchro_timeout' configuration option to 0.001
> [010] Assertion failed: (txn_limbo_is_empty(&txn_limbo)), function box_promote, file /Users/gerold/Work/Repositories/tarantool/src/box/box.cc, line 1678.
>
> And a wrong result:
>
> [005] replication/gh-5298-qsync-recovery-snap.test.l> vinyl           [ fail ]
> [005]
> [005] Test failed! Result content mismatch:
> [005] --- replication/gh-5298-qsync-recovery-snap.result	Tue Jun 15 21:48:57 2021
> [005] +++ var/rejects/replication/gh-5298-qsync-recovery-snap.reject	Tue Jun 15 22:13:44 2021
> [005] @@ -46,7 +46,7 @@
> [005]  -- Could hang if the limbo would incorrectly handle the snapshot end.
> [005]  box.space.sync:replace{11}
> [005]   | ---
> [005] - | - [11]
> [005] + | - error: Synchronous transaction limbo doesn't belong to any instance
> [005]   | ...
> [005]
> [005]  old_synchro_quorum = box.cfg.replication_synchro_quorum
> [005] @@ -64,7 +64,7 @@
> [005]   | ...
> [005]  box.space.sync:replace{12}
> [005]   | ---
> [005] - | - error: Quorum collection for a synchronous transaction is timed out
> [005] + | - error: Synchronous transaction limbo doesn't belong to any instance
> [005]   | ...
> [005]
> [005]  box.cfg{                                                                        \
> [005] @@ -75,18 +75,16 @@
> [005]   | ...
> [005]  box.space.sync:replace{13}
> [005]   | ---
> [005] - | - [13]
> [005] + | - error: Synchronous transaction limbo doesn't belong to any instance
> [005]   | ...
> [005]  box.space.sync:get({11})
> [005]   | ---
> [005] - | - [11]
> [005]   | ...
> [005]  box.space.sync:get({12})
> [005]   | ---
> [005]   | ...
> [005]  box.space.sync:get({13})
> [005]   | ---
> [005] - | - [13]
> [005]   | ...
> [005]
> [005]  box.cfg{                                                                        \
> [005]
>
> Then I stopped.


=========================================================

diff --git a/src/box/errcode.h b/src/box/errcode.h
index 3c42ae502..e3943c01d 100644
--- a/src/box/errcode.h
+++ b/src/box/errcode.h
@@ -278,7 +278,7 @@ struct errcode_record {
      /*223 */_(ER_INTERFERING_PROMOTE,    "Instance with replica id %u 
was promoted first") \
      /*224 */_(ER_RAFT_DISABLED,        "Elections were turned off 
while running box.ctl.promote()")\
      /*225 */_(ER_TXN_ROLLBACK,        "Transaction was rolled back") \
-    /*226 */_(ER_LIMBO_UNCLAIMED,        "Synchronous transaction limbo 
doesn't belong to any instance")\
+    /*226 */_(ER_SYNCHRO_QUEUE_UNCLAIMED,    "The synchronous 
transaction queue doesn't belong to any instance")\

  /*
   * !IMPORTANT! Please follow instructions at start of the file
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index 7f1c5c017..ce701ead4 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -97,7 +97,7 @@ txn_limbo_append(struct txn_limbo *limbo, uint32_t id, 
struct txn *txn)
      if (id == 0)
          id = instance_id;
      if  (limbo->owner_id == REPLICA_ID_NIL) {
-        diag_set(ClientError, ER_LIMBO_UNCLAIMED);
+        diag_set(ClientError, ER_SYNCHRO_QUEUE_UNCLAIMED);
          return NULL;
      } else if (limbo->owner_id != id) {
          diag_set(ClientError, ER_UNCOMMITTED_FOREIGN_SYNC_TXNS,
diff --git a/test/replication/qsync_basic.result 
b/test/replication/qsync_basic.result
index 08474269e..f1689f875 100644
--- a/test/replication/qsync_basic.result
+++ b/test/replication/qsync_basic.result
@@ -640,6 +640,56 @@ box.space.sync:count()
   | - 0
   | ...

+--
+-- gh-5445: NOPs bypass the limbo for the sake of vclock bumps from foreign
+-- instances, but also works for local rows.
+--
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.ctl.demote()
+ | ---
+ | ...
+box.space.sync:replace{1}
+ | ---
+ | - error: Synchronous transaction limbo doesn't belong to any instance
+ | ...
+function skip_row() return nil end
+ | ---
+ | ...
+old_lsn = box.info.lsn
+ | ---
+ | ...
+_ = box.space.sync:before_replace(skip_row)
+ | ---
+ | ...
+box.space.sync:replace{2}
+ | ---
+ | ...
+box.space.sync:before_replace(nil, skip_row)
+ | ---
+ | ...
+assert(box.space.sync:get{1} == nil)
+ | ---
+ | - true
+ | ...
+assert(box.space.sync:get{2} == nil)
+ | ---
+ | - true
+ | ...
+assert(box.info.lsn == old_lsn + 1)
+ | ---
+ | - true
+ | ...
+box.ctl.promote()
+ | ---
+ | ...
+
+box.space.sync:truncate()
+ | ---
+ | ...
+
  --
  -- gh-5191: test box.info.synchro interface. For
  -- this sake we stop the replica and initiate data
diff --git a/test/replication/qsync_basic.test.lua 
b/test/replication/qsync_basic.test.lua
index 6a49e2b01..6f0810afc 100644
--- a/test/replication/qsync_basic.test.lua
+++ b/test/replication/qsync_basic.test.lua
@@ -249,6 +249,25 @@ for i = 1, 100 do box.space.sync:delete{i} end
  test_run:cmd('switch replica')
  box.space.sync:count()

+--
+-- gh-5445: NOPs bypass the limbo for the sake of vclock bumps from foreign
+-- instances, but also works for local rows.
+--
+test_run:switch('default')
+box.ctl.demote()
+box.space.sync:replace{1}
+function skip_row() return nil end
+old_lsn = box.info.lsn
+_ = box.space.sync:before_replace(skip_row)
+box.space.sync:replace{2}
+box.space.sync:before_replace(nil, skip_row)
+assert(box.space.sync:get{1} == nil)
+assert(box.space.sync:get{2} == nil)
+assert(box.info.lsn == old_lsn + 1)
+box.ctl.promote()
+
+box.space.sync:truncate()
+
  --
  -- gh-5191: test box.info.synchro interface. For
  -- this sake we stop the replica and initiate data

=========================================================

-- 
Serge Petrenko


  reply	other threads:[~2021-06-17 21:00 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-10 13:32 [Tarantool-patches] [PATCH 0/7] forbid implicit limbo ownership transition Serge Petrenko via Tarantool-patches
2021-06-10 13:32 ` [Tarantool-patches] [PATCH 1/7] replication: always send raft state to subscribers Serge Petrenko via Tarantool-patches
2021-06-10 16:47   ` Cyrill Gorcunov via Tarantool-patches
2021-06-11  8:43     ` Serge Petrenko via Tarantool-patches
2021-06-11  8:44       ` Cyrill Gorcunov via Tarantool-patches
2021-06-15 20:53   ` Vladislav Shpilevoy via Tarantool-patches
2021-06-17 21:00     ` Serge Petrenko via Tarantool-patches
2021-06-10 13:32 ` [Tarantool-patches] [PATCH 2/7] replication: forbid implicit limbo owner transition Serge Petrenko via Tarantool-patches
2021-06-15 20:55   ` Vladislav Shpilevoy via Tarantool-patches
2021-06-17 21:00     ` Serge Petrenko via Tarantool-patches [this message]
2021-06-18 22:49   ` Vladislav Shpilevoy via Tarantool-patches
2021-06-21 10:13     ` Serge Petrenko via Tarantool-patches
2021-06-10 13:32 ` [Tarantool-patches] [PATCH 3/7] txn_limbo: fix promote term filtering Serge Petrenko via Tarantool-patches
2021-06-15 20:57   ` Vladislav Shpilevoy via Tarantool-patches
2021-06-17 21:00     ` Serge Petrenko via Tarantool-patches
2021-06-18 22:49       ` Vladislav Shpilevoy via Tarantool-patches
2021-06-21  8:55         ` Serge Petrenko via Tarantool-patches
2021-06-10 13:32 ` [Tarantool-patches] [PATCH 4/7] txn_limbo: persist the latest effective promote in snapshot Serge Petrenko via Tarantool-patches
2021-06-15 20:59   ` Vladislav Shpilevoy via Tarantool-patches
2021-06-17 21:00     ` Serge Petrenko via Tarantool-patches
2021-06-10 13:32 ` [Tarantool-patches] [PATCH 5/7] replication: send latest effective promote in initial join Serge Petrenko via Tarantool-patches
2021-06-15 21:00   ` Vladislav Shpilevoy via Tarantool-patches
2021-06-17 21:00     ` Serge Petrenko via Tarantool-patches
2021-06-18 22:52       ` Vladislav Shpilevoy via Tarantool-patches
2021-06-21 10:12         ` Serge Petrenko via Tarantool-patches
2021-06-10 13:32 ` [Tarantool-patches] [PATCH 6/7] box: introduce `box.ctl.demote` Serge Petrenko via Tarantool-patches
2021-06-18 22:52   ` Vladislav Shpilevoy via Tarantool-patches
2021-06-21 14:56     ` Serge Petrenko via Tarantool-patches
2021-06-10 13:32 ` [Tarantool-patches] [PATCH 7/7] box: make promote/demote always bump the term Serge Petrenko via Tarantool-patches
2021-06-15 21:00   ` Vladislav Shpilevoy via Tarantool-patches
2021-06-17 21:00     ` Serge Petrenko via Tarantool-patches
2021-06-18 22:53   ` Vladislav Shpilevoy via Tarantool-patches
2021-06-21 15:02     ` Serge Petrenko via Tarantool-patches
2021-06-15 20:53 ` [Tarantool-patches] [PATCH 0/7] forbid implicit limbo ownership transition Vladislav Shpilevoy via Tarantool-patches

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=67248926-d264-9d01-a24e-d0fa108708e9@tarantool.org \
    --to=tarantool-patches@dev.tarantool.org \
    --cc=gorcunov@gmail.com \
    --cc=sergepetrenko@tarantool.org \
    --cc=v.shpilevoy@tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH 2/7] replication: forbid implicit limbo owner transition' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox