From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id DE42A6EC55; Fri, 8 Oct 2021 20:59:48 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org DE42A6EC55 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1633715988; bh=9pihDQWwkGGfVKgfmxfAybiElBU+CWHilbZY6ZO4XNw=; h=To:Date:In-Reply-To:References:Subject:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=O6JY5xW3A3+Az+sa4ldkkXIvfypzktTUL6nkYXwYI+G5nqq4lyThgVaRhf5/BkffN KCp9XRtxWizslT78RAaZky0qJ6ET7t5xz+ULhgNusaFc10F1xBq4WdnAECfqtFNqXP 5gexkk2JZSUkwKQsR4p6cyHSXrPgrbo6TBpf3DB8= Received: from mail-lf1-f47.google.com (mail-lf1-f47.google.com [209.85.167.47]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 211A66E465 for ; Fri, 8 Oct 2021 20:58:52 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 211A66E465 Received: by mail-lf1-f47.google.com with SMTP id x27so42630459lfu.5 for ; Fri, 08 Oct 2021 10:58:52 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=vFt/RHddnIGdlxt/wd0mkp3tpQ2jAZhgNdhqBMvi5Xk=; b=rLroVCAXrPK6ZSZ1g5oQacAgy6wJCLQwuZuuVgdp98/fb67ZB2eiAmjhclijaUBihK YmnOpAVTBSZYM7+M+3scLPj4BQXjJmkn1wLlyNnwbbS3TlJk9/uhcvMm5cUVnePPSnaN GmyY7nlUr/5DO7+LVN8Px5DT2kBFTNsXrdDKlMaWoq1/x8JG+UmxsEum28OK62OPnPIA mKelFWt2kA0LSz5nN3h4AcWhBVcIpFjBHBn0bLE7G+1CS1zLS8tQ6TiRSvuC6UojXFze gCKJ3NJcABDQ/N7b/Lz1p69lVNsk55dwc0/MifF26NQlhpfl7ZjNsnWKhH9T4t0UagwL FaNg== X-Gm-Message-State: AOAM530wFLEYKHv+jR6P3f/wMo2EeV2dka2cOtfK300ErkLZhMX7RrnR Qw4V2KeSzYcwYxJ4Dy9ZQPR3bKiSl64= X-Google-Smtp-Source: ABdhPJzhWVc5YIwO/sek3ZJyqy4V1QhARVm5NJkMxrXk6EDZ03MANH4xNU63p54B4N9w8az06AhcTQ== X-Received: by 2002:a2e:a688:: with SMTP id q8mr5028423lje.7.1633715930807; Fri, 08 Oct 2021 10:58:50 -0700 (PDT) Received: from grain.localdomain ([5.18.253.97]) by smtp.gmail.com with ESMTPSA id o16sm114lfd.160.2021.10.08.10.58.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 08 Oct 2021 10:58:50 -0700 (PDT) Received: by grain.localdomain (Postfix, from userid 1000) id 240D55A0023; Fri, 8 Oct 2021 20:58:13 +0300 (MSK) To: tml Date: Fri, 8 Oct 2021 20:58:09 +0300 Message-Id: <20211008175809.349501-4-gorcunov@gmail.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20211008175809.349501-1-gorcunov@gmail.com> References: <20211008175809.349501-1-gorcunov@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: [Tarantool-patches] [PATCH v21 3/3] test: add gh-6036-qsync-order test X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Cyrill Gorcunov via Tarantool-patches Reply-To: Cyrill Gorcunov Cc: Vladislav Shpilevoy Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" To test that promotion requests are handled only when appropriate write to WAL completes, because we update memory data before the write finishes. Note that without the patch this test fires assertion > tarantool: src/box/txn_limbo.c:481: txn_limbo_read_rollback: Assertion `e->txn->signature >= 0' failed. Part-of #6036 Signed-off-by: Cyrill Gorcunov --- test/replication/gh-6036-qsync-order.result | 204 ++++++++++++++++++ test/replication/gh-6036-qsync-order.test.lua | 97 +++++++++ test/replication/suite.cfg | 1 + test/replication/suite.ini | 2 +- 4 files changed, 303 insertions(+), 1 deletion(-) create mode 100644 test/replication/gh-6036-qsync-order.result create mode 100644 test/replication/gh-6036-qsync-order.test.lua diff --git a/test/replication/gh-6036-qsync-order.result b/test/replication/gh-6036-qsync-order.result new file mode 100644 index 000000000..eb3e808cb --- /dev/null +++ b/test/replication/gh-6036-qsync-order.result @@ -0,0 +1,204 @@ +-- test-run result file version 2 +-- +-- gh-6036: verify that terms are locked when we're inside journal +-- write routine, because parallel appliers may ignore the fact that +-- the term is updated already but not yet written leading to data +-- inconsistency. +-- +test_run = require('test_run').new() + | --- + | ... + +SERVERS={"election_replica1", "election_replica2", "election_replica3"} + | --- + | ... +test_run:create_cluster(SERVERS, "replication", {args='1 nil manual 1'}) + | --- + | ... +test_run:wait_fullmesh(SERVERS) + | --- + | ... + +-- +-- Create a synchro space on the master node and make +-- sure the write processed just fine. +test_run:switch("election_replica1") + | --- + | - true + | ... +box.ctl.promote() + | --- + | ... +s = box.schema.create_space('test', {is_sync = true}) + | --- + | ... +_ = s:create_index('pk') + | --- + | ... +s:insert{1} + | --- + | - [1] + | ... + +test_run:switch("election_replica2") + | --- + | - true + | ... +test_run:wait_lsn('election_replica2', 'election_replica1') + | --- + | ... + +test_run:switch("election_replica3") + | --- + | - true + | ... +test_run:wait_lsn('election_replica3', 'election_replica1') + | --- + | ... + +-- +-- Drop connection between election_replica1 and election_replica2. +test_run:switch("election_replica1") + | --- + | - true + | ... +box.cfg({ \ + replication = { \ + "unix/:./election_replica1.sock", \ + "unix/:./election_replica3.sock", \ + }, \ +}) + | --- + | ... +-- +-- Drop connection between election_replica2 and election_replica1. +test_run:switch("election_replica2") + | --- + | - true + | ... +test_run:wait_cond(function() return box.space.test:get{1} ~= nil end) + | --- + | - true + | ... +box.cfg({ \ + replication = { \ + "unix/:./election_replica2.sock", \ + "unix/:./election_replica3.sock", \ + }, \ +}) + | --- + | ... + +-- +-- Here we have the following scheme +-- +-- election_replica3 (will be delayed) +-- / \ +-- election_replica1 election_replica2 + +-- +-- Initiate disk delay in a bit tricky way: the next write will +-- fall into forever sleep. +test_run:switch("election_replica3") + | --- + | - true + | ... +write_cnt = box.error.injection.get("ERRINJ_WAL_WRITE_COUNT") + | --- + | ... +-- +-- Make election_replica2 been a leader and start writting data, +-- the PROMOTE request get queued on election_replica3 and not +-- yet processed, same time INSERT won't complete either +-- waiting for PROMOTE completion first. Note that we +-- enter election_replica3 as well just to be sure the PROMOTE +-- reached it. +test_run:switch("election_replica2") + | --- + | - true + | ... +box.ctl.promote() + | --- + | ... +test_run:switch("election_replica3") + | --- + | - true + | ... +test_run:wait_cond(function() return box.error.injection.get("ERRINJ_WAL_WRITE_COUNT") > write_cnt end) + | --- + | - true + | ... +box.error.injection.set("ERRINJ_WAL_DELAY", true) + | --- + | - ok + | ... +test_run:switch("election_replica2") + | --- + | - true + | ... +_ = require('fiber').create(function() box.space.test:insert{2} end) + | --- + | ... + +-- +-- The election_replica1 node has no clue that there is a new leader +-- and continue writing data with obsolete term. Since election_replica3 +-- is delayed now the INSERT won't proceed yet but get queued. +test_run:switch("election_replica1") + | --- + | - true + | ... +_ = require('fiber').create(function() box.space.test:insert{3} end) + | --- + | ... + +-- +-- Finally enable election_replica3 back. Make sure the data from new election_replica2 +-- leader get writing while old leader's data ignored. +test_run:switch("election_replica3") + | --- + | - true + | ... +box.error.injection.set('ERRINJ_WAL_DELAY', false) + | --- + | - ok + | ... +test_run:wait_cond(function() return box.space.test:get{2} ~= nil end) + | --- + | - true + | ... +box.space.test:select{} + | --- + | - - [1] + | - [2] + | ... + +test_run:switch("default") + | --- + | - true + | ... +test_run:cmd('stop server election_replica1') + | --- + | - true + | ... +test_run:cmd('stop server election_replica2') + | --- + | - true + | ... +test_run:cmd('stop server election_replica3') + | --- + | - true + | ... + +test_run:cmd('delete server election_replica1') + | --- + | - true + | ... +test_run:cmd('delete server election_replica2') + | --- + | - true + | ... +test_run:cmd('delete server election_replica3') + | --- + | - true + | ... diff --git a/test/replication/gh-6036-qsync-order.test.lua b/test/replication/gh-6036-qsync-order.test.lua new file mode 100644 index 000000000..b8df170b8 --- /dev/null +++ b/test/replication/gh-6036-qsync-order.test.lua @@ -0,0 +1,97 @@ +-- +-- gh-6036: verify that terms are locked when we're inside journal +-- write routine, because parallel appliers may ignore the fact that +-- the term is updated already but not yet written leading to data +-- inconsistency. +-- +test_run = require('test_run').new() + +SERVERS={"election_replica1", "election_replica2", "election_replica3"} +test_run:create_cluster(SERVERS, "replication", {args='1 nil manual 1'}) +test_run:wait_fullmesh(SERVERS) + +-- +-- Create a synchro space on the master node and make +-- sure the write processed just fine. +test_run:switch("election_replica1") +box.ctl.promote() +s = box.schema.create_space('test', {is_sync = true}) +_ = s:create_index('pk') +s:insert{1} + +test_run:switch("election_replica2") +test_run:wait_lsn('election_replica2', 'election_replica1') + +test_run:switch("election_replica3") +test_run:wait_lsn('election_replica3', 'election_replica1') + +-- +-- Drop connection between election_replica1 and election_replica2. +test_run:switch("election_replica1") +box.cfg({ \ + replication = { \ + "unix/:./election_replica1.sock", \ + "unix/:./election_replica3.sock", \ + }, \ +}) +-- +-- Drop connection between election_replica2 and election_replica1. +test_run:switch("election_replica2") +test_run:wait_cond(function() return box.space.test:get{1} ~= nil end) +box.cfg({ \ + replication = { \ + "unix/:./election_replica2.sock", \ + "unix/:./election_replica3.sock", \ + }, \ +}) + +-- +-- Here we have the following scheme +-- +-- election_replica3 (will be delayed) +-- / \ +-- election_replica1 election_replica2 + +-- +-- Initiate disk delay in a bit tricky way: the next write will +-- fall into forever sleep. +test_run:switch("election_replica3") +write_cnt = box.error.injection.get("ERRINJ_WAL_WRITE_COUNT") +-- +-- Make election_replica2 been a leader and start writting data, +-- the PROMOTE request get queued on election_replica3 and not +-- yet processed, same time INSERT won't complete either +-- waiting for PROMOTE completion first. Note that we +-- enter election_replica3 as well just to be sure the PROMOTE +-- reached it. +test_run:switch("election_replica2") +box.ctl.promote() +test_run:switch("election_replica3") +test_run:wait_cond(function() return box.error.injection.get("ERRINJ_WAL_WRITE_COUNT") > write_cnt end) +box.error.injection.set("ERRINJ_WAL_DELAY", true) +test_run:switch("election_replica2") +_ = require('fiber').create(function() box.space.test:insert{2} end) + +-- +-- The election_replica1 node has no clue that there is a new leader +-- and continue writing data with obsolete term. Since election_replica3 +-- is delayed now the INSERT won't proceed yet but get queued. +test_run:switch("election_replica1") +_ = require('fiber').create(function() box.space.test:insert{3} end) + +-- +-- Finally enable election_replica3 back. Make sure the data from new election_replica2 +-- leader get writing while old leader's data ignored. +test_run:switch("election_replica3") +box.error.injection.set('ERRINJ_WAL_DELAY', false) +test_run:wait_cond(function() return box.space.test:get{2} ~= nil end) +box.space.test:select{} + +test_run:switch("default") +test_run:cmd('stop server election_replica1') +test_run:cmd('stop server election_replica2') +test_run:cmd('stop server election_replica3') + +test_run:cmd('delete server election_replica1') +test_run:cmd('delete server election_replica2') +test_run:cmd('delete server election_replica3') diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg index 3eee0803c..ed09b2087 100644 --- a/test/replication/suite.cfg +++ b/test/replication/suite.cfg @@ -59,6 +59,7 @@ "gh-6094-rs-uuid-mismatch.test.lua": {}, "gh-6127-election-join-new.test.lua": {}, "gh-6035-applier-filter.test.lua": {}, + "gh-6036-qsync-order.test.lua": {}, "election-candidate-promote.test.lua": {}, "*": { "memtx": {"engine": "memtx"}, diff --git a/test/replication/suite.ini b/test/replication/suite.ini index 77eb95f49..080e4fbf4 100644 --- a/test/replication/suite.ini +++ b/test/replication/suite.ini @@ -3,7 +3,7 @@ core = tarantool script = master.lua description = tarantool/box, replication disabled = consistent.test.lua -release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5430-qsync-promote-crash.test.lua gh-5430-cluster-mvcc.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua gh-6027-applier-error-show.test.lua gh-6032-promote-wal-write.test.lua gh-6057-qsync-confirm-async-no-wal.test.lua gh-5447-downstream-lag.test.lua gh-4040-invalid-msgpack.test.lua +release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5430-qsync-promote-crash.test.lua gh-5430-cluster-mvcc.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua gh-6027-applier-error-show.test.lua gh-6032-promote-wal-write.test.lua gh-6057-qsync-confirm-async-no-wal.test.lua gh-5447-downstream-lag.test.lua gh-4040-invalid-msgpack.test.lua gh-6036-qsync-order.test.lua config = suite.cfg lua_libs = lua/fast_replica.lua lua/rlimit.lua use_unix_sockets = True -- 2.31.1