From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id 527ED6EC40; Mon, 7 Jun 2021 22:21:55 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 527ED6EC40 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1623093715; bh=lSq9yfeZNsFpEElx7tauSm1wOLkDhND9oXwpT4D85t0=; h=To:References:Date:In-Reply-To:Subject:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=mw8DKA5xvNP97U41uDZ2RSwfFxyuzXrFMJs3ygIbZDGYez3OXas7HYuCUXyc0VJ9e Xtcy8CrgByMuDkMlkbk+d6rCGbl5WGcqjsdoqX3gw8ZQDYGfi8tfkI/rtd/ZoitIXV 6y5rkbS/aBjjJyC9DkUivTqTEm/QvMWMH9u2LR4Q= Received: from smtp39.i.mail.ru (smtp39.i.mail.ru [94.100.177.99]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id DD5746EC41 for ; Mon, 7 Jun 2021 22:21:11 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org DD5746EC41 Received: by smtp39.i.mail.ru with esmtpa (envelope-from ) id 1lqKoB-0006eM-3P; Mon, 07 Jun 2021 22:21:11 +0300 To: Cyrill Gorcunov , tml References: <20210607155519.109626-1-gorcunov@gmail.com> <20210607155519.109626-3-gorcunov@gmail.com> Message-ID: Date: Mon, 7 Jun 2021 21:21:09 +0200 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <20210607155519.109626-3-gorcunov@gmail.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-7564579A: 646B95376F6C166E X-77F55803: 4F1203BC0FB41BD9D5B0DA836B685C543EF5F9E25E4001B3518B676B8BE4A4C7182A05F5380850405AC2E1B425B8D9F55B2B59754E09F292AE2CEFB9B7622DFEC8E659749B8BEC2A X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE7C42AF033AFE07300EA1F7E6F0F101C67BD4B6F7A4D31EC0BCC500DACC3FED6E28638F802B75D45FF8AA50765F7900637F91103FA893F53AB8638F802B75D45FF36EB9D2243A4F8B5A6FCA7DBDB1FC311F39EFFDF887939037866D6147AF826D898A901999C9C5C919EFC484F4F17032B117882F4460429724CE54428C33FAD305F5C1EE8F4F765FC2EE5AD8F952D28FBA471835C12D1D9774AD6D5ED66289B52BA9C0B312567BB23117882F44604297287769387670735201E561CDFBCA1751FC26CFBAC0749D213D2E47CDBA5A96583BA9C0B312567BB2376E601842F6C81A19E625A9149C048EE140C956E756FBB7A4782AAF36435267CD8FC6C240DEA7642DBF02ECDB25306B2B78CF848AE20165D0A6AB1C7CE11FEE30CABCCA60F52D7EBBA3038C0950A5D36B5C8C57E37DE458B0BC6067A898B09E46D1867E19FE14079C09775C1D3CA48CF3D321E7403792E342EB15956EA79C166A417C69337E82CC275ECD9A6C639B01B78DA827A17800CE75A9E79F66F1C28F3731C566533BA786AA5CC5B56E945C8DA X-B7AD71C0: AC4F5C86D027EB782CDD5689AFBDA7A2368A440D3B0F6089093C9A16E5BC824A2A04A2ABAA09D25379311020FFC8D4AD4BC185A7ECCD7C828EF3A931C78DD1DB X-C1DE0DAB: 0D63561A33F958A502CAC6FA6421930DCEF8F3BD84DEC2BC61BA4D43C8FF325AD59269BC5F550898D99A6476B3ADF6B47008B74DF8BB9EF7333BD3B22AA88B938A852937E12ACA75FBC5FED0552DA851410CA545F18667F91A7EA1CDA0B5A7A0 X-C8649E89: 4E36BF7865823D7055A7F0CF078B5EC49A30900B95165D3452F7993FB7281BFD35262E2F5E519EABC814044DF394B11BF0F04A8BB26E95852B586C8CC461BAF21D7E09C32AA3244CAB02D954C264F01BC3E32507FFCA62A951E887DA02A9F7BF729B2BEF169E0186 X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2bioj/099YK7Sav202ZoHZqLDoA== X-Mailru-Sender: 504CC1E875BF3E7D9BC0E5172ADA31108B7DBD7A570E1C361D00339A6A0C1BB090EE8021CA85349607784C02288277CA03E0582D3806FB6A5317862B1921BA260ED6CFD6382C13A6112434F685709FCF0DA7A0AF5A3A8387 X-Mras: Ok Subject: Re: [Tarantool-patches] [PATCH v8 2/2] relay: provide information about downstream lag X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Vladislav Shpilevoy via Tarantool-patches Reply-To: Vladislav Shpilevoy Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" Thanks for the patch! See 6 comments below. > diff --git a/src/box/relay.cc b/src/box/relay.cc > index b1571b361..cdd1383e8 100644 > --- a/src/box/relay.cc > +++ b/src/box/relay.cc > @@ -158,6 +158,18 @@ struct relay { > struct stailq pending_gc; > /** Time when last row was sent to peer. */ > double last_row_time; > + /** > + * A time difference between the moment when we > + * wrote a transaction to the local WAL and when > + * this transaction has been replicated to remote > + * node (ie written to node's WAL). > + */ > + double txn_lag; > + /** > + * Last timestamp observed from remote node to > + * persist @a txn_lag value. > + */ > + double txn_acked_tm; > /** Relay sync state. */ > enum relay_state state; > > @@ -217,6 +229,12 @@ relay_last_row_time(const struct relay *relay) > return relay->last_row_time; > } > > +double > +relay_txn_lag(const struct relay *relay) > +{ > + return relay->txn_lag; 1. As I said in the previous review, you can't read a variable from another thread without any protection. Please, use the way I proposed last time. Relay has 'tx' struct inside, which is updated on each received ACK. You need to deliver the lag value to TX thread in the same way as the acked vclock is delivered. In the same message preferably. > @@ -629,6 +659,26 @@ relay_reader_f(va_list ap) > /* vclock is followed while decoding, zeroing it. */ > vclock_create(&relay->recv_vclock); > xrow_decode_vclock_xc(&xrow, &relay->recv_vclock); > + /* > + * Replica send us last replicated transaction > + * timestamp which is needed for relay lag > + * monitoring. Note that this transaction has > + * been written to WAL with our current realtime > + * clock value, thus when it get reported back we > + * can compute time spent regardless of the clock > + * value on remote replica. > + * > + * An interesting moment is replica restart - it will > + * send us value 0 after that but we can preserve > + * old reported value here since we *assume* that > + * timestamp is not going backwards on properly > + * set up nodes, otherwise the lag get raised. > + * After all this is a not tamper-proof value. 2. I don't understand. Why does it send value 0? And if it does, why can't you ignore only zeros? The non-0 values must be valid anyway. > + */ > + if (relay->txn_acked_tm < xrow.tm) { > + relay->txn_acked_tm = xrow.tm; > + relay->txn_lag = ev_now(loop()) - xrow.tm; > + } > diff --git a/test/replication/gh-5447-downstream-lag.result b/test/replication/gh-5447-downstream-lag.result > new file mode 100644 > index 000000000..8586d0ed3 > --- /dev/null > +++ b/test/replication/gh-5447-downstream-lag.result > @@ -0,0 +1,93 @@ > +-- test-run result file version 2 > +-- > +-- gh-5447: Test for box.info.replication[n].downstream.lag. > +-- We need to be sure that if replica start been back of > +-- master node reports own lagging and cluster admin would > +-- be able to detect such situation. 3. I couldn't parse the last sentence. Could you use some punctuation? It might help. > +-- > + > +fiber = require('fiber') > + | --- > + | ... > +test_run = require('test_run').new() > + | --- > + | ... > +engine = test_run:get_cfg('engine') > + | --- > + | ... > + > +box.schema.user.grant('guest', 'replication') > + | --- > + | ... > + > +test_run:cmd('create server replica with rpl_master=default, \ > + script="replication/replica.lua"') > + | --- > + | - true > + | ... > +test_run:cmd('start server replica') > + | --- > + | - true > + | ... > + > +s = box.schema.space.create('test', {engine = engine}) > + | --- > + | ... > +_ = s:create_index('pk') > + | --- > + | ... > + > +-- > +-- The replica should wait some time (wal delay is 1 second > +-- by default) so we would be able to detect the lag, since > +-- on local instances the lag is minimal and usually transactions > +-- are handled instantly. 4. But it is not 1 second. usleep(1000) means 1 millisecond, and it happens in a loop, so it does not matter much. It works until you set the delay back to false. That makes WAL thread blocked until you free it. It is not a fixed delay. > +test_run:switch('replica') > + | --- > + | - true > + | ... > +box.error.injection.set("ERRINJ_WAL_DELAY", true) > + | --- > + | - ok > + | ... > + > +test_run:switch('default') > + | --- > + | - true > + | ... > +box.space.test:insert({1}) > + | --- > + | - [1] > + | ... > +test_run:wait_cond(function() return box.info.replication[2].downstream.lag ~= 0 end, 10) 5. This condition is true even before you did the insert. And it couldn't change during insert, because there are no ACKs - the replica can't write to WAL because of the delay, it is blocked in a busy loop. > + | --- > + | - true > + | ... > + > +test_run:switch('replica') > + | --- > + | - true > + | ... > +box.error.injection.set("ERRINJ_WAL_DELAY", false) > + | --- > + | - ok > + | ... > +-- > +-- Cleanup everything. 6. You need to revoke the granted rights and drop the space. > +test_run:switch('default') > + | --- > + | - true > + | ... > + > +test_run:cmd('stop server replica') > + | --- > + | - true > + | ... > +test_run:cmd('cleanup server replica') > + | --- > + | - true > + | ... > +test_run:cmd('delete server replica') > + | --- > + | - true > + | ...