From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id BD5CA6EC58; Sun, 20 Jun 2021 17:37:25 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org BD5CA6EC58 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1624199845; bh=m5i2zMiTzvlO9SuXXqqfRzeosiekH7ZzD3++GL/pZAo=; h=To:References:Date:In-Reply-To:Subject:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=Kq35WgRT5zXq6162Z0eAT+RE3Et2lOeWJSBygD5yEREP59zHQAVINaHsnIRrgoPlk SEiucl+NLCoYh1Jw/mLLrSV/a5xWtB7/Bh0pIhuFd9ppNBJN6sXAP7/BIfdVnQ4MCp iFSczlQB2JN9FdPdBlZ63Oc0YOkygP2LxSTcvLVU= Received: from smtpng1.m.smailru.net (smtpng1.m.smailru.net [94.100.181.251]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 17BB36EC58 for ; Sun, 20 Jun 2021 17:37:23 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 17BB36EC58 Received: by smtpng1.m.smailru.net with esmtpa (envelope-from ) id 1luyZe-00078w-7e; Sun, 20 Jun 2021 17:37:22 +0300 To: Cyrill Gorcunov , tml References: <20210617154835.315576-1-gorcunov@gmail.com> <20210617154835.315576-3-gorcunov@gmail.com> Message-ID: <909ea97e-9b56-d327-860b-65aba685fce3@tarantool.org> Date: Sun, 20 Jun 2021 16:37:21 +0200 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <20210617154835.315576-3-gorcunov@gmail.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-7564579A: 646B95376F6C166E X-77F55803: 4F1203BC0FB41BD91C2C07775F13263AD4A2A3BFBC817E640615893838B9A94F00894C459B0CD1B9161C2707CC1B932B61FCEA329578A4A109CCECD298C6A646F8A48EF99929F69E X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE7B114C2C2C20B7E62EA1F7E6F0F101C67BD4B6F7A4D31EC0BCC500DACC3FED6E28638F802B75D45FF8AA50765F7900637367CCE42412B8BE38638F802B75D45FF36EB9D2243A4F8B5A6FCA7DBDB1FC311F39EFFDF887939037866D6147AF826D881DE5B16056BCBAE09DC203C9CFCFBB0117882F4460429724CE54428C33FAD305F5C1EE8F4F765FCF1175FABE1C0F9B6A471835C12D1D9774AD6D5ED66289B52BA9C0B312567BB23117882F446042972877693876707352033AC447995A7AD182CC0D3CB04F14752D2E47CDBA5A96583BA9C0B312567BB231DD303D21008E29813377AFFFEAFD269A417C69337E82CC2E827F84554CEF50127C277FBC8AE2E8BA83251EDC214901ED5E8D9A59859A8B65D56369A3576CBA5089D37D7C0E48F6C5571747095F342E88FB05168BE4CE3AF X-C1DE0DAB: 0D63561A33F958A5C0E3EE8D470C1F79038C4596EE9AB8C17E44862DB2CA9F6DD59269BC5F550898D99A6476B3ADF6B47008B74DF8BB9EF7333BD3B22AA88B938A852937E12ACA75448CF9D3A7B2C848410CA545F18667F91A7EA1CDA0B5A7A0 X-C8649E89: 4E36BF7865823D7055A7F0CF078B5EC49A30900B95165D34087368DEEB028F177999714690D4A4AD95DF1859623FDB7B7ED4FB08ED973F7AEBF4A48DAD062D9E1D7E09C32AA3244C6F6DC887B5D17E3F82C12203F0074C36A995755A1445935E729B2BEF169E0186 X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2biojRDhoFVeXyJMKZmvoCR8jSQ== X-Mailru-Sender: 689FA8AB762F73936BC43F508A06382242770E30765DD65EE3FD6156A01740043841015FED1DE5223CC9A89AB576DD93FB559BB5D741EB963CF37A108A312F5C27E8A8C3839CE0E267EA787935ED9F1B X-Mras: Ok Subject: Re: [Tarantool-patches] [PATCH v9 2/2] relay: provide information about downstream lag X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Vladislav Shpilevoy via Tarantool-patches Reply-To: Vladislav Shpilevoy Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" Hi! Thanks for the patch! The test fails when I run it multiple times: [014] Test failed! Result content mismatch: [014] --- replication/gh-5447-downstream-lag.result Sun Jun 20 16:10:26 2021 [014] +++ var/rejects/replication/gh-5447-downstream-lag.reject Sun Jun 20 16:33:01 2021 [014] @@ -37,7 +37,7 @@ [014] -- Upon replica startup there is no ACKs to process. [014] assert(box.info.replication[replica_id].downstream.lag == 0) [014] | --- [014] - | - true [014] + | - error: assertion failed! See 4 comments below. > @@ -629,6 +673,19 @@ relay_reader_f(va_list ap) > /* vclock is followed while decoding, zeroing it. */ > vclock_create(&relay->recv_vclock); > xrow_decode_vclock_xc(&xrow, &relay->recv_vclock); > + /* > + * Replica send us last replicated transaction > + * timestamp which is needed for relay lag > + * monitoring. Note that this transaction has > + * been written to WAL with our current realtime > + * clock value, thus when it get reported back we > + * can compute time spent regardless of the clock > + * value on remote replica. > + */ > + if (relay->txn_acked_tm < xrow.tm) { 1. Why do you need this `if`? Why xrow.tm != 0 is not enough? (It is 0 when replicate to old versions.) ACKs are sent in the same order as the rows, so there can't be any reordering. If it is intended to be used against time changes, this check won't work it seems. If time is moved far into the future, the check passes and the lag will be huge for some time. No protection. And there can't be. If the time is moved far into the past, the check will freeze for the time shift size. Even when all the old transactions are acked and new ones are coming. Because you cached txn_acked_tm in the old time system. No protection either. Looks even like a bug, because the lag freezes regardless of whether there are new transactions ACKed with the new time system or not. It will wait for the new time system to catch up with the old txn_acked_tm. If the timestamp is not needed, you can drop txn_acked_tm member from struct relay. > diff --git a/test/replication/gh-5447-downstream-lag.result b/test/replication/gh-5447-downstream-lag.result > new file mode 100644 > index 000000000..2cc020451 > --- /dev/null > +++ b/test/replication/gh-5447-downstream-lag.result > @@ -0,0 +1,128 @@ <...> > +-- Insert a record and wakeup replica's WAL to process data. > +test_run:switch('default') > + | --- > + | - true > + | ... > +lsn = box.info.lsn > + | --- > + | ... > +box.space.test:insert({1}) > + | --- > + | - [1] > + | ... > +test_run:wait_cond(function() return box.info.lsn > lsn end) 2. You don't need it. You did blocking insert(), which returns only after the WAL write is done. > + | --- > + | - true > + | ... > +-- The record is written on the master node. > +test_run:switch('replica') > + | --- > + | - true > + | ... > +box.error.injection.set("ERRINJ_WAL_DELAY", false) > + | --- > + | - ok > + | ... > + > +-- > +-- Wait the record to be ACKed, the lag value should be nonzero. > +test_run:switch('default') > + | --- > + | - true > + | ... > +test_run:wait_lsn('replica', 'default') > + | --- > + | ... > +assert(box.info.replication[replica_id].downstream.lag > 0) > + | --- > + | - true > + | ... > + > +-- > +-- Cleanup everything. > +test_run:switch('default') > + | --- > + | - true > + | ... > +box.schema.user.revoke('guest', 'replication') 3. You didn't drop the space. > + | --- > + | ... > +test_run:cmd('stop server replica') > + | --- > + | - true > + | ... > +test_run:cmd('cleanup server replica') > + | --- > + | - true > + | ... > +test_run:cmd('delete server replica') > + | --- > + | - true > + | ... 4. Your test uses error injections. It means it must be configured not to run in Release build. See replication/suite.ini.