From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id ED4706EC58; Mon, 21 Jun 2021 11:44:35 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org ED4706EC58 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1624265076; bh=NkLFbjZreIG2DBDTfu6qjQ+tpGpXxYlnDImR+z4FFYI=; h=Date:To:Cc:References:In-Reply-To:Subject:List-Id: List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe: From:Reply-To:From; b=EeYfM8gh5dY6a9SDuFfZ8BIyt4IdghtBqVwBeOBlU7mniGVOC0eE8i3vau0cuXOxc 7o666LiP9yWiqPxZG3KdAWPQyZ5pGTfpFBhjYxU/0XEsxuaHDbVN+4D+L/yIN3K1co USDBgk4NpeNoNkb0M3OAWFd/NuktjPOdErir8hnM= Received: from mail-lj1-f177.google.com (mail-lj1-f177.google.com [209.85.208.177]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 1F44C6EC58 for ; Mon, 21 Jun 2021 11:44:34 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 1F44C6EC58 Received: by mail-lj1-f177.google.com with SMTP id k8so970152ljk.7 for ; Mon, 21 Jun 2021 01:44:34 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=yRe/YlWmSMaPT0kVZZuqcIqrR8yIPdmlMZqqSGR/40w=; b=fCJORr+fEDJ27LSGN6WSjPOshcTULF9f4dLhmDlijTxcCEXeUNwN5kRYI4ZHIVjmKM OoFf66WH6+aXa3uLVGJ3qbUqUYpZmv0/PuWK+fzPGwdv4v03gLz3UXTkXwhgW9e7wgxu kNfCqZ9phM0IOYlfz+zPpmKFvHqcWBvrFbuZytNHwpAlA7wiWP+ri1NP7YG675qD/25D TEpDzX3SWjaWSvzCuZy4WI8XHBs5EHHbWYwcCzh1wZ66dIx05elm3MzGUln8E4uki4Ur RC8dYpBZaTwYdOTS9F2AmaFOLxEeKmPMHSgtNnWZCtBOHx+2GZ+jj1pkphXOpVnGhZfF /Q7w== X-Gm-Message-State: AOAM530NgcLwXZp/Fn44Z2eqCW9VD2so4muOUYRCC0bLu+ZVy3HuJP6y bEPHolgJT21wGQUd62PXfllOvQaEgHQ= X-Google-Smtp-Source: ABdhPJw9ZH6/6KwU8Ryj1+IiejSxCZe8JmWN1vI9Jxdcvhmp8/EgO159HklJevvfQyf6DVGySl4wYA== X-Received: by 2002:a05:651c:1181:: with SMTP id w1mr20894147ljo.116.1624265072966; Mon, 21 Jun 2021 01:44:32 -0700 (PDT) Received: from grain.localdomain ([5.18.199.94]) by smtp.gmail.com with ESMTPSA id t184sm1801424lff.2.2021.06.21.01.44.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 21 Jun 2021 01:44:31 -0700 (PDT) Received: by grain.localdomain (Postfix, from userid 1000) id E9CF05A0020; Mon, 21 Jun 2021 11:44:30 +0300 (MSK) Date: Mon, 21 Jun 2021 11:44:30 +0300 To: Vladislav Shpilevoy Cc: tml Message-ID: References: <20210617154835.315576-1-gorcunov@gmail.com> <20210617154835.315576-3-gorcunov@gmail.com> <909ea97e-9b56-d327-860b-65aba685fce3@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <909ea97e-9b56-d327-860b-65aba685fce3@tarantool.org> User-Agent: Mutt/2.0.7 (2021-05-04) Subject: Re: [Tarantool-patches] [PATCH v9 2/2] relay: provide information about downstream lag X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Cyrill Gorcunov via Tarantool-patches Reply-To: Cyrill Gorcunov Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" On Sun, Jun 20, 2021 at 04:37:21PM +0200, Vladislav Shpilevoy wrote: > Hi! Thanks for the patch! > > The test fails when I run it multiple times: > > [014] Test failed! Result content mismatch: > [014] --- replication/gh-5447-downstream-lag.result Sun Jun 20 16:10:26 2021 > [014] +++ var/rejects/replication/gh-5447-downstream-lag.reject Sun Jun 20 16:33:01 2021 > [014] @@ -37,7 +37,7 @@ > [014] -- Upon replica startup there is no ACKs to process. > [014] assert(box.info.replication[replica_id].downstream.lag == 0) > [014] | --- > [014] - | - true > [014] + | - error: assertion failed! > > See 4 comments below. Hmm, didn't trigger on my machine. Gimme some time I'll try to hit this problem. > > > @@ -629,6 +673,19 @@ relay_reader_f(va_list ap) > > /* vclock is followed while decoding, zeroing it. */ > > vclock_create(&relay->recv_vclock); > > xrow_decode_vclock_xc(&xrow, &relay->recv_vclock); > > + /* > > + * Replica send us last replicated transaction > > + * timestamp which is needed for relay lag > > + * monitoring. Note that this transaction has > > + * been written to WAL with our current realtime > > + * clock value, thus when it get reported back we > > + * can compute time spent regardless of the clock > > + * value on remote replica. > > + */ > > + if (relay->txn_acked_tm < xrow.tm) { > > 1. Why do you need this `if`? Why xrow.tm != 0 is not enough? (It is 0 > when replicate to old versions.) ACKs are sent in the same order as the > rows, so there can't be any reordering. The main purpose is to prevent the case where a peer sends us negative value, for some reason. Nit sure though maybe we should not hide such case but rather point out that there some weird node spamming us. I tend to agree that comparision with zero might be more straightforward, will do. > If it is intended to be used against time changes, this check won't work > it seems. If time is moved far into the future, the check passes and the > lag will be huge for some time. No protection. And there can't be. > > If the time is moved far into the past, the check will freeze for the > time shift size. Even when all the old transactions are acked and new > ones are coming. Because you cached txn_acked_tm in the old time system. > No protection either. Looks even like a bug, because the lag freezes > regardless of whether there are new transactions ACKed with the new time > system or not. It will wait for the new time system to catch up with the > old txn_acked_tm. > > If the timestamp is not needed, you can drop txn_acked_tm member from > struct relay. I'll drop this variable, and yes, there is no protection from clocks adjustments, and it can't be fixed (for current code base). > > +-- Insert a record and wakeup replica's WAL to process data. > > +test_run:switch('default') > > + | --- > > + | - true > > + | ... > > +lsn = box.info.lsn > > + | --- > > + | ... > > +box.space.test:insert({1}) > > + | --- > > + | - [1] > > + | ... > > +test_run:wait_cond(function() return box.info.lsn > lsn end) > > 2. You don't need it. You did blocking insert(), which returns only > after the WAL write is done. Yeah, it's redundant but harmless check, thanks. > > +-- Cleanup everything. > > +test_run:switch('default') > > + | --- > > + | - true > > + | ... > > +box.schema.user.revoke('guest', 'replication') > > 3. You didn't drop the space. Good catch, thanks!