From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id C12DA7F626; Fri, 6 Aug 2021 22:01:29 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org C12DA7F626 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1628276489; bh=orzzDr1SesnsHif6sxk+FMQ3i9aBBvNRHUDzCS3Ygj4=; h=Date:To:References:In-Reply-To:Subject:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=OHNOiK9K6OPCDl8EffVKwZL5Kv/2FTJgRv7xz11w1VRwTO5eqPFltBx13iv2Ixaji 0lHmfMUXU06b/FWxeBL8KtZ0KzpecOtHQSQt1c6xFHZMzRqnL53kOVxsj5Ac8U+TgI 3x/J1lEbB/CIgF412Dn61htXgrENpbnYBKJb/7sM= Received: from mail-lj1-f180.google.com (mail-lj1-f180.google.com [209.85.208.180]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 855F17F626 for ; Fri, 6 Aug 2021 22:01:28 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 855F17F626 Received: by mail-lj1-f180.google.com with SMTP id u13so13224609lje.5 for ; Fri, 06 Aug 2021 12:01:28 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=0a7w6vYb5FExkAA9YVoLpVMG7on9vTRfWxKIikHVwBA=; b=QBxWq6BhuOhJVFjD3ZvH1cibeu7ZeF8TRQMmkuWxDOBNoFljQSuLVRfyV3FfcAbtQ1 9H7DP3sm7KKNOtH+N8dokXFvFCDMvVlgnhYzOA/L1K27vmIdipVSzEMgifknjzux7Cai QK6TW22ycgwWtJ+gAlRCb2IiXU+BVwGfgfzHjyYmTRU2jZlPwGgo1WeLg5ybv15RXUI5 8LlrJszuaJme2cNddWl0cgO/fiBGPbJ1ud1bO/3El7Re8UE3AiHt+NIBylU296BlbSAj Fu/Tt871a8tCBO+W/E5lLZbMKuKuJfbS0LYLPd2xaszWABIGdKzeW7LKbj5//5NUZO/W VuHw== X-Gm-Message-State: AOAM530nbLLQdVtSaS5Q19ai7I8vS09orimpcxecWSrmBWObhKrIqe+D ZPaqx96D1RJZtupCAMBb/W/UIZkH6o8OOA== X-Google-Smtp-Source: ABdhPJyg+6nNwXKkhCCzGdSy9jLmRmy59kJt1xijz7gwGD0REVP1U59b7FhOgHk88kA5+Y1KCJXbyQ== X-Received: by 2002:a05:651c:12c8:: with SMTP id 8mr7539612lje.151.1628276485890; Fri, 06 Aug 2021 12:01:25 -0700 (PDT) Received: from grain.localdomain ([5.18.255.97]) by smtp.gmail.com with ESMTPSA id t30sm908639lfg.289.2021.08.06.12.01.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 06 Aug 2021 12:01:24 -0700 (PDT) Received: by grain.localdomain (Postfix, from userid 1000) id 154725A001E; Fri, 6 Aug 2021 22:01:24 +0300 (MSK) Date: Fri, 6 Aug 2021 22:01:23 +0300 To: Vladislav Shpilevoy Message-ID: References: <20210804190752.488147-1-gorcunov@gmail.com> <20210804190752.488147-4-gorcunov@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/2.0.7 (2021-05-04) Subject: Re: [Tarantool-patches] [PATCH v10 3/4] limbo: filter incoming synchro requests X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Cyrill Gorcunov via Tarantool-patches Reply-To: Cyrill Gorcunov Cc: tml Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" On Fri, Aug 06, 2021 at 01:33:39AM +0200, Vladislav Shpilevoy wrote: > > FILTER_PROMOTE > > FILTER_DEMOTE > > 1) The requests should come in with nonzero term, otherwise > > the packet is corrupted. > > 2) The request's term should not be less than maximal known > > one, iow it should not come in from nodes which didn't notice > > raft epoch changes and living in the past. > > 3) If LSN of the request matches current confirmed LSN the packet > > is obviously correct to process. > > 4) If LSN is less than confirmed LSN then the request is wrong, > > we have processed the requested LSN already. > > 5) If LSN is less than confirmed LSN then > > 1. You already said "If LSN is less than confirmed LSN then" in (4). > In (4) it must be 'greater'. Yup, typo, thanks! > > a) If limbo is empty we can't do anything, since data is already > > processed and should issue an error; > > b) If there is some data in the limbo then requested LSN should > > be in range of limbo's [first; last] LSNs, thus the request > > will be able to commit and rollback limbo queue. > > > > Closes #6036 > > 2. You need to add a changelog file. And change error code as well, as Serge suggested. I'll update it once we settle down on overall code structure. > > diff --git a/src/box/applier.cc b/src/box/applier.cc > > index 9db286ae2..f64b6fa35 100644 > > --- a/src/box/applier.cc > > +++ b/src/box/applier.cc > > @@ -514,6 +515,11 @@ applier_fetch_snapshot(struct applier *applier) > > struct ev_io *coio = &applier->io; > > struct xrow_header row; > > > > + txn_limbo_filter_disable(&txn_limbo); > > + auto filter_guard = make_scoped_guard([&]{ > > + txn_limbo_filter_enable(&txn_limbo); > > + }); > > 3. Why do you need to enable/disabled the filter here? Shouldn't snapshot > contain only valid data? Moreover, AFAIU it can't contain any limbo > rows at all. The limbo snapshot is sent separately, but the data flow > does not have anything except pure data. The same for the > join. The idea is that snapshot/recovery has valid data which forms the initial limbo state versus which we will be apply filtering. > > And how is it related to applier_register below? It does not download > any data at all, does it? After register stage is complete we catch up with lates not yet downloaded data (final join stage) where we still assume that the data received is valid and do not verify it. Actually this is a good question. I've to recheck this moment because in previous series when I ran join/recovery with filtering enabled sometime I've an issues where filter didnt pass. Gimme some time, maybe we will all this and manage to keep filtering all the time. > > diff --git a/src/box/box.cc b/src/box/box.cc > > index 8dc3b130b..c3516b7a4 100644 > > --- a/src/box/box.cc > > +++ b/src/box/box.cc > > @@ -1675,7 +1675,8 @@ box_issue_promote(uint32_t prev_leader_id, int64_t promote_lsn) > > .lsn = promote_lsn, > > .term = raft->term, > > }; > > - txn_limbo_process(&txn_limbo, &req); > > + if (txn_limbo_process(&txn_limbo, &req) != 0) > > + diag_raise(); > > 4. box_issue_promote() is used without try-catches anywhere. Please, > don't use exceptions in the new code. The same for demote. OK > > > assert(txn_limbo_is_empty(&txn_limbo)); > > } > > @@ -3284,6 +3286,11 @@ local_recovery(const struct tt_uuid *instance_uuid, > > > > say_info("instance uuid %s", tt_uuid_str(&INSTANCE_UUID)); > > > > + txn_limbo_filter_disable(&txn_limbo); > > 5. Why do you need it turned off for recovery? If the data was > able to be applied in the first place, why can't it be replayed > in the same way during recovery? I'll retest this moment, thanks! > > > + > > +/** > > + * Common chain for any incoming packet. > > + */ > > +static int > > +filter_in(struct txn_limbo *limbo, const struct synchro_request *req) > > +{ > > + (void)limbo; > > 6. So you have the filtering enabled dynamically in the limbo, but > you do not use the limbo here? Why? Maybe at least add an assertion > that the filter is enabled? All chains are having same interface it is just happen that for common filter I don't need to use limbo. I could add some operations here but not sure if it worth it. As far as I see leave unused args is pretty fine in our code base. > > + > > + /* > > + * Zero LSN are allowed for PROMOTE > > + * and DEMOTE requests only. > > + */ > > + if (req->lsn == 0) { > > + if (!iproto_type_is_promote_request(req->type)) { > > + say_info("%s. Zero lsn detected", > > + reject_str(req)); > > 7. It should be say_error(). The same in the other places. OK, will switch to error mode. > > > + > > + diag_set(ClientError, ER_UNSUPPORTED, > > + "Replication", > > + "zero LSN on promote/demote"); > > + return -1; > > 8. Please, try to be more compact. Even with the current indetation > level you don't need to wrap the lines so early. But the indentation > can be reduced even further easily: ... > > > The same in 2 places below. More compact. ok > > > +/** > > + * Filter CONFIRM and ROLLBACK packets. > > + */ > > +static int > > +filter_confirm_rollback(struct txn_limbo *limbo, > > + const struct synchro_request *req) > > +{ > > + /* > > + * When limbo is empty we have nothing to > > + * confirm/commit and if this request comes > > + * in it means the split brain has happened. > > + */ > > + if (!txn_limbo_is_empty(limbo)) > > + return 0; > > 9. What if rollback is for LSN > limbo's last LSN? It > also means nothing to do. The same for confirm LSN < limbo's > first LSN. static void txn_limbo_read_rollback(struct txn_limbo *limbo, int64_t lsn) { --> assert(limbo->owner_id != REPLICA_ID_NIL || txn_limbo_is_empty(limbo)); txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn) { --> assert(limbo->owner_id != REPLICA_ID_NIL || txn_limbo_is_empty(limbo)); Currently we're allowed to process empty limbo if only owner is not nil, I think I should add this case here. ... > > + > > + if (txn_limbo_is_empty(limbo)) { > > + /* > > + * Transactions are rolled back already, > > + * since the limbo is empty. > > + */ > > + say_info("%s. confirmed_lsn %lld < promote_lsn %lld " > > + "and empty limbo", reject_str(req), > > + (long long)limbo->confirmed_lsn, > > + (long long)promote_lsn); > > + > > + diag_set(ClientError, ER_UNSUPPORTED, > > + "Replication", > > + "forward promote LSN " > > + "(empty limbo, split brain)"); > > + return -1; > > + } else { > > 10. You don't need 'else' - the main branch already made 'return'. > This should help to reduce the indentation below. OK, will do, though explicit if\else here helps to gather code context, > > + /* > > + * Some entries are present in the limbo, > > + * we need to make sure the @a promote_lsn > > + * lays inside limbo [first; last] range. > > + * So that the promote request has some > > + * queued data to process, otherwise it > > + * means the request comes from split > > + * brained node. > > + */ > > + struct txn_limbo_entry *first, *last; > > + > > + first = txn_limbo_first_entry(limbo); > > + last = txn_limbo_last_entry(limbo); > > + > > + if (first->lsn < promote_lsn || > > + last->lsn > promote_lsn) { > > 11. This seems to be broken. In the comment you said the > error is when > > promote < first or promote > last > > And here in the condition you return an error when > > promote > first or promote < last > > Why? Good catch, typo. Actually I've updated this hunk locally but didn't pushed out. We need "first <= promote <= last" > > + > > +static int (*filter_req[FILTER_MAX]) > > +(struct txn_limbo *limbo, const struct synchro_request *req) = { > > + [FILTER_IN] = filter_in, > > + [FILTER_CONFIRM] = filter_confirm_rollback, > > + [FILTER_ROLLBACK] = filter_confirm_rollback, > > + [FILTER_PROMOTE] = filter_promote_demote, > > + [FILTER_DEMOTE] = filter_promote_demote, > > 12. What is this? Wouldn't it be much much much simpler to just > call filter_in() always + make a switch case for the request type + > call the needed functions? > > What is worse, you already have the switch-case anyway, but you > also added some strange loop, masks, and virtual functions ... . > I don't think I could make it more complex even if I wanted to, > sorry. Why so complicated? It might be look easier but won't allow to extend filtering in future without rewritting too much. I'm pretty sure this number of packet types is not finished and we will have more. Using bitmap routing you can easily hook in any call sequence you need while using explicit if\elses or direct calls via case-by-request-type won't allow to make it so. So no, this is not complicated at all but rather close to real packet filtering code. Anyway, which form would you prefer? txn_limbo_filter_locked() { int rc = filter_in(); if (rc != 0) return -1; swicth (req->type) { case IPROTO_CONFIRM: case IPROTO_ROLLBACK: rc = filter_confirm_rollback(); break; ... } return rc; } Kind of this? > > +int > > txn_limbo_process(struct txn_limbo *limbo, > > const struct synchro_request *req) > > { > > + int rc; > > + > > txn_limbo_term_lock(limbo); > > - txn_limbo_process_locked(limbo, req); > > + rc = txn_limbo_filter_locked(limbo, req); > > 13. We use relatively modern C. You can make int rc = ...; ok > 14. Process can fail now, but still its result is ignored in > wal_stream_apply_synchro_row(). +1, thanks!