From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id D79BD6EC5B; Wed, 12 May 2021 14:29:42 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org D79BD6EC5B DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1620818983; bh=PZgSgryzxe2zGWWysNvMV+c8z508f8zUPB2ANm4TGi4=; h=To:Date:Subject:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:Cc:From; b=Z2hmaVCGcq2r9RiXYee1PSVMn4f5vz4xNGHqub9LnE7Sha93GHdgzIxTb5vVFn7y1 FPSJPnDfGXIkVTO1PlLQ8npoKBnZ631BU27xGvc5yU9e1DYYnuIvIfx+02NVrWl5NO uEGu8MwfWQNeLezqwKcU3dsN3u6GD6WXZjJSn04c= Received: from smtp40.i.mail.ru (smtp40.i.mail.ru [94.100.177.100]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id E6B626EC5B for ; Wed, 12 May 2021 14:29:40 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org E6B626EC5B Received: by smtp40.i.mail.ru with esmtpa (envelope-from ) id 1lgn3b-0004UM-Ro; Wed, 12 May 2021 14:29:40 +0300 To: v.shpilevoy@tarantool.org, gorcunov@gmail.com Date: Wed, 12 May 2021 14:29:28 +0300 Message-Id: <20210512112928.12509-1-sergepetrenko@tarantool.org> X-Mailer: git-send-email 2.30.1 (Apple Git-130) MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-7564579A: 646B95376F6C166E X-77F55803: 4F1203BC0FB41BD95978C26455E69BE0B2C7F7C3B0039F1303CE517DE434612D182A05F538085040DCFCA43DDEE9FA3155674F841CB14E29878612B2B2ED6F270B8D6FD088B483D4 X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE7933EE24EF7732AE4EA1F7E6F0F101C67BD4B6F7A4D31EC0BCC500DACC3FED6E28638F802B75D45FF8AA50765F790063781E06E36B00F36668638F802B75D45FF914D58D5BE9E6BC1A93B80C6DEB9DEE97C6FB206A91F05B2DDD30FCE81127B99FB0BDEC3811113731DC1F2ED7A42611AD2E47CDBA5A96583C09775C1D3CA48CF543F342AA183667F117882F4460429724CE54428C33FAD30A8DF7F3B2552694AC26CFBAC0749D213D2E47CDBA5A9658378DA827A17800CE7437303CEBCE4AF319FA2833FD35BB23DF004C906525384302BEBFE083D3B9BA71A620F70A64A45A98AA50765F79006372E808ACE2090B5E1725E5C173C3A84C3C5EA940A35A165FF2DBA43225CD8A89FB26E97DCB74E6252A91E23F1B6B78B78B5C8C57E37DE458BEDA766A37F9254B7 X-C1DE0DAB: C20DE7B7AB408E4181F030C43753B8183A4AFAF3EA6BDC44E1F4276B8099419608CF033E463A5088745F2896E254FAA9C951260181C3C9509C2B6934AE262D3EE7EAB7254005DCED864F6051C41E3F5092E1F3B950BCE216D137639923C070ABC67D8CAA8ED015DFBDC6A1CF3F042BAD6DF99611D93F60EFA70DDFFB3186CBC5699F904B3F4130E343918A1A30D5E7FCCB5012B2E24CD356 X-C8649E89: 4E36BF7865823D7055A7F0CF078B5EC49A30900B95165D34A503FBFE8BE8FC494E447947A6CCD69ED90A3324C1717CADC9B3E4B40EC5F613CF64B8434D52CA4D1D7E09C32AA3244C4976069045B556D31F51E90E67FF1749C3B3ADDA61883BB5927AC6DF5659F194 X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2biojbL9S8ysBdXgbunPnEXPVa2Zf5pXIeUv4 X-Mailru-Sender: 3B9A0136629DC9125D61937A2360A446CD81505BC639401F0C744E41442A32A963361D86640981DF424AE0EB1F3D1D21E2978F233C3FAE6EE63DB1732555E4A8EE80603BA4A5B0BC112434F685709FCF0DA7A0AF5A3A8387 X-Mras: Ok Subject: [Tarantool-patches] [PATCH v2] recovery: make it yield when positioning in a WAL X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Serge Petrenko via Tarantool-patches Reply-To: Serge Petrenko Cc: tarantool-patches@dev.tarantool.org Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" We had various places in box.cc and relay.cc which counted processed rows and yielded every now and then. These yields didn't cover cases, when recovery has to position inside a long WAL file: For example, when tarantool exits without leaving an empty WAL file which'll be used to recover instance vclock on restart. In this case the instance freezes while processing the last availabe WAL in order to recover the vclock. Another issue is with replication. If a replica connects and needs data from the end of a really long WAL, recovery will read up to the needed position without yields, making relay disconnect by timeout. In order to fix the issue, make recovery decide when a yield should happen. Once recovery decides so, it calls a xstream callback, schedule_yield. Currently schedule_yield is fired once recovery processes (either skips or writes) WAL_ROWS_PER_YIELD rows. schedule_yield either yields right away, in case of relay, or saves the yield for later, in case of local recovery, because it might be in the middle of a transaction. Closes #5979 --- https://github.com/tarantool/tarantool/tree/sp/gh-5979-recovery-yield https://github.com/tarantool/tarantool/issues/5979 Changes in v2: - move schedule_yield to xstream .../unreleased/gh-5979-recovery-ligs.md | 11 ++++++ src/box/box.cc | 37 +++++++++---------- src/box/memtx_engine.c | 1 + src/box/recovery.cc | 23 +++++++----- src/box/recovery.h | 2 +- src/box/relay.cc | 16 ++++---- src/box/vinyl.c | 1 + src/box/xstream.h | 26 ++++++++++++- 8 files changed, 79 insertions(+), 38 deletions(-) create mode 100644 changelogs/unreleased/gh-5979-recovery-ligs.md diff --git a/changelogs/unreleased/gh-5979-recovery-ligs.md b/changelogs/unreleased/gh-5979-recovery-ligs.md new file mode 100644 index 000000000..86abfd66a --- /dev/null +++ b/changelogs/unreleased/gh-5979-recovery-ligs.md @@ -0,0 +1,11 @@ +# bugfix/core + +* Now tarantool yields when scanning `.xlog` files for the latest applied vclock + and when finding the right place in `.xlog`s to start recovering. This means + that the instance is responsive right after `box.cfg` call even when an empty + `.xlog` was not created on previous exit. + Also this prevents relay from timing out when a freshly subscribed replica + needs rows from the end of a relatively long (hundreds of MBs) `.xlog` + (gh-5979). +* The counter in `x.yM rows processed` log messages does not reset on each new + recovered `xlog` anymore. diff --git a/src/box/box.cc b/src/box/box.cc index 59925962d..8a7b8593d 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -335,8 +335,6 @@ struct wal_stream { * matches LSN of a first global row. */ bool has_global_row; - /** How many rows have been recovered so far. */ - size_t rows; }; /** @@ -570,12 +568,6 @@ end_diag_request: static void wal_stream_try_yield(struct wal_stream *stream) { - /* - * Save the yield. Otherwise it would happen only on rows which - * are a multiple of WAL_ROWS_PER_YIELD and are last in their - * transaction, which is probably a very rare coincidence. - */ - stream->has_yield |= (stream->rows % WAL_ROWS_PER_YIELD == 0); if (wal_stream_has_tx(stream) || !stream->has_yield) return; stream->has_yield = false; @@ -587,11 +579,6 @@ wal_stream_apply_row(struct xstream *base, struct xrow_header *row) { struct wal_stream *stream = container_of(base, struct wal_stream, base); - /* - * Account all rows, even non-DML, and even leading to an error. Because - * still need to yield sometimes. - */ - ++stream->rows; if (iproto_type_is_synchro_request(row->type)) { if (wal_stream_apply_synchro_row(stream, row) != 0) goto end_error; @@ -610,15 +597,28 @@ end_error: diag_raise(); } +static struct wal_stream wal_stream; + +/** + * Plan a yield in recovery stream. Wal stream will execute it as soon as it's + * ready. + */ +static void +wal_stream_schedule_yield(void) +{ + wal_stream.has_yield = true; + wal_stream_try_yield(&wal_stream); +} + static void wal_stream_create(struct wal_stream *ctx) { - xstream_create(&ctx->base, wal_stream_apply_row); + xstream_create(&ctx->base, wal_stream_apply_row, + wal_stream_schedule_yield); ctx->tsn = 0; ctx->first_row_lsn = 0; ctx->has_yield = false; ctx->has_global_row = false; - ctx->rows = 0; } /* {{{ configuration bindings */ @@ -3124,7 +3124,6 @@ local_recovery(const struct tt_uuid *instance_uuid, say_info("instance uuid %s", tt_uuid_str(&INSTANCE_UUID)); - struct wal_stream wal_stream; wal_stream_create(&wal_stream); auto stream_guard = make_scoped_guard([&]{ wal_stream_abort(&wal_stream); @@ -3132,8 +3131,7 @@ local_recovery(const struct tt_uuid *instance_uuid, struct recovery *recovery; bool is_force_recovery = cfg_geti("force_recovery"); - recovery = recovery_new(wal_dir(), is_force_recovery, - checkpoint_vclock); + recovery = recovery_new(wal_dir(), is_force_recovery, checkpoint_vclock); /* * Make sure we report the actual recovery position @@ -3151,7 +3149,8 @@ local_recovery(const struct tt_uuid *instance_uuid, * so we must reflect this in replicaset vclock to * not attempt to apply these rows twice. */ - recovery_scan(recovery, &replicaset.vclock, &gc.vclock); + recovery_scan(recovery, &replicaset.vclock, &gc.vclock, + &wal_stream.base); say_info("instance vclock %s", vclock_to_string(&replicaset.vclock)); if (wal_dir_lock >= 0) { diff --git a/src/box/memtx_engine.c b/src/box/memtx_engine.c index e076cd71d..6c4982b9f 100644 --- a/src/box/memtx_engine.c +++ b/src/box/memtx_engine.c @@ -956,6 +956,7 @@ memtx_engine_join(struct engine *engine, void *arg, struct xstream *stream) memtx->replica_join_cord = &cord; int res = cord_cojoin(&cord); memtx->replica_join_cord = NULL; + xstream_reset(stream); return res; } diff --git a/src/box/recovery.cc b/src/box/recovery.cc index cd33e7635..e5b88fcc2 100644 --- a/src/box/recovery.cc +++ b/src/box/recovery.cc @@ -119,7 +119,7 @@ recovery_new(const char *wal_dirname, bool force_recovery, void recovery_scan(struct recovery *r, struct vclock *end_vclock, - struct vclock *gc_vclock) + struct vclock *gc_vclock, struct xstream *stream) { xdir_scan_xc(&r->wal_dir, true); @@ -139,9 +139,14 @@ recovery_scan(struct recovery *r, struct vclock *end_vclock, if (xdir_open_cursor(&r->wal_dir, vclock_sum(end_vclock), &cursor) != 0) return; struct xrow_header row; - while (xlog_cursor_next(&cursor, &row, true) == 0) + while (xlog_cursor_next(&cursor, &row, true) == 0) { vclock_follow_xrow(end_vclock, &row); + if (++stream->row_count % WAL_ROWS_PER_YIELD == 0) + xstream_schedule_yield(stream); + } xlog_cursor_close(&cursor, false); + + xstream_reset(stream); } static inline void @@ -242,9 +247,14 @@ recover_xlog(struct recovery *r, struct xstream *stream, const struct vclock *stop_vclock) { struct xrow_header row; - uint64_t row_count = 0; while (xlog_cursor_next_xc(&r->cursor, &row, r->wal_dir.force_recovery) == 0) { + if (++stream->row_count % WAL_ROWS_PER_YIELD == 0) { + xstream_schedule_yield(stream); + } + if (stream->row_count % 100000 == 0) + say_info("%.1fM rows processed", + stream->row_count / 1000000.); /* * Read the next row from xlog file. * @@ -273,12 +283,7 @@ recover_xlog(struct recovery *r, struct xstream *stream, * failed row anyway. */ vclock_follow_xrow(&r->vclock, &row); - if (xstream_write(stream, &row) == 0) { - ++row_count; - if (row_count % 100000 == 0) - say_info("%.1fM rows processed", - row_count / 1000000.); - } else { + if (xstream_write(stream, &row) != 0) { if (!r->wal_dir.force_recovery) diag_raise(); diff --git a/src/box/recovery.h b/src/box/recovery.h index c8ccaa553..987932903 100644 --- a/src/box/recovery.h +++ b/src/box/recovery.h @@ -74,7 +74,7 @@ recovery_delete(struct recovery *r); */ void recovery_scan(struct recovery *r, struct vclock *end_vclock, - struct vclock *gc_vclock); + struct vclock *gc_vclock, struct xstream *stream); void recovery_follow_local(struct recovery *r, struct xstream *stream, diff --git a/src/box/relay.cc b/src/box/relay.cc index ff43c2fc7..81ac35bf2 100644 --- a/src/box/relay.cc +++ b/src/box/relay.cc @@ -260,11 +260,18 @@ relay_new(struct replica *replica) return relay; } +/** A callback recovery calls every now and then to unblock the event loop. */ +static void +relay_yield(void) +{ + fiber_sleep(0); +} + static void relay_start(struct relay *relay, int fd, uint64_t sync, void (*stream_write)(struct xstream *, struct xrow_header *)) { - xstream_create(&relay->stream, stream_write); + xstream_create(&relay->stream, stream_write, relay_yield); /* * Clear the diagnostics at start, in case it has the old * error message which we keep around to display in @@ -928,13 +935,6 @@ relay_send(struct relay *relay, struct xrow_header *packet) coio_write_xrow(&relay->io, packet); fiber_gc(); - /* - * It may happen that the socket is always ready for write, so yield - * explicitly every now and then to not block the event loop. - */ - if (++relay->row_count % WAL_ROWS_PER_YIELD == 0) - fiber_sleep(0); - struct errinj *inj = errinj(ERRINJ_RELAY_TIMEOUT, ERRINJ_DOUBLE); if (inj != NULL && inj->dparam > 0) fiber_sleep(inj->dparam); diff --git a/src/box/vinyl.c b/src/box/vinyl.c index b53e97593..647566623 100644 --- a/src/box/vinyl.c +++ b/src/box/vinyl.c @@ -3033,6 +3033,7 @@ vinyl_engine_join(struct engine *engine, void *arg, struct xstream *stream) if (++loops % VY_YIELD_LOOPS == 0) fiber_sleep(0); } + xstream_reset(stream); return 0; } diff --git a/src/box/xstream.h b/src/box/xstream.h index d29ff4213..d27de09a3 100644 --- a/src/box/xstream.h +++ b/src/box/xstream.h @@ -41,16 +41,40 @@ extern "C" { struct xrow_header; struct xstream; +/** + * A type for a callback invoked by recovery after some batch of rows is + * processed. Is used mostly to unblock the event loop every now and then. + */ +typedef void (*schedule_yield_f)(void); + typedef void (*xstream_write_f)(struct xstream *, struct xrow_header *); struct xstream { xstream_write_f write; + schedule_yield_f schedule_yield; + uint64_t row_count; }; static inline void -xstream_create(struct xstream *xstream, xstream_write_f write) +xstream_create(struct xstream *xstream, xstream_write_f write, + schedule_yield_f schedule_yield) { xstream->write = write; + xstream->schedule_yield = schedule_yield; + xstream->row_count = 0; +} + +static inline void +xstream_schedule_yield(struct xstream *stream) +{ + stream->schedule_yield(); +} + +static inline void +xstream_reset(struct xstream *stream) +{ + stream->row_count = 0; + xstream_schedule_yield(stream); } int -- 2.30.1 (Apple Git-130)