From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-f67.google.com (mail-lf1-f67.google.com [209.85.167.67]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 4155B46970E for ; Tue, 4 Feb 2020 18:47:44 +0300 (MSK) Received: by mail-lf1-f67.google.com with SMTP id c23so12526802lfi.7 for ; Tue, 04 Feb 2020 07:47:44 -0800 (PST) Date: Tue, 4 Feb 2020 18:47:42 +0300 From: Konstantin Osipov Message-ID: <20200204154742.GA32754@atlas> References: <20200115170524.20471-1-gorcunov@gmail.com> <20200115170524.20471-3-gorcunov@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200115170524.20471-3-gorcunov@gmail.com> Subject: Re: [Tarantool-patches] [PATCH v2 2/2] fiber: exit with panic if we unable to revert guard page List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Cyrill Gorcunov Cc: tml * Cyrill Gorcunov [20/01/15 20:07]: > At the moment we setup fiber's stack with a guard page > which is used to detect stack overrun. This page is just > a regular page taken from a slab with PROT_NONE attribute. > > Once fiber is no longer needed we try to revert this > attribute back to PROT_READ | PROT_WRITE. Still there > is a pretty small chance that this attempt get failed. > > Thus in such case we should not allow to proceed but rather > lets panic, otherwise the slab won't longer be solid r/w memory > area and attempt to write into this page will cause > an unpredictable exception. > > Signed-off-by: Cyrill Gorcunov > --- > src/lib/core/fiber.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/src/lib/core/fiber.c b/src/lib/core/fiber.c > index b51f46f2f..fdad7607c 100644 > --- a/src/lib/core/fiber.c > +++ b/src/lib/core/fiber.c > @@ -1041,13 +1041,17 @@ fiber_stack_destroy(struct fiber *fiber, struct slab_cache *slabc) > * to setup the original protection back in > * background. > * > + * For now lets exit with panic: if mprotect > + * failed we must not allow to reuse such slab > + * with PROT_NONE'ed page somewhere inside. > + * somewhere inside its stack area would be more clear. > * Note that in case if we're called from > * fiber_stack_create() the @mprotect_flags is > * the same as the slab been created with, so > * calling mprotect for VMA with same flags > * won't fail. > */ > - diag_log(); > + panic_syserror("fiber: Can't put guard page to slab"); While the patch itself is LGTM, we need to nail down the cause of the failure even at the cost of crash, I suspect what we're getting here is ENOMEM from the kernel. I suspect we have too many mprotect regions and the kernel runs out of some internal resources for them. I think adding better diagnostics could help us identify the issue: the failed address, its slab, the total number of fibers (and by induction mprotected pages). I have also discussed the issue with @xemul, and he suggests that wrong slab alignment could be causing this. Finally, we could try to clear mprotect() first, and if it fails, avoid destroying the fiber and keep it cached for a while more. We could retry destroying it when kernel has more memory. These are of course ideas for follow ups. -- Konstantin Osipov, Moscow, Russia