[Tarantool-patches] [PATCH v2 2/2] fiber: exit with panic if we unable to revert guard page

Konstantin Osipov kostja.osipov at gmail.com
Tue Feb 4 18:47:42 MSK 2020


* Cyrill Gorcunov <gorcunov at gmail.com> [20/01/15 20:07]:
> At the moment we setup fiber's stack with a guard page
> which is used to detect stack overrun. This page is just
> a regular page taken from a slab with PROT_NONE attribute.
> 
> Once fiber is no longer needed we try to revert this
> attribute back to PROT_READ | PROT_WRITE. Still there
> is a pretty small chance that this attempt get failed.
> 
> Thus in such case we should not allow to proceed but rather
> lets panic, otherwise the slab won't longer be solid r/w memory
> area and attempt to write into this page will cause
> an unpredictable exception.
> 
> Signed-off-by: Cyrill Gorcunov <gorcunov at gmail.com>
> ---
>  src/lib/core/fiber.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/src/lib/core/fiber.c b/src/lib/core/fiber.c
> index b51f46f2f..fdad7607c 100644
> --- a/src/lib/core/fiber.c
> +++ b/src/lib/core/fiber.c
> @@ -1041,13 +1041,17 @@ fiber_stack_destroy(struct fiber *fiber, struct slab_cache *slabc)
>  			 * to setup the original protection back in
>  			 * background.
>  			 *
> +			 * For now lets exit with panic: if mprotect
> +			 * failed we must not allow to reuse such slab
> +			 * with PROT_NONE'ed page somewhere inside.
> +			 *

somewhere inside its stack area would be more clear.

>  			 * Note that in case if we're called from
>  			 * fiber_stack_create() the @mprotect_flags is
>  			 * the same as the slab been created with, so
>  			 * calling mprotect for VMA with same flags
>  			 * won't fail.
>  			 */
> -			diag_log();
> +			panic_syserror("fiber: Can't put guard page to slab");

While the patch itself is LGTM, we need to nail down the cause of
the failure even at the cost of crash, I suspect what we're getting
here is ENOMEM from the kernel. I suspect we have too many
mprotect regions and the kernel runs out of some internal
resources for them.

I think adding better diagnostics could help us identify the
issue: the failed address, its slab, the total number of fibers (and by
induction mprotected pages). 

I have also discussed the issue with @xemul, and he suggests that 
wrong slab alignment could be causing this.

Finally, we could try to clear mprotect() first, and if it fails,
avoid destroying the fiber and keep it cached for a while more.
We could retry destroying it when kernel has more memory.

These are of course ideas for follow ups.

-- 
Konstantin Osipov, Moscow, Russia


More information about the Tarantool-patches mailing list