[Tarantool-patches] [PATCH luajit] ARM64: Avoid side-effects of constant rematerialization.

sergos sergos at tarantool.org
Tue Sep 6 11:43:56 MSK 2022


Hi!

Thanks for the patch!
As I can’t say much about the patch from Mike, LGTM.
Just some nits in the comment.

Sergos


> On 31 Aug 2022, at 12:52, Sergey Kaplun <skaplun at tarantool.org> wrote:
> 
> From: Mike Pall <mike>
> 
> Thanks to Patrick Galizia.
> 
> (cherry picked from commit b33e3f2d441590f4de0d189bd9a65661824a48f6)
> 
> Constant rematerialization must not use other registers that contain
> constants, if the register is in-flight. When we have the high
                                ^^^^^^
                                in use?                          
> regitster pressure we can face the following issue:
> 
> The assembly of an IR instruction allocates a constant into a free
> register. Then it spills another register (due to high register
> pressure), which is rematerialized using the same constant (which it
> assumes is now in the allocated register). In case when the first
> register also happens to be the destination register, the constant value
> is modified before the rematerialization.
> 
> For the code in the test for this commit we get the following register
> allocation order (read from top to bottom (DBG RA reversed)):
> | current IR | operation | IR ref | register
> |  0048         alloc       0038     x0
> |  0048         remat       K038     x0
> |  0048         alloc       K023     x4
> 
> Which leads to the following asembly:
> | ...
> | add   x4, x4, x0    # x4 modified before x0 rematerialization
> | ldrb  w4, [x4, #24]
> | add   x0, x4, #24   # constant x0 rematerialization
> | ...
> As a result, the value register x0 holding is incorrect.
> 
> This patch moves allocation of constants for earlier to be sure that the
                                           ^^^ remove it

> rematerialization can not make use of the same constant as one of the
> sources of the IR instruction.
> 
> After the patch register allocation order is the following:
> | current IR | operation | IR ref | register
> |  0048         alloc       K023     x4
> |  0048         alloc       0038     x0
> |  0048         remat       K038     x0
> 
> Also, this patch fixes the `asm_fusexref()` logic for the `IR_STRREF` in
> case, when both operands don't fit in 32-bit constants (`asm_isk32()`
> fails). We want to use the IR operand holds the referenced value in
                                       holding

> `ra_alloc1()` as one having the hint set (`ra_hashint()` check passes).
> It is set for the operand with a non constant value (`irref_isk()`
> fails). The code assumes that this is always the `ir->op1` operand, so
                                 it

> for cases when this value holds `ir->op2` operand register allocator
    the case                                      the

> misses the aforementioned hint in `ir->op2`. As the result the wrong
> register is selected. This patch adds the corresponding `irref_isk()`
> check for the `ir->op1` to detect which operand contains the value with
> the hint.
> 
> After the patch the resulting assembly is the following:
> | ...
> | add   x4, x0, x4
> | ldrb  w4, [x4, #24]
> | add   x0, x1, #112
> | ...
> 
> As we can see the constant is rematerialized from another, non-modified
> register.
> 
> Sergey Kaplun:
> * added the description and the test for the problem
> 
> Part of tarantool/tarantool#7230
> ---
> 
> The test case leads to the coredump when compile with
> -DCMAKE_BUILD_TYPE=[Release, RelWithDebInfo].
> 
> Issue: https://github.com/tarantool/tarantool/issues/7230
> PRs:
> * https://github.com/LuaJIT/LuaJIT/pull/438
> * https://github.com/LuaJIT/LuaJIT/pull/479
> Branch: https://github.com/tarantool/luajit/tree/skaplun/lj-438-arm64-constant-rematerialization-full-ci
> Tarantool PR: https://github.com/tarantool/tarantool/pull/7628
> 
> src/lj_asm_arm64.h                            |  46 +++++---
> ...-arm64-constant-rematerialization.test.lua | 102 ++++++++++++++++++
> 2 files changed, 131 insertions(+), 17 deletions(-)
> create mode 100644 test/tarantool-tests/lj-438-arm64-constant-rematerialization.test.lua
> 
> diff --git a/src/lj_asm_arm64.h b/src/lj_asm_arm64.h
> index da0ee4bb..a4de187f 100644
> --- a/src/lj_asm_arm64.h
> +++ b/src/lj_asm_arm64.h
> @@ -295,8 +295,10 @@ static void asm_fusexref(ASMState *as, A64Ins ai, Reg rd, IRRef ref,
>       } else if (asm_isk32(as, ir->op1, &ofs)) {
> 	ref = ir->op2;
>       } else {
> -	Reg rn = ra_alloc1(as, ir->op1, allow);
> -	IRIns *irr = IR(ir->op2);
> +	Reg refk = irref_isk(ir->op1) ? ir->op1 : ir->op2;
> +	Reg refv = irref_isk(ir->op1) ? ir->op2 : ir->op1;
> +	Reg rn = ra_alloc1(as, refv, allow);
> +	IRIns *irr = IR(refk);
> 	uint32_t m;
> 	if (irr+1 == ir && !ra_used(irr) &&
> 	    irr->o == IR_ADD && irref_isk(irr->op2)) {
> @@ -307,7 +309,7 @@ static void asm_fusexref(ASMState *as, A64Ins ai, Reg rd, IRRef ref,
> 	    goto skipopm;
> 	  }
> 	}
> -	m = asm_fuseopm(as, 0, ir->op2, rset_exclude(allow, rn));
> +	m = asm_fuseopm(as, 0, refk, rset_exclude(allow, rn));
> 	ofs = sizeof(GCstr);
>       skipopm:
> 	emit_lso(as, ai, rd, rd, ofs);
> @@ -722,6 +724,7 @@ static void asm_href(ASMState *as, IRIns *ir, IROp merge)
>   Reg dest = ra_dest(as, ir, allow);
>   Reg tab = ra_alloc1(as, ir->op1, rset_clear(allow, dest));
>   Reg key = 0, tmp = RID_TMP;
> +  Reg ftmp = RID_NONE, type = RID_NONE, scr = RID_NONE, tisnum = RID_NONE;
>   IRRef refkey = ir->op2;
>   IRIns *irkey = IR(refkey);
>   int isk = irref_isk(ir->op2);
> @@ -751,6 +754,28 @@ static void asm_href(ASMState *as, IRIns *ir, IROp merge)
>     }
>   }
> 
> +  /* Allocate constants early. */
> +  if (irt_isnum(kt)) {
> +    if (!isk) {
> +      tisnum = ra_allock(as, LJ_TISNUM << 15, allow);
> +      ftmp = ra_scratch(as, rset_exclude(RSET_FPR, key));
> +      rset_clear(allow, tisnum);
> +    }
> +  } else if (irt_isaddr(kt)) {
> +    if (isk) {
> +      int64_t kk = ((int64_t)irt_toitype(irkey->t) << 47) | irkey[1].tv.u64;
> +      scr = ra_allock(as, kk, allow);
> +    } else {
> +      scr = ra_scratch(as, allow);
> +    }
> +    rset_clear(allow, scr);
> +  } else {
> +    lua_assert(irt_ispri(kt) && !irt_isnil(kt));
> +    type = ra_allock(as, ~((int64_t)~irt_toitype(ir->t) << 47), allow);
> +    scr = ra_scratch(as, rset_clear(allow, type));
> +    rset_clear(allow, scr);
> +  }
> +
>   /* Key not found in chain: jump to exit (if merged) or load niltv. */
>   l_end = emit_label(as);
>   as->invmcp = NULL;
> @@ -780,9 +805,6 @@ static void asm_href(ASMState *as, IRIns *ir, IROp merge)
> 	emit_nm(as, A64I_CMPx, key, tmp);
>       emit_lso(as, A64I_LDRx, tmp, dest, offsetof(Node, key.u64));
>     } else {
> -      Reg tisnum = ra_allock(as, LJ_TISNUM << 15, allow);
> -      Reg ftmp = ra_scratch(as, rset_exclude(RSET_FPR, key));
> -      rset_clear(allow, tisnum);
>       emit_nm(as, A64I_FCMPd, key, ftmp);
>       emit_dn(as, A64I_FMOV_D_R, (ftmp & 31), (tmp & 31));
>       emit_cond_branch(as, CC_LO, l_next);
> @@ -790,31 +812,21 @@ static void asm_href(ASMState *as, IRIns *ir, IROp merge)
>       emit_lso(as, A64I_LDRx, tmp, dest, offsetof(Node, key.n));
>     }
>   } else if (irt_isaddr(kt)) {
> -    Reg scr;
>     if (isk) {
> -      int64_t kk = ((int64_t)irt_toitype(irkey->t) << 47) | irkey[1].tv.u64;
> -      scr = ra_allock(as, kk, allow);
>       emit_nm(as, A64I_CMPx, scr, tmp);
>       emit_lso(as, A64I_LDRx, tmp, dest, offsetof(Node, key.u64));
>     } else {
> -      scr = ra_scratch(as, allow);
>       emit_nm(as, A64I_CMPx, tmp, scr);
>       emit_lso(as, A64I_LDRx, scr, dest, offsetof(Node, key.u64));
>     }
> -    rset_clear(allow, scr);
>   } else {
> -    Reg type, scr;
> -    lua_assert(irt_ispri(kt) && !irt_isnil(kt));
> -    type = ra_allock(as, ~((int64_t)~irt_toitype(ir->t) << 47), allow);
> -    scr = ra_scratch(as, rset_clear(allow, type));
> -    rset_clear(allow, scr);
>     emit_nm(as, A64I_CMPw, scr, type);
>     emit_lso(as, A64I_LDRx, scr, dest, offsetof(Node, key));
>   }
> 
>   *l_loop = A64I_BCC | A64F_S19(as->mcp - l_loop) | CC_NE;
>   if (!isk && irt_isaddr(kt)) {
> -    Reg type = ra_allock(as, (int32_t)irt_toitype(kt), allow);
> +    type = ra_allock(as, (int32_t)irt_toitype(kt), allow);
>     emit_dnm(as, A64I_ADDx | A64F_SH(A64SH_LSL, 47), tmp, key, type);
>     rset_clear(allow, type);
>   }
> diff --git a/test/tarantool-tests/lj-438-arm64-constant-rematerialization.test.lua b/test/tarantool-tests/lj-438-arm64-constant-rematerialization.test.lua
> new file mode 100644
> index 00000000..ffc449bc
> --- /dev/null
> +++ b/test/tarantool-tests/lj-438-arm64-constant-rematerialization.test.lua
> @@ -0,0 +1,102 @@
> +local tap = require('tap')
> +
> +-- Test file to demonstrate LuaJIT bug with constant
> +-- rematerialization on arm64.
> +-- See also https://github.com/LuaJIT/LuaJIT/pull/438.
> +local test = tap.test('lj-438-arm64-constant-rematerialization')
> +test:plan(1)
> +
> +-- This test file demonstrates the following problem:
> +-- The assembly of an IR instruction allocates a constant into a
> +-- free register. Then it spills another register (due to high
> +-- register pressure), which is rematerialized using the same
> +-- constant (which it assumes is now in the allocated register).
> +-- In case when the first register also happens to be the
> +-- destination register, the constant value is modified before the
> +-- rematerialization.
> +--
> +-- For the code below we get the following register allocation
> +-- order (read from top to bottom (DBG RA reversed)):
> +-- | current IR | operation | IR ref | register
> +-- |  0048         alloc       0038     x0
> +-- |  0048         remat       K038     x0
> +-- |  0048         alloc       K023     x4
> +--
> +-- Which leads to the following asembly:
> +-- | ...
> +-- | add   x4, x4, x0    # x4 modified before x0 rematerialization
> +-- | ldrb  w4, [x4, #24]
> +-- | add   x0, x4, #24   # constant x0 rematerialization
> +-- | ...
> +-- As a result, the value register x0 holding is incorrect.
> +
> +local empty = {}
> +
> +jit.off()
> +jit.flush()
> +
> +-- XXX: The example below is very fragile. Even the names of
> +-- the variables matter.
> +local function scan(vs)
> +  -- The code below is needed to generate high register pressure
> +  -- and specific register allocations.
> +  for _, v in ipairs(vs) do
> +    -- XXX: Just more usage of registers. Nothing significant.
> +    local sep = v:find('@')
> +    -- Recording of yielding `string.byte()` result encodes XLOAD
> +    -- IR. Its assembly modifies x4 register, that is chosen as
> +    -- a destination register.
> +    -- IR_NE, that using `asm_href()` uses the modified x4
> +    -- register as a source for constant x0 rematerialization.
> +    -- As far as it is modified before, the result value is
> +    -- incorrect.
> +    -- luacheck: ignore
> +    if v:sub(sep + 2, -2):byte() == 0x3f then -- 0x3f == '?'
> +    end
> +
> +    -- XXX: Just more usage of registers. Nothing significant.
> +    local _ = empty[v]
> +
> +    -- Here the `str` strdata value (rematerialized x0 register)
> +    -- given to the `lj_str_find()` is invalid on the trace,
> +    -- that as a result leading to the core dump.
> +    v:find(':')
> +  end
> +end
> +
> +jit.on()
> +jit.opt.start('hotloop=1', 'loopunroll=1')
> +
> +-- This wrapper function is needed to avoid excess errors 'leaving
> +-- loop in the root trace'.
> +local function wrap()
> +  -- XXX: There are four failing attemts to compile trace for this
> +  -- code:
> +  -- * The first trace trying to record starts with the ITERL BC
> +  --   in `scan()` function. The compilation failed, because
> +  --   recording starts at the second iteration, when the loop is
> +  --   left.
> +  -- * The second trace starts with UGET (scan) in the cycle
> +  --   below. Entering calling the `scan` function compilation
> +  --   failed, when sees the inner ITERL loop.
> +  -- * The third trace starts with GGET (ipairs) in the `scan()`
> +  --   function trying to record the hot function. The compilation
> +  --   is failed due to facing the inner ITERL loop.
> +  -- * At 19th iteration the ITERL trying to be recorded again
> +  --   after this instruction become hot again.
> +  --
> +  -- And, finally, at 39th iteration the `for` loop below is
> +  -- recorded after becoming hot again. Now the compiler inlining
> +  -- the inner loop and recording doesn't fail.
> +  -- The 40th iteration is needed to be sure the compiled mcode is
> +  -- correct.
> +  for _ = 1, 40 do
> +    scan({'ab at xyz'})
> +  end
> +end
> +
> +wrap()
> +
> +test:ok(true, 'the resulting trace is correct')
> +
> +os.exit(test:check() and 0 or 1)
> -- 
> 2.34.1
> 



More information about the Tarantool-patches mailing list