[Tarantool-patches] [PATCH luajit] ARM64: Avoid side-effects of constant rematerialization.
sergos
sergos at tarantool.org
Tue Sep 6 11:43:56 MSK 2022
Hi!
Thanks for the patch!
As I can’t say much about the patch from Mike, LGTM.
Just some nits in the comment.
Sergos
> On 31 Aug 2022, at 12:52, Sergey Kaplun <skaplun at tarantool.org> wrote:
>
> From: Mike Pall <mike>
>
> Thanks to Patrick Galizia.
>
> (cherry picked from commit b33e3f2d441590f4de0d189bd9a65661824a48f6)
>
> Constant rematerialization must not use other registers that contain
> constants, if the register is in-flight. When we have the high
^^^^^^
in use?
> regitster pressure we can face the following issue:
>
> The assembly of an IR instruction allocates a constant into a free
> register. Then it spills another register (due to high register
> pressure), which is rematerialized using the same constant (which it
> assumes is now in the allocated register). In case when the first
> register also happens to be the destination register, the constant value
> is modified before the rematerialization.
>
> For the code in the test for this commit we get the following register
> allocation order (read from top to bottom (DBG RA reversed)):
> | current IR | operation | IR ref | register
> | 0048 alloc 0038 x0
> | 0048 remat K038 x0
> | 0048 alloc K023 x4
>
> Which leads to the following asembly:
> | ...
> | add x4, x4, x0 # x4 modified before x0 rematerialization
> | ldrb w4, [x4, #24]
> | add x0, x4, #24 # constant x0 rematerialization
> | ...
> As a result, the value register x0 holding is incorrect.
>
> This patch moves allocation of constants for earlier to be sure that the
^^^ remove it
> rematerialization can not make use of the same constant as one of the
> sources of the IR instruction.
>
> After the patch register allocation order is the following:
> | current IR | operation | IR ref | register
> | 0048 alloc K023 x4
> | 0048 alloc 0038 x0
> | 0048 remat K038 x0
>
> Also, this patch fixes the `asm_fusexref()` logic for the `IR_STRREF` in
> case, when both operands don't fit in 32-bit constants (`asm_isk32()`
> fails). We want to use the IR operand holds the referenced value in
holding
> `ra_alloc1()` as one having the hint set (`ra_hashint()` check passes).
> It is set for the operand with a non constant value (`irref_isk()`
> fails). The code assumes that this is always the `ir->op1` operand, so
it
> for cases when this value holds `ir->op2` operand register allocator
the case the
> misses the aforementioned hint in `ir->op2`. As the result the wrong
> register is selected. This patch adds the corresponding `irref_isk()`
> check for the `ir->op1` to detect which operand contains the value with
> the hint.
>
> After the patch the resulting assembly is the following:
> | ...
> | add x4, x0, x4
> | ldrb w4, [x4, #24]
> | add x0, x1, #112
> | ...
>
> As we can see the constant is rematerialized from another, non-modified
> register.
>
> Sergey Kaplun:
> * added the description and the test for the problem
>
> Part of tarantool/tarantool#7230
> ---
>
> The test case leads to the coredump when compile with
> -DCMAKE_BUILD_TYPE=[Release, RelWithDebInfo].
>
> Issue: https://github.com/tarantool/tarantool/issues/7230
> PRs:
> * https://github.com/LuaJIT/LuaJIT/pull/438
> * https://github.com/LuaJIT/LuaJIT/pull/479
> Branch: https://github.com/tarantool/luajit/tree/skaplun/lj-438-arm64-constant-rematerialization-full-ci
> Tarantool PR: https://github.com/tarantool/tarantool/pull/7628
>
> src/lj_asm_arm64.h | 46 +++++---
> ...-arm64-constant-rematerialization.test.lua | 102 ++++++++++++++++++
> 2 files changed, 131 insertions(+), 17 deletions(-)
> create mode 100644 test/tarantool-tests/lj-438-arm64-constant-rematerialization.test.lua
>
> diff --git a/src/lj_asm_arm64.h b/src/lj_asm_arm64.h
> index da0ee4bb..a4de187f 100644
> --- a/src/lj_asm_arm64.h
> +++ b/src/lj_asm_arm64.h
> @@ -295,8 +295,10 @@ static void asm_fusexref(ASMState *as, A64Ins ai, Reg rd, IRRef ref,
> } else if (asm_isk32(as, ir->op1, &ofs)) {
> ref = ir->op2;
> } else {
> - Reg rn = ra_alloc1(as, ir->op1, allow);
> - IRIns *irr = IR(ir->op2);
> + Reg refk = irref_isk(ir->op1) ? ir->op1 : ir->op2;
> + Reg refv = irref_isk(ir->op1) ? ir->op2 : ir->op1;
> + Reg rn = ra_alloc1(as, refv, allow);
> + IRIns *irr = IR(refk);
> uint32_t m;
> if (irr+1 == ir && !ra_used(irr) &&
> irr->o == IR_ADD && irref_isk(irr->op2)) {
> @@ -307,7 +309,7 @@ static void asm_fusexref(ASMState *as, A64Ins ai, Reg rd, IRRef ref,
> goto skipopm;
> }
> }
> - m = asm_fuseopm(as, 0, ir->op2, rset_exclude(allow, rn));
> + m = asm_fuseopm(as, 0, refk, rset_exclude(allow, rn));
> ofs = sizeof(GCstr);
> skipopm:
> emit_lso(as, ai, rd, rd, ofs);
> @@ -722,6 +724,7 @@ static void asm_href(ASMState *as, IRIns *ir, IROp merge)
> Reg dest = ra_dest(as, ir, allow);
> Reg tab = ra_alloc1(as, ir->op1, rset_clear(allow, dest));
> Reg key = 0, tmp = RID_TMP;
> + Reg ftmp = RID_NONE, type = RID_NONE, scr = RID_NONE, tisnum = RID_NONE;
> IRRef refkey = ir->op2;
> IRIns *irkey = IR(refkey);
> int isk = irref_isk(ir->op2);
> @@ -751,6 +754,28 @@ static void asm_href(ASMState *as, IRIns *ir, IROp merge)
> }
> }
>
> + /* Allocate constants early. */
> + if (irt_isnum(kt)) {
> + if (!isk) {
> + tisnum = ra_allock(as, LJ_TISNUM << 15, allow);
> + ftmp = ra_scratch(as, rset_exclude(RSET_FPR, key));
> + rset_clear(allow, tisnum);
> + }
> + } else if (irt_isaddr(kt)) {
> + if (isk) {
> + int64_t kk = ((int64_t)irt_toitype(irkey->t) << 47) | irkey[1].tv.u64;
> + scr = ra_allock(as, kk, allow);
> + } else {
> + scr = ra_scratch(as, allow);
> + }
> + rset_clear(allow, scr);
> + } else {
> + lua_assert(irt_ispri(kt) && !irt_isnil(kt));
> + type = ra_allock(as, ~((int64_t)~irt_toitype(ir->t) << 47), allow);
> + scr = ra_scratch(as, rset_clear(allow, type));
> + rset_clear(allow, scr);
> + }
> +
> /* Key not found in chain: jump to exit (if merged) or load niltv. */
> l_end = emit_label(as);
> as->invmcp = NULL;
> @@ -780,9 +805,6 @@ static void asm_href(ASMState *as, IRIns *ir, IROp merge)
> emit_nm(as, A64I_CMPx, key, tmp);
> emit_lso(as, A64I_LDRx, tmp, dest, offsetof(Node, key.u64));
> } else {
> - Reg tisnum = ra_allock(as, LJ_TISNUM << 15, allow);
> - Reg ftmp = ra_scratch(as, rset_exclude(RSET_FPR, key));
> - rset_clear(allow, tisnum);
> emit_nm(as, A64I_FCMPd, key, ftmp);
> emit_dn(as, A64I_FMOV_D_R, (ftmp & 31), (tmp & 31));
> emit_cond_branch(as, CC_LO, l_next);
> @@ -790,31 +812,21 @@ static void asm_href(ASMState *as, IRIns *ir, IROp merge)
> emit_lso(as, A64I_LDRx, tmp, dest, offsetof(Node, key.n));
> }
> } else if (irt_isaddr(kt)) {
> - Reg scr;
> if (isk) {
> - int64_t kk = ((int64_t)irt_toitype(irkey->t) << 47) | irkey[1].tv.u64;
> - scr = ra_allock(as, kk, allow);
> emit_nm(as, A64I_CMPx, scr, tmp);
> emit_lso(as, A64I_LDRx, tmp, dest, offsetof(Node, key.u64));
> } else {
> - scr = ra_scratch(as, allow);
> emit_nm(as, A64I_CMPx, tmp, scr);
> emit_lso(as, A64I_LDRx, scr, dest, offsetof(Node, key.u64));
> }
> - rset_clear(allow, scr);
> } else {
> - Reg type, scr;
> - lua_assert(irt_ispri(kt) && !irt_isnil(kt));
> - type = ra_allock(as, ~((int64_t)~irt_toitype(ir->t) << 47), allow);
> - scr = ra_scratch(as, rset_clear(allow, type));
> - rset_clear(allow, scr);
> emit_nm(as, A64I_CMPw, scr, type);
> emit_lso(as, A64I_LDRx, scr, dest, offsetof(Node, key));
> }
>
> *l_loop = A64I_BCC | A64F_S19(as->mcp - l_loop) | CC_NE;
> if (!isk && irt_isaddr(kt)) {
> - Reg type = ra_allock(as, (int32_t)irt_toitype(kt), allow);
> + type = ra_allock(as, (int32_t)irt_toitype(kt), allow);
> emit_dnm(as, A64I_ADDx | A64F_SH(A64SH_LSL, 47), tmp, key, type);
> rset_clear(allow, type);
> }
> diff --git a/test/tarantool-tests/lj-438-arm64-constant-rematerialization.test.lua b/test/tarantool-tests/lj-438-arm64-constant-rematerialization.test.lua
> new file mode 100644
> index 00000000..ffc449bc
> --- /dev/null
> +++ b/test/tarantool-tests/lj-438-arm64-constant-rematerialization.test.lua
> @@ -0,0 +1,102 @@
> +local tap = require('tap')
> +
> +-- Test file to demonstrate LuaJIT bug with constant
> +-- rematerialization on arm64.
> +-- See also https://github.com/LuaJIT/LuaJIT/pull/438.
> +local test = tap.test('lj-438-arm64-constant-rematerialization')
> +test:plan(1)
> +
> +-- This test file demonstrates the following problem:
> +-- The assembly of an IR instruction allocates a constant into a
> +-- free register. Then it spills another register (due to high
> +-- register pressure), which is rematerialized using the same
> +-- constant (which it assumes is now in the allocated register).
> +-- In case when the first register also happens to be the
> +-- destination register, the constant value is modified before the
> +-- rematerialization.
> +--
> +-- For the code below we get the following register allocation
> +-- order (read from top to bottom (DBG RA reversed)):
> +-- | current IR | operation | IR ref | register
> +-- | 0048 alloc 0038 x0
> +-- | 0048 remat K038 x0
> +-- | 0048 alloc K023 x4
> +--
> +-- Which leads to the following asembly:
> +-- | ...
> +-- | add x4, x4, x0 # x4 modified before x0 rematerialization
> +-- | ldrb w4, [x4, #24]
> +-- | add x0, x4, #24 # constant x0 rematerialization
> +-- | ...
> +-- As a result, the value register x0 holding is incorrect.
> +
> +local empty = {}
> +
> +jit.off()
> +jit.flush()
> +
> +-- XXX: The example below is very fragile. Even the names of
> +-- the variables matter.
> +local function scan(vs)
> + -- The code below is needed to generate high register pressure
> + -- and specific register allocations.
> + for _, v in ipairs(vs) do
> + -- XXX: Just more usage of registers. Nothing significant.
> + local sep = v:find('@')
> + -- Recording of yielding `string.byte()` result encodes XLOAD
> + -- IR. Its assembly modifies x4 register, that is chosen as
> + -- a destination register.
> + -- IR_NE, that using `asm_href()` uses the modified x4
> + -- register as a source for constant x0 rematerialization.
> + -- As far as it is modified before, the result value is
> + -- incorrect.
> + -- luacheck: ignore
> + if v:sub(sep + 2, -2):byte() == 0x3f then -- 0x3f == '?'
> + end
> +
> + -- XXX: Just more usage of registers. Nothing significant.
> + local _ = empty[v]
> +
> + -- Here the `str` strdata value (rematerialized x0 register)
> + -- given to the `lj_str_find()` is invalid on the trace,
> + -- that as a result leading to the core dump.
> + v:find(':')
> + end
> +end
> +
> +jit.on()
> +jit.opt.start('hotloop=1', 'loopunroll=1')
> +
> +-- This wrapper function is needed to avoid excess errors 'leaving
> +-- loop in the root trace'.
> +local function wrap()
> + -- XXX: There are four failing attemts to compile trace for this
> + -- code:
> + -- * The first trace trying to record starts with the ITERL BC
> + -- in `scan()` function. The compilation failed, because
> + -- recording starts at the second iteration, when the loop is
> + -- left.
> + -- * The second trace starts with UGET (scan) in the cycle
> + -- below. Entering calling the `scan` function compilation
> + -- failed, when sees the inner ITERL loop.
> + -- * The third trace starts with GGET (ipairs) in the `scan()`
> + -- function trying to record the hot function. The compilation
> + -- is failed due to facing the inner ITERL loop.
> + -- * At 19th iteration the ITERL trying to be recorded again
> + -- after this instruction become hot again.
> + --
> + -- And, finally, at 39th iteration the `for` loop below is
> + -- recorded after becoming hot again. Now the compiler inlining
> + -- the inner loop and recording doesn't fail.
> + -- The 40th iteration is needed to be sure the compiled mcode is
> + -- correct.
> + for _ = 1, 40 do
> + scan({'ab at xyz'})
> + end
> +end
> +
> +wrap()
> +
> +test:ok(true, 'the resulting trace is correct')
> +
> +os.exit(test:check() and 0 or 1)
> --
> 2.34.1
>
More information about the Tarantool-patches
mailing list