* [Tarantool-patches] [PATCH luajit] ARM64: Fix LDP/STP fusion (again). @ 2025-08-27 9:17 Sergey Kaplun via Tarantool-patches 2025-09-08 8:54 ` Sergey Bronnikov via Tarantool-patches 2025-12-13 16:12 ` Sergey Kaplun via Tarantool-patches 0 siblings, 2 replies; 7+ messages in thread From: Sergey Kaplun via Tarantool-patches @ 2025-08-27 9:17 UTC (permalink / raw) To: Sergey Bronnikov; +Cc: tarantool-patches From: Mike Pall <mike> Reported and analyzed by Zhongwei Yao. Fix by Peter Cawley. (cherry picked from commit b8c6ccd50c61b7a2df5123ddc5a85ac7d089542b) Assume we have stores/loads from the pointer with offset +488 and -16. The lower bits of the offset are the same as for the offset (488 + 8). This leads to the incorrect fusion of these instructions: | str x20, [x21, 488] | stur x20, [x21, -16] to the following instruction: | stp x20, x20, [x21, 488] This patch prevents this fusion by more accurate offset comparison. Sergey Kaplun: * added the description and the test for the problem Part of tarantool/tarantool#11691 --- Branch: https://github.com/tarantool/luajit/tree/skaplun/lj-1075-arm64-incorrect-ldp-stp-fusion Related issues: * https://github.com/tarantool/tarantool/issues/11691 * https://github.com/LuaJIT/LuaJIT/issues/1075 src/lj_emit_arm64.h | 17 ++- ...75-arm64-incorrect-ldp-stp-fusion.test.lua | 129 ++++++++++++++++++ 2 files changed, 142 insertions(+), 4 deletions(-) create mode 100644 test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua diff --git a/src/lj_emit_arm64.h b/src/lj_emit_arm64.h index 5c1bc372..9dd92c40 100644 --- a/src/lj_emit_arm64.h +++ b/src/lj_emit_arm64.h @@ -121,6 +121,17 @@ static int emit_checkofs(A64Ins ai, int64_t ofs) } } +static LJ_AINLINE uint32_t emit_lso_pair_candidate(A64Ins ai, int ofs, int sc) +{ + if (ofs >= 0) { + return ai | A64F_U12(ofs>>sc); /* Subsequent lj_ror checks ofs. */ + } else if (ofs >= -256) { + return (ai^A64I_LS_U) | A64F_S9(ofs & 0x1ff); + } else { + return A64F_D(31); /* Will mismatch prev. */ + } +} + static void emit_lso(ASMState *as, A64Ins ai, Reg rd, Reg rn, int64_t ofs) { int ot = emit_checkofs(ai, ofs), sc = (ai >> 30) & 3; @@ -132,11 +143,9 @@ static void emit_lso(ASMState *as, A64Ins ai, Reg rd, Reg rn, int64_t ofs) uint32_t prev = *as->mcp & ~A64F_D(31); int ofsm = ofs - (1<<sc), ofsp = ofs + (1<<sc); A64Ins aip; - if (prev == (ai | A64F_N(rn) | A64F_U12(ofsm>>sc)) || - prev == ((ai^A64I_LS_U) | A64F_N(rn) | A64F_S9(ofsm&0x1ff))) { + if (prev == emit_lso_pair_candidate(ai | A64F_N(rn), ofsm, sc)) { aip = (A64F_A(rd) | A64F_D(*as->mcp & 31)); - } else if (prev == (ai | A64F_N(rn) | A64F_U12(ofsp>>sc)) || - prev == ((ai^A64I_LS_U) | A64F_N(rn) | A64F_S9(ofsp&0x1ff))) { + } else if (prev == emit_lso_pair_candidate(ai | A64F_N(rn), ofsp, sc)) { aip = (A64F_D(rd) | A64F_A(*as->mcp & 31)); ofsm = ofs; } else { diff --git a/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua new file mode 100644 index 00000000..c84c3b23 --- /dev/null +++ b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua @@ -0,0 +1,129 @@ +local tap = require('tap') +local ffi = require('ffi') + +-- This test demonstrates LuaJIT's incorrect emitting of LDP/STP +-- instruction fused from LDR/STR with negative offset and +-- positive offset with the same lower bits on arm64. +-- See also https://github.com/LuaJIT/LuaJIT/pull/1075. +local test = tap.test('lj-1075-arm64-incorrect-ldp-stp-fusion'):skipcond({ + ['Test requires JIT enabled'] = not jit.status(), +}) + +test:plan(6) + +-- Amount of iterations to compile and run the invariant part of +-- the trace. +local N_ITERATIONS = 4 + +local EXPECTED = 42 + +-- 4 slots of redzone for int64_t load/store. +local REDZONE = 4 +local MASK_IMM7 = 0x7f +local BUFLEN = (MASK_IMM7 + REDZONE) * 4 +local buf = ffi.new('unsigned char [' .. BUFLEN .. ']', 0) + +local function clear_buf() + ffi.fill(buf, ffi.sizeof(buf), 0) +end + +-- Initialize the buffer with simple values. +local function init_buf() + -- Limit to fill the buffer. 0 in the top part helps + -- to detect the issue. + local LIMIT = BUFLEN - 12 + for i = 0, LIMIT - 1 do + buf[i] = i + end + for i = LIMIT, BUFLEN - 1 do + buf[i] = 0 + end +end + +jit.opt.start('hotloop=2') + +-- Assume we have stores/loads from the pointer with offset +-- +488 and -16. The lower 7 bits of the offset (-16) >> 2 are +-- 1111100. These bits are the same as for the offset (488 + 8). +-- Thus, before the patch, these two instructions: +-- | str x20, [x21, #488] +-- | stur x20, [x21, #-16] +-- are incorrectly fused to the: +-- | stp x20, x20, [x21, #488] + +-- Test stores. + +local start = ffi.cast('unsigned char *', buf) +-- Use constants to allow optimization to take place. +local base_ptr = start + 16 +for _ = 1, N_ITERATIONS do + -- Save the result only for the last iteration. + clear_buf() + -- These 2 accesses become `base_ptr + 488` and `base_ptr + 496` + -- on the trace before the patch. + ffi.cast('uint64_t *', base_ptr + 488)[0] = EXPECTED + ffi.cast('uint64_t *', base_ptr - 16)[0] = EXPECTED +end + +test:is(buf[488 + 16], EXPECTED, 'correct store top value') +test:is(buf[0], EXPECTED, 'correct store bottom value') + +-- Test loads. + +init_buf() + +local top, bottom +for _ = 1, N_ITERATIONS do + -- These 2 accesses become `base_ptr + 488` and `base_ptr + 496` + -- on the trace before the patch. + top = ffi.cast('uint64_t *', base_ptr + 488)[0] + bottom = ffi.cast('uint64_t *', base_ptr - 16)[0] +end + +test:is(top, 0xfffefdfcfbfaf9f8ULL, 'correct load top value') +test:is(bottom, 0x706050403020100ULL, 'correct load bottom value') + +-- Another reproducer that is based on the snapshot restoring. +-- Its advantage is avoiding FFI usage. + +-- Snapshot slots are restored in the reversed order. +-- The recording order is the following (from the bottom of the +-- trace to the top): +-- - 0th (ofs == -16) -- `f64()` replaced the `tail64()` on the +-- stack, +-- - 63rd (ofs == 488) -- 1, +-- - 64th (ofs == 496) -- 2. +-- At recording, the instructions for the 0th and 63rd slots are +-- merged like the following: +-- | str x3, [x19, #496] +-- | stp x2, x1, [x19, #488] +-- The first store is dominated by the stp, so the restored value +-- is incorrect. + +-- Function with 63 slots on the stack. +local function f63() + -- 61 unused slots to avoid extra stores in between. + -- luacheck: no unused + local _, _, _, _, _, _, _, _, _, _ + local _, _, _, _, _, _, _, _, _, _ + local _, _, _, _, _, _, _, _, _, _ + local _, _, _, _, _, _, _, _, _, _ + local _, _, _, _, _, _, _, _, _, _ + local _, _, _, _, _, _, _, _, _, _ + local _ + return 1, 2 +end + +local function tail63() + return f63() +end + +-- Record the trace. +tail63() +tail63() +-- Run the trace. +local one, two = tail63() +test:is(one, 1, 'correct 1st value on stack') +test:is(two, 2, 'correct 2nd value on stack') + +test:done(true) -- 2.51.0 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Tarantool-patches] [PATCH luajit] ARM64: Fix LDP/STP fusion (again). 2025-08-27 9:17 [Tarantool-patches] [PATCH luajit] ARM64: Fix LDP/STP fusion (again) Sergey Kaplun via Tarantool-patches @ 2025-09-08 8:54 ` Sergey Bronnikov via Tarantool-patches 2025-09-08 9:18 ` Sergey Kaplun via Tarantool-patches 2025-09-08 9:26 ` Sergey Bronnikov via Tarantool-patches 2025-12-13 16:12 ` Sergey Kaplun via Tarantool-patches 1 sibling, 2 replies; 7+ messages in thread From: Sergey Bronnikov via Tarantool-patches @ 2025-09-08 8:54 UTC (permalink / raw) To: Sergey Kaplun; +Cc: tarantool-patches [-- Attachment #1: Type: text/plain, Size: 7468 bytes --] Hi, Sergey, The test added with initial fix (test/tarantool-tests/lj-1057-arm64-stp-fusing-across-tbar.test.lua) segfaults with proposed patch. CMake configuration: cmake -S . -B build -DCMAKE_BUILD_TYPE=Debug -DLUA_USE_ASSERT=ON -DLUA_USE_APICHECK=ON Arch: ARM64. Sergey On 8/27/25 12:17, Sergey Kaplun wrote: > From: Mike Pall <mike> > > Reported and analyzed by Zhongwei Yao. Fix by Peter Cawley. > > (cherry picked from commit b8c6ccd50c61b7a2df5123ddc5a85ac7d089542b) > > Assume we have stores/loads from the pointer with offset +488 and -16. > The lower bits of the offset are the same as for the offset (488 + 8). > This leads to the incorrect fusion of these instructions: > | str x20, [x21, 488] > | stur x20, [x21, -16] > to the following instruction: > | stp x20, x20, [x21, 488] > > This patch prevents this fusion by more accurate offset comparison. > > Sergey Kaplun: > * added the description and the test for the problem > > Part of tarantool/tarantool#11691 > --- > > Branch:https://github.com/tarantool/luajit/tree/skaplun/lj-1075-arm64-incorrect-ldp-stp-fusion > Related issues: > *https://github.com/tarantool/tarantool/issues/11691 > *https://github.com/LuaJIT/LuaJIT/issues/1075 > > src/lj_emit_arm64.h | 17 ++- > ...75-arm64-incorrect-ldp-stp-fusion.test.lua | 129 ++++++++++++++++++ > 2 files changed, 142 insertions(+), 4 deletions(-) > create mode 100644 test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua > > diff --git a/src/lj_emit_arm64.h b/src/lj_emit_arm64.h > index 5c1bc372..9dd92c40 100644 > --- a/src/lj_emit_arm64.h > +++ b/src/lj_emit_arm64.h > @@ -121,6 +121,17 @@ static int emit_checkofs(A64Ins ai, int64_t ofs) > } > } > > +static LJ_AINLINE uint32_t emit_lso_pair_candidate(A64Ins ai, int ofs, int sc) > +{ > + if (ofs >= 0) { > + return ai | A64F_U12(ofs>>sc); /* Subsequent lj_ror checks ofs. */ > + } else if (ofs >= -256) { > + return (ai^A64I_LS_U) | A64F_S9(ofs & 0x1ff); > + } else { > + return A64F_D(31); /* Will mismatch prev. */ > + } > +} > + > static void emit_lso(ASMState *as, A64Ins ai, Reg rd, Reg rn, int64_t ofs) > { > int ot = emit_checkofs(ai, ofs), sc = (ai >> 30) & 3; > @@ -132,11 +143,9 @@ static void emit_lso(ASMState *as, A64Ins ai, Reg rd, Reg rn, int64_t ofs) > uint32_t prev = *as->mcp & ~A64F_D(31); > int ofsm = ofs - (1<<sc), ofsp = ofs + (1<<sc); > A64Ins aip; > - if (prev == (ai | A64F_N(rn) | A64F_U12(ofsm>>sc)) || > - prev == ((ai^A64I_LS_U) | A64F_N(rn) | A64F_S9(ofsm&0x1ff))) { > + if (prev == emit_lso_pair_candidate(ai | A64F_N(rn), ofsm, sc)) { > aip = (A64F_A(rd) | A64F_D(*as->mcp & 31)); > - } else if (prev == (ai | A64F_N(rn) | A64F_U12(ofsp>>sc)) || > - prev == ((ai^A64I_LS_U) | A64F_N(rn) | A64F_S9(ofsp&0x1ff))) { > + } else if (prev == emit_lso_pair_candidate(ai | A64F_N(rn), ofsp, sc)) { > aip = (A64F_D(rd) | A64F_A(*as->mcp & 31)); > ofsm = ofs; > } else { > diff --git a/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua > new file mode 100644 > index 00000000..c84c3b23 > --- /dev/null > +++ b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua > @@ -0,0 +1,129 @@ > +local tap = require('tap') > +local ffi = require('ffi') > + > +-- This test demonstrates LuaJIT's incorrect emitting of LDP/STP > +-- instruction fused from LDR/STR with negative offset and > +-- positive offset with the same lower bits on arm64. > +-- See alsohttps://github.com/LuaJIT/LuaJIT/pull/1075. > +local test = tap.test('lj-1075-arm64-incorrect-ldp-stp-fusion'):skipcond({ > + ['Test requires JIT enabled'] = not jit.status(), > +}) > + > +test:plan(6) > + > +-- Amount of iterations to compile and run the invariant part of > +-- the trace. > +local N_ITERATIONS = 4 > + > +local EXPECTED = 42 > + > +-- 4 slots of redzone for int64_t load/store. > +local REDZONE = 4 > +local MASK_IMM7 = 0x7f > +local BUFLEN = (MASK_IMM7 + REDZONE) * 4 > +local buf = ffi.new('unsigned char [' .. BUFLEN .. ']', 0) > + > +local function clear_buf() > + ffi.fill(buf, ffi.sizeof(buf), 0) > +end > + > +-- Initialize the buffer with simple values. > +local function init_buf() > + -- Limit to fill the buffer. 0 in the top part helps > + -- to detect the issue. > + local LIMIT = BUFLEN - 12 > + for i = 0, LIMIT - 1 do > + buf[i] = i > + end > + for i = LIMIT, BUFLEN - 1 do > + buf[i] = 0 > + end > +end > + > +jit.opt.start('hotloop=2') > + > +-- Assume we have stores/loads from the pointer with offset > +-- +488 and -16. The lower 7 bits of the offset (-16) >> 2 are > +-- 1111100. These bits are the same as for the offset (488 + 8). > +-- Thus, before the patch, these two instructions: > +-- | str x20, [x21, #488] > +-- | stur x20, [x21, #-16] > +-- are incorrectly fused to the: > +-- | stp x20, x20, [x21, #488] > + > +-- Test stores. > + > +local start = ffi.cast('unsigned char *', buf) > +-- Use constants to allow optimization to take place. > +local base_ptr = start + 16 > +for _ = 1, N_ITERATIONS do > + -- Save the result only for the last iteration. > + clear_buf() > + -- These 2 accesses become `base_ptr + 488` and `base_ptr + 496` > + -- on the trace before the patch. > + ffi.cast('uint64_t *', base_ptr + 488)[0] = EXPECTED > + ffi.cast('uint64_t *', base_ptr - 16)[0] = EXPECTED > +end > + > +test:is(buf[488 + 16], EXPECTED, 'correct store top value') > +test:is(buf[0], EXPECTED, 'correct store bottom value') > + > +-- Test loads. > + > +init_buf() > + > +local top, bottom > +for _ = 1, N_ITERATIONS do > + -- These 2 accesses become `base_ptr + 488` and `base_ptr + 496` > + -- on the trace before the patch. > + top = ffi.cast('uint64_t *', base_ptr + 488)[0] > + bottom = ffi.cast('uint64_t *', base_ptr - 16)[0] > +end > + > +test:is(top, 0xfffefdfcfbfaf9f8ULL, 'correct load top value') > +test:is(bottom, 0x706050403020100ULL, 'correct load bottom value') > + > +-- Another reproducer that is based on the snapshot restoring. > +-- Its advantage is avoiding FFI usage. > + > +-- Snapshot slots are restored in the reversed order. > +-- The recording order is the following (from the bottom of the > +-- trace to the top): > +-- - 0th (ofs == -16) -- `f64()` replaced the `tail64()` on the > +-- stack, > +-- - 63rd (ofs == 488) -- 1, > +-- - 64th (ofs == 496) -- 2. > +-- At recording, the instructions for the 0th and 63rd slots are > +-- merged like the following: > +-- | str x3, [x19, #496] > +-- | stp x2, x1, [x19, #488] > +-- The first store is dominated by the stp, so the restored value > +-- is incorrect. > + > +-- Function with 63 slots on the stack. > +local function f63() > + -- 61 unused slots to avoid extra stores in between. > + -- luacheck: no unused > + local _, _, _, _, _, _, _, _, _, _ > + local _, _, _, _, _, _, _, _, _, _ > + local _, _, _, _, _, _, _, _, _, _ > + local _, _, _, _, _, _, _, _, _, _ > + local _, _, _, _, _, _, _, _, _, _ > + local _, _, _, _, _, _, _, _, _, _ > + local _ > + return 1, 2 > +end > + > +local function tail63() > + return f63() > +end > + > +-- Record the trace. > +tail63() > +tail63() > +-- Run the trace. > +local one, two = tail63() > +test:is(one, 1, 'correct 1st value on stack') > +test:is(two, 2, 'correct 2nd value on stack') > + > +test:done(true) [-- Attachment #2: Type: text/html, Size: 8065 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Tarantool-patches] [PATCH luajit] ARM64: Fix LDP/STP fusion (again). 2025-09-08 8:54 ` Sergey Bronnikov via Tarantool-patches @ 2025-09-08 9:18 ` Sergey Kaplun via Tarantool-patches 2025-09-08 9:26 ` Sergey Bronnikov via Tarantool-patches 1 sibling, 0 replies; 7+ messages in thread From: Sergey Kaplun via Tarantool-patches @ 2025-09-08 9:18 UTC (permalink / raw) To: Sergey Bronnikov; +Cc: tarantool-patches Hi, Sergey, Thanks for the comment, please consider my answer below. On 08.09.25, Sergey Bronnikov wrote: > Hi, Sergey, > > The test added with initial fix > (test/tarantool-tests/lj-1057-arm64-stp-fusing-across-tbar.test.lua) > > segfaults with proposed patch. > > CMake configuration: cmake -S . -B build -DCMAKE_BUILD_TYPE=Debug > -DLUA_USE_ASSERT=ON -DLUA_USE_APICHECK=ON > > Arch: ARM64. The lj-1057-arm64-stp-fusing-across-tbar.test.lua test is fixed via the corresponding patchset. It should be applied to avoid the corresponding test failures. Within 2 patchsets applied, I see no regressions. > > Sergey > <snipped> -- Best regards, Sergey Kaplun ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Tarantool-patches] [PATCH luajit] ARM64: Fix LDP/STP fusion (again). 2025-09-08 8:54 ` Sergey Bronnikov via Tarantool-patches 2025-09-08 9:18 ` Sergey Kaplun via Tarantool-patches @ 2025-09-08 9:26 ` Sergey Bronnikov via Tarantool-patches 2025-09-08 9:48 ` Sergey Kaplun via Tarantool-patches 1 sibling, 1 reply; 7+ messages in thread From: Sergey Bronnikov via Tarantool-patches @ 2025-09-08 9:26 UTC (permalink / raw) To: Sergey Kaplun; +Cc: tarantool-patches [-- Attachment #1: Type: text/plain, Size: 8146 bytes --] Hi, Sergey, thanks for the patch! LGTM with two minor comments Sergey On 9/8/25 11:54, Sergey Bronnikov wrote: > > Hi, Sergey, > > The test added with initial fix > (test/tarantool-tests/lj-1057-arm64-stp-fusing-across-tbar.test.lua) > > segfaults with proposed patch. > Please disregard, seems there was a misconfiguration or "dirty" build on the machine. > > CMake configuration: cmake -S . -B build -DCMAKE_BUILD_TYPE=Debug > -DLUA_USE_ASSERT=ON -DLUA_USE_APICHECK=ON > > Arch: ARM64. > > Sergey > > On 8/27/25 12:17, Sergey Kaplun wrote: >> From: Mike Pall <mike> >> >> Reported and analyzed by Zhongwei Yao. Fix by Peter Cawley. >> >> (cherry picked from commit b8c6ccd50c61b7a2df5123ddc5a85ac7d089542b) >> >> Assume we have stores/loads from the pointer with offset +488 and -16. >> The lower bits of the offset are the same as for the offset (488 + 8). >> This leads to the incorrect fusion of these instructions: >> | str x20, [x21, 488] >> | stur x20, [x21, -16] >> to the following instruction: >> | stp x20, x20, [x21, 488] >> >> This patch prevents this fusion by more accurate offset comparison. >> >> Sergey Kaplun: >> * added the description and the test for the problem >> >> Part of tarantool/tarantool#11691 >> --- >> >> Branch:https://github.com/tarantool/luajit/tree/skaplun/lj-1075-arm64-incorrect-ldp-stp-fusion >> Related issues: >> *https://github.com/tarantool/tarantool/issues/11691 >> *https://github.com/LuaJIT/LuaJIT/issues/1075 >> >> src/lj_emit_arm64.h | 17 ++- >> ...75-arm64-incorrect-ldp-stp-fusion.test.lua | 129 ++++++++++++++++++ >> 2 files changed, 142 insertions(+), 4 deletions(-) >> create mode 100644 test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua >> >> diff --git a/src/lj_emit_arm64.h b/src/lj_emit_arm64.h >> index 5c1bc372..9dd92c40 100644 >> --- a/src/lj_emit_arm64.h >> +++ b/src/lj_emit_arm64.h >> @@ -121,6 +121,17 @@ static int emit_checkofs(A64Ins ai, int64_t ofs) >> } >> } >> >> +static LJ_AINLINE uint32_t emit_lso_pair_candidate(A64Ins ai, int ofs, int sc) >> +{ >> + if (ofs >= 0) { >> + return ai | A64F_U12(ofs>>sc); /* Subsequent lj_ror checks ofs. */ >> + } else if (ofs >= -256) { >> + return (ai^A64I_LS_U) | A64F_S9(ofs & 0x1ff); >> + } else { >> + return A64F_D(31); /* Will mismatch prev. */ >> + } >> +} >> + >> static void emit_lso(ASMState *as, A64Ins ai, Reg rd, Reg rn, int64_t ofs) >> { >> int ot = emit_checkofs(ai, ofs), sc = (ai >> 30) & 3; >> @@ -132,11 +143,9 @@ static void emit_lso(ASMState *as, A64Ins ai, Reg rd, Reg rn, int64_t ofs) >> uint32_t prev = *as->mcp & ~A64F_D(31); >> int ofsm = ofs - (1<<sc), ofsp = ofs + (1<<sc); >> A64Ins aip; >> - if (prev == (ai | A64F_N(rn) | A64F_U12(ofsm>>sc)) || >> - prev == ((ai^A64I_LS_U) | A64F_N(rn) | A64F_S9(ofsm&0x1ff))) { >> + if (prev == emit_lso_pair_candidate(ai | A64F_N(rn), ofsm, sc)) { >> aip = (A64F_A(rd) | A64F_D(*as->mcp & 31)); >> - } else if (prev == (ai | A64F_N(rn) | A64F_U12(ofsp>>sc)) || >> - prev == ((ai^A64I_LS_U) | A64F_N(rn) | A64F_S9(ofsp&0x1ff))) { >> + } else if (prev == emit_lso_pair_candidate(ai | A64F_N(rn), ofsp, sc)) { >> aip = (A64F_D(rd) | A64F_A(*as->mcp & 31)); >> ofsm = ofs; >> } else { >> diff --git a/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua >> new file mode 100644 >> index 00000000..c84c3b23 >> --- /dev/null >> +++ b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua >> @@ -0,0 +1,129 @@ >> +local tap = require('tap') >> +local ffi = require('ffi') >> + >> +-- This test demonstrates LuaJIT's incorrect emitting of LDP/STP >> +-- instruction fused from LDR/STR with negative offset and >> +-- positive offset with the same lower bits on arm64. >> +-- See alsohttps://github.com/LuaJIT/LuaJIT/pull/1075. >> +local test = tap.test('lj-1075-arm64-incorrect-ldp-stp-fusion'):skipcond({ >> + ['Test requires JIT enabled'] = not jit.status(), >> +}) >> + >> +test:plan(6) >> + >> +-- Amount of iterations to compile and run the invariant part of >> +-- the trace. >> +local N_ITERATIONS = 4 >> + >> +local EXPECTED = 42 >> + >> +-- 4 slots of redzone for int64_t load/store. >> +local REDZONE = 4 >> +local MASK_IMM7 = 0x7f >> +local BUFLEN = (MASK_IMM7 + REDZONE) * 4 >> +local buf = ffi.new('unsigned char [' .. BUFLEN .. ']', 0) >> + >> +local function clear_buf() >> + ffi.fill(buf, ffi.sizeof(buf), 0) >> +end >> + >> +-- Initialize the buffer with simple values. >> +local function init_buf() >> + -- Limit to fill the buffer. 0 in the top part helps >> + -- to detect the issue. >> + local LIMIT = BUFLEN - 12 >> + for i = 0, LIMIT - 1 do >> + buf[i] = i >> + end >> + for i = LIMIT, BUFLEN - 1 do >> + buf[i] = 0 >> + end >> +end >> + >> +jit.opt.start('hotloop=2') Why 2? It deserves a comment, because usually we use 1 hotloop. >> + >> +-- Assume we have stores/loads from the pointer with offset >> +-- +488 and -16. The lower 7 bits of the offset (-16) >> 2 are >> +-- 1111100. These bits are the same as for the offset (488 + 8). >> +-- Thus, before the patch, these two instructions: >> +-- | str x20, [x21, #488] >> +-- | stur x20, [x21, #-16] >> +-- are incorrectly fused to the: >> +-- | stp x20, x20, [x21, #488] >> + >> +-- Test stores. >> + >> +local start = ffi.cast('unsigned char *', buf) >> +-- Use constants to allow optimization to take place. >> +local base_ptr = start + 16 >> +for _ = 1, N_ITERATIONS do >> + -- Save the result only for the last iteration. >> + clear_buf() >> + -- These 2 accesses become `base_ptr + 488` and `base_ptr + 496` >> + -- on the trace before the patch. >> + ffi.cast('uint64_t *', base_ptr + 488)[0] = EXPECTED >> + ffi.cast('uint64_t *', base_ptr - 16)[0] = EXPECTED >> +end >> + >> +test:is(buf[488 + 16], EXPECTED, 'correct store top value') >> +test:is(buf[0], EXPECTED, 'correct store bottom value') >> + >> +-- Test loads. >> + >> +init_buf() >> + >> +local top, bottom >> +for _ = 1, N_ITERATIONS do >> + -- These 2 accesses become `base_ptr + 488` and `base_ptr + 496` >> + -- on the trace before the patch. >> + top = ffi.cast('uint64_t *', base_ptr + 488)[0] >> + bottom = ffi.cast('uint64_t *', base_ptr - 16)[0] >> +end >> + >> +test:is(top, 0xfffefdfcfbfaf9f8ULL, 'correct load top value') >> +test:is(bottom, 0x706050403020100ULL, 'correct load bottom value') >> + >> +-- Another reproducer that is based on the snapshot restoring. >> +-- Its advantage is avoiding FFI usage. >> + >> +-- Snapshot slots are restored in the reversed order. >> +-- The recording order is the following (from the bottom of the >> +-- trace to the top): >> +-- - 0th (ofs == -16) -- `f64()` replaced the `tail64()` on the >> +-- stack, >> +-- - 63rd (ofs == 488) -- 1, >> +-- - 64th (ofs == 496) -- 2. >> +-- At recording, the instructions for the 0th and 63rd slots are >> +-- merged like the following: >> +-- | str x3, [x19, #496] >> +-- | stp x2, x1, [x19, #488] >> +-- The first store is dominated by the stp, so the restored value >> +-- is incorrect. >> + >> +-- Function with 63 slots on the stack. >> +local function f63() Minor: Hardcode a number of slots to the function name looks odd. The same for tail63. Bumping a number of slots will require renaming of two functions. Feel free to ignore. >> + -- 61 unused slots to avoid extra stores in between. >> + -- luacheck: no unused >> + local _, _, _, _, _, _, _, _, _, _ >> + local _, _, _, _, _, _, _, _, _, _ >> + local _, _, _, _, _, _, _, _, _, _ >> + local _, _, _, _, _, _, _, _, _, _ >> + local _, _, _, _, _, _, _, _, _, _ >> + local _, _, _, _, _, _, _, _, _, _ >> + local _ >> + return 1, 2 >> +end >> + >> +local function tail63() >> + return f63() >> +end >> + >> +-- Record the trace. >> +tail63() >> +tail63() >> +-- Run the trace. >> +local one, two = tail63() >> +test:is(one, 1, 'correct 1st value on stack') >> +test:is(two, 2, 'correct 2nd value on stack') >> + >> +test:done(true) [-- Attachment #2: Type: text/html, Size: 9747 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Tarantool-patches] [PATCH luajit] ARM64: Fix LDP/STP fusion (again). 2025-09-08 9:26 ` Sergey Bronnikov via Tarantool-patches @ 2025-09-08 9:48 ` Sergey Kaplun via Tarantool-patches 2025-09-08 10:40 ` Sergey Bronnikov via Tarantool-patches 0 siblings, 1 reply; 7+ messages in thread From: Sergey Kaplun via Tarantool-patches @ 2025-09-08 9:48 UTC (permalink / raw) To: Sergey Bronnikov; +Cc: tarantool-patches Hi, Sergey! Thanks for the review! Fixed your comment and force-pushed the branch. On 08.09.25, Sergey Bronnikov wrote: > Hi, Sergey, > > thanks for the patch! LGTM with two minor comments > > Sergey > <snipped> > > On 8/27/25 12:17, Sergey Kaplun wrote: > >> From: Mike Pall <mike> > >> > >> Reported and analyzed by Zhongwei Yao. Fix by Peter Cawley. > >> > >> (cherry picked from commit b8c6ccd50c61b7a2df5123ddc5a85ac7d089542b) > >> > >> Assume we have stores/loads from the pointer with offset +488 and -16. > >> The lower bits of the offset are the same as for the offset (488 + 8). > >> This leads to the incorrect fusion of these instructions: > >> | str x20, [x21, 488] > >> | stur x20, [x21, -16] > >> to the following instruction: > >> | stp x20, x20, [x21, 488] > >> > >> This patch prevents this fusion by more accurate offset comparison. > >> > >> Sergey Kaplun: > >> * added the description and the test for the problem > >> > >> Part of tarantool/tarantool#11691 > >> --- > >> > >> Branch:https://github.com/tarantool/luajit/tree/skaplun/lj-1075-arm64-incorrect-ldp-stp-fusion > >> Related issues: > >> *https://github.com/tarantool/tarantool/issues/11691 > >> *https://github.com/LuaJIT/LuaJIT/issues/1075 > >> > >> src/lj_emit_arm64.h | 17 ++- > >> ...75-arm64-incorrect-ldp-stp-fusion.test.lua | 129 ++++++++++++++++++ > >> 2 files changed, 142 insertions(+), 4 deletions(-) > >> create mode 100644 test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua > >> > >> diff --git a/src/lj_emit_arm64.h b/src/lj_emit_arm64.h > >> index 5c1bc372..9dd92c40 100644 > >> --- a/src/lj_emit_arm64.h > >> +++ b/src/lj_emit_arm64.h <snipped> > >> diff --git a/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua > >> new file mode 100644 > >> index 00000000..c84c3b23 > >> --- /dev/null > >> +++ b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua > >> @@ -0,0 +1,129 @@ <snipped> > >> + > >> +jit.opt.start('hotloop=2') > > Why 2? It deserves a comment, because usually we use 1 hotloop. It's a copy-pasting mistake from the aarch64 machine, fixed to `hotloop=1`, thanks: =================================================================== diff --git a/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua index c84c3b23..393a1aa7 100644 --- a/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua +++ b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua @@ -40,7 +40,7 @@ local function init_buf() end end -jit.opt.start('hotloop=2') +jit.opt.start('hotloop=1') -- Assume we have stores/loads from the pointer with offset -- +488 and -16. The lower 7 bits of the offset (-16) >> 2 are =================================================================== > <snipped> > >> + > >> +-- Another reproducer that is based on the snapshot restoring. > >> +-- Its advantage is avoiding FFI usage. > >> + > >> +-- Snapshot slots are restored in the reversed order. > >> +-- The recording order is the following (from the bottom of the > >> +-- trace to the top): > >> +-- - 0th (ofs == -16) -- `f64()` replaced the `tail64()` on the > >> +-- stack, > >> +-- - 63rd (ofs == 488) -- 1, > >> +-- - 64th (ofs == 496) -- 2. > >> +-- At recording, the instructions for the 0th and 63rd slots are > >> +-- merged like the following: > >> +-- | str x3, [x19, #496] > >> +-- | stp x2, x1, [x19, #488] > >> +-- The first store is dominated by the stp, so the restored value > >> +-- is incorrect. > >> + > >> +-- Function with 63 slots on the stack. > >> +local function f63() > > Minor: Hardcode a number of slots to the function name looks odd. It is mentioned above why exactly this amount of slots is required. It shouldn't be touched. > > The same for tail63. Bumping a number of slots will > > require renaming of two functions. > > Feel free to ignore. Ignoring. > > >> + -- 61 unused slots to avoid extra stores in between. > >> + -- luacheck: no unused > >> + local _, _, _, _, _, _, _, _, _, _ > >> + local _, _, _, _, _, _, _, _, _, _ > >> + local _, _, _, _, _, _, _, _, _, _ > >> + local _, _, _, _, _, _, _, _, _, _ > >> + local _, _, _, _, _, _, _, _, _, _ > >> + local _, _, _, _, _, _, _, _, _, _ > >> + local _ > >> + return 1, 2 > >> +end > >> + <snipped> > >> +test:done(true) -- Best regards, Sergey Kaplun ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Tarantool-patches] [PATCH luajit] ARM64: Fix LDP/STP fusion (again). 2025-09-08 9:48 ` Sergey Kaplun via Tarantool-patches @ 2025-09-08 10:40 ` Sergey Bronnikov via Tarantool-patches 0 siblings, 0 replies; 7+ messages in thread From: Sergey Bronnikov via Tarantool-patches @ 2025-09-08 10:40 UTC (permalink / raw) To: Sergey Kaplun; +Cc: tarantool-patches [-- Attachment #1: Type: text/plain, Size: 4813 bytes --] LGTM On 9/8/25 12:48, Sergey Kaplun wrote: > Hi, Sergey! > Thanks for the review! > Fixed your comment and force-pushed the branch. > > On 08.09.25, Sergey Bronnikov wrote: >> Hi, Sergey, >> >> thanks for the patch! LGTM with two minor comments >> >> Sergey >> > <snipped> > >>> On 8/27/25 12:17, Sergey Kaplun wrote: >>>> From: Mike Pall <mike> >>>> >>>> Reported and analyzed by Zhongwei Yao. Fix by Peter Cawley. >>>> >>>> (cherry picked from commit b8c6ccd50c61b7a2df5123ddc5a85ac7d089542b) >>>> >>>> Assume we have stores/loads from the pointer with offset +488 and -16. >>>> The lower bits of the offset are the same as for the offset (488 + 8). >>>> This leads to the incorrect fusion of these instructions: >>>> | str x20, [x21, 488] >>>> | stur x20, [x21, -16] >>>> to the following instruction: >>>> | stp x20, x20, [x21, 488] >>>> >>>> This patch prevents this fusion by more accurate offset comparison. >>>> >>>> Sergey Kaplun: >>>> * added the description and the test for the problem >>>> >>>> Part of tarantool/tarantool#11691 >>>> --- >>>> >>>> Branch:https://github.com/tarantool/luajit/tree/skaplun/lj-1075-arm64-incorrect-ldp-stp-fusion >>>> Related issues: >>>> *https://github.com/tarantool/tarantool/issues/11691 >>>> *https://github.com/LuaJIT/LuaJIT/issues/1075 >>>> >>>> src/lj_emit_arm64.h | 17 ++- >>>> ...75-arm64-incorrect-ldp-stp-fusion.test.lua | 129 ++++++++++++++++++ >>>> 2 files changed, 142 insertions(+), 4 deletions(-) >>>> create mode 100644 test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua >>>> >>>> diff --git a/src/lj_emit_arm64.h b/src/lj_emit_arm64.h >>>> index 5c1bc372..9dd92c40 100644 >>>> --- a/src/lj_emit_arm64.h >>>> +++ b/src/lj_emit_arm64.h > <snipped> > >>>> diff --git a/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua >>>> new file mode 100644 >>>> index 00000000..c84c3b23 >>>> --- /dev/null >>>> +++ b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua >>>> @@ -0,0 +1,129 @@ > <snipped> > >>>> + >>>> +jit.opt.start('hotloop=2') >> Why 2? It deserves a comment, because usually we use 1 hotloop. > It's a copy-pasting mistake from the aarch64 machine, fixed to > `hotloop=1`, thanks: Thanks! > > =================================================================== > diff --git a/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua > index c84c3b23..393a1aa7 100644 > --- a/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua > +++ b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua > @@ -40,7 +40,7 @@ local function init_buf() > end > end > > -jit.opt.start('hotloop=2') > +jit.opt.start('hotloop=1') > > -- Assume we have stores/loads from the pointer with offset > -- +488 and -16. The lower 7 bits of the offset (-16) >> 2 are > =================================================================== > > <snipped> > >>>> + >>>> +-- Another reproducer that is based on the snapshot restoring. >>>> +-- Its advantage is avoiding FFI usage. >>>> + >>>> +-- Snapshot slots are restored in the reversed order. >>>> +-- The recording order is the following (from the bottom of the >>>> +-- trace to the top): >>>> +-- - 0th (ofs == -16) -- `f64()` replaced the `tail64()` on the >>>> +-- stack, >>>> +-- - 63rd (ofs == 488) -- 1, >>>> +-- - 64th (ofs == 496) -- 2. >>>> +-- At recording, the instructions for the 0th and 63rd slots are >>>> +-- merged like the following: >>>> +-- | str x3, [x19, #496] >>>> +-- | stp x2, x1, [x19, #488] >>>> +-- The first store is dominated by the stp, so the restored value >>>> +-- is incorrect. >>>> + >>>> +-- Function with 63 slots on the stack. >>>> +local function f63() >> Minor: Hardcode a number of slots to the function name looks odd. > It is mentioned above why exactly this amount of slots is required. > It shouldn't be touched. The question was about hard-coding a number in a function name, not about using exactly this number of slots. Ok, I'll not insist, as I said in a question. >> The same for tail63. Bumping a number of slots will >> >> require renaming of two functions. >> >> Feel free to ignore. > Ignoring. > >>>> + -- 61 unused slots to avoid extra stores in between. >>>> + -- luacheck: no unused >>>> + local _, _, _, _, _, _, _, _, _, _ >>>> + local _, _, _, _, _, _, _, _, _, _ >>>> + local _, _, _, _, _, _, _, _, _, _ >>>> + local _, _, _, _, _, _, _, _, _, _ >>>> + local _, _, _, _, _, _, _, _, _, _ >>>> + local _, _, _, _, _, _, _, _, _, _ >>>> + local _ >>>> + return 1, 2 >>>> +end >>>> + > <snipped> > >>>> +test:done(true) [-- Attachment #2: Type: text/html, Size: 7680 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Tarantool-patches] [PATCH luajit] ARM64: Fix LDP/STP fusion (again). 2025-08-27 9:17 [Tarantool-patches] [PATCH luajit] ARM64: Fix LDP/STP fusion (again) Sergey Kaplun via Tarantool-patches 2025-09-08 8:54 ` Sergey Bronnikov via Tarantool-patches @ 2025-12-13 16:12 ` Sergey Kaplun via Tarantool-patches 1 sibling, 0 replies; 7+ messages in thread From: Sergey Kaplun via Tarantool-patches @ 2025-12-13 16:12 UTC (permalink / raw) To: Sergey Bronnikov; +Cc: tarantool-patches I've applied the patch into all long-term branches in tarantool/luajit and bumped a new version in Tarantool's master [1], release/3.5 [2], release/3.3 [3], release/3.2 [4] and release/2.11 [5]. [1]: https://github.com/tarantool/tarantool/pull/12129 [2]: https://github.com/tarantool/tarantool/pull/12130 [3]: https://github.com/tarantool/tarantool/pull/12131 [4]: https://github.com/tarantool/tarantool/pull/12132 [5]: https://github.com/tarantool/tarantool/pull/12133 -- Best regards, Sergey Kaplun ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-12-13 16:12 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-08-27 9:17 [Tarantool-patches] [PATCH luajit] ARM64: Fix LDP/STP fusion (again) Sergey Kaplun via Tarantool-patches 2025-09-08 8:54 ` Sergey Bronnikov via Tarantool-patches 2025-09-08 9:18 ` Sergey Kaplun via Tarantool-patches 2025-09-08 9:26 ` Sergey Bronnikov via Tarantool-patches 2025-09-08 9:48 ` Sergey Kaplun via Tarantool-patches 2025-09-08 10:40 ` Sergey Bronnikov via Tarantool-patches 2025-12-13 16:12 ` Sergey Kaplun via Tarantool-patches
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox