[Tarantool-patches] [PATCH luajit] ARM64: Fix LDP/STP fusion (again).

Tarantool development patches archive
 help / color / mirror / Atom feed

* [Tarantool-patches] [PATCH luajit] ARM64: Fix LDP/STP fusion (again).
@ 2025-08-27  9:17 Sergey Kaplun via Tarantool-patches
  2025-09-08  8:54 ` Sergey Bronnikov via Tarantool-patches
  2025-12-13 16:12 ` Sergey Kaplun via Tarantool-patches
  0 siblings, 2 replies; 7+ messages in thread
From: Sergey Kaplun via Tarantool-patches @ 2025-08-27  9:17 UTC (permalink / raw)
  To: Sergey Bronnikov; +Cc: tarantool-patches

From: Mike Pall <mike>

Reported and analyzed by Zhongwei Yao. Fix by Peter Cawley.

(cherry picked from commit b8c6ccd50c61b7a2df5123ddc5a85ac7d089542b)

Assume we have stores/loads from the pointer with offset +488 and -16.
The lower bits of the offset are the same as for the offset (488 + 8).
This leads to the incorrect fusion of these instructions:
| str   x20, [x21, 488]
| stur  x20, [x21, -16]
to the following instruction:
| stp   x20, x20, [x21, 488]

This patch prevents this fusion by more accurate offset comparison.

Sergey Kaplun:
* added the description and the test for the problem

Part of tarantool/tarantool#11691
---

Branch: https://github.com/tarantool/luajit/tree/skaplun/lj-1075-arm64-incorrect-ldp-stp-fusion
Related issues:
* https://github.com/tarantool/tarantool/issues/11691
* https://github.com/LuaJIT/LuaJIT/issues/1075

 src/lj_emit_arm64.h                           |  17 ++-
 ...75-arm64-incorrect-ldp-stp-fusion.test.lua | 129 ++++++++++++++++++
 2 files changed, 142 insertions(+), 4 deletions(-)
 create mode 100644 test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua

diff --git a/src/lj_emit_arm64.h b/src/lj_emit_arm64.h
index 5c1bc372..9dd92c40 100644
--- a/src/lj_emit_arm64.h
+++ b/src/lj_emit_arm64.h
@@ -121,6 +121,17 @@ static int emit_checkofs(A64Ins ai, int64_t ofs)
   }
 }
 
+static LJ_AINLINE uint32_t emit_lso_pair_candidate(A64Ins ai, int ofs, int sc)
+{
+  if (ofs >= 0) {
+    return ai | A64F_U12(ofs>>sc);  /* Subsequent lj_ror checks ofs. */
+  } else if (ofs >= -256) {
+    return (ai^A64I_LS_U) | A64F_S9(ofs & 0x1ff);
+  } else {
+    return A64F_D(31);  /* Will mismatch prev. */
+  }
+}
+
 static void emit_lso(ASMState *as, A64Ins ai, Reg rd, Reg rn, int64_t ofs)
 {
   int ot = emit_checkofs(ai, ofs), sc = (ai >> 30) & 3;
@@ -132,11 +143,9 @@ static void emit_lso(ASMState *as, A64Ins ai, Reg rd, Reg rn, int64_t ofs)
     uint32_t prev = *as->mcp & ~A64F_D(31);
     int ofsm = ofs - (1<<sc), ofsp = ofs + (1<<sc);
     A64Ins aip;
-    if (prev == (ai | A64F_N(rn) | A64F_U12(ofsm>>sc)) ||
-	prev == ((ai^A64I_LS_U) | A64F_N(rn) | A64F_S9(ofsm&0x1ff))) {
+    if (prev == emit_lso_pair_candidate(ai | A64F_N(rn), ofsm, sc)) {
       aip = (A64F_A(rd) | A64F_D(*as->mcp & 31));
-    } else if (prev == (ai | A64F_N(rn) | A64F_U12(ofsp>>sc)) ||
-	       prev == ((ai^A64I_LS_U) | A64F_N(rn) | A64F_S9(ofsp&0x1ff))) {
+    } else if (prev == emit_lso_pair_candidate(ai | A64F_N(rn), ofsp, sc)) {
       aip = (A64F_D(rd) | A64F_A(*as->mcp & 31));
       ofsm = ofs;
     } else {
diff --git a/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua
new file mode 100644
index 00000000..c84c3b23
--- /dev/null
+++ b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua
@@ -0,0 +1,129 @@
+local tap = require('tap')
+local ffi = require('ffi')
+
+-- This test demonstrates LuaJIT's incorrect emitting of LDP/STP
+-- instruction fused from LDR/STR with negative offset and
+-- positive offset with the same lower bits on arm64.
+-- See also https://github.com/LuaJIT/LuaJIT/pull/1075.
+local test = tap.test('lj-1075-arm64-incorrect-ldp-stp-fusion'):skipcond({
+  ['Test requires JIT enabled'] = not jit.status(),
+})
+
+test:plan(6)
+
+-- Amount of iterations to compile and run the invariant part of
+-- the trace.
+local N_ITERATIONS = 4
+
+local EXPECTED = 42
+
+-- 4 slots of redzone for int64_t load/store.
+local REDZONE = 4
+local MASK_IMM7 = 0x7f
+local BUFLEN = (MASK_IMM7 + REDZONE) * 4
+local buf = ffi.new('unsigned char [' .. BUFLEN .. ']', 0)
+
+local function clear_buf()
+  ffi.fill(buf, ffi.sizeof(buf), 0)
+end
+
+-- Initialize the buffer with simple values.
+local function init_buf()
+  -- Limit to fill the buffer. 0 in the top part helps
+  -- to detect the issue.
+  local LIMIT = BUFLEN - 12
+  for i = 0, LIMIT - 1  do
+    buf[i] = i
+  end
+  for i = LIMIT, BUFLEN - 1  do
+    buf[i] = 0
+  end
+end
+
+jit.opt.start('hotloop=2')
+
+-- Assume we have stores/loads from the pointer with offset
+-- +488 and -16. The lower 7 bits of the offset (-16) >> 2 are
+-- 1111100. These bits are the same as for the offset (488 + 8).
+-- Thus, before the patch, these two instructions:
+-- | str   x20, [x21, #488]
+-- | stur  x20, [x21, #-16]
+-- are incorrectly fused to the:
+-- | stp   x20, x20, [x21, #488]
+
+-- Test stores.
+
+local start = ffi.cast('unsigned char *', buf)
+-- Use constants to allow optimization to take place.
+local base_ptr = start + 16
+for _ = 1, N_ITERATIONS do
+  -- Save the result only for the last iteration.
+  clear_buf()
+  -- These 2 accesses become `base_ptr + 488` and `base_ptr + 496`
+  -- on the trace before the patch.
+  ffi.cast('uint64_t *', base_ptr + 488)[0] = EXPECTED
+  ffi.cast('uint64_t *', base_ptr - 16)[0] = EXPECTED
+end
+
+test:is(buf[488 + 16], EXPECTED, 'correct store top value')
+test:is(buf[0], EXPECTED, 'correct store bottom value')
+
+-- Test loads.
+
+init_buf()
+
+local top, bottom
+for _ = 1, N_ITERATIONS do
+  -- These 2 accesses become `base_ptr + 488` and `base_ptr + 496`
+  -- on the trace before the patch.
+  top = ffi.cast('uint64_t *', base_ptr + 488)[0]
+  bottom = ffi.cast('uint64_t *', base_ptr - 16)[0]
+end
+
+test:is(top, 0xfffefdfcfbfaf9f8ULL, 'correct load top value')
+test:is(bottom, 0x706050403020100ULL, 'correct load bottom value')
+
+-- Another reproducer that is based on the snapshot restoring.
+-- Its advantage is avoiding FFI usage.
+
+-- Snapshot slots are restored in the reversed order.
+-- The recording order is the following (from the bottom of the
+-- trace to the top):
+-- - 0th  (ofs == -16) -- `f64()` replaced the `tail64()` on the
+--                         stack,
+-- - 63rd (ofs == 488) -- 1,
+-- - 64th (ofs == 496) -- 2.
+-- At recording, the instructions for the 0th and 63rd slots are
+-- merged like the following:
+-- | str   x3, [x19, #496]
+-- | stp   x2, x1, [x19, #488]
+-- The first store is dominated by the stp, so the restored value
+-- is incorrect.
+
+-- Function with 63 slots on the stack.
+local function f63()
+  -- 61 unused slots to avoid extra stores in between.
+  -- luacheck: no unused
+  local _, _, _, _, _, _, _, _, _, _
+  local _, _, _, _, _, _, _, _, _, _
+  local _, _, _, _, _, _, _, _, _, _
+  local _, _, _, _, _, _, _, _, _, _
+  local _, _, _, _, _, _, _, _, _, _
+  local _, _, _, _, _, _, _, _, _, _
+  local _
+  return 1, 2
+end
+
+local function tail63()
+  return f63()
+end
+
+-- Record the trace.
+tail63()
+tail63()
+-- Run the trace.
+local one, two = tail63()
+test:is(one, 1, 'correct 1st value on stack')
+test:is(two, 2, 'correct 2nd value on stack')
+
+test:done(true)
-- 
2.51.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Tarantool-patches] [PATCH luajit] ARM64: Fix LDP/STP fusion (again).
  2025-08-27  9:17 [Tarantool-patches] [PATCH luajit] ARM64: Fix LDP/STP fusion (again) Sergey Kaplun via Tarantool-patches
@ 2025-09-08  8:54 ` Sergey Bronnikov via Tarantool-patches
  2025-09-08  9:18   ` Sergey Kaplun via Tarantool-patches
  2025-09-08  9:26   ` Sergey Bronnikov via Tarantool-patches
  2025-12-13 16:12 ` Sergey Kaplun via Tarantool-patches
  1 sibling, 2 replies; 7+ messages in thread
From: Sergey Bronnikov via Tarantool-patches @ 2025-09-08  8:54 UTC (permalink / raw)
  To: Sergey Kaplun; +Cc: tarantool-patches

[-- Attachment #1: Type: text/plain, Size: 7468 bytes --]

Hi, Sergey,

The test added with initial fix 
(test/tarantool-tests/lj-1057-arm64-stp-fusing-across-tbar.test.lua)

segfaults with proposed patch.

CMake configuration: cmake -S . -B build -DCMAKE_BUILD_TYPE=Debug 
-DLUA_USE_ASSERT=ON -DLUA_USE_APICHECK=ON

Arch: ARM64.

Sergey

On 8/27/25 12:17, Sergey Kaplun wrote:
> From: Mike Pall <mike>
>
> Reported and analyzed by Zhongwei Yao. Fix by Peter Cawley.
>
> (cherry picked from commit b8c6ccd50c61b7a2df5123ddc5a85ac7d089542b)
>
> Assume we have stores/loads from the pointer with offset +488 and -16.
> The lower bits of the offset are the same as for the offset (488 + 8).
> This leads to the incorrect fusion of these instructions:
> | str   x20, [x21, 488]
> | stur  x20, [x21, -16]
> to the following instruction:
> | stp   x20, x20, [x21, 488]
>
> This patch prevents this fusion by more accurate offset comparison.
>
> Sergey Kaplun:
> * added the description and the test for the problem
>
> Part of tarantool/tarantool#11691
> ---
>
> Branch:https://github.com/tarantool/luajit/tree/skaplun/lj-1075-arm64-incorrect-ldp-stp-fusion
> Related issues:
> *https://github.com/tarantool/tarantool/issues/11691
> *https://github.com/LuaJIT/LuaJIT/issues/1075
>
>   src/lj_emit_arm64.h                           |  17 ++-
>   ...75-arm64-incorrect-ldp-stp-fusion.test.lua | 129 ++++++++++++++++++
>   2 files changed, 142 insertions(+), 4 deletions(-)
>   create mode 100644 test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua
>
> diff --git a/src/lj_emit_arm64.h b/src/lj_emit_arm64.h
> index 5c1bc372..9dd92c40 100644
> --- a/src/lj_emit_arm64.h
> +++ b/src/lj_emit_arm64.h
> @@ -121,6 +121,17 @@ static int emit_checkofs(A64Ins ai, int64_t ofs)
>     }
>   }
>   
> +static LJ_AINLINE uint32_t emit_lso_pair_candidate(A64Ins ai, int ofs, int sc)
> +{
> +  if (ofs >= 0) {
> +    return ai | A64F_U12(ofs>>sc);  /* Subsequent lj_ror checks ofs. */
> +  } else if (ofs >= -256) {
> +    return (ai^A64I_LS_U) | A64F_S9(ofs & 0x1ff);
> +  } else {
> +    return A64F_D(31);  /* Will mismatch prev. */
> +  }
> +}
> +
>   static void emit_lso(ASMState *as, A64Ins ai, Reg rd, Reg rn, int64_t ofs)
>   {
>     int ot = emit_checkofs(ai, ofs), sc = (ai >> 30) & 3;
> @@ -132,11 +143,9 @@ static void emit_lso(ASMState *as, A64Ins ai, Reg rd, Reg rn, int64_t ofs)
>       uint32_t prev = *as->mcp & ~A64F_D(31);
>       int ofsm = ofs - (1<<sc), ofsp = ofs + (1<<sc);
>       A64Ins aip;
> -    if (prev == (ai | A64F_N(rn) | A64F_U12(ofsm>>sc)) ||
> -	prev == ((ai^A64I_LS_U) | A64F_N(rn) | A64F_S9(ofsm&0x1ff))) {
> +    if (prev == emit_lso_pair_candidate(ai | A64F_N(rn), ofsm, sc)) {
>         aip = (A64F_A(rd) | A64F_D(*as->mcp & 31));
> -    } else if (prev == (ai | A64F_N(rn) | A64F_U12(ofsp>>sc)) ||
> -	       prev == ((ai^A64I_LS_U) | A64F_N(rn) | A64F_S9(ofsp&0x1ff))) {
> +    } else if (prev == emit_lso_pair_candidate(ai | A64F_N(rn), ofsp, sc)) {
>         aip = (A64F_D(rd) | A64F_A(*as->mcp & 31));
>         ofsm = ofs;
>       } else {
> diff --git a/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua
> new file mode 100644
> index 00000000..c84c3b23
> --- /dev/null
> +++ b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua
> @@ -0,0 +1,129 @@
> +local tap = require('tap')
> +local ffi = require('ffi')
> +
> +-- This test demonstrates LuaJIT's incorrect emitting of LDP/STP
> +-- instruction fused from LDR/STR with negative offset and
> +-- positive offset with the same lower bits on arm64.
> +-- See alsohttps://github.com/LuaJIT/LuaJIT/pull/1075.
> +local test = tap.test('lj-1075-arm64-incorrect-ldp-stp-fusion'):skipcond({
> +  ['Test requires JIT enabled'] = not jit.status(),
> +})
> +
> +test:plan(6)
> +
> +-- Amount of iterations to compile and run the invariant part of
> +-- the trace.
> +local N_ITERATIONS = 4
> +
> +local EXPECTED = 42
> +
> +-- 4 slots of redzone for int64_t load/store.
> +local REDZONE = 4
> +local MASK_IMM7 = 0x7f
> +local BUFLEN = (MASK_IMM7 + REDZONE) * 4
> +local buf = ffi.new('unsigned char [' .. BUFLEN .. ']', 0)
> +
> +local function clear_buf()
> +  ffi.fill(buf, ffi.sizeof(buf), 0)
> +end
> +
> +-- Initialize the buffer with simple values.
> +local function init_buf()
> +  -- Limit to fill the buffer. 0 in the top part helps
> +  -- to detect the issue.
> +  local LIMIT = BUFLEN - 12
> +  for i = 0, LIMIT - 1  do
> +    buf[i] = i
> +  end
> +  for i = LIMIT, BUFLEN - 1  do
> +    buf[i] = 0
> +  end
> +end
> +
> +jit.opt.start('hotloop=2')
> +
> +-- Assume we have stores/loads from the pointer with offset
> +-- +488 and -16. The lower 7 bits of the offset (-16) >> 2 are
> +-- 1111100. These bits are the same as for the offset (488 + 8).
> +-- Thus, before the patch, these two instructions:
> +-- | str   x20, [x21, #488]
> +-- | stur  x20, [x21, #-16]
> +-- are incorrectly fused to the:
> +-- | stp   x20, x20, [x21, #488]
> +
> +-- Test stores.
> +
> +local start = ffi.cast('unsigned char *', buf)
> +-- Use constants to allow optimization to take place.
> +local base_ptr = start + 16
> +for _ = 1, N_ITERATIONS do
> +  -- Save the result only for the last iteration.
> +  clear_buf()
> +  -- These 2 accesses become `base_ptr + 488` and `base_ptr + 496`
> +  -- on the trace before the patch.
> +  ffi.cast('uint64_t *', base_ptr + 488)[0] = EXPECTED
> +  ffi.cast('uint64_t *', base_ptr - 16)[0] = EXPECTED
> +end
> +
> +test:is(buf[488 + 16], EXPECTED, 'correct store top value')
> +test:is(buf[0], EXPECTED, 'correct store bottom value')
> +
> +-- Test loads.
> +
> +init_buf()
> +
> +local top, bottom
> +for _ = 1, N_ITERATIONS do
> +  -- These 2 accesses become `base_ptr + 488` and `base_ptr + 496`
> +  -- on the trace before the patch.
> +  top = ffi.cast('uint64_t *', base_ptr + 488)[0]
> +  bottom = ffi.cast('uint64_t *', base_ptr - 16)[0]
> +end
> +
> +test:is(top, 0xfffefdfcfbfaf9f8ULL, 'correct load top value')
> +test:is(bottom, 0x706050403020100ULL, 'correct load bottom value')
> +
> +-- Another reproducer that is based on the snapshot restoring.
> +-- Its advantage is avoiding FFI usage.
> +
> +-- Snapshot slots are restored in the reversed order.
> +-- The recording order is the following (from the bottom of the
> +-- trace to the top):
> +-- - 0th  (ofs == -16) -- `f64()` replaced the `tail64()` on the
> +--                         stack,
> +-- - 63rd (ofs == 488) -- 1,
> +-- - 64th (ofs == 496) -- 2.
> +-- At recording, the instructions for the 0th and 63rd slots are
> +-- merged like the following:
> +-- | str   x3, [x19, #496]
> +-- | stp   x2, x1, [x19, #488]
> +-- The first store is dominated by the stp, so the restored value
> +-- is incorrect.
> +
> +-- Function with 63 slots on the stack.
> +local function f63()
> +  -- 61 unused slots to avoid extra stores in between.
> +  -- luacheck: no unused
> +  local _, _, _, _, _, _, _, _, _, _
> +  local _, _, _, _, _, _, _, _, _, _
> +  local _, _, _, _, _, _, _, _, _, _
> +  local _, _, _, _, _, _, _, _, _, _
> +  local _, _, _, _, _, _, _, _, _, _
> +  local _, _, _, _, _, _, _, _, _, _
> +  local _
> +  return 1, 2
> +end
> +
> +local function tail63()
> +  return f63()
> +end
> +
> +-- Record the trace.
> +tail63()
> +tail63()
> +-- Run the trace.
> +local one, two = tail63()
> +test:is(one, 1, 'correct 1st value on stack')
> +test:is(two, 2, 'correct 2nd value on stack')
> +
> +test:done(true)

[-- Attachment #2: Type: text/html, Size: 8065 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Tarantool-patches] [PATCH luajit] ARM64: Fix LDP/STP fusion (again).
  2025-09-08  8:54 ` Sergey Bronnikov via Tarantool-patches
@ 2025-09-08  9:18   ` Sergey Kaplun via Tarantool-patches
  2025-09-08  9:26   ` Sergey Bronnikov via Tarantool-patches
  1 sibling, 0 replies; 7+ messages in thread
From: Sergey Kaplun via Tarantool-patches @ 2025-09-08  9:18 UTC (permalink / raw)
  To: Sergey Bronnikov; +Cc: tarantool-patches

Hi, Sergey,
Thanks for the comment, please consider my answer below.

On 08.09.25, Sergey Bronnikov wrote:
> Hi, Sergey,
> 
> The test added with initial fix 
> (test/tarantool-tests/lj-1057-arm64-stp-fusing-across-tbar.test.lua)
> 
> segfaults with proposed patch.
> 
> CMake configuration: cmake -S . -B build -DCMAKE_BUILD_TYPE=Debug 
> -DLUA_USE_ASSERT=ON -DLUA_USE_APICHECK=ON
> 
> Arch: ARM64.

The lj-1057-arm64-stp-fusing-across-tbar.test.lua test is fixed via the
corresponding patchset. It should be applied to avoid the corresponding
test failures. Within 2 patchsets applied, I see no regressions.

> 
> Sergey
> 

<snipped>

-- 
Best regards,
Sergey Kaplun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Tarantool-patches] [PATCH luajit] ARM64: Fix LDP/STP fusion (again).
  2025-09-08  8:54 ` Sergey Bronnikov via Tarantool-patches
  2025-09-08  9:18   ` Sergey Kaplun via Tarantool-patches
@ 2025-09-08  9:26   ` Sergey Bronnikov via Tarantool-patches
  2025-09-08  9:48     ` Sergey Kaplun via Tarantool-patches
  1 sibling, 1 reply; 7+ messages in thread
From: Sergey Bronnikov via Tarantool-patches @ 2025-09-08  9:26 UTC (permalink / raw)
  To: Sergey Kaplun; +Cc: tarantool-patches

[-- Attachment #1: Type: text/plain, Size: 8146 bytes --]

Hi, Sergey,

thanks for the patch! LGTM with two minor comments

Sergey

On 9/8/25 11:54, Sergey Bronnikov wrote:
>
> Hi, Sergey,
>
> The test added with initial fix 
> (test/tarantool-tests/lj-1057-arm64-stp-fusing-across-tbar.test.lua)
>
> segfaults with proposed patch.
>
Please disregard, seems there was a misconfiguration or "dirty" build on 
the machine.
>
> CMake configuration: cmake -S . -B build -DCMAKE_BUILD_TYPE=Debug 
> -DLUA_USE_ASSERT=ON -DLUA_USE_APICHECK=ON
>
> Arch: ARM64.
>
> Sergey
>
> On 8/27/25 12:17, Sergey Kaplun wrote:
>> From: Mike Pall <mike>
>>
>> Reported and analyzed by Zhongwei Yao. Fix by Peter Cawley.
>>
>> (cherry picked from commit b8c6ccd50c61b7a2df5123ddc5a85ac7d089542b)
>>
>> Assume we have stores/loads from the pointer with offset +488 and -16.
>> The lower bits of the offset are the same as for the offset (488 + 8).
>> This leads to the incorrect fusion of these instructions:
>> | str   x20, [x21, 488]
>> | stur  x20, [x21, -16]
>> to the following instruction:
>> | stp   x20, x20, [x21, 488]
>>
>> This patch prevents this fusion by more accurate offset comparison.
>>
>> Sergey Kaplun:
>> * added the description and the test for the problem
>>
>> Part of tarantool/tarantool#11691
>> ---
>>
>> Branch:https://github.com/tarantool/luajit/tree/skaplun/lj-1075-arm64-incorrect-ldp-stp-fusion
>> Related issues:
>> *https://github.com/tarantool/tarantool/issues/11691
>> *https://github.com/LuaJIT/LuaJIT/issues/1075
>>
>>   src/lj_emit_arm64.h                           |  17 ++-
>>   ...75-arm64-incorrect-ldp-stp-fusion.test.lua | 129 ++++++++++++++++++
>>   2 files changed, 142 insertions(+), 4 deletions(-)
>>   create mode 100644 test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua
>>
>> diff --git a/src/lj_emit_arm64.h b/src/lj_emit_arm64.h
>> index 5c1bc372..9dd92c40 100644
>> --- a/src/lj_emit_arm64.h
>> +++ b/src/lj_emit_arm64.h
>> @@ -121,6 +121,17 @@ static int emit_checkofs(A64Ins ai, int64_t ofs)
>>     }
>>   }
>>   
>> +static LJ_AINLINE uint32_t emit_lso_pair_candidate(A64Ins ai, int ofs, int sc)
>> +{
>> +  if (ofs >= 0) {
>> +    return ai | A64F_U12(ofs>>sc);  /* Subsequent lj_ror checks ofs. */
>> +  } else if (ofs >= -256) {
>> +    return (ai^A64I_LS_U) | A64F_S9(ofs & 0x1ff);
>> +  } else {
>> +    return A64F_D(31);  /* Will mismatch prev. */
>> +  }
>> +}
>> +
>>   static void emit_lso(ASMState *as, A64Ins ai, Reg rd, Reg rn, int64_t ofs)
>>   {
>>     int ot = emit_checkofs(ai, ofs), sc = (ai >> 30) & 3;
>> @@ -132,11 +143,9 @@ static void emit_lso(ASMState *as, A64Ins ai, Reg rd, Reg rn, int64_t ofs)
>>       uint32_t prev = *as->mcp & ~A64F_D(31);
>>       int ofsm = ofs - (1<<sc), ofsp = ofs + (1<<sc);
>>       A64Ins aip;
>> -    if (prev == (ai | A64F_N(rn) | A64F_U12(ofsm>>sc)) ||
>> -	prev == ((ai^A64I_LS_U) | A64F_N(rn) | A64F_S9(ofsm&0x1ff))) {
>> +    if (prev == emit_lso_pair_candidate(ai | A64F_N(rn), ofsm, sc)) {
>>         aip = (A64F_A(rd) | A64F_D(*as->mcp & 31));
>> -    } else if (prev == (ai | A64F_N(rn) | A64F_U12(ofsp>>sc)) ||
>> -	       prev == ((ai^A64I_LS_U) | A64F_N(rn) | A64F_S9(ofsp&0x1ff))) {
>> +    } else if (prev == emit_lso_pair_candidate(ai | A64F_N(rn), ofsp, sc)) {
>>         aip = (A64F_D(rd) | A64F_A(*as->mcp & 31));
>>         ofsm = ofs;
>>       } else {
>> diff --git a/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua
>> new file mode 100644
>> index 00000000..c84c3b23
>> --- /dev/null
>> +++ b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua
>> @@ -0,0 +1,129 @@
>> +local tap = require('tap')
>> +local ffi = require('ffi')
>> +
>> +-- This test demonstrates LuaJIT's incorrect emitting of LDP/STP
>> +-- instruction fused from LDR/STR with negative offset and
>> +-- positive offset with the same lower bits on arm64.
>> +-- See alsohttps://github.com/LuaJIT/LuaJIT/pull/1075.
>> +local test = tap.test('lj-1075-arm64-incorrect-ldp-stp-fusion'):skipcond({
>> +  ['Test requires JIT enabled'] = not jit.status(),
>> +})
>> +
>> +test:plan(6)
>> +
>> +-- Amount of iterations to compile and run the invariant part of
>> +-- the trace.
>> +local N_ITERATIONS = 4
>> +
>> +local EXPECTED = 42
>> +
>> +-- 4 slots of redzone for int64_t load/store.
>> +local REDZONE = 4
>> +local MASK_IMM7 = 0x7f
>> +local BUFLEN = (MASK_IMM7 + REDZONE) * 4
>> +local buf = ffi.new('unsigned char [' .. BUFLEN .. ']', 0)
>> +
>> +local function clear_buf()
>> +  ffi.fill(buf, ffi.sizeof(buf), 0)
>> +end
>> +
>> +-- Initialize the buffer with simple values.
>> +local function init_buf()
>> +  -- Limit to fill the buffer. 0 in the top part helps
>> +  -- to detect the issue.
>> +  local LIMIT = BUFLEN - 12
>> +  for i = 0, LIMIT - 1  do
>> +    buf[i] = i
>> +  end
>> +  for i = LIMIT, BUFLEN - 1  do
>> +    buf[i] = 0
>> +  end
>> +end
>> +
>> +jit.opt.start('hotloop=2')

Why 2? It deserves a comment, because usually we use 1 hotloop.


>> +
>> +-- Assume we have stores/loads from the pointer with offset
>> +-- +488 and -16. The lower 7 bits of the offset (-16) >> 2 are
>> +-- 1111100. These bits are the same as for the offset (488 + 8).
>> +-- Thus, before the patch, these two instructions:
>> +-- | str   x20, [x21, #488]
>> +-- | stur  x20, [x21, #-16]
>> +-- are incorrectly fused to the:
>> +-- | stp   x20, x20, [x21, #488]
>> +
>> +-- Test stores.
>> +
>> +local start = ffi.cast('unsigned char *', buf)
>> +-- Use constants to allow optimization to take place.
>> +local base_ptr = start + 16
>> +for _ = 1, N_ITERATIONS do
>> +  -- Save the result only for the last iteration.
>> +  clear_buf()
>> +  -- These 2 accesses become `base_ptr + 488` and `base_ptr + 496`
>> +  -- on the trace before the patch.
>> +  ffi.cast('uint64_t *', base_ptr + 488)[0] = EXPECTED
>> +  ffi.cast('uint64_t *', base_ptr - 16)[0] = EXPECTED
>> +end
>> +
>> +test:is(buf[488 + 16], EXPECTED, 'correct store top value')
>> +test:is(buf[0], EXPECTED, 'correct store bottom value')
>> +
>> +-- Test loads.
>> +
>> +init_buf()
>> +
>> +local top, bottom
>> +for _ = 1, N_ITERATIONS do
>> +  -- These 2 accesses become `base_ptr + 488` and `base_ptr + 496`
>> +  -- on the trace before the patch.
>> +  top = ffi.cast('uint64_t *', base_ptr + 488)[0]
>> +  bottom = ffi.cast('uint64_t *', base_ptr - 16)[0]
>> +end
>> +
>> +test:is(top, 0xfffefdfcfbfaf9f8ULL, 'correct load top value')
>> +test:is(bottom, 0x706050403020100ULL, 'correct load bottom value')
>> +
>> +-- Another reproducer that is based on the snapshot restoring.
>> +-- Its advantage is avoiding FFI usage.
>> +
>> +-- Snapshot slots are restored in the reversed order.
>> +-- The recording order is the following (from the bottom of the
>> +-- trace to the top):
>> +-- - 0th  (ofs == -16) -- `f64()` replaced the `tail64()` on the
>> +--                         stack,
>> +-- - 63rd (ofs == 488) -- 1,
>> +-- - 64th (ofs == 496) -- 2.
>> +-- At recording, the instructions for the 0th and 63rd slots are
>> +-- merged like the following:
>> +-- | str   x3, [x19, #496]
>> +-- | stp   x2, x1, [x19, #488]
>> +-- The first store is dominated by the stp, so the restored value
>> +-- is incorrect.
>> +
>> +-- Function with 63 slots on the stack.
>> +local function f63()

Minor: Hardcode a number of slots to the function name looks odd.

The same for tail63. Bumping a number of slots will

require renaming of two functions.

Feel free to ignore.

>> +  -- 61 unused slots to avoid extra stores in between.
>> +  -- luacheck: no unused
>> +  local _, _, _, _, _, _, _, _, _, _
>> +  local _, _, _, _, _, _, _, _, _, _
>> +  local _, _, _, _, _, _, _, _, _, _
>> +  local _, _, _, _, _, _, _, _, _, _
>> +  local _, _, _, _, _, _, _, _, _, _
>> +  local _, _, _, _, _, _, _, _, _, _
>> +  local _
>> +  return 1, 2
>> +end
>> +
>> +local function tail63()
>> +  return f63()
>> +end
>> +
>> +-- Record the trace.
>> +tail63()
>> +tail63()
>> +-- Run the trace.
>> +local one, two = tail63()
>> +test:is(one, 1, 'correct 1st value on stack')
>> +test:is(two, 2, 'correct 2nd value on stack')
>> +
>> +test:done(true)

[-- Attachment #2: Type: text/html, Size: 9747 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Tarantool-patches] [PATCH luajit] ARM64: Fix LDP/STP fusion (again).
  2025-09-08  9:26   ` Sergey Bronnikov via Tarantool-patches
@ 2025-09-08  9:48     ` Sergey Kaplun via Tarantool-patches
  2025-09-08 10:40       ` Sergey Bronnikov via Tarantool-patches
  0 siblings, 1 reply; 7+ messages in thread
From: Sergey Kaplun via Tarantool-patches @ 2025-09-08  9:48 UTC (permalink / raw)
  To: Sergey Bronnikov; +Cc: tarantool-patches

Hi, Sergey!
Thanks for the review!
Fixed your comment and force-pushed the branch.

On 08.09.25, Sergey Bronnikov wrote:
> Hi, Sergey,
> 
> thanks for the patch! LGTM with two minor comments
> 
> Sergey
> 

<snipped>

> > On 8/27/25 12:17, Sergey Kaplun wrote:
> >> From: Mike Pall <mike>
> >>
> >> Reported and analyzed by Zhongwei Yao. Fix by Peter Cawley.
> >>
> >> (cherry picked from commit b8c6ccd50c61b7a2df5123ddc5a85ac7d089542b)
> >>
> >> Assume we have stores/loads from the pointer with offset +488 and -16.
> >> The lower bits of the offset are the same as for the offset (488 + 8).
> >> This leads to the incorrect fusion of these instructions:
> >> | str   x20, [x21, 488]
> >> | stur  x20, [x21, -16]
> >> to the following instruction:
> >> | stp   x20, x20, [x21, 488]
> >>
> >> This patch prevents this fusion by more accurate offset comparison.
> >>
> >> Sergey Kaplun:
> >> * added the description and the test for the problem
> >>
> >> Part of tarantool/tarantool#11691
> >> ---
> >>
> >> Branch:https://github.com/tarantool/luajit/tree/skaplun/lj-1075-arm64-incorrect-ldp-stp-fusion
> >> Related issues:
> >> *https://github.com/tarantool/tarantool/issues/11691
> >> *https://github.com/LuaJIT/LuaJIT/issues/1075
> >>
> >>   src/lj_emit_arm64.h                           |  17 ++-
> >>   ...75-arm64-incorrect-ldp-stp-fusion.test.lua | 129 ++++++++++++++++++
> >>   2 files changed, 142 insertions(+), 4 deletions(-)
> >>   create mode 100644 test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua
> >>
> >> diff --git a/src/lj_emit_arm64.h b/src/lj_emit_arm64.h
> >> index 5c1bc372..9dd92c40 100644
> >> --- a/src/lj_emit_arm64.h
> >> +++ b/src/lj_emit_arm64.h

<snipped>

> >> diff --git a/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua
> >> new file mode 100644
> >> index 00000000..c84c3b23
> >> --- /dev/null
> >> +++ b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua
> >> @@ -0,0 +1,129 @@

<snipped>

> >> +
> >> +jit.opt.start('hotloop=2')
> 
> Why 2? It deserves a comment, because usually we use 1 hotloop.

It's a copy-pasting mistake from the aarch64 machine, fixed to
`hotloop=1`, thanks:

===================================================================
diff --git a/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua
index c84c3b23..393a1aa7 100644
--- a/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua
+++ b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua
@@ -40,7 +40,7 @@ local function init_buf()
   end
 end
 
-jit.opt.start('hotloop=2')
+jit.opt.start('hotloop=1')
 
 -- Assume we have stores/loads from the pointer with offset
 -- +488 and -16. The lower 7 bits of the offset (-16) >> 2 are
===================================================================

> 

<snipped>

> >> +
> >> +-- Another reproducer that is based on the snapshot restoring.
> >> +-- Its advantage is avoiding FFI usage.
> >> +
> >> +-- Snapshot slots are restored in the reversed order.
> >> +-- The recording order is the following (from the bottom of the
> >> +-- trace to the top):
> >> +-- - 0th  (ofs == -16) -- `f64()` replaced the `tail64()` on the
> >> +--                         stack,
> >> +-- - 63rd (ofs == 488) -- 1,
> >> +-- - 64th (ofs == 496) -- 2.
> >> +-- At recording, the instructions for the 0th and 63rd slots are
> >> +-- merged like the following:
> >> +-- | str   x3, [x19, #496]
> >> +-- | stp   x2, x1, [x19, #488]
> >> +-- The first store is dominated by the stp, so the restored value
> >> +-- is incorrect.
> >> +
> >> +-- Function with 63 slots on the stack.
> >> +local function f63()
> 
> Minor: Hardcode a number of slots to the function name looks odd.

It is mentioned above why exactly this amount of slots is required.
It shouldn't be touched.

> 
> The same for tail63. Bumping a number of slots will
> 
> require renaming of two functions.
> 
> Feel free to ignore.

Ignoring.

> 
> >> +  -- 61 unused slots to avoid extra stores in between.
> >> +  -- luacheck: no unused
> >> +  local _, _, _, _, _, _, _, _, _, _
> >> +  local _, _, _, _, _, _, _, _, _, _
> >> +  local _, _, _, _, _, _, _, _, _, _
> >> +  local _, _, _, _, _, _, _, _, _, _
> >> +  local _, _, _, _, _, _, _, _, _, _
> >> +  local _, _, _, _, _, _, _, _, _, _
> >> +  local _
> >> +  return 1, 2
> >> +end
> >> +

<snipped>

> >> +test:done(true)

-- 
Best regards,
Sergey Kaplun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Tarantool-patches] [PATCH luajit] ARM64: Fix LDP/STP fusion (again).
  2025-09-08  9:48     ` Sergey Kaplun via Tarantool-patches
@ 2025-09-08 10:40       ` Sergey Bronnikov via Tarantool-patches
  0 siblings, 0 replies; 7+ messages in thread
From: Sergey Bronnikov via Tarantool-patches @ 2025-09-08 10:40 UTC (permalink / raw)
  To: Sergey Kaplun; +Cc: tarantool-patches

[-- Attachment #1: Type: text/plain, Size: 4813 bytes --]

LGTM

On 9/8/25 12:48, Sergey Kaplun wrote:
> Hi, Sergey!
> Thanks for the review!
> Fixed your comment and force-pushed the branch.
>
> On 08.09.25, Sergey Bronnikov wrote:
>> Hi, Sergey,
>>
>> thanks for the patch! LGTM with two minor comments
>>
>> Sergey
>>
> <snipped>
>
>>> On 8/27/25 12:17, Sergey Kaplun wrote:
>>>> From: Mike Pall <mike>
>>>>
>>>> Reported and analyzed by Zhongwei Yao. Fix by Peter Cawley.
>>>>
>>>> (cherry picked from commit b8c6ccd50c61b7a2df5123ddc5a85ac7d089542b)
>>>>
>>>> Assume we have stores/loads from the pointer with offset +488 and -16.
>>>> The lower bits of the offset are the same as for the offset (488 + 8).
>>>> This leads to the incorrect fusion of these instructions:
>>>> | str   x20, [x21, 488]
>>>> | stur  x20, [x21, -16]
>>>> to the following instruction:
>>>> | stp   x20, x20, [x21, 488]
>>>>
>>>> This patch prevents this fusion by more accurate offset comparison.
>>>>
>>>> Sergey Kaplun:
>>>> * added the description and the test for the problem
>>>>
>>>> Part of tarantool/tarantool#11691
>>>> ---
>>>>
>>>> Branch:https://github.com/tarantool/luajit/tree/skaplun/lj-1075-arm64-incorrect-ldp-stp-fusion
>>>> Related issues:
>>>> *https://github.com/tarantool/tarantool/issues/11691
>>>> *https://github.com/LuaJIT/LuaJIT/issues/1075
>>>>
>>>>    src/lj_emit_arm64.h                           |  17 ++-
>>>>    ...75-arm64-incorrect-ldp-stp-fusion.test.lua | 129 ++++++++++++++++++
>>>>    2 files changed, 142 insertions(+), 4 deletions(-)
>>>>    create mode 100644 test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua
>>>>
>>>> diff --git a/src/lj_emit_arm64.h b/src/lj_emit_arm64.h
>>>> index 5c1bc372..9dd92c40 100644
>>>> --- a/src/lj_emit_arm64.h
>>>> +++ b/src/lj_emit_arm64.h
> <snipped>
>
>>>> diff --git a/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua
>>>> new file mode 100644
>>>> index 00000000..c84c3b23
>>>> --- /dev/null
>>>> +++ b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua
>>>> @@ -0,0 +1,129 @@
> <snipped>
>
>>>> +
>>>> +jit.opt.start('hotloop=2')
>> Why 2? It deserves a comment, because usually we use 1 hotloop.
> It's a copy-pasting mistake from the aarch64 machine, fixed to
> `hotloop=1`, thanks:
Thanks!
>
> ===================================================================
> diff --git a/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua
> index c84c3b23..393a1aa7 100644
> --- a/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua
> +++ b/test/tarantool-tests/lj-1075-arm64-incorrect-ldp-stp-fusion.test.lua
> @@ -40,7 +40,7 @@ local function init_buf()
>     end
>   end
>   
> -jit.opt.start('hotloop=2')
> +jit.opt.start('hotloop=1')
>   
>   -- Assume we have stores/loads from the pointer with offset
>   -- +488 and -16. The lower 7 bits of the offset (-16) >> 2 are
> ===================================================================
>
> <snipped>
>
>>>> +
>>>> +-- Another reproducer that is based on the snapshot restoring.
>>>> +-- Its advantage is avoiding FFI usage.
>>>> +
>>>> +-- Snapshot slots are restored in the reversed order.
>>>> +-- The recording order is the following (from the bottom of the
>>>> +-- trace to the top):
>>>> +-- - 0th  (ofs == -16) -- `f64()` replaced the `tail64()` on the
>>>> +--                         stack,
>>>> +-- - 63rd (ofs == 488) -- 1,
>>>> +-- - 64th (ofs == 496) -- 2.
>>>> +-- At recording, the instructions for the 0th and 63rd slots are
>>>> +-- merged like the following:
>>>> +-- | str   x3, [x19, #496]
>>>> +-- | stp   x2, x1, [x19, #488]
>>>> +-- The first store is dominated by the stp, so the restored value
>>>> +-- is incorrect.
>>>> +
>>>> +-- Function with 63 slots on the stack.
>>>> +local function f63()
>> Minor: Hardcode a number of slots to the function name looks odd.
> It is mentioned above why exactly this amount of slots is required.
> It shouldn't be touched.

The question was about hard-coding a number in a function name, not about

using exactly this number of slots. Ok, I'll not insist, as I said in a 
question.

>> The same for tail63. Bumping a number of slots will
>>
>> require renaming of two functions.
>>
>> Feel free to ignore.
> Ignoring.
>
>>>> +  -- 61 unused slots to avoid extra stores in between.
>>>> +  -- luacheck: no unused
>>>> +  local _, _, _, _, _, _, _, _, _, _
>>>> +  local _, _, _, _, _, _, _, _, _, _
>>>> +  local _, _, _, _, _, _, _, _, _, _
>>>> +  local _, _, _, _, _, _, _, _, _, _
>>>> +  local _, _, _, _, _, _, _, _, _, _
>>>> +  local _, _, _, _, _, _, _, _, _, _
>>>> +  local _
>>>> +  return 1, 2
>>>> +end
>>>> +
> <snipped>
>
>>>> +test:done(true)

[-- Attachment #2: Type: text/html, Size: 7680 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Tarantool-patches] [PATCH luajit] ARM64: Fix LDP/STP fusion (again).
  2025-08-27  9:17 [Tarantool-patches] [PATCH luajit] ARM64: Fix LDP/STP fusion (again) Sergey Kaplun via Tarantool-patches
  2025-09-08  8:54 ` Sergey Bronnikov via Tarantool-patches
@ 2025-12-13 16:12 ` Sergey Kaplun via Tarantool-patches
  1 sibling, 0 replies; 7+ messages in thread
From: Sergey Kaplun via Tarantool-patches @ 2025-12-13 16:12 UTC (permalink / raw)
  To: Sergey Bronnikov; +Cc: tarantool-patches

I've applied the patch into all long-term branches in
tarantool/luajit and bumped a new version in Tarantool's master [1],
release/3.5 [2], release/3.3 [3], release/3.2 [4] and release/2.11 [5].

[1]: https://github.com/tarantool/tarantool/pull/12129
[2]: https://github.com/tarantool/tarantool/pull/12130
[3]: https://github.com/tarantool/tarantool/pull/12131
[4]: https://github.com/tarantool/tarantool/pull/12132
[5]: https://github.com/tarantool/tarantool/pull/12133

-- 
Best regards,
Sergey Kaplun

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-12-13 16:12 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-08-27  9:17 [Tarantool-patches] [PATCH luajit] ARM64: Fix LDP/STP fusion (again) Sergey Kaplun via Tarantool-patches
2025-09-08  8:54 ` Sergey Bronnikov via Tarantool-patches
2025-09-08  9:18   ` Sergey Kaplun via Tarantool-patches
2025-09-08  9:26   ` Sergey Bronnikov via Tarantool-patches
2025-09-08  9:48     ` Sergey Kaplun via Tarantool-patches
2025-09-08 10:40       ` Sergey Bronnikov via Tarantool-patches
2025-12-13 16:12 ` Sergey Kaplun via Tarantool-patches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox