* [Tarantool-patches] [PATCH luajit 0/3] Generation of immediates and TBAR fusion
@ 2025-07-24 9:03 Sergey Kaplun via Tarantool-patches
2025-07-24 9:03 ` [Tarantool-patches] [PATCH luajit 1/3] ARM64: Improve generation of immediates Sergey Kaplun via Tarantool-patches
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Sergey Kaplun via Tarantool-patches @ 2025-07-24 9:03 UTC (permalink / raw)
To: Sergey Bronnikov; +Cc: tarantool-patches
The first 2 patches are required to avoid conflicts with the last one
patch. The last patch fixes incorrect fusion of instructions across IR,
which leads to incorrect label target.
Related issues:
* https://github.com/tarantool/tarantool/issues/11691
* https://github.com/LuaJIT/LuaJIT/issues/1057
Mike Pall (3):
ARM64: Improve generation of immediates.
ARM64: More improvements to the generation of immediates.
ARM64: Prevent STP fusion for conditional code emitted by TBAR.
src/lj_asm.c | 3 +
src/lj_asm_arm64.h | 26 +++---
src/lj_emit_arm64.h | 66 ++++++++--------
...1057-arm64-stp-fusing-across-tbar.test.lua | 79 +++++++++++++++++++
4 files changed, 126 insertions(+), 48 deletions(-)
create mode 100644 test/tarantool-tests/lj-1057-arm64-stp-fusing-across-tbar.test.lua
--
2.50.0
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Tarantool-patches] [PATCH luajit 1/3] ARM64: Improve generation of immediates.
2025-07-24 9:03 [Tarantool-patches] [PATCH luajit 0/3] Generation of immediates and TBAR fusion Sergey Kaplun via Tarantool-patches
@ 2025-07-24 9:03 ` Sergey Kaplun via Tarantool-patches
2025-07-24 9:03 ` [Tarantool-patches] [PATCH luajit 2/3] ARM64: More improvements to the " Sergey Kaplun via Tarantool-patches
2025-07-24 9:04 ` [Tarantool-patches] [PATCH luajit 3/3] ARM64: Prevent STP fusion for conditional code emitted by TBAR Sergey Kaplun via Tarantool-patches
2 siblings, 0 replies; 4+ messages in thread
From: Sergey Kaplun via Tarantool-patches @ 2025-07-24 9:03 UTC (permalink / raw)
To: Sergey Bronnikov; +Cc: tarantool-patches
From: Mike Pall <mike>
(cherry picked from commit ebc4919affbc0f9e8adfb5dede378017c7dd1fdd)
This patch improves the generation of immediates, which are used in the
cdata alignment or arguments to the function calls. Before this patch,
these immediates are rematerialized to the registers like the following
(if they may be encoded in K13 format for logical data processing
instructions):
| orr w1, wzr, 0x3
Instead of recommended [1][2]:
| mov x1, 0x3
Since for the case when a constant may be placed in the register with
the single `mov` [3] instruction, it is preferable to encode the `mov`
instruction instead (which is an alias to `orr`, in most cases).
For the cases when the constant needs at least `mov` and `mov[kn]`
instructions, it is still preferable to try short emitting via `orr` if
possible.
Sergey Kaplun:
* added the description for the patch
[1]: https://developer.arm.com/documentation/ddi0602/2025-06/Shared-Pseudocode/aarch64-functions-movwpreferred
[2]: https://developer.arm.com/documentation/ddi0602/2025-06/Base-Instructions/ORR--immediate---Bitwise-OR--immediate--
[3]: https://developer.arm.com/documentation/ddi0602/2025-06/Base-Instructions/MOV--register---Move-register-value--an-alias-of-ORR--shifted-register--
Part of tarantool/tarantool#11691
---
src/lj_emit_arm64.h | 64 +++++++++++++++++++++++----------------------
1 file changed, 33 insertions(+), 31 deletions(-)
diff --git a/src/lj_emit_arm64.h b/src/lj_emit_arm64.h
index 5c1bc372..2bb93dd9 100644
--- a/src/lj_emit_arm64.h
+++ b/src/lj_emit_arm64.h
@@ -194,39 +194,41 @@ static int emit_kdelta(ASMState *as, Reg rd, uint64_t k, int lim)
static void emit_loadk(ASMState *as, Reg rd, uint64_t u64, int is64)
{
- uint32_t k13 = emit_isk13(u64, is64);
- if (k13) { /* Can the constant be represented as a bitmask immediate? */
- emit_dn(as, (is64|A64I_ORRw)^k13, rd, RID_ZERO);
- } else {
- int i, zeros = 0, ones = 0, neg;
- if (!is64) u64 = (int64_t)(int32_t)u64; /* Sign-extend. */
- /* Count homogeneous 16 bit fragments. */
- for (i = 0; i < 4; i++) {
- uint64_t frag = (u64 >> i*16) & 0xffff;
- zeros += (frag == 0);
- ones += (frag == 0xffff);
+ int i, zeros = 0, ones = 0, neg;
+ if (!is64) u64 = (int64_t)(int32_t)u64; /* Sign-extend. */
+ /* Count homogeneous 16 bit fragments. */
+ for (i = 0; i < 4; i++) {
+ uint64_t frag = (u64 >> i*16) & 0xffff;
+ zeros += (frag == 0);
+ ones += (frag == 0xffff);
+ }
+ neg = ones > zeros; /* Use MOVN if it pays off. */
+ if ((neg ? ones : zeros) < 3) { /* Need 2+ ins. Try shorter K13 encoding. */
+ uint32_t k13 = emit_isk13(u64, is64);
+ if (k13) {
+ emit_dn(as, (is64|A64I_ORRw)^k13, rd, RID_ZERO);
+ return;
}
- neg = ones > zeros; /* Use MOVN if it pays off. */
- if (!emit_kdelta(as, rd, u64, 4 - (neg ? ones : zeros))) {
- int shift = 0, lshift = 0;
- uint64_t n64 = neg ? ~u64 : u64;
- if (n64 != 0) {
- /* Find first/last fragment to be filled. */
- shift = (63-emit_clz64(n64)) & ~15;
- lshift = emit_ctz64(n64) & ~15;
- }
- /* MOVK requires the original value (u64). */
- while (shift > lshift) {
- uint32_t u16 = (u64 >> shift) & 0xffff;
- /* Skip fragments that are correctly filled by MOVN/MOVZ. */
- if (u16 != (neg ? 0xffff : 0))
- emit_d(as, is64 | A64I_MOVKw | A64F_U16(u16) | A64F_LSL16(shift), rd);
- shift -= 16;
- }
- /* But MOVN needs an inverted value (n64). */
- emit_d(as, (neg ? A64I_MOVNx : A64I_MOVZx) |
- A64F_U16((n64 >> lshift) & 0xffff) | A64F_LSL16(lshift), rd);
+ }
+ if (!emit_kdelta(as, rd, u64, 4 - (neg ? ones : zeros))) {
+ int shift = 0, lshift = 0;
+ uint64_t n64 = neg ? ~u64 : u64;
+ if (n64 != 0) {
+ /* Find first/last fragment to be filled. */
+ shift = (63-emit_clz64(n64)) & ~15;
+ lshift = emit_ctz64(n64) & ~15;
+ }
+ /* MOVK requires the original value (u64). */
+ while (shift > lshift) {
+ uint32_t u16 = (u64 >> shift) & 0xffff;
+ /* Skip fragments that are correctly filled by MOVN/MOVZ. */
+ if (u16 != (neg ? 0xffff : 0))
+ emit_d(as, is64 | A64I_MOVKw | A64F_U16(u16) | A64F_LSL16(shift), rd);
+ shift -= 16;
}
+ /* But MOVN needs an inverted value (n64). */
+ emit_d(as, (neg ? A64I_MOVNx : A64I_MOVZx) |
+ A64F_U16((n64 >> lshift) & 0xffff) | A64F_LSL16(lshift), rd);
}
}
--
2.50.0
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Tarantool-patches] [PATCH luajit 2/3] ARM64: More improvements to the generation of immediates.
2025-07-24 9:03 [Tarantool-patches] [PATCH luajit 0/3] Generation of immediates and TBAR fusion Sergey Kaplun via Tarantool-patches
2025-07-24 9:03 ` [Tarantool-patches] [PATCH luajit 1/3] ARM64: Improve generation of immediates Sergey Kaplun via Tarantool-patches
@ 2025-07-24 9:03 ` Sergey Kaplun via Tarantool-patches
2025-07-24 9:04 ` [Tarantool-patches] [PATCH luajit 3/3] ARM64: Prevent STP fusion for conditional code emitted by TBAR Sergey Kaplun via Tarantool-patches
2 siblings, 0 replies; 4+ messages in thread
From: Sergey Kaplun via Tarantool-patches @ 2025-07-24 9:03 UTC (permalink / raw)
To: Sergey Bronnikov; +Cc: tarantool-patches
From: Mike Pall <mike>
(cherry picked from commit 69138082a3166105faa8cbb25fadb1e4298686c0)
This patch refactors the emitting of immediates for the arm64
architecture. The main changes are the following:
* Use `emit_getgl()`, `emit_setgl()` instead of `emit_lso()`, where it
is possible, since it makes the code cleaner.
* The `RID_GL` is allocated for `g` at the start of the trace emitting.
Also, this register is considered as a candidate to be used as a base
for the N-step offset in `emit_kdelta()`.
* The address of `tmptv` is not rematerialized to the register from the
constant not. It is calculated via the adding the corresponding
offset to `RID_GL`.
Sergey Kaplun:
* added the description for the patch
Part of tarantool/tarantool#11691
---
src/lj_asm.c | 3 +++
src/lj_asm_arm64.h | 23 ++++++++---------------
src/lj_emit_arm64.h | 2 +-
3 files changed, 12 insertions(+), 16 deletions(-)
diff --git a/src/lj_asm.c b/src/lj_asm.c
index 9e81dbc9..f163b2e3 100644
--- a/src/lj_asm.c
+++ b/src/lj_asm.c
@@ -2113,6 +2113,9 @@ static void asm_setup_regsp(ASMState *as)
#endif
ra_setup(as);
+#if LJ_TARGET_ARM64
+ ra_setkref(as, RID_GL, (intptr_t)J2G(as->J));
+#endif
/* Clear reg/sp for constants. */
for (ir = IR(T->nk), lastir = IR(REF_BASE); ir < lastir; ir++) {
diff --git a/src/lj_asm_arm64.h b/src/lj_asm_arm64.h
index a7f059a2..5a6c60b7 100644
--- a/src/lj_asm_arm64.h
+++ b/src/lj_asm_arm64.h
@@ -690,7 +690,7 @@ static void asm_tvptr(ASMState *as, Reg dest, IRRef ref)
} else {
/* Otherwise use g->tmptv to hold the TValue. */
asm_tvstore64(as, dest, 0, ref);
- ra_allockreg(as, i64ptr(&J2G(as->J)->tmptv), dest);
+ emit_dn(as, A64I_ADDx^emit_isk12(glofs(as, &J2G(as->J)->tmptv)), dest, RID_GL);
}
}
@@ -1269,17 +1269,13 @@ static void asm_tbar(ASMState *as, IRIns *ir)
{
Reg tab = ra_alloc1(as, ir->op1, RSET_GPR);
Reg link = ra_scratch(as, rset_exclude(RSET_GPR, tab));
- Reg gr = ra_allock(as, i64ptr(J2G(as->J)),
- rset_exclude(rset_exclude(RSET_GPR, tab), link));
Reg mark = RID_TMP;
MCLabel l_end = emit_label(as);
emit_lso(as, A64I_STRx, link, tab, (int32_t)offsetof(GCtab, gclist));
emit_lso(as, A64I_STRB, mark, tab, (int32_t)offsetof(GCtab, marked));
- emit_lso(as, A64I_STRx, tab, gr,
- (int32_t)offsetof(global_State, gc.grayagain));
+ emit_setgl(as, tab, gc.grayagain);
emit_dn(as, A64I_ANDw^emit_isk13(~LJ_GC_BLACK, 0), mark, mark);
- emit_lso(as, A64I_LDRx, link, gr,
- (int32_t)offsetof(global_State, gc.grayagain));
+ emit_getgl(as, link, gc.grayagain);
emit_cond_branch(as, CC_EQ, l_end);
emit_n(as, A64I_TSTw^emit_isk13(LJ_GC_BLACK, 0), mark);
emit_lso(as, A64I_LDRB, mark, tab, (int32_t)offsetof(GCtab, marked));
@@ -1299,7 +1295,7 @@ static void asm_obar(ASMState *as, IRIns *ir)
args[0] = ASMREF_TMP1; /* global_State *g */
args[1] = ir->op1; /* TValue *tv */
asm_gencall(as, ci, args);
- ra_allockreg(as, i64ptr(J2G(as->J)), ra_releasetmp(as, ASMREF_TMP1) );
+ emit_dm(as, A64I_MOVx, ra_releasetmp(as, ASMREF_TMP1), RID_GL);
obj = IR(ir->op1)->r;
tmp = ra_scratch(as, rset_exclude(allow, obj));
emit_cond_branch(as, CC_EQ, l_end);
@@ -1808,7 +1804,7 @@ static void asm_gc_check(ASMState *as)
const CCallInfo *ci = &lj_ir_callinfo[IRCALL_lj_gc_step_jit];
IRRef args[2];
MCLabel l_end;
- Reg tmp1, tmp2;
+ Reg tmp2;
ra_evictset(as, RSET_SCRATCH);
l_end = emit_label(as);
/* Exit trace if in GCSatomic or GCSfinalize. Avoids syncing GC objects. */
@@ -1816,17 +1812,14 @@ static void asm_gc_check(ASMState *as)
args[0] = ASMREF_TMP1; /* global_State *g */
args[1] = ASMREF_TMP2; /* MSize steps */
asm_gencall(as, ci, args);
- tmp1 = ra_releasetmp(as, ASMREF_TMP1);
+ emit_dm(as, A64I_MOVx, ra_releasetmp(as, ASMREF_TMP1), RID_GL);
tmp2 = ra_releasetmp(as, ASMREF_TMP2);
emit_loadi(as, tmp2, as->gcsteps);
/* Jump around GC step if GC total < GC threshold. */
emit_cond_branch(as, CC_LS, l_end);
emit_nm(as, A64I_CMPx, RID_TMP, tmp2);
- emit_lso(as, A64I_LDRx, tmp2, tmp1,
- (int32_t)offsetof(global_State, gc.threshold));
- emit_lso(as, A64I_LDRx, RID_TMP, tmp1,
- (int32_t)offsetof(global_State, gc.total));
- ra_allockreg(as, i64ptr(J2G(as->J)), tmp1);
+ emit_getgl(as, tmp2, gc.threshold);
+ emit_getgl(as, RID_TMP, gc.total);
as->gcsteps = 0;
checkmclim(as);
}
diff --git a/src/lj_emit_arm64.h b/src/lj_emit_arm64.h
index 2bb93dd9..184a05ca 100644
--- a/src/lj_emit_arm64.h
+++ b/src/lj_emit_arm64.h
@@ -163,7 +163,7 @@ nopair:
/* Try to find an N-step delta relative to other consts with N < lim. */
static int emit_kdelta(ASMState *as, Reg rd, uint64_t k, int lim)
{
- RegSet work = ~as->freeset & RSET_GPR;
+ RegSet work = (~as->freeset & RSET_GPR) | RID2RSET(RID_GL);
if (lim <= 1) return 0; /* Can't beat that. */
while (work) {
Reg r = rset_picktop(work);
--
2.50.0
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Tarantool-patches] [PATCH luajit 3/3] ARM64: Prevent STP fusion for conditional code emitted by TBAR.
2025-07-24 9:03 [Tarantool-patches] [PATCH luajit 0/3] Generation of immediates and TBAR fusion Sergey Kaplun via Tarantool-patches
2025-07-24 9:03 ` [Tarantool-patches] [PATCH luajit 1/3] ARM64: Improve generation of immediates Sergey Kaplun via Tarantool-patches
2025-07-24 9:03 ` [Tarantool-patches] [PATCH luajit 2/3] ARM64: More improvements to the " Sergey Kaplun via Tarantool-patches
@ 2025-07-24 9:04 ` Sergey Kaplun via Tarantool-patches
2 siblings, 0 replies; 4+ messages in thread
From: Sergey Kaplun via Tarantool-patches @ 2025-07-24 9:04 UTC (permalink / raw)
To: Sergey Bronnikov; +Cc: tarantool-patches
From: Mike Pall <mike>
Thanks to Peter Cawley.
(cherry picked from commit 7cc53f0b85f834dfba1516ea79d59db463e856fa)
Assume we have a trace for the several `setmetatable()` calls to the
same table. This trace contains the following IR:
| 0011 p64 FREF 0003 tab.meta
| ...
| 0018 x0 > tab TNEW 0 0
| 0019 tab TBAR 0003
| 0020 tab FSTORE 0011 0018
The expected mcode to be emitted for the last two IRs is the following:
| 55626cffb0 ldrb w30, [x19, 8] ; tab->marked
| 55626cffb4 tst w30, 0x4 ; Is black?
| 55626cffb8 beq 0x626cffd0 ; Skip marking.
| 55626cffbc ldr x27, [x20, 128]
| 55626cffc0 and w30, w30, 0xfffffffb
| 55626cffc4 str x19, [x20, 128]
| 55626cffcc strb w30, [x19, 8] ; tab->marked
| 55626cffc8 str x27, [x19, 24] ; tab->gclist
| 55626cffd0 str x0, [x19, 32] ; tab->metatable
But the last 2 instructions are fused into the following `stp`:
| 55581dffd0 stp x27, x0, [x19, 48]
Hence, the GC propagation frontier back is done partially, since
`str x27, [x19, 24]` is not skipped despite TBAR semantics. This leads
to the incorrect value in the `gclist` and the segmentation fault during
its traversal on GC step.
This patch prevents this fusion via switching instruction for
`tab->gclist` and `tab->marked` storing.
Sergey Kaplun:
* added the description and the test for the problem
Part of tarantool/tarantool#11691
---
src/lj_asm_arm64.h | 3 +-
...1057-arm64-stp-fusing-across-tbar.test.lua | 79 +++++++++++++++++++
2 files changed, 81 insertions(+), 1 deletion(-)
create mode 100644 test/tarantool-tests/lj-1057-arm64-stp-fusing-across-tbar.test.lua
diff --git a/src/lj_asm_arm64.h b/src/lj_asm_arm64.h
index 5a6c60b7..9b3c0467 100644
--- a/src/lj_asm_arm64.h
+++ b/src/lj_asm_arm64.h
@@ -1271,8 +1271,9 @@ static void asm_tbar(ASMState *as, IRIns *ir)
Reg link = ra_scratch(as, rset_exclude(RSET_GPR, tab));
Reg mark = RID_TMP;
MCLabel l_end = emit_label(as);
- emit_lso(as, A64I_STRx, link, tab, (int32_t)offsetof(GCtab, gclist));
emit_lso(as, A64I_STRB, mark, tab, (int32_t)offsetof(GCtab, marked));
+ /* Keep STRx in the middle to avoid LDP/STP fusion with surrounding code. */
+ emit_lso(as, A64I_STRx, link, tab, (int32_t)offsetof(GCtab, gclist));
emit_setgl(as, tab, gc.grayagain);
emit_dn(as, A64I_ANDw^emit_isk13(~LJ_GC_BLACK, 0), mark, mark);
emit_getgl(as, link, gc.grayagain);
diff --git a/test/tarantool-tests/lj-1057-arm64-stp-fusing-across-tbar.test.lua b/test/tarantool-tests/lj-1057-arm64-stp-fusing-across-tbar.test.lua
new file mode 100644
index 00000000..27d18916
--- /dev/null
+++ b/test/tarantool-tests/lj-1057-arm64-stp-fusing-across-tbar.test.lua
@@ -0,0 +1,79 @@
+local tap = require('tap')
+
+-- This test demonstrates LuaJIT's incorrect fusing of store
+-- instructions separated by the conditional branch on arm64.
+-- See also https://github.com/LuaJIT/LuaJIT/issues/1057.
+local test = tap.test('lj-1057-arm64-stp-fusing-across-tbar'):skipcond({
+ ['Test requires JIT enabled'] = not jit.status(),
+})
+
+test:plan(2)
+
+-- XXX: Simplify the `jit.dump()` output.
+local setmetatable = setmetatable
+
+-- The function below generates the following IR:
+-- | 0011 p64 FREF 0003 tab.meta
+-- | ...
+-- | 0018 x0 > tab TNEW #0 #0
+-- | 0019 tab TBAR 0003
+-- | 0020 tab FSTORE 0011 0018
+-- The expected mcode to be emitted for the last two IRs is the
+-- following:
+-- | 55626cffb0 ldrb w30, [x19, #8] ; tab->marked
+-- | 55626cffb4 tst w30, #0x4 ; Is black?
+-- | 55626cffb8 beq 0x626cffd0 ; Skip marking.
+-- | 55626cffbc ldr x27, [x20, #128]
+-- | 55626cffc0 and w30, w30, #0xfffffffb
+-- | 55626cffc4 str x19, [x20, #128]
+-- | 55626cffcc strb w30, [x19, #8] ; tab->marked
+-- | 55626cffc8 str x27, [x19, #24] ; tab->gclist
+-- | 55626cffd0 str x0, [x19, #32] ; tab->metatable
+--
+-- But the last 2 instructions are fused into the following `stp`:
+-- | 55581dffd0 stp x27, x0, [x19, #48]
+-- Hence, the GC propagation frontier back is done partially,
+-- since `str x27, [x19, #24]` is not skipped despite TBAR
+-- semantics. This leads to the incorrect value in the `gclist`
+-- and the segmentation fault during its traversal on GC step.
+local function trace(target_t)
+ -- Precreate a table for the FLOAD to avoid TNEW in between.
+ local stack_t = {}
+ -- Generate FSTORE TBAR pair. The FSTORE will be dropped due to
+ -- the FSTORE below by DSE.
+ setmetatable(target_t, {})
+ -- Generate FSTORE. TBAR will be dropped by CSE.
+ setmetatable(target_t, stack_t)
+end
+
+jit.opt.start('hotloop=1')
+
+-- XXX: Need to trigger the GC on trace to introspect that the
+-- GC chain is broken. Use empirical 10000 iterations.
+local tab = {}
+for _ = 1, 1e4 do
+ trace(tab)
+end
+
+test:ok(true, 'no assertion failure in the simple loop')
+
+-- The similar test, but be sure that we finish the whole GC
+-- cycle, plus using upvalue instead of stack slot for the target
+-- table.
+
+local target_t = {}
+local function trace2()
+ local stack_t = {}
+ setmetatable(target_t, {})
+ setmetatable(target_t, stack_t)
+end
+
+collectgarbage('collect')
+collectgarbage('setstepmul', 1)
+while not collectgarbage('step') do
+ trace2()
+end
+
+test:ok(true, 'no assertion failure in the whole GC cycle')
+
+test:done(true)
--
2.50.0
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-07-24 9:05 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-07-24 9:03 [Tarantool-patches] [PATCH luajit 0/3] Generation of immediates and TBAR fusion Sergey Kaplun via Tarantool-patches
2025-07-24 9:03 ` [Tarantool-patches] [PATCH luajit 1/3] ARM64: Improve generation of immediates Sergey Kaplun via Tarantool-patches
2025-07-24 9:03 ` [Tarantool-patches] [PATCH luajit 2/3] ARM64: More improvements to the " Sergey Kaplun via Tarantool-patches
2025-07-24 9:04 ` [Tarantool-patches] [PATCH luajit 3/3] ARM64: Prevent STP fusion for conditional code emitted by TBAR Sergey Kaplun via Tarantool-patches
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox