[Tarantool-patches] [PATCH luajit 0/2] Refactoring and FMA optimizations

Tarantool development patches archive
 help / color / mirror / Atom feed

* [Tarantool-patches] [PATCH luajit 0/2] Refactoring and FMA optimizations
@ 2025-01-14 11:06 Sergey Kaplun via Tarantool-patches
  2025-01-14 11:06 ` [Tarantool-patches] [PATCH luajit 1/2] Cleanup CPU detection and tuning for old CPUs Sergey Kaplun via Tarantool-patches
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Sergey Kaplun via Tarantool-patches @ 2025-01-14 11:06 UTC (permalink / raw)
  To: Sergey Bronnikov; +Cc: tarantool-patches

The first commit in the patchset helps to avoid conflicts during the
backporting of the second patch, which fixes the ARM64 behaviour for FMA
optimization. Since the first patch is just a refactoring, it may be
easily applied too.

Branch: https://github.com/tarantool/luajit/tree/skaplun/lj-918-fma-optimization
Related issues:
* https://github.com/LuaJIT/LuaJIT/issues/24
* https://github.com/LuaJIT/LuaJIT/issues/918
* https://github.com/tarantool/tarantool/issues/10709

Mike Pall (2):
  Cleanup CPU detection and tuning for old CPUs.
  Disable FMA by default. Use -Ofma or jit.opt.start("+fma") to enable.

 doc/running.html                              |  8 ++
 src/Makefile.original                         |  1 -
 src/lib_jit.c                                 | 65 +++++-------
 src/lj_arch.h                                 |  6 +-
 src/lj_asm_arm.h                              |  6 +-
 src/lj_asm_arm64.h                            |  3 +-
 src/lj_asm_ppc.h                              |  3 +-
 src/lj_asm_x86.h                              | 33 ++-----
 src/lj_dispatch.c                             |  7 --
 src/lj_emit_x86.h                             |  5 +-
 src/lj_errmsg.h                               |  4 -
 src/lj_jit.h                                  | 98 ++++++++++---------
 src/lj_vmmath.c                               | 13 ++-
 src/ljamalg.c                                 | 10 --
 src/vm_arm64.dasc                             |  4 +-
 ...lj-918-fma-numerical-accuracy-jit.test.lua | 36 +++++++
 .../lj-918-fma-numerical-accuracy.test.lua    | 31 ++++++
 .../lj-918-fma-optimization.test.lua          | 25 +++++
 18 files changed, 214 insertions(+), 144 deletions(-)
 create mode 100644 test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua
 create mode 100644 test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua
 create mode 100644 test/tarantool-tests/lj-918-fma-optimization.test.lua

-- 
2.47.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Tarantool-patches] [PATCH luajit 1/2] Cleanup CPU detection and tuning for old CPUs.
  2025-01-14 11:06 [Tarantool-patches] [PATCH luajit 0/2] Refactoring and FMA optimizations Sergey Kaplun via Tarantool-patches
@ 2025-01-14 11:06 ` Sergey Kaplun via Tarantool-patches
  2025-01-14 11:25   ` Sergey Bronnikov via Tarantool-patches
  2025-01-14 11:06 ` [Tarantool-patches] [PATCH luajit 2/2] Disable FMA by default. Use -Ofma or jit.opt.start("+fma") to enable Sergey Kaplun via Tarantool-patches
  2025-01-31  9:28 ` [Tarantool-patches] [PATCH luajit 0/2] Refactoring and FMA optimizations Sergey Kaplun via Tarantool-patches
  2 siblings, 1 reply; 10+ messages in thread
From: Sergey Kaplun via Tarantool-patches @ 2025-01-14 11:06 UTC (permalink / raw)
  To: Sergey Bronnikov; +Cc: tarantool-patches

From: Mike Pall <mike>

(cherry picked from commit 0eddcbead2d67c16dcd4039a6765b9d2fc8ea631)

This patch does the following refactoring:
1) Drops optimizations for the Intel Atom CPU [1]: removes the
   `JIT_F_LEA_AGU` flag and related optimizations. The considerations
   for the use of LEA are complex and very CPU-specific, mostly
   dependent on the number of operands. Mostly, it isn't worth it due to
   the extra register pressure and/or extra instructions.
2) Drops optimizations for the AMD K8, K10 CPU [2][3]: removes the
   `JIT_F_PREFER_IMUL` flag and related optimizations.
3) Refactors JIT flags defined in the <lj_jit.h>. Now all CPU-specific
   JIT flags are defined as the left shift of `JIT_F_CPU` instead of
   hardcoded constants, similar for the optimization flags.
4) Adds detection of the ARM8 CPU.
5) Drops the check for SSE2 since the VM already presumes CPU supports
   it.
6) Adds checks for `__ARM_ARCH`[4] macro in <lj_arch.h>.
7) Drops outdated comment in the amalgamation file about memory
   requirements.

Sergey Kaplun:
* added the description for the patch

[1]: https://en.wikipedia.org/wiki/Intel_Atom
[2]: https://en.wikipedia.org/wiki/AMD_K8
[3]: https://en.wikipedia.org/wiki/AMD_K10
[4]: https://developer.arm.com/documentation/dui0774/l/Other-Compiler-specific-Features/Predefined-macros

Part of tarantool/tarantool#10709
---
 src/Makefile.original |  1 -
 src/lib_jit.c         | 65 +++++++++++-------------------
 src/lj_arch.h         |  6 +--
 src/lj_asm_x86.h      | 33 +++++----------
 src/lj_dispatch.c     |  7 ----
 src/lj_emit_x86.h     |  5 +--
 src/lj_errmsg.h       |  4 --
 src/lj_jit.h          | 94 +++++++++++++++++++++++--------------------
 src/ljamalg.c         | 10 -----
 9 files changed, 87 insertions(+), 138 deletions(-)

diff --git a/src/Makefile.original b/src/Makefile.original
index 9f55fa32..8d925e3a 100644
--- a/src/Makefile.original
+++ b/src/Makefile.original
@@ -621,7 +621,6 @@ E= @echo
 default all:	$(TARGET_T)
 
 amalg:
-	@grep "^[+|]" ljamalg.c
 	$(MAKE) -f Makefile.original all "LJCORE_O=ljamalg.o"
 
 clean:
diff --git a/src/lib_jit.c b/src/lib_jit.c
index f705f334..9f870f68 100644
--- a/src/lib_jit.c
+++ b/src/lib_jit.c
@@ -104,8 +104,8 @@ LJLIB_CF(jit_status)
   jit_State *J = L2J(L);
   L->top = L->base;
   setboolV(L->top++, (J->flags & JIT_F_ON) ? 1 : 0);
-  flagbits_to_strings(L, J->flags, JIT_F_CPU_FIRST, JIT_F_CPUSTRING);
-  flagbits_to_strings(L, J->flags, JIT_F_OPT_FIRST, JIT_F_OPTSTRING);
+  flagbits_to_strings(L, J->flags, JIT_F_CPU, JIT_F_CPUSTRING);
+  flagbits_to_strings(L, J->flags, JIT_F_OPT, JIT_F_OPTSTRING);
   return (int)(L->top - L->base);
 #else
   setboolV(L->top++, 0);
@@ -467,7 +467,7 @@ static int jitopt_flag(jit_State *J, const char *str)
     str += str[2] == '-' ? 3 : 2;
     set = 0;
   }
-  for (opt = JIT_F_OPT_FIRST; ; opt <<= 1) {
+  for (opt = JIT_F_OPT; ; opt <<= 1) {
     size_t len = *(const uint8_t *)lst;
     if (len == 0)
       break;
@@ -636,59 +636,41 @@ JIT_PARAMDEF(JIT_PARAMINIT)
 #undef JIT_PARAMINIT
   0
 };
-#endif
 
 #if LJ_TARGET_ARM && LJ_TARGET_LINUX
 #include <sys/utsname.h>
 #endif
 
-/* Arch-dependent CPU detection. */
-static uint32_t jit_cpudetect(lua_State *L)
+/* Arch-dependent CPU feature detection. */
+static uint32_t jit_cpudetect(void)
 {
   uint32_t flags = 0;
 #if LJ_TARGET_X86ORX64
+
   uint32_t vendor[4];
   uint32_t features[4];
   if (lj_vm_cpuid(0, vendor) && lj_vm_cpuid(1, features)) {
-#if !LJ_HASJIT
-#define JIT_F_SSE2	2
-#endif
-    flags |= ((features[3] >> 26)&1) * JIT_F_SSE2;
-#if LJ_HASJIT
     flags |= ((features[2] >> 0)&1) * JIT_F_SSE3;
     flags |= ((features[2] >> 19)&1) * JIT_F_SSE4_1;
-    if (vendor[2] == 0x6c65746e) {  /* Intel. */
-      if ((features[0] & 0x0fff0ff0) == 0x000106c0)  /* Atom. */
-	flags |= JIT_F_LEA_AGU;
-    } else if (vendor[2] == 0x444d4163) {  /* AMD. */
-      uint32_t fam = (features[0] & 0x0ff00f00);
-      if (fam >= 0x00000f00)  /* K8, K10. */
-	flags |= JIT_F_PREFER_IMUL;
-    }
     if (vendor[0] >= 7) {
       uint32_t xfeatures[4];
       lj_vm_cpuid(7, xfeatures);
       flags |= ((xfeatures[1] >> 8)&1) * JIT_F_BMI2;
     }
-#endif
   }
-  /* Check for required instruction set support on x86 (unnecessary on x64). */
-#if LJ_TARGET_X86
-  if (!(flags & JIT_F_SSE2))
-    luaL_error(L, "CPU with SSE2 required");
-#endif
+  /* Don't bother checking for SSE2 -- the VM will crash before getting here. */
+
 #elif LJ_TARGET_ARM
-#if LJ_HASJIT
+
   int ver = LJ_ARCH_VERSION;  /* Compile-time ARM CPU detection. */
 #if LJ_TARGET_LINUX
   if (ver < 70) {  /* Runtime ARM CPU detection. */
     struct utsname ut;
     uname(&ut);
     if (strncmp(ut.machine, "armv", 4) == 0) {
-      if (ut.machine[4] >= '7')
-	ver = 70;
-      else if (ut.machine[4] == '6')
-	ver = 60;
+      if (ut.machine[4] >= '8') ver = 80;
+      else if (ut.machine[4] == '7') ver = 70;
+      else if (ut.machine[4] == '6') ver = 60;
     }
   }
 #endif
@@ -696,20 +678,22 @@ static uint32_t jit_cpudetect(lua_State *L)
 	   ver >= 61 ? JIT_F_ARMV6T2_ :
 	   ver >= 60 ? JIT_F_ARMV6_ : 0;
   flags |= LJ_ARCH_HASFPU == 0 ? 0 : ver >= 70 ? JIT_F_VFPV3 : JIT_F_VFPV2;
-#endif
+
 #elif LJ_TARGET_ARM64
+
   /* No optional CPU features to detect (for now). */
+
 #elif LJ_TARGET_PPC
-#if LJ_HASJIT
+
 #if LJ_ARCH_SQRT
   flags |= JIT_F_SQRT;
 #endif
 #if LJ_ARCH_ROUND
   flags |= JIT_F_ROUND;
 #endif
-#endif
+
 #elif LJ_TARGET_MIPS
-#if LJ_HASJIT
+
   /* Compile-time MIPS CPU detection. */
 #if LJ_ARCH_VERSION >= 20
   flags |= JIT_F_MIPSXXR2;
@@ -727,31 +711,28 @@ static uint32_t jit_cpudetect(lua_State *L)
     if (x) flags |= JIT_F_MIPSXXR2;  /* Either 0x80000000 (R2) or 0 (R1). */
   }
 #endif
-#endif
+
 #else
 #error "Missing CPU detection for this architecture"
 #endif
-  UNUSED(L);
   return flags;
 }
 
 /* Initialize JIT compiler. */
 static void jit_init(lua_State *L)
 {
-  uint32_t flags = jit_cpudetect(L);
-#if LJ_HASJIT
   jit_State *J = L2J(L);
-  J->flags = flags | JIT_F_ON | JIT_F_OPT_DEFAULT;
+  J->flags = jit_cpudetect() | JIT_F_ON | JIT_F_OPT_DEFAULT;
   memcpy(J->param, jit_param_default, sizeof(J->param));
   lj_dispatch_update(G(L));
-#else
-  UNUSED(flags);
-#endif
 }
+#endif
 
 LUALIB_API int luaopen_jit(lua_State *L)
 {
+#if LJ_HASJIT
   jit_init(L);
+#endif
   lua_pushliteral(L, LJ_OS_NAME);
   lua_pushliteral(L, LJ_ARCH_NAME);
   lua_pushinteger(L, LUAJIT_VERSION_NUM);
diff --git a/src/lj_arch.h b/src/lj_arch.h
index 3bdbe84e..e853c4a4 100644
--- a/src/lj_arch.h
+++ b/src/lj_arch.h
@@ -209,13 +209,13 @@
 #define LJ_TARGET_UNIFYROT	2	/* Want only IR_BROR. */
 #define LJ_ARCH_NUMMODE		LJ_NUMMODE_DUAL
 
-#if __ARM_ARCH____ARM_ARCH_8__ || __ARM_ARCH_8A__
+#if __ARM_ARCH == 8 || __ARM_ARCH_8__ || __ARM_ARCH_8A__
 #define LJ_ARCH_VERSION		80
-#elif __ARM_ARCH_7__ || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH_7S__ || __ARM_ARCH_7VE__
+#elif __ARM_ARCH == 7 || __ARM_ARCH_7__ || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH_7S__ || __ARM_ARCH_7VE__
 #define LJ_ARCH_VERSION		70
 #elif __ARM_ARCH_6T2__
 #define LJ_ARCH_VERSION		61
-#elif __ARM_ARCH_6__ || __ARM_ARCH_6J__ || __ARM_ARCH_6K__ || __ARM_ARCH_6Z__ || __ARM_ARCH_6ZK__
+#elif __ARM_ARCH == 6 || __ARM_ARCH_6__ || __ARM_ARCH_6J__ || __ARM_ARCH_6K__ || __ARM_ARCH_6Z__ || __ARM_ARCH_6ZK__
 #define LJ_ARCH_VERSION		60
 #else
 #define LJ_ARCH_VERSION		50
diff --git a/src/lj_asm_x86.h b/src/lj_asm_x86.h
index 86ce3937..5819fa7a 100644
--- a/src/lj_asm_x86.h
+++ b/src/lj_asm_x86.h
@@ -1222,13 +1222,8 @@ static void asm_href(ASMState *as, IRIns *ir, IROp merge)
     emit_rmro(as, XO_MOV, dest|REX_GC64, tab, offsetof(GCtab, node));
   } else {
     emit_rmro(as, XO_ARITH(XOg_ADD), dest|REX_GC64, tab, offsetof(GCtab,node));
-    if ((as->flags & JIT_F_PREFER_IMUL)) {
-      emit_i8(as, sizeof(Node));
-      emit_rr(as, XO_IMULi8, dest, dest);
-    } else {
-      emit_shifti(as, XOg_SHL, dest, 3);
-      emit_rmrxo(as, XO_LEA, dest, dest, dest, XM_SCALE2, 0);
-    }
+    emit_shifti(as, XOg_SHL, dest, 3);
+    emit_rmrxo(as, XO_LEA, dest, dest, dest, XM_SCALE2, 0);
     if (isk) {
       emit_gri(as, XG_ARITHi(XOg_AND), dest, (int32_t)khash);
       emit_rmro(as, XO_MOV, dest, tab, offsetof(GCtab, hmask));
@@ -1287,7 +1282,7 @@ static void asm_hrefk(ASMState *as, IRIns *ir)
   lj_assertA(ofs % sizeof(Node) == 0, "unaligned HREFK slot");
   if (ra_hasreg(dest)) {
     if (ofs != 0) {
-      if (dest == node && !(as->flags & JIT_F_LEA_AGU))
+      if (dest == node)
 	emit_gri(as, XG_ARITHi(XOg_ADD), dest|REX_GC64, ofs);
       else
 	emit_rmro(as, XO_LEA, dest|REX_GC64, node, ofs);
@@ -2181,8 +2176,7 @@ static void asm_add(ASMState *as, IRIns *ir)
 {
   if (irt_isnum(ir->t))
     asm_fparith(as, ir, XO_ADDSD);
-  else if ((as->flags & JIT_F_LEA_AGU) || as->flagmcp == as->mcp ||
-	   irt_is64(ir->t) || !asm_lea(as, ir))
+  else if (as->flagmcp == as->mcp || irt_is64(ir->t) || !asm_lea(as, ir))
     asm_intarith(as, ir, XOg_ADD);
 }
 
@@ -2887,7 +2881,7 @@ static void asm_tail_fixup(ASMState *as, TraceNo lnk)
   MCode *target, *q;
   int32_t spadj = as->T->spadjust;
   if (spadj == 0) {
-    p -= ((as->flags & JIT_F_LEA_AGU) ? 7 : 6) + (LJ_64 ? 1 : 0);
+    p -= LJ_64 ? 7 : 6;
   } else {
     MCode *p1;
     /* Patch stack adjustment. */
@@ -2899,20 +2893,11 @@ static void asm_tail_fixup(ASMState *as, TraceNo lnk)
       p1 = p-9;
       *(int32_t *)p1 = spadj;
     }
-    if ((as->flags & JIT_F_LEA_AGU)) {
-#if LJ_64
-      p1[-4] = 0x48;
-#endif
-      p1[-3] = (MCode)XI_LEA;
-      p1[-2] = MODRM(checki8(spadj) ? XM_OFS8 : XM_OFS32, RID_ESP, RID_ESP);
-      p1[-1] = MODRM(XM_SCALE1, RID_ESP, RID_ESP);
-    } else {
 #if LJ_64
-      p1[-3] = 0x48;
+    p1[-3] = 0x48;
 #endif
-      p1[-2] = (MCode)(checki8(spadj) ? XI_ARITHi8 : XI_ARITHi);
-      p1[-1] = MODRM(XM_REG, XOg_ADD, RID_ESP);
-    }
+    p1[-2] = (MCode)(checki8(spadj) ? XI_ARITHi8 : XI_ARITHi);
+    p1[-1] = MODRM(XM_REG, XOg_ADD, RID_ESP);
   }
   /* Patch exit branch. */
   target = lnk ? traceref(as->J, lnk)->mcode : (MCode *)lj_vm_exit_interp;
@@ -2943,7 +2928,7 @@ static void asm_tail_prep(ASMState *as)
     as->invmcp = as->mcp = p;
   } else {
     /* Leave room for ESP adjustment: add esp, imm or lea esp, [esp+imm] */
-    as->mcp = p - (((as->flags & JIT_F_LEA_AGU) ? 7 : 6)  + (LJ_64 ? 1 : 0));
+    as->mcp = p - (LJ_64 ? 7 : 6);
     as->invmcp = NULL;
   }
 }
diff --git a/src/lj_dispatch.c b/src/lj_dispatch.c
index ddee68de..a44a5adf 100644
--- a/src/lj_dispatch.c
+++ b/src/lj_dispatch.c
@@ -258,15 +258,8 @@ int luaJIT_setmode(lua_State *L, int idx, int mode)
     } else {
       if (!(mode & LUAJIT_MODE_ON))
 	G2J(g)->flags &= ~(uint32_t)JIT_F_ON;
-#if LJ_TARGET_X86ORX64
-      else if ((G2J(g)->flags & JIT_F_SSE2))
-	G2J(g)->flags |= (uint32_t)JIT_F_ON;
-      else
-	return 0;  /* Don't turn on JIT compiler without SSE2 support. */
-#else
       else
 	G2J(g)->flags |= (uint32_t)JIT_F_ON;
-#endif
       lj_dispatch_update(g);
     }
     break;
diff --git a/src/lj_emit_x86.h b/src/lj_emit_x86.h
index f4990151..85978027 100644
--- a/src/lj_emit_x86.h
+++ b/src/lj_emit_x86.h
@@ -561,10 +561,7 @@ static void emit_storeofs(ASMState *as, IRIns *ir, Reg r, Reg base, int32_t ofs)
 static void emit_addptr(ASMState *as, Reg r, int32_t ofs)
 {
   if (ofs) {
-    if ((as->flags & JIT_F_LEA_AGU))
-      emit_rmro(as, XO_LEA, r|REX_GC64, r, ofs);
-    else
-      emit_gri(as, XG_ARITHi(XOg_ADD), r|REX_GC64, ofs);
+    emit_gri(as, XG_ARITHi(XOg_ADD), r|REX_GC64, ofs);
   }
 }
 
diff --git a/src/lj_errmsg.h b/src/lj_errmsg.h
index 77a08cb0..19c41f0b 100644
--- a/src/lj_errmsg.h
+++ b/src/lj_errmsg.h
@@ -101,11 +101,7 @@ ERRDEF(STRGSRV,	"invalid replacement value (a %s)")
 ERRDEF(BADMODN,	"name conflict for module " LUA_QS)
 #if LJ_HASJIT
 ERRDEF(JITPROT,	"runtime code generation failed, restricted kernel?")
-#if LJ_TARGET_X86ORX64
-ERRDEF(NOJIT,	"JIT compiler disabled, CPU does not support SSE2")
-#else
 ERRDEF(NOJIT,	"JIT compiler disabled")
-#endif
 #elif defined(LJ_ARCH_NOJIT)
 ERRDEF(NOJIT,	"no JIT compiler for this architecture (yet)")
 #else
diff --git a/src/lj_jit.h b/src/lj_jit.h
index 361570a0..47df85c6 100644
--- a/src/lj_jit.h
+++ b/src/lj_jit.h
@@ -9,47 +9,49 @@
 #include "lj_obj.h"
 #include "lj_ir.h"
 
-/* JIT engine flags. */
+/* -- JIT engine flags ---------------------------------------------------- */
+
+/* General JIT engine flags. 4 bits. */
 #define JIT_F_ON		0x00000001
 
-/* CPU-specific JIT engine flags. */
+/* CPU-specific JIT engine flags. 12 bits. Flags and strings must match. */
+#define JIT_F_CPU		0x00000010
+
 #if LJ_TARGET_X86ORX64
-#define JIT_F_SSE2		0x00000010
-#define JIT_F_SSE3		0x00000020
-#define JIT_F_SSE4_1		0x00000040
-#define JIT_F_PREFER_IMUL	0x00000080
-#define JIT_F_LEA_AGU		0x00000100
-#define JIT_F_BMI2		0x00000200
-
-/* Names for the CPU-specific flags. Must match the order above. */
-#define JIT_F_CPU_FIRST		JIT_F_SSE2
-#define JIT_F_CPUSTRING		"\4SSE2\4SSE3\6SSE4.1\3AMD\4ATOM\4BMI2"
+
+#define JIT_F_SSE3		(JIT_F_CPU << 0)
+#define JIT_F_SSE4_1		(JIT_F_CPU << 1)
+#define JIT_F_BMI2		(JIT_F_CPU << 2)
+
+
+#define JIT_F_CPUSTRING		"\4SSE3\6SSE4.1\4BMI2"
+
 #elif LJ_TARGET_ARM
-#define JIT_F_ARMV6_		0x00000010
-#define JIT_F_ARMV6T2_		0x00000020
-#define JIT_F_ARMV7		0x00000040
-#define JIT_F_VFPV2		0x00000080
-#define JIT_F_VFPV3		0x00000100
-
-#define JIT_F_ARMV6		(JIT_F_ARMV6_|JIT_F_ARMV6T2_|JIT_F_ARMV7)
-#define JIT_F_ARMV6T2		(JIT_F_ARMV6T2_|JIT_F_ARMV7)
+
+#define JIT_F_ARMV6_		(JIT_F_CPU << 0)
+#define JIT_F_ARMV6T2_		(JIT_F_CPU << 1)
+#define JIT_F_ARMV7		(JIT_F_CPU << 2)
+#define JIT_F_ARMV8		(JIT_F_CPU << 3)
+#define JIT_F_VFPV2		(JIT_F_CPU << 4)
+#define JIT_F_VFPV3		(JIT_F_CPU << 5)
+
+#define JIT_F_ARMV6		(JIT_F_ARMV6_|JIT_F_ARMV6T2_|JIT_F_ARMV7|JIT_F_ARMV8)
+#define JIT_F_ARMV6T2		(JIT_F_ARMV6T2_|JIT_F_ARMV7|JIT_F_ARMV8)
 #define JIT_F_VFP		(JIT_F_VFPV2|JIT_F_VFPV3)
 
-/* Names for the CPU-specific flags. Must match the order above. */
-#define JIT_F_CPU_FIRST		JIT_F_ARMV6_
-#define JIT_F_CPUSTRING		"\5ARMv6\7ARMv6T2\5ARMv7\5VFPv2\5VFPv3"
+#define JIT_F_CPUSTRING		"\5ARMv6\7ARMv6T2\5ARMv7\5ARMv8\5VFPv2\5VFPv3"
+
 #elif LJ_TARGET_PPC
-#define JIT_F_SQRT		0x00000010
-#define JIT_F_ROUND		0x00000020
 
-/* Names for the CPU-specific flags. Must match the order above. */
-#define JIT_F_CPU_FIRST		JIT_F_SQRT
+#define JIT_F_SQRT		(JIT_F_CPU << 0)
+#define JIT_F_ROUND		(JIT_F_CPU << 1)
+
 #define JIT_F_CPUSTRING		"\4SQRT\5ROUND"
+
 #elif LJ_TARGET_MIPS
-#define JIT_F_MIPSXXR2		0x00000010
 
-/* Names for the CPU-specific flags. Must match the order above. */
-#define JIT_F_CPU_FIRST		JIT_F_MIPSXXR2
+#define JIT_F_MIPSXXR2		(JIT_F_CPU << 0)
+
 #if LJ_TARGET_MIPS32
 #if LJ_TARGET_MIPSR6
 #define JIT_F_CPUSTRING		"\010MIPS32R6"
@@ -63,27 +65,29 @@
 #define JIT_F_CPUSTRING		"\010MIPS64R2"
 #endif
 #endif
+
 #else
-#define JIT_F_CPU_FIRST		0
+
 #define JIT_F_CPUSTRING		""
+
 #endif
 
-/* Optimization flags. */
+/* Optimization flags. 12 bits. */
+#define JIT_F_OPT		0x00010000
 #define JIT_F_OPT_MASK		0x0fff0000
 
-#define JIT_F_OPT_FOLD		0x00010000
-#define JIT_F_OPT_CSE		0x00020000
-#define JIT_F_OPT_DCE		0x00040000
-#define JIT_F_OPT_FWD		0x00080000
-#define JIT_F_OPT_DSE		0x00100000
-#define JIT_F_OPT_NARROW	0x00200000
-#define JIT_F_OPT_LOOP		0x00400000
-#define JIT_F_OPT_ABC		0x00800000
-#define JIT_F_OPT_SINK		0x01000000
-#define JIT_F_OPT_FUSE		0x02000000
+#define JIT_F_OPT_FOLD		(JIT_F_OPT << 0)
+#define JIT_F_OPT_CSE		(JIT_F_OPT << 1)
+#define JIT_F_OPT_DCE		(JIT_F_OPT << 2)
+#define JIT_F_OPT_FWD		(JIT_F_OPT << 3)
+#define JIT_F_OPT_DSE		(JIT_F_OPT << 4)
+#define JIT_F_OPT_NARROW	(JIT_F_OPT << 5)
+#define JIT_F_OPT_LOOP		(JIT_F_OPT << 6)
+#define JIT_F_OPT_ABC		(JIT_F_OPT << 7)
+#define JIT_F_OPT_SINK		(JIT_F_OPT << 8)
+#define JIT_F_OPT_FUSE		(JIT_F_OPT << 9)
 
 /* Optimizations names for -O. Must match the order above. */
-#define JIT_F_OPT_FIRST		JIT_F_OPT_FOLD
 #define JIT_F_OPTSTRING	\
   "\4fold\3cse\3dce\3fwd\3dse\6narrow\4loop\3abc\4sink\4fuse"
 
@@ -95,6 +99,8 @@
   JIT_F_OPT_FWD|JIT_F_OPT_DSE|JIT_F_OPT_ABC|JIT_F_OPT_SINK|JIT_F_OPT_FUSE)
 #define JIT_F_OPT_DEFAULT	JIT_F_OPT_3
 
+/* -- JIT engine parameters ----------------------------------------------- */
+
 #if LJ_TARGET_WINDOWS || LJ_64
 /* See: http://blogs.msdn.com/oldnewthing/archive/2003/10/08/55239.aspx */
 #define JIT_P_sizemcode_DEFAULT		64
@@ -137,6 +143,8 @@ JIT_PARAMDEF(JIT_PARAMENUM)
 #define JIT_PARAMSTR(len, name, value)	#len #name
 #define JIT_P_STRING	JIT_PARAMDEF(JIT_PARAMSTR)
 
+/* -- JIT engine data structures ------------------------------------------ */
+
 /* Trace compiler state. */
 typedef enum {
   LJ_TRACE_IDLE,	/* Trace compiler idle. */
diff --git a/src/ljamalg.c b/src/ljamalg.c
index 0ffc7e81..63b4ec87 100644
--- a/src/ljamalg.c
+++ b/src/ljamalg.c
@@ -3,16 +3,6 @@
 ** Copyright (C) 2005-2017 Mike Pall. See Copyright Notice in luajit.h
 */
 
-/*
-+--------------------------------------------------------------------------+
-| WARNING: Compiling the amalgamation needs a lot of virtual memory        |
-| (around 300 MB with GCC 4.x)! If you don't have enough physical memory   |
-| your machine will start swapping to disk and the compile will not finish |
-| within a reasonable amount of time.                                      |
-| So either compile on a bigger machine or use the non-amalgamated build.  |
-+--------------------------------------------------------------------------+
-*/
-
 #define ljamalg_c
 #define LUA_CORE
 
-- 
2.47.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Tarantool-patches] [PATCH luajit 2/2] Disable FMA by default. Use -Ofma or jit.opt.start("+fma") to enable.
  2025-01-14 11:06 [Tarantool-patches] [PATCH luajit 0/2] Refactoring and FMA optimizations Sergey Kaplun via Tarantool-patches
  2025-01-14 11:06 ` [Tarantool-patches] [PATCH luajit 1/2] Cleanup CPU detection and tuning for old CPUs Sergey Kaplun via Tarantool-patches
@ 2025-01-14 11:06 ` Sergey Kaplun via Tarantool-patches
  2025-01-14 12:45   ` Sergey Bronnikov via Tarantool-patches
  2025-01-31  9:28 ` [Tarantool-patches] [PATCH luajit 0/2] Refactoring and FMA optimizations Sergey Kaplun via Tarantool-patches
  2 siblings, 1 reply; 10+ messages in thread
From: Sergey Kaplun via Tarantool-patches @ 2025-01-14 11:06 UTC (permalink / raw)
  To: Sergey Bronnikov; +Cc: tarantool-patches

From: Mike Pall <mike>

See the discussion in the corresponding ticket for the rationale.

(cherry picked from commit de2e1ca9d3d87e74c0c20c1e4ad3c32b31a5875b)

For the modulo operation, the arm64 VM uses `fmsub` [1] instruction,
which is the fused multiply-add (FMA [2]) operation (more precisely,
multiply-sub). Hence, it may produce different results compared to the
unfused one. This patch fixes the behaviour by using the unfused
instructions by default. However, the new JIT optimization flag (fma) is
introduced to make it possible to take advantage of the FMA
optimizations.

Sergey Kaplun:
* added the description and the test for the problem

[1]: https://developer.arm.com/documentation/dui0801/g/A64-Floating-point-Instructions/FMSUB
[2]: https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation

Part of tarantool/tarantool#10709
---

I intentionally avoid mentioning the ticket in the commit message to
avoid excess mentioning in the LuaJIT issue tracker. You can see the
LuaJIT/LuaJIT#918 link in the cover letter.

 doc/running.html                              |  8 +++++
 src/lj_asm_arm.h                              |  6 +++-
 src/lj_asm_arm64.h                            |  3 +-
 src/lj_asm_ppc.h                              |  3 +-
 src/lj_jit.h                                  |  4 ++-
 src/lj_vmmath.c                               | 13 ++++++-
 src/vm_arm64.dasc                             |  4 ++-
 ...lj-918-fma-numerical-accuracy-jit.test.lua | 36 +++++++++++++++++++
 .../lj-918-fma-numerical-accuracy.test.lua    | 31 ++++++++++++++++
 .../lj-918-fma-optimization.test.lua          | 25 +++++++++++++
 10 files changed, 127 insertions(+), 6 deletions(-)
 create mode 100644 test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua
 create mode 100644 test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua
 create mode 100644 test/tarantool-tests/lj-918-fma-optimization.test.lua

diff --git a/doc/running.html b/doc/running.html
index 7868efab..1cf41f1b 100644
--- a/doc/running.html
+++ b/doc/running.html
@@ -226,6 +226,12 @@ mix the three forms, but note that setting an optimization level
 overrides all earlier flags.
 </p>
 <p>
+Note that <tt>-Ofma</tt> is not enabled by default at any level,
+because it affects floating-point result accuracy. Only enable this,
+if you fully understand the trade-offs of FMA for performance (higher),
+determinism (lower) and numerical accuracy (higher).
+</p>
+<p>
 Here are the available flags and at what optimization levels they
 are enabled:
 </p>
@@ -257,6 +263,8 @@ are enabled:
 <td class="flag_name">sink</td><td class="flag_level">&nbsp;</td><td class="flag_level">&nbsp;</td><td class="flag_level">&bull;</td><td class="flag_desc">Allocation/Store Sinking</td></tr>
 <tr class="even">
 <td class="flag_name">fuse</td><td class="flag_level">&nbsp;</td><td class="flag_level">&nbsp;</td><td class="flag_level">&bull;</td><td class="flag_desc">Fusion of operands into instructions</td></tr>
+<tr class="odd">
+<td class="flag_name">fma </td><td class="flag_level">&nbsp;</td><td class="flag_level">&nbsp;</td><td class="flag_level">&nbsp;</td><td class="flag_desc">Fused multiply-add</td></tr>
 </table>
 <p>
 Here are the parameters and their default settings:
diff --git a/src/lj_asm_arm.h b/src/lj_asm_arm.h
index 5a0f925f..041cd794 100644
--- a/src/lj_asm_arm.h
+++ b/src/lj_asm_arm.h
@@ -310,7 +310,11 @@ static void asm_fusexref(ASMState *as, ARMIns ai, Reg rd, IRRef ref,
 }
 
 #if !LJ_SOFTFP
-/* Fuse to multiply-add/sub instruction. */
+/*
+** Fuse to multiply-add/sub instruction.
+** VMLA rounds twice (UMA, not FMA) -- no need to check for JIT_F_OPT_FMA.
+** VFMA needs VFPv4, which is uncommon on the remaining ARM32 targets.
+*/
 static int asm_fusemadd(ASMState *as, IRIns *ir, ARMIns ai, ARMIns air)
 {
   IRRef lref = ir->op1, rref = ir->op2;
diff --git a/src/lj_asm_arm64.h b/src/lj_asm_arm64.h
index 88b47ceb..554bb60a 100644
--- a/src/lj_asm_arm64.h
+++ b/src/lj_asm_arm64.h
@@ -334,7 +334,8 @@ static int asm_fusemadd(ASMState *as, IRIns *ir, A64Ins ai, A64Ins air)
 {
   IRRef lref = ir->op1, rref = ir->op2;
   IRIns *irm;
-  if (lref != rref &&
+  if ((as->flags & JIT_F_OPT_FMA) &&
+      lref != rref &&
       ((mayfuse(as, lref) && (irm = IR(lref), irm->o == IR_MUL) &&
        ra_noreg(irm->r)) ||
        (mayfuse(as, rref) && (irm = IR(rref), irm->o == IR_MUL) &&
diff --git a/src/lj_asm_ppc.h b/src/lj_asm_ppc.h
index 7bba71b3..52db2926 100644
--- a/src/lj_asm_ppc.h
+++ b/src/lj_asm_ppc.h
@@ -232,7 +232,8 @@ static int asm_fusemadd(ASMState *as, IRIns *ir, PPCIns pi, PPCIns pir)
 {
   IRRef lref = ir->op1, rref = ir->op2;
   IRIns *irm;
-  if (lref != rref &&
+  if ((as->flags & JIT_F_OPT_FMA) &&
+      lref != rref &&
       ((mayfuse(as, lref) && (irm = IR(lref), irm->o == IR_MUL) &&
 	ra_noreg(irm->r)) ||
        (mayfuse(as, rref) && (irm = IR(rref), irm->o == IR_MUL) &&
diff --git a/src/lj_jit.h b/src/lj_jit.h
index 47df85c6..73c355b9 100644
--- a/src/lj_jit.h
+++ b/src/lj_jit.h
@@ -86,10 +86,11 @@
 #define JIT_F_OPT_ABC		(JIT_F_OPT << 7)
 #define JIT_F_OPT_SINK		(JIT_F_OPT << 8)
 #define JIT_F_OPT_FUSE		(JIT_F_OPT << 9)
+#define JIT_F_OPT_FMA		(JIT_F_OPT << 10)
 
 /* Optimizations names for -O. Must match the order above. */
 #define JIT_F_OPTSTRING	\
-  "\4fold\3cse\3dce\3fwd\3dse\6narrow\4loop\3abc\4sink\4fuse"
+  "\4fold\3cse\3dce\3fwd\3dse\6narrow\4loop\3abc\4sink\4fuse\3fma"
 
 /* Optimization levels set a fixed combination of flags. */
 #define JIT_F_OPT_0	0
@@ -98,6 +99,7 @@
 #define JIT_F_OPT_3	(JIT_F_OPT_2|\
   JIT_F_OPT_FWD|JIT_F_OPT_DSE|JIT_F_OPT_ABC|JIT_F_OPT_SINK|JIT_F_OPT_FUSE)
 #define JIT_F_OPT_DEFAULT	JIT_F_OPT_3
+/* Note: FMA is not set by default. */
 
 /* -- JIT engine parameters ----------------------------------------------- */
 
diff --git a/src/lj_vmmath.c b/src/lj_vmmath.c
index faebe719..29b72e0c 100644
--- a/src/lj_vmmath.c
+++ b/src/lj_vmmath.c
@@ -36,6 +36,17 @@ LJ_FUNCA double lj_wrap_fmod(double x, double y) { return fmod(x, y); }
 
 /* -- Helper functions ---------------------------------------------------- */
 
+/* Required to prevent the C compiler from applying FMA optimizations.
+**
+** Yes, there's -ffp-contract and the FP_CONTRACT pragma ... in theory.
+** But the current state of C compilers is a mess in this regard.
+** Also, this function is not performance sensitive at all.
+*/
+LJ_NOINLINE static double lj_vm_floormul(double x, double y)
+{
+  return lj_vm_floor(x / y) * y;
+}
+
 double lj_vm_foldarith(double x, double y, int op)
 {
   switch (op) {
@@ -43,7 +54,7 @@ double lj_vm_foldarith(double x, double y, int op)
   case IR_SUB - IR_ADD: return x-y; break;
   case IR_MUL - IR_ADD: return x*y; break;
   case IR_DIV - IR_ADD: return x/y; break;
-  case IR_MOD - IR_ADD: return x-lj_vm_floor(x/y)*y; break;
+  case IR_MOD - IR_ADD: return x-lj_vm_floormul(x, y); break;
   case IR_POW - IR_ADD: return pow(x, y); break;
   case IR_NEG - IR_ADD: return -x; break;
   case IR_ABS - IR_ADD: return fabs(x); break;
diff --git a/src/vm_arm64.dasc b/src/vm_arm64.dasc
index 1cf1ea51..c5f0a7a7 100644
--- a/src/vm_arm64.dasc
+++ b/src/vm_arm64.dasc
@@ -2581,7 +2581,9 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
     |.macro ins_arithmod, res, reg1, reg2
     |  fdiv d2, reg1, reg2
     |  frintm d2, d2
-    |  fmsub res, d2, reg2, reg1
+    |  // Cannot use fmsub, because FMA is not enabled by default.
+    |  fmul d2, d2, reg2
+    |  fsub res, reg1, d2
     |.endmacro
     |
     |.macro ins_arithdn, intins, fpins
diff --git a/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua b/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua
new file mode 100644
index 00000000..55ec7b98
--- /dev/null
+++ b/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua
@@ -0,0 +1,36 @@
+local tap = require('tap')
+
+-- Test file to demonstrate consistent behaviour for JIT and the
+-- VM regarding FMA optimization (disabled by default).
+-- XXX: The VM behaviour is checked in the
+-- <lj-918-fma-numerical-accuracy.test.lua>.
+-- See also: https://github.com/LuaJIT/LuaJIT/issues/918.
+local test = tap.test('lj-918-fma-numerical-accuracy-jit'):skipcond({
+  ['Test requires JIT enabled'] = not jit.status(),
+})
+
+test:plan(1)
+
+local _2pow52 = 2 ^ 52
+
+-- IEEE754 components to double:
+-- sign * (2 ^ (exp - 1023)) * (mantissa / _2pow52 + normal).
+local a = 1 * (2 ^ (1083 - 1023)) * (4080546448249347 / _2pow52 + 1)
+assert(a == 2197541395358679800)
+
+local b = -1 * (2 ^ (1052 - 1023)) * (3927497732209973 / _2pow52 + 1)
+assert(b == -1005065126.3690554)
+
+local results = {}
+
+jit.opt.start('hotloop=1')
+for i = 1, 4 do
+  results[i] = a % b
+end
+
+-- XXX: The test doesn't fail before the commit. But it is
+-- required to be sure that there are no inconsistencies after the
+-- commit.
+test:samevalues(results, 'consistent behaviour between the JIT and the VM')
+
+test:done(true)
diff --git a/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua b/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua
new file mode 100644
index 00000000..a3775d6d
--- /dev/null
+++ b/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua
@@ -0,0 +1,31 @@
+local tap = require('tap')
+
+-- Test file to demonstrate possible numerical inaccuracy if FMA
+-- optimization takes place.
+-- XXX: The JIT consistency is checked in the
+-- <lj-918-fma-numerical-accuracy-jit.test.lua>.
+-- See also: https://github.com/LuaJIT/LuaJIT/issues/918.
+local test = tap.test('lj-918-fma-numerical-accuracy')
+
+test:plan(2)
+
+local _2pow52 = 2 ^ 52
+
+-- IEEE754 components to double:
+-- sign * (2 ^ (exp - 1023)) * (mantissa / _2pow52 + normal).
+local a = 1 * (2 ^ (1083 - 1023)) * (4080546448249347 / _2pow52 + 1)
+assert(a == 2197541395358679800)
+
+local b = -1 * (2 ^ (1052 - 1023)) * (3927497732209973 / _2pow52 + 1)
+assert(b == -1005065126.3690554)
+
+-- These tests fail on ARM64 before the patch or with FMA
+-- optimization enabled.
+-- The first test may not fail if the compiler doesn't generate
+-- an ARM64 FMA operation in `lj_vm_foldarith()`.
+test:is(2197541395358679800 % -1005065126.3690554, -606337536,
+        'FMA in the lj_vm_foldarith() during parsing')
+
+test:is(a % b, -606337536, 'FMA in the VM')
+
+test:done(true)
diff --git a/test/tarantool-tests/lj-918-fma-optimization.test.lua b/test/tarantool-tests/lj-918-fma-optimization.test.lua
new file mode 100644
index 00000000..af749eb5
--- /dev/null
+++ b/test/tarantool-tests/lj-918-fma-optimization.test.lua
@@ -0,0 +1,25 @@
+local tap = require('tap')
+local test = tap.test('lj-918-fma-optimization'):skipcond({
+  ['Test requires JIT enabled'] = not jit.status(),
+})
+
+test:plan(3)
+
+local function jit_opt_is_on(needed)
+  for _, opt in ipairs({jit.status()}) do
+    if opt == needed then
+      return true
+    end
+  end
+  return false
+end
+
+test:ok(not jit_opt_is_on('fma'), 'FMA is disabled by default')
+
+local ok, _ = pcall(jit.opt.start, '+fma')
+
+test:ok(ok, 'fma flag is recognized')
+
+test:ok(jit_opt_is_on('fma'), 'FMA is enabled after jit.opt.start()')
+
+test:done(true)
-- 
2.47.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH luajit 1/2] Cleanup CPU detection and tuning for old CPUs.
  2025-01-14 11:06 ` [Tarantool-patches] [PATCH luajit 1/2] Cleanup CPU detection and tuning for old CPUs Sergey Kaplun via Tarantool-patches
@ 2025-01-14 11:25   ` Sergey Bronnikov via Tarantool-patches
  2025-01-15 13:10     ` Sergey Kaplun via Tarantool-patches
  0 siblings, 1 reply; 10+ messages in thread
From: Sergey Bronnikov via Tarantool-patches @ 2025-01-14 11:25 UTC (permalink / raw)
  To: Sergey Kaplun, tarantool-patches

[-- Attachment #1: Type: text/plain, Size: 19315 bytes --]

Hi, Sergey,

thanks for the patch!

LGTM with a minor comment

Sergey

On 13.01.2025 18:17, Sergey Kaplun wrote:
> From: Mike Pall <mike>
>
> (cherry picked from commit 0eddcbead2d67c16dcd4039a6765b9d2fc8ea631)
>
> This patch does the following refactoring:
> 1) Drops optimizations for the Intel Atom CPU [1]: removes the
>     `JIT_F_LEA_AGU` flag and related optimizations. The considerations
>     for the use of LEA are complex and very CPU-specific, mostly
>     dependent on the number of operands. Mostly, it isn't worth it due to
>     the extra register pressure and/or extra instructions.

I would say explicitly that `JIT_F_LEA_AGU` is used in "Well, yes, that 
applies to the original and obsolete Atom architecture. Today "Intel 
Atom" is just a trade name for reduced-performance implementations of 
the current Intel architecture."

as Mike explained in LUAJIT#24. So there are no any risks for tarantool 
users

regarding performance degradation.

> 2) Drops optimizations for the AMD K8, K10 CPU [2][3]: removes the
>     `JIT_F_PREFER_IMUL` flag and related optimizations.
> 3) Refactors JIT flags defined in the <lj_jit.h>. Now all CPU-specific
>     JIT flags are defined as the left shift of `JIT_F_CPU` instead of
>     hardcoded constants, similar for the optimization flags.
> 4) Adds detection of the ARM8 CPU.
> 5) Drops the check for SSE2 since the VM already presumes CPU supports
>     it.
> 6) Adds checks for `__ARM_ARCH`[4] macro in <lj_arch.h>.
> 7) Drops outdated comment in the amalgamation file about memory
>     requirements.
>
> Sergey Kaplun:
> * added the description for the patch
>
> [1]:https://en.wikipedia.org/wiki/Intel_Atom
> [2]:https://en.wikipedia.org/wiki/AMD_K8
> [3]:https://en.wikipedia.org/wiki/AMD_K10
> [4]:https://developer.arm.com/documentation/dui0774/l/Other-Compiler-specific-Features/Predefined-macros
>
> Part of tarantool/tarantool#10709
> ---
>   src/Makefile.original |  1 -
>   src/lib_jit.c         | 65 +++++++++++-------------------
>   src/lj_arch.h         |  6 +--
>   src/lj_asm_x86.h      | 33 +++++----------
>   src/lj_dispatch.c     |  7 ----
>   src/lj_emit_x86.h     |  5 +--
>   src/lj_errmsg.h       |  4 --
>   src/lj_jit.h          | 94 +++++++++++++++++++++++--------------------
>   src/ljamalg.c         | 10 -----
>   9 files changed, 87 insertions(+), 138 deletions(-)
>
> diff --git a/src/Makefile.original b/src/Makefile.original
> index 9f55fa32..8d925e3a 100644
> --- a/src/Makefile.original
> +++ b/src/Makefile.original
> @@ -621,7 +621,6 @@ E= @echo
>   default all:	$(TARGET_T)
>   
>   amalg:
> -	@grep "^[+|]" ljamalg.c
>   	$(MAKE) -f Makefile.original all "LJCORE_O=ljamalg.o"
>   
>   clean:
> diff --git a/src/lib_jit.c b/src/lib_jit.c
> index f705f334..9f870f68 100644
> --- a/src/lib_jit.c
> +++ b/src/lib_jit.c
> @@ -104,8 +104,8 @@ LJLIB_CF(jit_status)
>     jit_State *J = L2J(L);
>     L->top = L->base;
>     setboolV(L->top++, (J->flags & JIT_F_ON) ? 1 : 0);
> -  flagbits_to_strings(L, J->flags, JIT_F_CPU_FIRST, JIT_F_CPUSTRING);
> -  flagbits_to_strings(L, J->flags, JIT_F_OPT_FIRST, JIT_F_OPTSTRING);
> +  flagbits_to_strings(L, J->flags, JIT_F_CPU, JIT_F_CPUSTRING);
> +  flagbits_to_strings(L, J->flags, JIT_F_OPT, JIT_F_OPTSTRING);
>     return (int)(L->top - L->base);
>   #else
>     setboolV(L->top++, 0);
> @@ -467,7 +467,7 @@ static int jitopt_flag(jit_State *J, const char *str)
>       str += str[2] == '-' ? 3 : 2;
>       set = 0;
>     }
> -  for (opt = JIT_F_OPT_FIRST; ; opt <<= 1) {
> +  for (opt = JIT_F_OPT; ; opt <<= 1) {
>       size_t len = *(const uint8_t *)lst;
>       if (len == 0)
>         break;
> @@ -636,59 +636,41 @@ JIT_PARAMDEF(JIT_PARAMINIT)
>   #undef JIT_PARAMINIT
>     0
>   };
> -#endif
>   
>   #if LJ_TARGET_ARM && LJ_TARGET_LINUX
>   #include <sys/utsname.h>
>   #endif
>   
> -/* Arch-dependent CPU detection. */
> -static uint32_t jit_cpudetect(lua_State *L)
> +/* Arch-dependent CPU feature detection. */
> +static uint32_t jit_cpudetect(void)
>   {
>     uint32_t flags = 0;
>   #if LJ_TARGET_X86ORX64
> +
>     uint32_t vendor[4];
>     uint32_t features[4];
>     if (lj_vm_cpuid(0, vendor) && lj_vm_cpuid(1, features)) {
> -#if !LJ_HASJIT
> -#define JIT_F_SSE2	2
> -#endif
> -    flags |= ((features[3] >> 26)&1) * JIT_F_SSE2;
> -#if LJ_HASJIT
>       flags |= ((features[2] >> 0)&1) * JIT_F_SSE3;
>       flags |= ((features[2] >> 19)&1) * JIT_F_SSE4_1;
> -    if (vendor[2] == 0x6c65746e) {  /* Intel. */
> -      if ((features[0] & 0x0fff0ff0) == 0x000106c0)  /* Atom. */
> -	flags |= JIT_F_LEA_AGU;
> -    } else if (vendor[2] == 0x444d4163) {  /* AMD. */
> -      uint32_t fam = (features[0] & 0x0ff00f00);
> -      if (fam >= 0x00000f00)  /* K8, K10. */
> -	flags |= JIT_F_PREFER_IMUL;
> -    }
>       if (vendor[0] >= 7) {
>         uint32_t xfeatures[4];
>         lj_vm_cpuid(7, xfeatures);
>         flags |= ((xfeatures[1] >> 8)&1) * JIT_F_BMI2;
>       }
> -#endif
>     }
> -  /* Check for required instruction set support on x86 (unnecessary on x64). */
> -#if LJ_TARGET_X86
> -  if (!(flags & JIT_F_SSE2))
> -    luaL_error(L, "CPU with SSE2 required");
> -#endif
> +  /* Don't bother checking for SSE2 -- the VM will crash before getting here. */
> +
>   #elif LJ_TARGET_ARM
> -#if LJ_HASJIT
> +
>     int ver = LJ_ARCH_VERSION;  /* Compile-time ARM CPU detection. */
>   #if LJ_TARGET_LINUX
>     if (ver < 70) {  /* Runtime ARM CPU detection. */
>       struct utsname ut;
>       uname(&ut);
>       if (strncmp(ut.machine, "armv", 4) == 0) {
> -      if (ut.machine[4] >= '7')
> -	ver = 70;
> -      else if (ut.machine[4] == '6')
> -	ver = 60;
> +      if (ut.machine[4] >= '8') ver = 80;
> +      else if (ut.machine[4] == '7') ver = 70;
> +      else if (ut.machine[4] == '6') ver = 60;
>       }
>     }
>   #endif
> @@ -696,20 +678,22 @@ static uint32_t jit_cpudetect(lua_State *L)
>   	   ver >= 61 ? JIT_F_ARMV6T2_ :
>   	   ver >= 60 ? JIT_F_ARMV6_ : 0;
>     flags |= LJ_ARCH_HASFPU == 0 ? 0 : ver >= 70 ? JIT_F_VFPV3 : JIT_F_VFPV2;
> -#endif
> +
>   #elif LJ_TARGET_ARM64
> +
>     /* No optional CPU features to detect (for now). */
> +
>   #elif LJ_TARGET_PPC
> -#if LJ_HASJIT
> +
>   #if LJ_ARCH_SQRT
>     flags |= JIT_F_SQRT;
>   #endif
>   #if LJ_ARCH_ROUND
>     flags |= JIT_F_ROUND;
>   #endif
> -#endif
> +
>   #elif LJ_TARGET_MIPS
> -#if LJ_HASJIT
> +
>     /* Compile-time MIPS CPU detection. */
>   #if LJ_ARCH_VERSION >= 20
>     flags |= JIT_F_MIPSXXR2;
> @@ -727,31 +711,28 @@ static uint32_t jit_cpudetect(lua_State *L)
>       if (x) flags |= JIT_F_MIPSXXR2;  /* Either 0x80000000 (R2) or 0 (R1). */
>     }
>   #endif
> -#endif
> +
>   #else
>   #error "Missing CPU detection for this architecture"
>   #endif
> -  UNUSED(L);
>     return flags;
>   }
>   
>   /* Initialize JIT compiler. */
>   static void jit_init(lua_State *L)
>   {
> -  uint32_t flags = jit_cpudetect(L);
> -#if LJ_HASJIT
>     jit_State *J = L2J(L);
> -  J->flags = flags | JIT_F_ON | JIT_F_OPT_DEFAULT;
> +  J->flags = jit_cpudetect() | JIT_F_ON | JIT_F_OPT_DEFAULT;
>     memcpy(J->param, jit_param_default, sizeof(J->param));
>     lj_dispatch_update(G(L));
> -#else
> -  UNUSED(flags);
> -#endif
>   }
> +#endif
>   
>   LUALIB_API int luaopen_jit(lua_State *L)
>   {
> +#if LJ_HASJIT
>     jit_init(L);
> +#endif
>     lua_pushliteral(L, LJ_OS_NAME);
>     lua_pushliteral(L, LJ_ARCH_NAME);
>     lua_pushinteger(L, LUAJIT_VERSION_NUM);
> diff --git a/src/lj_arch.h b/src/lj_arch.h
> index 3bdbe84e..e853c4a4 100644
> --- a/src/lj_arch.h
> +++ b/src/lj_arch.h
> @@ -209,13 +209,13 @@
>   #define LJ_TARGET_UNIFYROT	2	/* Want only IR_BROR. */
>   #define LJ_ARCH_NUMMODE		LJ_NUMMODE_DUAL
>   
> -#if __ARM_ARCH____ARM_ARCH_8__ || __ARM_ARCH_8A__
> +#if __ARM_ARCH == 8 || __ARM_ARCH_8__ || __ARM_ARCH_8A__
>   #define LJ_ARCH_VERSION		80
> -#elif __ARM_ARCH_7__ || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH_7S__ || __ARM_ARCH_7VE__
> +#elif __ARM_ARCH == 7 || __ARM_ARCH_7__ || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH_7S__ || __ARM_ARCH_7VE__
>   #define LJ_ARCH_VERSION		70
>   #elif __ARM_ARCH_6T2__
>   #define LJ_ARCH_VERSION		61
> -#elif __ARM_ARCH_6__ || __ARM_ARCH_6J__ || __ARM_ARCH_6K__ || __ARM_ARCH_6Z__ || __ARM_ARCH_6ZK__
> +#elif __ARM_ARCH == 6 || __ARM_ARCH_6__ || __ARM_ARCH_6J__ || __ARM_ARCH_6K__ || __ARM_ARCH_6Z__ || __ARM_ARCH_6ZK__
>   #define LJ_ARCH_VERSION		60
>   #else
>   #define LJ_ARCH_VERSION		50
> diff --git a/src/lj_asm_x86.h b/src/lj_asm_x86.h
> index 86ce3937..5819fa7a 100644
> --- a/src/lj_asm_x86.h
> +++ b/src/lj_asm_x86.h
> @@ -1222,13 +1222,8 @@ static void asm_href(ASMState *as, IRIns *ir, IROp merge)
>       emit_rmro(as, XO_MOV, dest|REX_GC64, tab, offsetof(GCtab, node));
>     } else {
>       emit_rmro(as, XO_ARITH(XOg_ADD), dest|REX_GC64, tab, offsetof(GCtab,node));
> -    if ((as->flags & JIT_F_PREFER_IMUL)) {
> -      emit_i8(as, sizeof(Node));
> -      emit_rr(as, XO_IMULi8, dest, dest);
> -    } else {
> -      emit_shifti(as, XOg_SHL, dest, 3);
> -      emit_rmrxo(as, XO_LEA, dest, dest, dest, XM_SCALE2, 0);
> -    }
> +    emit_shifti(as, XOg_SHL, dest, 3);
> +    emit_rmrxo(as, XO_LEA, dest, dest, dest, XM_SCALE2, 0);
>       if (isk) {
>         emit_gri(as, XG_ARITHi(XOg_AND), dest, (int32_t)khash);
>         emit_rmro(as, XO_MOV, dest, tab, offsetof(GCtab, hmask));
> @@ -1287,7 +1282,7 @@ static void asm_hrefk(ASMState *as, IRIns *ir)
>     lj_assertA(ofs % sizeof(Node) == 0, "unaligned HREFK slot");
>     if (ra_hasreg(dest)) {
>       if (ofs != 0) {
> -      if (dest == node && !(as->flags & JIT_F_LEA_AGU))
> +      if (dest == node)
>   	emit_gri(as, XG_ARITHi(XOg_ADD), dest|REX_GC64, ofs);
>         else
>   	emit_rmro(as, XO_LEA, dest|REX_GC64, node, ofs);
> @@ -2181,8 +2176,7 @@ static void asm_add(ASMState *as, IRIns *ir)
>   {
>     if (irt_isnum(ir->t))
>       asm_fparith(as, ir, XO_ADDSD);
> -  else if ((as->flags & JIT_F_LEA_AGU) || as->flagmcp == as->mcp ||
> -	   irt_is64(ir->t) || !asm_lea(as, ir))
> +  else if (as->flagmcp == as->mcp || irt_is64(ir->t) || !asm_lea(as, ir))
>       asm_intarith(as, ir, XOg_ADD);
>   }
>   
> @@ -2887,7 +2881,7 @@ static void asm_tail_fixup(ASMState *as, TraceNo lnk)
>     MCode *target, *q;
>     int32_t spadj = as->T->spadjust;
>     if (spadj == 0) {
> -    p -= ((as->flags & JIT_F_LEA_AGU) ? 7 : 6) + (LJ_64 ? 1 : 0);
> +    p -= LJ_64 ? 7 : 6;
>     } else {
>       MCode *p1;
>       /* Patch stack adjustment. */
> @@ -2899,20 +2893,11 @@ static void asm_tail_fixup(ASMState *as, TraceNo lnk)
>         p1 = p-9;
>         *(int32_t *)p1 = spadj;
>       }
> -    if ((as->flags & JIT_F_LEA_AGU)) {
> -#if LJ_64
> -      p1[-4] = 0x48;
> -#endif
> -      p1[-3] = (MCode)XI_LEA;
> -      p1[-2] = MODRM(checki8(spadj) ? XM_OFS8 : XM_OFS32, RID_ESP, RID_ESP);
> -      p1[-1] = MODRM(XM_SCALE1, RID_ESP, RID_ESP);
> -    } else {
>   #if LJ_64
> -      p1[-3] = 0x48;
> +    p1[-3] = 0x48;
>   #endif
> -      p1[-2] = (MCode)(checki8(spadj) ? XI_ARITHi8 : XI_ARITHi);
> -      p1[-1] = MODRM(XM_REG, XOg_ADD, RID_ESP);
> -    }
> +    p1[-2] = (MCode)(checki8(spadj) ? XI_ARITHi8 : XI_ARITHi);
> +    p1[-1] = MODRM(XM_REG, XOg_ADD, RID_ESP);
>     }
>     /* Patch exit branch. */
>     target = lnk ? traceref(as->J, lnk)->mcode : (MCode *)lj_vm_exit_interp;
> @@ -2943,7 +2928,7 @@ static void asm_tail_prep(ASMState *as)
>       as->invmcp = as->mcp = p;
>     } else {
>       /* Leave room for ESP adjustment: add esp, imm or lea esp, [esp+imm] */
> -    as->mcp = p - (((as->flags & JIT_F_LEA_AGU) ? 7 : 6)  + (LJ_64 ? 1 : 0));
> +    as->mcp = p - (LJ_64 ? 7 : 6);
>       as->invmcp = NULL;
>     }
>   }
> diff --git a/src/lj_dispatch.c b/src/lj_dispatch.c
> index ddee68de..a44a5adf 100644
> --- a/src/lj_dispatch.c
> +++ b/src/lj_dispatch.c
> @@ -258,15 +258,8 @@ int luaJIT_setmode(lua_State *L, int idx, int mode)
>       } else {
>         if (!(mode & LUAJIT_MODE_ON))
>   	G2J(g)->flags &= ~(uint32_t)JIT_F_ON;
> -#if LJ_TARGET_X86ORX64
> -      else if ((G2J(g)->flags & JIT_F_SSE2))
> -	G2J(g)->flags |= (uint32_t)JIT_F_ON;
> -      else
> -	return 0;  /* Don't turn on JIT compiler without SSE2 support. */
> -#else
>         else
>   	G2J(g)->flags |= (uint32_t)JIT_F_ON;
> -#endif
>         lj_dispatch_update(g);
>       }
>       break;
> diff --git a/src/lj_emit_x86.h b/src/lj_emit_x86.h
> index f4990151..85978027 100644
> --- a/src/lj_emit_x86.h
> +++ b/src/lj_emit_x86.h
> @@ -561,10 +561,7 @@ static void emit_storeofs(ASMState *as, IRIns *ir, Reg r, Reg base, int32_t ofs)
>   static void emit_addptr(ASMState *as, Reg r, int32_t ofs)
>   {
>     if (ofs) {
> -    if ((as->flags & JIT_F_LEA_AGU))
> -      emit_rmro(as, XO_LEA, r|REX_GC64, r, ofs);
> -    else
> -      emit_gri(as, XG_ARITHi(XOg_ADD), r|REX_GC64, ofs);
> +    emit_gri(as, XG_ARITHi(XOg_ADD), r|REX_GC64, ofs);
>     }
>   }
>   
> diff --git a/src/lj_errmsg.h b/src/lj_errmsg.h
> index 77a08cb0..19c41f0b 100644
> --- a/src/lj_errmsg.h
> +++ b/src/lj_errmsg.h
> @@ -101,11 +101,7 @@ ERRDEF(STRGSRV,	"invalid replacement value (a %s)")
>   ERRDEF(BADMODN,	"name conflict for module " LUA_QS)
>   #if LJ_HASJIT
>   ERRDEF(JITPROT,	"runtime code generation failed, restricted kernel?")
> -#if LJ_TARGET_X86ORX64
> -ERRDEF(NOJIT,	"JIT compiler disabled, CPU does not support SSE2")
> -#else
>   ERRDEF(NOJIT,	"JIT compiler disabled")
> -#endif
>   #elif defined(LJ_ARCH_NOJIT)
>   ERRDEF(NOJIT,	"no JIT compiler for this architecture (yet)")
>   #else
> diff --git a/src/lj_jit.h b/src/lj_jit.h
> index 361570a0..47df85c6 100644
> --- a/src/lj_jit.h
> +++ b/src/lj_jit.h
> @@ -9,47 +9,49 @@
>   #include "lj_obj.h"
>   #include "lj_ir.h"
>   
> -/* JIT engine flags. */
> +/* -- JIT engine flags ---------------------------------------------------- */
> +
> +/* General JIT engine flags. 4 bits. */
>   #define JIT_F_ON		0x00000001
>   
> -/* CPU-specific JIT engine flags. */
> +/* CPU-specific JIT engine flags. 12 bits. Flags and strings must match. */
> +#define JIT_F_CPU		0x00000010
> +
>   #if LJ_TARGET_X86ORX64
> -#define JIT_F_SSE2		0x00000010
> -#define JIT_F_SSE3		0x00000020
> -#define JIT_F_SSE4_1		0x00000040
> -#define JIT_F_PREFER_IMUL	0x00000080
> -#define JIT_F_LEA_AGU		0x00000100
> -#define JIT_F_BMI2		0x00000200
> -
> -/* Names for the CPU-specific flags. Must match the order above. */
> -#define JIT_F_CPU_FIRST		JIT_F_SSE2
> -#define JIT_F_CPUSTRING		"\4SSE2\4SSE3\6SSE4.1\3AMD\4ATOM\4BMI2"
> +
> +#define JIT_F_SSE3		(JIT_F_CPU << 0)
> +#define JIT_F_SSE4_1		(JIT_F_CPU << 1)
> +#define JIT_F_BMI2		(JIT_F_CPU << 2)
> +
> +
> +#define JIT_F_CPUSTRING		"\4SSE3\6SSE4.1\4BMI2"
> +
>   #elif LJ_TARGET_ARM
> -#define JIT_F_ARMV6_		0x00000010
> -#define JIT_F_ARMV6T2_		0x00000020
> -#define JIT_F_ARMV7		0x00000040
> -#define JIT_F_VFPV2		0x00000080
> -#define JIT_F_VFPV3		0x00000100
> -
> -#define JIT_F_ARMV6		(JIT_F_ARMV6_|JIT_F_ARMV6T2_|JIT_F_ARMV7)
> -#define JIT_F_ARMV6T2		(JIT_F_ARMV6T2_|JIT_F_ARMV7)
> +
> +#define JIT_F_ARMV6_		(JIT_F_CPU << 0)
> +#define JIT_F_ARMV6T2_		(JIT_F_CPU << 1)
> +#define JIT_F_ARMV7		(JIT_F_CPU << 2)
> +#define JIT_F_ARMV8		(JIT_F_CPU << 3)
> +#define JIT_F_VFPV2		(JIT_F_CPU << 4)
> +#define JIT_F_VFPV3		(JIT_F_CPU << 5)
> +
> +#define JIT_F_ARMV6		(JIT_F_ARMV6_|JIT_F_ARMV6T2_|JIT_F_ARMV7|JIT_F_ARMV8)
> +#define JIT_F_ARMV6T2		(JIT_F_ARMV6T2_|JIT_F_ARMV7|JIT_F_ARMV8)
>   #define JIT_F_VFP		(JIT_F_VFPV2|JIT_F_VFPV3)
>   
> -/* Names for the CPU-specific flags. Must match the order above. */
> -#define JIT_F_CPU_FIRST		JIT_F_ARMV6_
> -#define JIT_F_CPUSTRING		"\5ARMv6\7ARMv6T2\5ARMv7\5VFPv2\5VFPv3"
> +#define JIT_F_CPUSTRING		"\5ARMv6\7ARMv6T2\5ARMv7\5ARMv8\5VFPv2\5VFPv3"
> +
>   #elif LJ_TARGET_PPC
> -#define JIT_F_SQRT		0x00000010
> -#define JIT_F_ROUND		0x00000020
>   
> -/* Names for the CPU-specific flags. Must match the order above. */
> -#define JIT_F_CPU_FIRST		JIT_F_SQRT
> +#define JIT_F_SQRT		(JIT_F_CPU << 0)
> +#define JIT_F_ROUND		(JIT_F_CPU << 1)
> +
>   #define JIT_F_CPUSTRING		"\4SQRT\5ROUND"
> +
>   #elif LJ_TARGET_MIPS
> -#define JIT_F_MIPSXXR2		0x00000010
>   
> -/* Names for the CPU-specific flags. Must match the order above. */
> -#define JIT_F_CPU_FIRST		JIT_F_MIPSXXR2
> +#define JIT_F_MIPSXXR2		(JIT_F_CPU << 0)
> +
>   #if LJ_TARGET_MIPS32
>   #if LJ_TARGET_MIPSR6
>   #define JIT_F_CPUSTRING		"\010MIPS32R6"
> @@ -63,27 +65,29 @@
>   #define JIT_F_CPUSTRING		"\010MIPS64R2"
>   #endif
>   #endif
> +
>   #else
> -#define JIT_F_CPU_FIRST		0
> +
>   #define JIT_F_CPUSTRING		""
> +
>   #endif
>   
> -/* Optimization flags. */
> +/* Optimization flags. 12 bits. */
> +#define JIT_F_OPT		0x00010000
>   #define JIT_F_OPT_MASK		0x0fff0000
>   
> -#define JIT_F_OPT_FOLD		0x00010000
> -#define JIT_F_OPT_CSE		0x00020000
> -#define JIT_F_OPT_DCE		0x00040000
> -#define JIT_F_OPT_FWD		0x00080000
> -#define JIT_F_OPT_DSE		0x00100000
> -#define JIT_F_OPT_NARROW	0x00200000
> -#define JIT_F_OPT_LOOP		0x00400000
> -#define JIT_F_OPT_ABC		0x00800000
> -#define JIT_F_OPT_SINK		0x01000000
> -#define JIT_F_OPT_FUSE		0x02000000
> +#define JIT_F_OPT_FOLD		(JIT_F_OPT << 0)
> +#define JIT_F_OPT_CSE		(JIT_F_OPT << 1)
> +#define JIT_F_OPT_DCE		(JIT_F_OPT << 2)
> +#define JIT_F_OPT_FWD		(JIT_F_OPT << 3)
> +#define JIT_F_OPT_DSE		(JIT_F_OPT << 4)
> +#define JIT_F_OPT_NARROW	(JIT_F_OPT << 5)
> +#define JIT_F_OPT_LOOP		(JIT_F_OPT << 6)
> +#define JIT_F_OPT_ABC		(JIT_F_OPT << 7)
> +#define JIT_F_OPT_SINK		(JIT_F_OPT << 8)
> +#define JIT_F_OPT_FUSE		(JIT_F_OPT << 9)
>   
>   /* Optimizations names for -O. Must match the order above. */
> -#define JIT_F_OPT_FIRST		JIT_F_OPT_FOLD
>   #define JIT_F_OPTSTRING	\
>     "\4fold\3cse\3dce\3fwd\3dse\6narrow\4loop\3abc\4sink\4fuse"
>   
> @@ -95,6 +99,8 @@
>     JIT_F_OPT_FWD|JIT_F_OPT_DSE|JIT_F_OPT_ABC|JIT_F_OPT_SINK|JIT_F_OPT_FUSE)
>   #define JIT_F_OPT_DEFAULT	JIT_F_OPT_3
>   
> +/* -- JIT engine parameters ----------------------------------------------- */
> +
>   #if LJ_TARGET_WINDOWS || LJ_64
>   /* See:http://blogs.msdn.com/oldnewthing/archive/2003/10/08/55239.aspx  */
>   #define JIT_P_sizemcode_DEFAULT		64
> @@ -137,6 +143,8 @@ JIT_PARAMDEF(JIT_PARAMENUM)
>   #define JIT_PARAMSTR(len, name, value)	#len #name
>   #define JIT_P_STRING	JIT_PARAMDEF(JIT_PARAMSTR)
>   
> +/* -- JIT engine data structures ------------------------------------------ */
> +
>   /* Trace compiler state. */
>   typedef enum {
>     LJ_TRACE_IDLE,	/* Trace compiler idle. */
> diff --git a/src/ljamalg.c b/src/ljamalg.c
> index 0ffc7e81..63b4ec87 100644
> --- a/src/ljamalg.c
> +++ b/src/ljamalg.c
> @@ -3,16 +3,6 @@
>   ** Copyright (C) 2005-2017 Mike Pall. See Copyright Notice in luajit.h
>   */
>   
> -/*
> -+--------------------------------------------------------------------------+
> -| WARNING: Compiling the amalgamation needs a lot of virtual memory        |
> -| (around 300 MB with GCC 4.x)! If you don't have enough physical memory   |
> -| your machine will start swapping to disk and the compile will not finish |
> -| within a reasonable amount of time.                                      |
> -| So either compile on a bigger machine or use the non-amalgamated build.  |
> -+--------------------------------------------------------------------------+
> -*/
> -
>   #define ljamalg_c
>   #define LUA_CORE
>   

[-- Attachment #2: Type: text/html, Size: 19750 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH luajit 2/2] Disable FMA by default. Use -Ofma or jit.opt.start("+fma") to enable.
  2025-01-14 11:06 ` [Tarantool-patches] [PATCH luajit 2/2] Disable FMA by default. Use -Ofma or jit.opt.start("+fma") to enable Sergey Kaplun via Tarantool-patches
@ 2025-01-14 12:45   ` Sergey Bronnikov via Tarantool-patches
  2025-01-15 13:06     ` Sergey Kaplun via Tarantool-patches
  0 siblings, 1 reply; 10+ messages in thread
From: Sergey Bronnikov via Tarantool-patches @ 2025-01-14 12:45 UTC (permalink / raw)
  To: Sergey Kaplun; +Cc: tarantool-patches

[-- Attachment #1: Type: text/plain, Size: 12685 bytes --]

Hi, Sergey!

Thanks for the patch!


On 14.01.2025 14:06, Sergey Kaplun wrote:
> From: Mike Pall <mike>
>
> See the discussion in the corresponding ticket for the rationale.
>
> (cherry picked from commit de2e1ca9d3d87e74c0c20c1e4ad3c32b31a5875b)
>
> For the modulo operation, the arm64 VM uses `fmsub` [1] instruction,
> which is the fused multiply-add (FMA [2]) operation (more precisely,
> multiply-sub). Hence, it may produce different results compared to the
> unfused one. This patch fixes the behaviour by using the unfused
> instructions by default. However, the new JIT optimization flag (fma) is
> introduced to make it possible to take advantage of the FMA
> optimizations.
>
> Sergey Kaplun:
> * added the description and the test for the problem
>
> [1]:https://developer.arm.com/documentation/dui0801/g/A64-Floating-point-Instructions/FMSUB
> [2]:https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation
>
> Part of tarantool/tarantool#10709
> ---
>
> I intentionally avoid mentioning the ticket in the commit message to
> avoid excess mentioning in the LuaJIT issue tracker. You can see the
> LuaJIT/LuaJIT#918 link in the cover letter.
>
>   doc/running.html                              |  8 +++++
>   src/lj_asm_arm.h                              |  6 +++-
>   src/lj_asm_arm64.h                            |  3 +-
>   src/lj_asm_ppc.h                              |  3 +-
>   src/lj_jit.h                                  |  4 ++-
>   src/lj_vmmath.c                               | 13 ++++++-
>   src/vm_arm64.dasc                             |  4 ++-
>   ...lj-918-fma-numerical-accuracy-jit.test.lua | 36 +++++++++++++++++++
>   .../lj-918-fma-numerical-accuracy.test.lua    | 31 ++++++++++++++++
>   .../lj-918-fma-optimization.test.lua          | 25 +++++++++++++
>   10 files changed, 127 insertions(+), 6 deletions(-)
>   create mode 100644 test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua
>   create mode 100644 test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua
>   create mode 100644 test/tarantool-tests/lj-918-fma-optimization.test.lua
>
> diff --git a/doc/running.html b/doc/running.html
> index 7868efab..1cf41f1b 100644
> --- a/doc/running.html
> +++ b/doc/running.html
> @@ -226,6 +226,12 @@ mix the three forms, but note that setting an optimization level
>   overrides all earlier flags.
>   </p>
>   <p>
> +Note that <tt>-Ofma</tt> is not enabled by default at any level,
> +because it affects floating-point result accuracy. Only enable this,
> +if you fully understand the trade-offs of FMA for performance (higher),
> +determinism (lower) and numerical accuracy (higher).
> +</p>
> +<p>
>   Here are the available flags and at what optimization levels they
>   are enabled:
>   </p>
> @@ -257,6 +263,8 @@ are enabled:
>   <td class="flag_name">sink</td><td class="flag_level">&nbsp;</td><td class="flag_level">&nbsp;</td><td class="flag_level">&bull;</td><td class="flag_desc">Allocation/Store Sinking</td></tr>
>   <tr class="even">
>   <td class="flag_name">fuse</td><td class="flag_level">&nbsp;</td><td class="flag_level">&nbsp;</td><td class="flag_level">&bull;</td><td class="flag_desc">Fusion of operands into instructions</td></tr>
> +<tr class="odd">
> +<td class="flag_name">fma </td><td class="flag_level">&nbsp;</td><td class="flag_level">&nbsp;</td><td class="flag_level">&nbsp;</td><td class="flag_desc">Fused multiply-add</td></tr>
>   </table>
>   <p>
>   Here are the parameters and their default settings:
> diff --git a/src/lj_asm_arm.h b/src/lj_asm_arm.h
> index 5a0f925f..041cd794 100644
> --- a/src/lj_asm_arm.h
> +++ b/src/lj_asm_arm.h
> @@ -310,7 +310,11 @@ static void asm_fusexref(ASMState *as, ARMIns ai, Reg rd, IRRef ref,
>   }
>   
>   #if !LJ_SOFTFP
> -/* Fuse to multiply-add/sub instruction. */
> +/*
> +** Fuse to multiply-add/sub instruction.
> +** VMLA rounds twice (UMA, not FMA) -- no need to check for JIT_F_OPT_FMA.
> +** VFMA needs VFPv4, which is uncommon on the remaining ARM32 targets.
> +*/
>   static int asm_fusemadd(ASMState *as, IRIns *ir, ARMIns ai, ARMIns air)
>   {
>     IRRef lref = ir->op1, rref = ir->op2;
> diff --git a/src/lj_asm_arm64.h b/src/lj_asm_arm64.h
> index 88b47ceb..554bb60a 100644
> --- a/src/lj_asm_arm64.h
> +++ b/src/lj_asm_arm64.h
> @@ -334,7 +334,8 @@ static int asm_fusemadd(ASMState *as, IRIns *ir, A64Ins ai, A64Ins air)
>   {
>     IRRef lref = ir->op1, rref = ir->op2;
>     IRIns *irm;
> -  if (lref != rref &&
> +  if ((as->flags & JIT_F_OPT_FMA) &&
> +      lref != rref &&
>         ((mayfuse(as, lref) && (irm = IR(lref), irm->o == IR_MUL) &&
>          ra_noreg(irm->r)) ||
>          (mayfuse(as, rref) && (irm = IR(rref), irm->o == IR_MUL) &&
> diff --git a/src/lj_asm_ppc.h b/src/lj_asm_ppc.h
> index 7bba71b3..52db2926 100644
> --- a/src/lj_asm_ppc.h
> +++ b/src/lj_asm_ppc.h
> @@ -232,7 +232,8 @@ static int asm_fusemadd(ASMState *as, IRIns *ir, PPCIns pi, PPCIns pir)
>   {
>     IRRef lref = ir->op1, rref = ir->op2;
>     IRIns *irm;
> -  if (lref != rref &&
> +  if ((as->flags & JIT_F_OPT_FMA) &&
> +      lref != rref &&
>         ((mayfuse(as, lref) && (irm = IR(lref), irm->o == IR_MUL) &&
>   	ra_noreg(irm->r)) ||
>          (mayfuse(as, rref) && (irm = IR(rref), irm->o == IR_MUL) &&
> diff --git a/src/lj_jit.h b/src/lj_jit.h
> index 47df85c6..73c355b9 100644
> --- a/src/lj_jit.h
> +++ b/src/lj_jit.h
> @@ -86,10 +86,11 @@
>   #define JIT_F_OPT_ABC		(JIT_F_OPT << 7)
>   #define JIT_F_OPT_SINK		(JIT_F_OPT << 8)
>   #define JIT_F_OPT_FUSE		(JIT_F_OPT << 9)
> +#define JIT_F_OPT_FMA		(JIT_F_OPT << 10)
>   
>   /* Optimizations names for -O. Must match the order above. */
>   #define JIT_F_OPTSTRING	\
> -  "\4fold\3cse\3dce\3fwd\3dse\6narrow\4loop\3abc\4sink\4fuse"
> +  "\4fold\3cse\3dce\3fwd\3dse\6narrow\4loop\3abc\4sink\4fuse\3fma"
>   
>   /* Optimization levels set a fixed combination of flags. */
>   #define JIT_F_OPT_0	0
> @@ -98,6 +99,7 @@
>   #define JIT_F_OPT_3	(JIT_F_OPT_2|\
>     JIT_F_OPT_FWD|JIT_F_OPT_DSE|JIT_F_OPT_ABC|JIT_F_OPT_SINK|JIT_F_OPT_FUSE)
>   #define JIT_F_OPT_DEFAULT	JIT_F_OPT_3
> +/* Note: FMA is not set by default. */
>   
>   /* -- JIT engine parameters ----------------------------------------------- */
>   
> diff --git a/src/lj_vmmath.c b/src/lj_vmmath.c
> index faebe719..29b72e0c 100644
> --- a/src/lj_vmmath.c
> +++ b/src/lj_vmmath.c
> @@ -36,6 +36,17 @@ LJ_FUNCA double lj_wrap_fmod(double x, double y) { return fmod(x, y); }
>   
>   /* -- Helper functions ---------------------------------------------------- */
>   
> +/* Required to prevent the C compiler from applying FMA optimizations.
> +**
> +** Yes, there's -ffp-contract and the FP_CONTRACT pragma ... in theory.
> +** But the current state of C compilers is a mess in this regard.
> +** Also, this function is not performance sensitive at all.
> +*/
> +LJ_NOINLINE static double lj_vm_floormul(double x, double y)
> +{
> +  return lj_vm_floor(x / y) * y;
> +}
> +
>   double lj_vm_foldarith(double x, double y, int op)
>   {
>     switch (op) {
> @@ -43,7 +54,7 @@ double lj_vm_foldarith(double x, double y, int op)
>     case IR_SUB - IR_ADD: return x-y; break;
>     case IR_MUL - IR_ADD: return x*y; break;
>     case IR_DIV - IR_ADD: return x/y; break;
> -  case IR_MOD - IR_ADD: return x-lj_vm_floor(x/y)*y; break;
> +  case IR_MOD - IR_ADD: return x-lj_vm_floormul(x, y); break;
>     case IR_POW - IR_ADD: return pow(x, y); break;
>     case IR_NEG - IR_ADD: return -x; break;
>     case IR_ABS - IR_ADD: return fabs(x); break;
> diff --git a/src/vm_arm64.dasc b/src/vm_arm64.dasc
> index 1cf1ea51..c5f0a7a7 100644
> --- a/src/vm_arm64.dasc
> +++ b/src/vm_arm64.dasc
> @@ -2581,7 +2581,9 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |.macro ins_arithmod, res, reg1, reg2
>       |  fdiv d2, reg1, reg2
>       |  frintm d2, d2
> -    |  fmsub res, d2, reg2, reg1
> +    |  // Cannot use fmsub, because FMA is not enabled by default.
> +    |  fmul d2, d2, reg2
> +    |  fsub res, reg1, d2
>       |.endmacro
>       |
>       |.macro ins_arithdn, intins, fpins
> diff --git a/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua b/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua
> new file mode 100644
> index 00000000..55ec7b98
> --- /dev/null
> +++ b/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua
> @@ -0,0 +1,36 @@
> +local tap = require('tap')
> +
> +-- Test file to demonstrate consistent behaviour for JIT and the
> +-- VM regarding FMA optimization (disabled by default).
> +-- XXX: The VM behaviour is checked in the
> +-- <lj-918-fma-numerical-accuracy.test.lua>.
> +-- See also:https://github.com/LuaJIT/LuaJIT/issues/918.
> +local test = tap.test('lj-918-fma-numerical-accuracy-jit'):skipcond({
> +  ['Test requires JIT enabled'] = not jit.status(),
> +})
> +
> +test:plan(1)
> +
> +local _2pow52 = 2 ^ 52
> +
> +-- IEEE754 components to double:
> +-- sign * (2 ^ (exp - 1023)) * (mantissa / _2pow52 + normal).
> +local a = 1 * (2 ^ (1083 - 1023)) * (4080546448249347 / _2pow52 + 1)
> +assert(a == 2197541395358679800)
> +
> +local b = -1 * (2 ^ (1052 - 1023)) * (3927497732209973 / _2pow52 + 1)
> +assert(b == -1005065126.3690554)
> +

Please add a comment with explanation why exactly these testcases

are used.

As I got it right, the idea is to calculate negative and positive 
number, right?

Why do you think two examples are enough for testing that behavior for 
JIT and the VM

is consistent?

Should we check more corner cases?

  * Standard/Normal arithmetic
  * Subnormal arithmetic
  * Infinite arithmetic
  * NaN arithmetic
  * Zero arithmetic

> +local results = {}
> +
> +jit.opt.start('hotloop=1')
> +for i = 1, 4 do
> +  results[i] = a % b
> +end
> +
> +-- XXX: The test doesn't fail before the commit. But it is
Please add a commit hash and it's short description.
> +-- required to be sure that there are no inconsistencies after the
> +-- commit.
> +test:samevalues(results, 'consistent behaviour between the JIT and the VM')
> +
> +test:done(true)
> diff --git a/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua b/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua
> new file mode 100644
> index 00000000..a3775d6d
> --- /dev/null
> +++ b/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua
> @@ -0,0 +1,31 @@
> +local tap = require('tap')
> +
> +-- Test file to demonstrate possible numerical inaccuracy if FMA
> +-- optimization takes place.

I suppose we don't need to test FMA itself, but we should

check that FMA is actually enabled when it's option

is enabled. Right? if yes I would merge test 
lj-918-fma-numerical-accuracy.test.lua

and test lj-918-fma-optimization.test.lua.


> +-- XXX: The JIT consistency is checked in the
> +-- <lj-918-fma-numerical-accuracy-jit.test.lua>.
> +-- See also:https://github.com/LuaJIT/LuaJIT/issues/918.
> +local test = tap.test('lj-918-fma-numerical-accuracy')
> +
> +test:plan(2)
> +
> +local _2pow52 = 2 ^ 52
> +
> +-- IEEE754 components to double:
> +-- sign * (2 ^ (exp - 1023)) * (mantissa / _2pow52 + normal).
> +local a = 1 * (2 ^ (1083 - 1023)) * (4080546448249347 / _2pow52 + 1)
> +assert(a == 2197541395358679800)
> +
> +local b = -1 * (2 ^ (1052 - 1023)) * (3927497732209973 / _2pow52 + 1)
> +assert(b == -1005065126.3690554)
The same questions as above.
> +
> +-- These tests fail on ARM64 before the patch or with FMA
> +-- optimization enabled.
> +-- The first test may not fail if the compiler doesn't generate
> +-- an ARM64 FMA operation in `lj_vm_foldarith()`.
> +test:is(2197541395358679800 % -1005065126.3690554, -606337536,
> +        'FMA in the lj_vm_foldarith() during parsing')
> +
> +test:is(a % b, -606337536, 'FMA in the VM')
> +
> +test:done(true)
> diff --git a/test/tarantool-tests/lj-918-fma-optimization.test.lua b/test/tarantool-tests/lj-918-fma-optimization.test.lua
> new file mode 100644
> index 00000000..af749eb5
> --- /dev/null
> +++ b/test/tarantool-tests/lj-918-fma-optimization.test.lua
> @@ -0,0 +1,25 @@
> +local tap = require('tap')
> +local test = tap.test('lj-918-fma-optimization'):skipcond({
> +  ['Test requires JIT enabled'] = not jit.status(),
> +})
> +
> +test:plan(3)
> +
> +local function jit_opt_is_on(needed)
why `needed` and not something like "flag"?
> +  for _, opt in ipairs({jit.status()}) do
> +    if opt == needed then
> +      return true
> +    end
> +  end
> +  return false
> +end
> +
> +test:ok(not jit_opt_is_on('fma'), 'FMA is disabled by default')
> +
> +local ok, _ = pcall(jit.opt.start, '+fma')
> +
> +test:ok(ok, 'fma flag is recognized')
> +
> +test:ok(jit_opt_is_on('fma'), 'FMA is enabled after jit.opt.start()')
> +
> +test:done(true)

[-- Attachment #2: Type: text/html, Size: 14632 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH luajit 2/2] Disable FMA by default. Use -Ofma or jit.opt.start("+fma") to enable.
  2025-01-14 12:45   ` Sergey Bronnikov via Tarantool-patches
@ 2025-01-15 13:06     ` Sergey Kaplun via Tarantool-patches
  2025-01-16 13:19       ` Sergey Bronnikov via Tarantool-patches
  0 siblings, 1 reply; 10+ messages in thread
From: Sergey Kaplun via Tarantool-patches @ 2025-01-15 13:06 UTC (permalink / raw)
  To: Sergey Bronnikov; +Cc: tarantool-patches

Hi, Sergey!
Thanks for the review!
Please consider my answers below.

On 14.01.25, Sergey Bronnikov wrote:
> Hi, Sergey!
> 
> Thanks for the patch!
> 
> 
> On 14.01.2025 14:06, Sergey Kaplun wrote:
> > From: Mike Pall <mike>
> >
> > See the discussion in the corresponding ticket for the rationale.
> >
> > (cherry picked from commit de2e1ca9d3d87e74c0c20c1e4ad3c32b31a5875b)
> >
> > For the modulo operation, the arm64 VM uses `fmsub` [1] instruction,
> > which is the fused multiply-add (FMA [2]) operation (more precisely,
> > multiply-sub). Hence, it may produce different results compared to the
> > unfused one. This patch fixes the behaviour by using the unfused
> > instructions by default. However, the new JIT optimization flag (fma) is
> > introduced to make it possible to take advantage of the FMA
> > optimizations.
> >
> > Sergey Kaplun:
> > * added the description and the test for the problem
> >
> > [1]:https://developer.arm.com/documentation/dui0801/g/A64-Floating-point-Instructions/FMSUB
> > [2]:https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation
> >
> > Part of tarantool/tarantool#10709
> > ---

<snipped>

> > diff --git a/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua b/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua
> > new file mode 100644
> > index 00000000..55ec7b98
> > --- /dev/null
> > +++ b/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua
> > @@ -0,0 +1,36 @@
> > +local tap = require('tap')
> > +
> > +-- Test file to demonstrate consistent behaviour for JIT and the
> > +-- VM regarding FMA optimization (disabled by default).
> > +-- XXX: The VM behaviour is checked in the
> > +-- <lj-918-fma-numerical-accuracy.test.lua>.
> > +-- See also:https://github.com/LuaJIT/LuaJIT/issues/918.
> > +local test = tap.test('lj-918-fma-numerical-accuracy-jit'):skipcond({
> > +  ['Test requires JIT enabled'] = not jit.status(),
> > +})
> > +
> > +test:plan(1)
> > +
> > +local _2pow52 = 2 ^ 52
> > +
> > +-- IEEE754 components to double:
> > +-- sign * (2 ^ (exp - 1023)) * (mantissa / _2pow52 + normal).
> > +local a = 1 * (2 ^ (1083 - 1023)) * (4080546448249347 / _2pow52 + 1)
> > +assert(a == 2197541395358679800)
> > +
> > +local b = -1 * (2 ^ (1052 - 1023)) * (3927497732209973 / _2pow52 + 1)
> > +assert(b == -1005065126.3690554)
> > +
> 
> Please add a comment with explanation why exactly these testcases
> 
> are used.
> 
> As I got it right, the idea is to calculate negative and positive 
> number, right?

I've added the corresponding comment to avoid confusion. Branch is
force-pushed.
===================================================================
diff --git a/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua b/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua
index 55ec7b98..8b16d4c3 100644
--- a/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua
+++ b/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua
@@ -13,6 +13,18 @@ test:plan(1)
 
 local _2pow52 = 2 ^ 52
 
+-- XXX: Before this commit the LuaJIT arm64 VM uses `fmsub` [1]
+-- instruction for the modulo operation, which is the fused
+-- multiply-add (FMA [2]) operation (more precisely,
+-- multiply-sub). Hence, it may produce different results compared
+-- to the unfused one. For the test, let's just use 2 numbers in
+-- modulo for which the single rounding is different from the
+-- double rounding. The numbers from the original issue are good
+-- enough.
+--
+-- [1]:https://developer.arm.com/documentation/dui0801/g/A64-Floating-point-Instructions/FMSUB
+-- [2]:https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation
+--
 -- IEEE754 components to double:
 -- sign * (2 ^ (exp - 1023)) * (mantissa / _2pow52 + normal).
 local a = 1 * (2 ^ (1083 - 1023)) * (4080546448249347 / _2pow52 + 1)
diff --git a/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua b/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua
index a3775d6d..25b59707 100644
--- a/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua
+++ b/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua
@@ -11,6 +11,18 @@ test:plan(2)
 
 local _2pow52 = 2 ^ 52
 
+-- XXX: Before this commit the LuaJIT arm64 VM uses `fmsub` [1]
+-- instruction for the modulo operation, which is the fused
+-- multiply-add (FMA [2]) operation (more precisely,
+-- multiply-sub). Hence, it may produce different results compared
+-- to the unfused one. For the test, let's just use 2 numbers in
+-- modulo for which the single rounding is different from the
+-- double rounding. The numbers from the original issue are good
+-- enough.
+--
+-- [1]:https://developer.arm.com/documentation/dui0801/g/A64-Floating-point-Instructions/FMSUB
+-- [2]:https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation
+--
 -- IEEE754 components to double:
 -- sign * (2 ^ (exp - 1023)) * (mantissa / _2pow52 + normal).
 local a = 1 * (2 ^ (1083 - 1023)) * (4080546448249347 / _2pow52 + 1)
===================================================================

> 
> Why do you think two examples are enough for testing that behavior for 
> JIT and the VM
> 
> is consistent?

Since we test for modulo operation consistency, I suppose it is enough
to check similar cases for the JIT and the VM. I distinguished them in
the separate files to avoid skipping the test for the VM when JIT is
disabled.

> 
> Should we check more corner cases?
> 
>   * Standard/Normal arithmetic
>   * Subnormal arithmetic
>   * Infinite arithmetic
>   * NaN arithmetic
>   * Zero arithmetic

All these checks are good but not really relevant to this particular
issue. I suppose we may continue this activity as a part of the
corresponding issue (please create one if it isn't created already), as
we discussed offline, with test vectors for floating point values.

> 
> > +local results = {}
> > +
> > +jit.opt.start('hotloop=1')
> > +for i = 1, 4 do
> > +  results[i] = a % b
> > +end
> > +
> > +-- XXX: The test doesn't fail before the commit. But it is
> Please add a commit hash and it's short description.

We usually meand this particular commit in the tests (commit when test
is introduced). I rephrase it like the following to avoid confusion:

===================================================================
diff --git a/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua b/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua
index 55ec7b98..8b16d4c3 100644
--- a/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua
+++ b/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua
@@ -28,7 +40,7 @@ for i = 1, 4 do
   results[i] = a % b
 end
 
--- XXX: The test doesn't fail before the commit. But it is
+-- XXX: The test doesn't fail before this commit. But it is
 -- required to be sure that there are no inconsistencies after the
 -- commit.
 test:samevalues(results, 'consistent behaviour between the JIT and the VM')
diff --git a/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua b/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua
index a3775d6d..25b59707 100644
--- a/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua
+++ b/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua
@@ -19,7 +31,7 @@ assert(a == 2197541395358679800)
 local b = -1 * (2 ^ (1052 - 1023)) * (3927497732209973 / _2pow52 + 1)
 assert(b == -1005065126.3690554)
 
--- These tests fail on ARM64 before the patch or with FMA
+-- These tests fail on ARM64 before this patch or with FMA
 -- optimization enabled.
 -- The first test may not fail if the compiler doesn't generate
 -- an ARM64 FMA operation in `lj_vm_foldarith()`.
===================================================================

> > +-- required to be sure that there are no inconsistencies after the
> > +-- commit.
> > +test:samevalues(results, 'consistent behaviour between the JIT and the VM')
> > +
> > +test:done(true)
> > diff --git a/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua b/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua
> > new file mode 100644
> > index 00000000..a3775d6d
> > --- /dev/null
> > +++ b/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua
> > @@ -0,0 +1,31 @@
> > +local tap = require('tap')
> > +
> > +-- Test file to demonstrate possible numerical inaccuracy if FMA
> > +-- optimization takes place.
> 
> I suppose we don't need to test FMA itself, but we should
> 
> check that FMA is actually enabled when it's option
> 
> is enabled. Right? if yes I would merge test 
> lj-918-fma-numerical-accuracy.test.lua
> 
> and test lj-918-fma-optimization.test.lua.

I would rather avoid this since the FMA is more like the -mfma compiler
option and affects only the JIT behaviour and can be enabled for the
performance reason. I used this canary test to check that this option
exists.

> 
> 
> > +-- XXX: The JIT consistency is checked in the
> > +-- <lj-918-fma-numerical-accuracy-jit.test.lua>.
> > +-- See also:https://github.com/LuaJIT/LuaJIT/issues/918.
> > +local test = tap.test('lj-918-fma-numerical-accuracy')
> > +
> > +test:plan(2)
> > +
> > +local _2pow52 = 2 ^ 52
> > +
> > +-- IEEE754 components to double:
> > +-- sign * (2 ^ (exp - 1023)) * (mantissa / _2pow52 + normal).
> > +local a = 1 * (2 ^ (1083 - 1023)) * (4080546448249347 / _2pow52 + 1)
> > +assert(a == 2197541395358679800)
> > +
> > +local b = -1 * (2 ^ (1052 - 1023)) * (3927497732209973 / _2pow52 + 1)
> > +assert(b == -1005065126.3690554)
> The same questions as above.

Added the comment.

> > +
> > +-- These tests fail on ARM64 before the patch or with FMA
> > +-- optimization enabled.
> > +-- The first test may not fail if the compiler doesn't generate
> > +-- an ARM64 FMA operation in `lj_vm_foldarith()`.
> > +test:is(2197541395358679800 % -1005065126.3690554, -606337536,
> > +        'FMA in the lj_vm_foldarith() during parsing')
> > +
> > +test:is(a % b, -606337536, 'FMA in the VM')
> > +
> > +test:done(true)
> > diff --git a/test/tarantool-tests/lj-918-fma-optimization.test.lua b/test/tarantool-tests/lj-918-fma-optimization.test.lua
> > new file mode 100644
> > index 00000000..af749eb5
> > --- /dev/null
> > +++ b/test/tarantool-tests/lj-918-fma-optimization.test.lua
> > @@ -0,0 +1,25 @@
> > +local tap = require('tap')
> > +local test = tap.test('lj-918-fma-optimization'):skipcond({
> > +  ['Test requires JIT enabled'] = not jit.status(),
> > +})
> > +
> > +test:plan(3)
> > +
> > +local function jit_opt_is_on(needed)
> why `needed` and not something like "flag"?

Replaced with `flag`:

===================================================================
diff --git a/test/tarantool-tests/lj-918-fma-optimization.test.lua b/test/tarantool-tests/lj-918-fma-optimization.test.lua
index af749eb5..9396e558 100644
--- a/test/tarantool-tests/lj-918-fma-optimization.test.lua
+++ b/test/tarantool-tests/lj-918-fma-optimization.test.lua
@@ -5,9 +5,9 @@ local test = tap.test('lj-918-fma-optimization'):skipcond({
 
 test:plan(3)
 
-local function jit_opt_is_on(needed)
+local function jit_opt_is_on(flag)
   for _, opt in ipairs({jit.status()}) do
-    if opt == needed then
+    if opt == flag then
       return true
     end
   end
===================================================================

> > +  for _, opt in ipairs({jit.status()}) do
> > +    if opt == needed then
> > +      return true
> > +    end
> > +  end

<snipped>

-- 
Best regards,
Sergey Kaplun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH luajit 1/2] Cleanup CPU detection and tuning for old CPUs.
  2025-01-14 11:25   ` Sergey Bronnikov via Tarantool-patches
@ 2025-01-15 13:10     ` Sergey Kaplun via Tarantool-patches
  2025-01-16 12:47       ` Sergey Bronnikov via Tarantool-patches
  0 siblings, 1 reply; 10+ messages in thread
From: Sergey Kaplun via Tarantool-patches @ 2025-01-15 13:10 UTC (permalink / raw)
  To: Sergey Bronnikov; +Cc: tarantool-patches

Hi, Sergey!
Thanks for the review!
Updated the commit message and force-pushed the branch.

On 14.01.25, Sergey Bronnikov wrote:
> Hi, Sergey,
> 
> thanks for the patch!
> 
> LGTM with a minor comment
> 
> Sergey
> 
> On 13.01.2025 18:17, Sergey Kaplun wrote:
> > From: Mike Pall <mike>
> >
> > (cherry picked from commit 0eddcbead2d67c16dcd4039a6765b9d2fc8ea631)
> >
> > This patch does the following refactoring:
> > 1) Drops optimizations for the Intel Atom CPU [1]: removes the
> >     `JIT_F_LEA_AGU` flag and related optimizations. The considerations
> >     for the use of LEA are complex and very CPU-specific, mostly
> >     dependent on the number of operands. Mostly, it isn't worth it due to
> >     the extra register pressure and/or extra instructions.
> 
> I would say explicitly that `JIT_F_LEA_AGU` is used in "Well, yes, that 
> applies to the original and obsolete Atom architecture. Today "Intel 
> Atom" is just a trade name for reduced-performance implementations of 
> the current Intel architecture."
> 
> as Mike explained in LUAJIT#24. So there are no any risks for tarantool 
> users
> 
> regarding performance degradation.

Added, as you suggested. The new commit message is the following:

| Cleanup CPU detection and tuning for old CPUs.
|
| (cherry picked from commit 0eddcbead2d67c16dcd4039a6765b9d2fc8ea631)
|
| This patch does the following refactoring:
| 1) Drops optimizations for the Intel Atom CPU [1]: removes the
|    `JIT_F_LEA_AGU` flag and related optimizations. The considerations
|    for the use of LEA are complex and very CPU-specific, mostly
|    dependent on the number of operands. Mostly, it isn't worth it due to
|    the extra register pressure and/or extra instructions.
|    Be aware that it applies to the original and obsolete Atom
|    architecture. Today "Intel Atom" is just a trade name for
|    reduced-performance implementations of the current Intel
|    architecture.
| 2) Drops optimizations for the AMD K8, K10 CPU [2][3]: removes the
|    `JIT_F_PREFER_IMUL` flag and related optimizations.
| 3) Refactors JIT flags defined in the <lj_jit.h>. Now all CPU-specific
|    JIT flags are defined as the left shift of `JIT_F_CPU` instead of
|    hardcoded constants, similar for the optimization flags.
| 4) Adds detection of the ARM8 CPU.
| 5) Drops the check for SSE2 since the VM already presumes CPU supports
|    it.
| 6) Adds checks for `__ARM_ARCH`[4] macro in <lj_arch.h>.
| 7) Drops outdated comment in the amalgamation file about memory
|    requirements.
|
| Sergey Kaplun:
| * added the description for the patch
|
| [1]: https://en.wikipedia.org/wiki/Intel_Atom
| [2]: https://en.wikipedia.org/wiki/AMD_K8
| [3]: https://en.wikipedia.org/wiki/AMD_K10
| [4]: https://developer.arm.com/documentation/dui0774/l/Other-Compiler-specific-Features/Predefined-macros
|
| Part of tarantool/tarantool#10709

> 
> > 2) Drops optimizations for the AMD K8, K10 CPU [2][3]: removes the
> >     `JIT_F_PREFER_IMUL` flag and related optimizations.
> > 3) Refactors JIT flags defined in the <lj_jit.h>. Now all CPU-specific
> >     JIT flags are defined as the left shift of `JIT_F_CPU` instead of
> >     hardcoded constants, similar for the optimization flags.
> > 4) Adds detection of the ARM8 CPU.
> > 5) Drops the check for SSE2 since the VM already presumes CPU supports
> >     it.
> > 6) Adds checks for `__ARM_ARCH`[4] macro in <lj_arch.h>.
> > 7) Drops outdated comment in the amalgamation file about memory
> >     requirements.
> >
> > Sergey Kaplun:
> > * added the description for the patch
> >
> > [1]:https://en.wikipedia.org/wiki/Intel_Atom
> > [2]:https://en.wikipedia.org/wiki/AMD_K8
> > [3]:https://en.wikipedia.org/wiki/AMD_K10
> > [4]:https://developer.arm.com/documentation/dui0774/l/Other-Compiler-specific-Features/Predefined-macros
> >
> > Part of tarantool/tarantool#10709
> > ---

<snipped>

-- 
Best regards,
Sergey Kaplun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH luajit 1/2] Cleanup CPU detection and tuning for old CPUs.
  2025-01-15 13:10     ` Sergey Kaplun via Tarantool-patches
@ 2025-01-16 12:47       ` Sergey Bronnikov via Tarantool-patches
  0 siblings, 0 replies; 10+ messages in thread
From: Sergey Bronnikov via Tarantool-patches @ 2025-01-16 12:47 UTC (permalink / raw)
  To: Sergey Kaplun; +Cc: tarantool-patches

[-- Attachment #1: Type: text/plain, Size: 4052 bytes --]

Hi, Sergey,

Thanks for the fixes! LGTM

On 15.01.2025 16:10, Sergey Kaplun wrote:
> Hi, Sergey!
> Thanks for the review!
> Updated the commit message and force-pushed the branch.
>
> On 14.01.25, Sergey Bronnikov wrote:
>> Hi, Sergey,
>>
>> thanks for the patch!
>>
>> LGTM with a minor comment
>>
>> Sergey
>>
>> On 13.01.2025 18:17, Sergey Kaplun wrote:
>>> From: Mike Pall <mike>
>>>
>>> (cherry picked from commit 0eddcbead2d67c16dcd4039a6765b9d2fc8ea631)
>>>
>>> This patch does the following refactoring:
>>> 1) Drops optimizations for the Intel Atom CPU [1]: removes the
>>>      `JIT_F_LEA_AGU` flag and related optimizations. The considerations
>>>      for the use of LEA are complex and very CPU-specific, mostly
>>>      dependent on the number of operands. Mostly, it isn't worth it due to
>>>      the extra register pressure and/or extra instructions.
>> I would say explicitly that `JIT_F_LEA_AGU` is used in "Well, yes, that
>> applies to the original and obsolete Atom architecture. Today "Intel
>> Atom" is just a trade name for reduced-performance implementations of
>> the current Intel architecture."
>>
>> as Mike explained in LUAJIT#24. So there are no any risks for tarantool
>> users
>>
>> regarding performance degradation.
> Added, as you suggested. The new commit message is the following:
>
> | Cleanup CPU detection and tuning for old CPUs.
> |
> | (cherry picked from commit 0eddcbead2d67c16dcd4039a6765b9d2fc8ea631)
> |
> | This patch does the following refactoring:
> | 1) Drops optimizations for the Intel Atom CPU [1]: removes the
> |    `JIT_F_LEA_AGU` flag and related optimizations. The considerations
> |    for the use of LEA are complex and very CPU-specific, mostly
> |    dependent on the number of operands. Mostly, it isn't worth it due to
> |    the extra register pressure and/or extra instructions.
> |    Be aware that it applies to the original and obsolete Atom
> |    architecture. Today "Intel Atom" is just a trade name for
> |    reduced-performance implementations of the current Intel
> |    architecture.
> | 2) Drops optimizations for the AMD K8, K10 CPU [2][3]: removes the
> |    `JIT_F_PREFER_IMUL` flag and related optimizations.
> | 3) Refactors JIT flags defined in the <lj_jit.h>. Now all CPU-specific
> |    JIT flags are defined as the left shift of `JIT_F_CPU` instead of
> |    hardcoded constants, similar for the optimization flags.
> | 4) Adds detection of the ARM8 CPU.
> | 5) Drops the check for SSE2 since the VM already presumes CPU supports
> |    it.
> | 6) Adds checks for `__ARM_ARCH`[4] macro in <lj_arch.h>.
> | 7) Drops outdated comment in the amalgamation file about memory
> |    requirements.
> |
> | Sergey Kaplun:
> | * added the description for the patch
> |
> | [1]:https://en.wikipedia.org/wiki/Intel_Atom
> | [2]:https://en.wikipedia.org/wiki/AMD_K8
> | [3]:https://en.wikipedia.org/wiki/AMD_K10
> | [4]:https://developer.arm.com/documentation/dui0774/l/Other-Compiler-specific-Features/Predefined-macros
> |
> | Part of tarantool/tarantool#10709
>
>>> 2) Drops optimizations for the AMD K8, K10 CPU [2][3]: removes the
>>>      `JIT_F_PREFER_IMUL` flag and related optimizations.
>>> 3) Refactors JIT flags defined in the <lj_jit.h>. Now all CPU-specific
>>>      JIT flags are defined as the left shift of `JIT_F_CPU` instead of
>>>      hardcoded constants, similar for the optimization flags.
>>> 4) Adds detection of the ARM8 CPU.
>>> 5) Drops the check for SSE2 since the VM already presumes CPU supports
>>>      it.
>>> 6) Adds checks for `__ARM_ARCH`[4] macro in <lj_arch.h>.
>>> 7) Drops outdated comment in the amalgamation file about memory
>>>      requirements.
>>>
>>> Sergey Kaplun:
>>> * added the description for the patch
>>>
>>> [1]:https://en.wikipedia.org/wiki/Intel_Atom
>>> [2]:https://en.wikipedia.org/wiki/AMD_K8
>>> [3]:https://en.wikipedia.org/wiki/AMD_K10
>>> [4]:https://developer.arm.com/documentation/dui0774/l/Other-Compiler-specific-Features/Predefined-macros
>>>
>>> Part of tarantool/tarantool#10709
>>> ---
> <snipped>
>

[-- Attachment #2: Type: text/html, Size: 5621 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH luajit 2/2] Disable FMA by default. Use -Ofma or jit.opt.start("+fma") to enable.
  2025-01-15 13:06     ` Sergey Kaplun via Tarantool-patches
@ 2025-01-16 13:19       ` Sergey Bronnikov via Tarantool-patches
  0 siblings, 0 replies; 10+ messages in thread
From: Sergey Bronnikov via Tarantool-patches @ 2025-01-16 13:19 UTC (permalink / raw)
  To: Sergey Kaplun; +Cc: tarantool-patches

[-- Attachment #1: Type: text/plain, Size: 11788 bytes --]

Hi, Sergey,

  thanks for the fixes! LGTM

On 15.01.2025 16:06, Sergey Kaplun wrote:
> Hi, Sergey!
> Thanks for the review!
> Please consider my answers below.
>
> On 14.01.25, Sergey Bronnikov wrote:
>> Hi, Sergey!
>>
>> Thanks for the patch!
>>
>>
>> On 14.01.2025 14:06, Sergey Kaplun wrote:
>>> From: Mike Pall <mike>
>>>
>>> See the discussion in the corresponding ticket for the rationale.
>>>
>>> (cherry picked from commit de2e1ca9d3d87e74c0c20c1e4ad3c32b31a5875b)
>>>
>>> For the modulo operation, the arm64 VM uses `fmsub` [1] instruction,
>>> which is the fused multiply-add (FMA [2]) operation (more precisely,
>>> multiply-sub). Hence, it may produce different results compared to the
>>> unfused one. This patch fixes the behaviour by using the unfused
>>> instructions by default. However, the new JIT optimization flag (fma) is
>>> introduced to make it possible to take advantage of the FMA
>>> optimizations.
>>>
>>> Sergey Kaplun:
>>> * added the description and the test for the problem
>>>
>>> [1]:https://developer.arm.com/documentation/dui0801/g/A64-Floating-point-Instructions/FMSUB
>>> [2]:https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation
>>>
>>> Part of tarantool/tarantool#10709
>>> ---
> <snipped>
>
>>> diff --git a/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua b/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua
>>> new file mode 100644
>>> index 00000000..55ec7b98
>>> --- /dev/null
>>> +++ b/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua
>>> @@ -0,0 +1,36 @@
>>> +local tap = require('tap')
>>> +
>>> +-- Test file to demonstrate consistent behaviour for JIT and the
>>> +-- VM regarding FMA optimization (disabled by default).
>>> +-- XXX: The VM behaviour is checked in the
>>> +-- <lj-918-fma-numerical-accuracy.test.lua>.
>>> +-- See also:https://github.com/LuaJIT/LuaJIT/issues/918.
>>> +local test = tap.test('lj-918-fma-numerical-accuracy-jit'):skipcond({
>>> +  ['Test requires JIT enabled'] = not jit.status(),
>>> +})
>>> +
>>> +test:plan(1)
>>> +
>>> +local _2pow52 = 2 ^ 52
>>> +
>>> +-- IEEE754 components to double:
>>> +-- sign * (2 ^ (exp - 1023)) * (mantissa / _2pow52 + normal).
>>> +local a = 1 * (2 ^ (1083 - 1023)) * (4080546448249347 / _2pow52 + 1)
>>> +assert(a == 2197541395358679800)
>>> +
>>> +local b = -1 * (2 ^ (1052 - 1023)) * (3927497732209973 / _2pow52 + 1)
>>> +assert(b == -1005065126.3690554)
>>> +
>> Please add a comment with explanation why exactly these testcases
>>
>> are used.
>>
>> As I got it right, the idea is to calculate negative and positive
>> number, right?
> I've added the corresponding comment to avoid confusion. Branch is
> force-pushed.
> ===================================================================
> diff --git a/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua b/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua
> index 55ec7b98..8b16d4c3 100644
> --- a/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua
> +++ b/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua
> @@ -13,6 +13,18 @@ test:plan(1)
>   
>   local _2pow52 = 2 ^ 52
>   
> +-- XXX: Before this commit the LuaJIT arm64 VM uses `fmsub` [1]
> +-- instruction for the modulo operation, which is the fused
> +-- multiply-add (FMA [2]) operation (more precisely,
> +-- multiply-sub). Hence, it may produce different results compared
> +-- to the unfused one. For the test, let's just use 2 numbers in
> +-- modulo for which the single rounding is different from the
> +-- double rounding. The numbers from the original issue are good
> +-- enough.
> +--
> +-- [1]:https://developer.arm.com/documentation/dui0801/g/A64-Floating-point-Instructions/FMSUB
> +-- [2]:https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation
> +--
>   -- IEEE754 components to double:
>   -- sign * (2 ^ (exp - 1023)) * (mantissa / _2pow52 + normal).
>   local a = 1 * (2 ^ (1083 - 1023)) * (4080546448249347 / _2pow52 + 1)
> diff --git a/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua b/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua
> index a3775d6d..25b59707 100644
> --- a/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua
> +++ b/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua
> @@ -11,6 +11,18 @@ test:plan(2)
>   
>   local _2pow52 = 2 ^ 52
>   
> +-- XXX: Before this commit the LuaJIT arm64 VM uses `fmsub` [1]
> +-- instruction for the modulo operation, which is the fused
> +-- multiply-add (FMA [2]) operation (more precisely,
> +-- multiply-sub). Hence, it may produce different results compared
> +-- to the unfused one. For the test, let's just use 2 numbers in
> +-- modulo for which the single rounding is different from the
> +-- double rounding. The numbers from the original issue are good
> +-- enough.
> +--
> +-- [1]:https://developer.arm.com/documentation/dui0801/g/A64-Floating-point-Instructions/FMSUB
> +-- [2]:https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation
> +--
>   -- IEEE754 components to double:
>   -- sign * (2 ^ (exp - 1023)) * (mantissa / _2pow52 + normal).
>   local a = 1 * (2 ^ (1083 - 1023)) * (4080546448249347 / _2pow52 + 1)
> ===================================================================
>
>> Why do you think two examples are enough for testing that behavior for
>> JIT and the VM
>>
>> is consistent?
> Since we test for modulo operation consistency, I suppose it is enough
> to check similar cases for the JIT and the VM. I distinguished them in
> the separate files to avoid skipping the test for the VM when JIT is
> disabled.
>
>> Should we check more corner cases?
>>
>>    * Standard/Normal arithmetic
>>    * Subnormal arithmetic
>>    * Infinite arithmetic
>>    * NaN arithmetic
>>    * Zero arithmetic
> All these checks are good but not really relevant to this particular
> issue. I suppose we may continue this activity as a part of the
> corresponding issue (please create one if it isn't created already), as
> we discussed offline, with test vectors for floating point values.
>
>>> +local results = {}
>>> +
>>> +jit.opt.start('hotloop=1')
>>> +for i = 1, 4 do
>>> +  results[i] = a % b
>>> +end
>>> +
>>> +-- XXX: The test doesn't fail before the commit. But it is
>> Please add a commit hash and it's short description.
> We usually meand this particular commit in the tests (commit when test
> is introduced). I rephrase it like the following to avoid confusion:
>
> ===================================================================
> diff --git a/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua b/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua
> index 55ec7b98..8b16d4c3 100644
> --- a/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua
> +++ b/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua
> @@ -28,7 +40,7 @@ for i = 1, 4 do
>     results[i] = a % b
>   end
>   
> --- XXX: The test doesn't fail before the commit. But it is
> +-- XXX: The test doesn't fail before this commit. But it is
>   -- required to be sure that there are no inconsistencies after the
>   -- commit.
>   test:samevalues(results, 'consistent behaviour between the JIT and the VM')
> diff --git a/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua b/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua
> index a3775d6d..25b59707 100644
> --- a/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua
> +++ b/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua
> @@ -19,7 +31,7 @@ assert(a == 2197541395358679800)
>   local b = -1 * (2 ^ (1052 - 1023)) * (3927497732209973 / _2pow52 + 1)
>   assert(b == -1005065126.3690554)
>   
> --- These tests fail on ARM64 before the patch or with FMA
> +-- These tests fail on ARM64 before this patch or with FMA
>   -- optimization enabled.
>   -- The first test may not fail if the compiler doesn't generate
>   -- an ARM64 FMA operation in `lj_vm_foldarith()`.
> ===================================================================
>
>>> +-- required to be sure that there are no inconsistencies after the
>>> +-- commit.
>>> +test:samevalues(results, 'consistent behaviour between the JIT and the VM')
>>> +
>>> +test:done(true)
>>> diff --git a/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua b/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua
>>> new file mode 100644
>>> index 00000000..a3775d6d
>>> --- /dev/null
>>> +++ b/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua
>>> @@ -0,0 +1,31 @@
>>> +local tap = require('tap')
>>> +
>>> +-- Test file to demonstrate possible numerical inaccuracy if FMA
>>> +-- optimization takes place.
>> I suppose we don't need to test FMA itself, but we should
>>
>> check that FMA is actually enabled when it's option
>>
>> is enabled. Right? if yes I would merge test
>> lj-918-fma-numerical-accuracy.test.lua
>>
>> and test lj-918-fma-optimization.test.lua.
> I would rather avoid this since the FMA is more like the -mfma compiler
> option and affects only the JIT behaviour and can be enabled for the
> performance reason. I used this canary test to check that this option
> exists.
>
>>
>>> +-- XXX: The JIT consistency is checked in the
>>> +-- <lj-918-fma-numerical-accuracy-jit.test.lua>.
>>> +-- See also:https://github.com/LuaJIT/LuaJIT/issues/918.
>>> +local test = tap.test('lj-918-fma-numerical-accuracy')
>>> +
>>> +test:plan(2)
>>> +
>>> +local _2pow52 = 2 ^ 52
>>> +
>>> +-- IEEE754 components to double:
>>> +-- sign * (2 ^ (exp - 1023)) * (mantissa / _2pow52 + normal).
>>> +local a = 1 * (2 ^ (1083 - 1023)) * (4080546448249347 / _2pow52 + 1)
>>> +assert(a == 2197541395358679800)
>>> +
>>> +local b = -1 * (2 ^ (1052 - 1023)) * (3927497732209973 / _2pow52 + 1)
>>> +assert(b == -1005065126.3690554)
>> The same questions as above.
> Added the comment.
>
>>> +
>>> +-- These tests fail on ARM64 before the patch or with FMA
>>> +-- optimization enabled.
>>> +-- The first test may not fail if the compiler doesn't generate
>>> +-- an ARM64 FMA operation in `lj_vm_foldarith()`.
>>> +test:is(2197541395358679800 % -1005065126.3690554, -606337536,
>>> +        'FMA in the lj_vm_foldarith() during parsing')
>>> +
>>> +test:is(a % b, -606337536, 'FMA in the VM')
>>> +
>>> +test:done(true)
>>> diff --git a/test/tarantool-tests/lj-918-fma-optimization.test.lua b/test/tarantool-tests/lj-918-fma-optimization.test.lua
>>> new file mode 100644
>>> index 00000000..af749eb5
>>> --- /dev/null
>>> +++ b/test/tarantool-tests/lj-918-fma-optimization.test.lua
>>> @@ -0,0 +1,25 @@
>>> +local tap = require('tap')
>>> +local test = tap.test('lj-918-fma-optimization'):skipcond({
>>> +  ['Test requires JIT enabled'] = not jit.status(),
>>> +})
>>> +
>>> +test:plan(3)
>>> +
>>> +local function jit_opt_is_on(needed)
>> why `needed` and not something like "flag"?
> Replaced with `flag`:
>
> ===================================================================
> diff --git a/test/tarantool-tests/lj-918-fma-optimization.test.lua b/test/tarantool-tests/lj-918-fma-optimization.test.lua
> index af749eb5..9396e558 100644
> --- a/test/tarantool-tests/lj-918-fma-optimization.test.lua
> +++ b/test/tarantool-tests/lj-918-fma-optimization.test.lua
> @@ -5,9 +5,9 @@ local test = tap.test('lj-918-fma-optimization'):skipcond({
>   
>   test:plan(3)
>   
> -local function jit_opt_is_on(needed)
> +local function jit_opt_is_on(flag)
>     for _, opt in ipairs({jit.status()}) do
> -    if opt == needed then
> +    if opt == flag then
>         return true
>       end
>     end
> ===================================================================
>
>>> +  for _, opt in ipairs({jit.status()}) do
>>> +    if opt == needed then
>>> +      return true
>>> +    end
>>> +  end
> <snipped>
>

[-- Attachment #2: Type: text/html, Size: 14470 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH luajit 0/2] Refactoring and FMA optimizations
  2025-01-14 11:06 [Tarantool-patches] [PATCH luajit 0/2] Refactoring and FMA optimizations Sergey Kaplun via Tarantool-patches
  2025-01-14 11:06 ` [Tarantool-patches] [PATCH luajit 1/2] Cleanup CPU detection and tuning for old CPUs Sergey Kaplun via Tarantool-patches
  2025-01-14 11:06 ` [Tarantool-patches] [PATCH luajit 2/2] Disable FMA by default. Use -Ofma or jit.opt.start("+fma") to enable Sergey Kaplun via Tarantool-patches
@ 2025-01-31  9:28 ` Sergey Kaplun via Tarantool-patches
  2 siblings, 0 replies; 10+ messages in thread
From: Sergey Kaplun via Tarantool-patches @ 2025-01-31  9:28 UTC (permalink / raw)
  To: Sergey Bronnikov; +Cc: tarantool-patches

I've applied the patch into all long-term branches in tarantool/luajit
and bumped a new version in master [1], release/3.3 [2], release/3.2 [3]
and release/2.11 [4].

[1]: https://github.com/tarantool/tarantool/pull/11063
[2]: https://github.com/tarantool/tarantool/pull/11064
[3]: https://github.com/tarantool/tarantool/pull/11065
[4]: https://github.com/tarantool/tarantool/pull/11066

-- 
Best regards,
Sergey Kaplun

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-01-31  9:29 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-14 11:06 [Tarantool-patches] [PATCH luajit 0/2] Refactoring and FMA optimizations Sergey Kaplun via Tarantool-patches
2025-01-14 11:06 ` [Tarantool-patches] [PATCH luajit 1/2] Cleanup CPU detection and tuning for old CPUs Sergey Kaplun via Tarantool-patches
2025-01-14 11:25   ` Sergey Bronnikov via Tarantool-patches
2025-01-15 13:10     ` Sergey Kaplun via Tarantool-patches
2025-01-16 12:47       ` Sergey Bronnikov via Tarantool-patches
2025-01-14 11:06 ` [Tarantool-patches] [PATCH luajit 2/2] Disable FMA by default. Use -Ofma or jit.opt.start("+fma") to enable Sergey Kaplun via Tarantool-patches
2025-01-14 12:45   ` Sergey Bronnikov via Tarantool-patches
2025-01-15 13:06     ` Sergey Kaplun via Tarantool-patches
2025-01-16 13:19       ` Sergey Bronnikov via Tarantool-patches
2025-01-31  9:28 ` [Tarantool-patches] [PATCH luajit 0/2] Refactoring and FMA optimizations Sergey Kaplun via Tarantool-patches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox