From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id B41F64A0A19; Tue, 14 Jan 2025 14:26:02 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org B41F64A0A19 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1736853962; bh=lEqP08fMaCVDdzRyl1phdTdj8rEQbkIqMoNJ52LhqOk=; h=Date:To:References:In-Reply-To:Subject:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=Geu+TZgYF8ZUp/ep7Qrn9RDqBQerbsL6dv6p/A6eueiPN0hqZo3uN7WFb+jTzqUqV A0n2fVFSGh5+AlOcn7PnFbgN/74LLRoREVRZLiw86ivmp6e6XClNG+Lb//MNMzXTVO fUGmkuAgk1G/rktTY0eQl36Cd3/zZ7gd9+pnrmuI= Received: from send276.i.mail.ru (send276.i.mail.ru [95.163.59.115]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 1E9404A0A19 for ; Tue, 14 Jan 2025 14:26:01 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 1E9404A0A19 Received: by exim-smtp-6758d5575c-hfvww with esmtpa (envelope-from ) id 1tXf3X-00000000KM7-454F; Tue, 14 Jan 2025 14:26:00 +0300 Content-Type: multipart/alternative; boundary="------------jn3J346b9oXAv0fHQm1gHvo5" Message-ID: Date: Tue, 14 Jan 2025 14:25:59 +0300 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Content-Language: en-US To: Sergey Kaplun , tarantool-patches References: In-Reply-To: X-Mailru-Src: smtp X-4EC0790: 10 X-7564579A: 646B95376F6C166E X-77F55803: 4F1203BC0FB41BD97BF177E4F5EB85B8E1230AFA6136DD93CCECC82C094B2A1C00894C459B0CD1B96ACE202B569B347A15C513EAE8123D4DFA17718855B465E405FF7A6259929719A5D819213F947FA3 X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE79A02CFBD12041B85C2099A533E45F2D0395957E7521B51C2CFCAF695D4D8E9FCEA1F7E6F0F101C6778DA827A17800CE7ACBBB5686CDD3661EA1F7E6F0F101C6723150C8DA25C47586E58E00D9D99D84E1BDDB23E98D2D38BC08E230531AC9C90C995387E7FEA1C0BD7A67ABFB3DA5909D32421DEB6C8A678A471835C12D1D9774AD6D5ED66289B5278DA827A17800CE7850F8B975A76562C9FA2833FD35BB23D2EF20D2F80756B5F868A13BD56FB6657A471835C12D1D977725E5C173C3A84C390D92131081DE748117882F4460429728AD0CFFFB425014E868A13BD56FB6657D81D268191BDAD3DC09775C1D3CA48CF4E3D0462B229F456BA3038C0950A5D36C8A9BA7A39EFB766D91E3A1F190DE8FDBA3038C0950A5D36D5E8D9A59859A8B6B3B269C6689FA51776E601842F6C81A1F004C906525384303E02D724532EE2C3F43C7A68FF6260569E8FC8737B5C2249EC8D19AE6D49635B68655334FD4449CB9ECD01F8117BC8BEAAAE862A0553A39223F8577A6DFFEA7CC1948A84299AD5C643847C11F186F3C59DAA53EE0834AAEE X-B7AD71C0: 4965CFDFE05191347903E2BF2CCE50FB42605BC54A9F883B79311020FFC8D4ADB4B5F60B8EDBE30EC9D4FAE04B6AE23B X-C1DE0DAB: 0D63561A33F958A55FCD520D94A007515002B1117B3ED69643876214276776D6ED71F038FC046993823CB91A9FED034534781492E4B8EEADC24F4ACCBBC88C66 X-C8649E89: 1C3962B70DF3F0AD75DCE07D45A749953FED46C3ACD6F73ED3581295AF09D3DF87807E0823442EA2ED31085941D9CD0AF7F820E7B07EA4CF999016C65E6FAD2167DAAD79FB90ED8698AB1A0AC09DF3DC876FB7EB9268FA6B4D6C1B221ADCA66469A713A3E5A0090DB75365C9EC7ADD6761662C69AF452F34621B138D89993E1C111DC66A97D0BFE2913E6812662D5F2AB9AF64DB4688768036DF5FE9C0001AF333F2C28C22F508233FCF178C6DD14203 X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2bioj7YQzuqfQVvwjI6eJmqnizQ== X-Mailru-Sender: 520A125C2F17F0B1E52FEF5D219D61406ACE202B569B347A15C513EAE8123D4D8AE5826D116EB2B60152A3D17938EB451EB5A0BCEC6A560B3DDE9B364B0DF289BE2DA36745F2EEB5CEBA01FB949A1F1EEAB4BC95F72C04283CDA0F3B3F5B9367 X-Mras: Ok Subject: Re: [Tarantool-patches] [PATCH luajit 1/2] Cleanup CPU detection and tuning for old CPUs. X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Sergey Bronnikov via Tarantool-patches Reply-To: Sergey Bronnikov Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" This is a multi-part message in MIME format. --------------jn3J346b9oXAv0fHQm1gHvo5 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Hi, Sergey, thanks for the patch! LGTM with a minor comment Sergey On 13.01.2025 18:17, Sergey Kaplun wrote: > From: Mike Pall > > (cherry picked from commit 0eddcbead2d67c16dcd4039a6765b9d2fc8ea631) > > This patch does the following refactoring: > 1) Drops optimizations for the Intel Atom CPU [1]: removes the > `JIT_F_LEA_AGU` flag and related optimizations. The considerations > for the use of LEA are complex and very CPU-specific, mostly > dependent on the number of operands. Mostly, it isn't worth it due to > the extra register pressure and/or extra instructions. I would say explicitly that `JIT_F_LEA_AGU` is used in "Well, yes, that applies to the original and obsolete Atom architecture. Today "Intel Atom" is just a trade name for reduced-performance implementations of the current Intel architecture." as Mike explained in LUAJIT#24. So there are no any risks for tarantool users regarding performance degradation. > 2) Drops optimizations for the AMD K8, K10 CPU [2][3]: removes the > `JIT_F_PREFER_IMUL` flag and related optimizations. > 3) Refactors JIT flags defined in the . Now all CPU-specific > JIT flags are defined as the left shift of `JIT_F_CPU` instead of > hardcoded constants, similar for the optimization flags. > 4) Adds detection of the ARM8 CPU. > 5) Drops the check for SSE2 since the VM already presumes CPU supports > it. > 6) Adds checks for `__ARM_ARCH`[4] macro in . > 7) Drops outdated comment in the amalgamation file about memory > requirements. > > Sergey Kaplun: > * added the description for the patch > > [1]:https://en.wikipedia.org/wiki/Intel_Atom > [2]:https://en.wikipedia.org/wiki/AMD_K8 > [3]:https://en.wikipedia.org/wiki/AMD_K10 > [4]:https://developer.arm.com/documentation/dui0774/l/Other-Compiler-specific-Features/Predefined-macros > > Part of tarantool/tarantool#10709 > --- > src/Makefile.original | 1 - > src/lib_jit.c | 65 +++++++++++------------------- > src/lj_arch.h | 6 +-- > src/lj_asm_x86.h | 33 +++++---------- > src/lj_dispatch.c | 7 ---- > src/lj_emit_x86.h | 5 +-- > src/lj_errmsg.h | 4 -- > src/lj_jit.h | 94 +++++++++++++++++++++++-------------------- > src/ljamalg.c | 10 ----- > 9 files changed, 87 insertions(+), 138 deletions(-) > > diff --git a/src/Makefile.original b/src/Makefile.original > index 9f55fa32..8d925e3a 100644 > --- a/src/Makefile.original > +++ b/src/Makefile.original > @@ -621,7 +621,6 @@ E= @echo > default all: $(TARGET_T) > > amalg: > - @grep "^[+|]" ljamalg.c > $(MAKE) -f Makefile.original all "LJCORE_O=ljamalg.o" > > clean: > diff --git a/src/lib_jit.c b/src/lib_jit.c > index f705f334..9f870f68 100644 > --- a/src/lib_jit.c > +++ b/src/lib_jit.c > @@ -104,8 +104,8 @@ LJLIB_CF(jit_status) > jit_State *J = L2J(L); > L->top = L->base; > setboolV(L->top++, (J->flags & JIT_F_ON) ? 1 : 0); > - flagbits_to_strings(L, J->flags, JIT_F_CPU_FIRST, JIT_F_CPUSTRING); > - flagbits_to_strings(L, J->flags, JIT_F_OPT_FIRST, JIT_F_OPTSTRING); > + flagbits_to_strings(L, J->flags, JIT_F_CPU, JIT_F_CPUSTRING); > + flagbits_to_strings(L, J->flags, JIT_F_OPT, JIT_F_OPTSTRING); > return (int)(L->top - L->base); > #else > setboolV(L->top++, 0); > @@ -467,7 +467,7 @@ static int jitopt_flag(jit_State *J, const char *str) > str += str[2] == '-' ? 3 : 2; > set = 0; > } > - for (opt = JIT_F_OPT_FIRST; ; opt <<= 1) { > + for (opt = JIT_F_OPT; ; opt <<= 1) { > size_t len = *(const uint8_t *)lst; > if (len == 0) > break; > @@ -636,59 +636,41 @@ JIT_PARAMDEF(JIT_PARAMINIT) > #undef JIT_PARAMINIT > 0 > }; > -#endif > > #if LJ_TARGET_ARM && LJ_TARGET_LINUX > #include > #endif > > -/* Arch-dependent CPU detection. */ > -static uint32_t jit_cpudetect(lua_State *L) > +/* Arch-dependent CPU feature detection. */ > +static uint32_t jit_cpudetect(void) > { > uint32_t flags = 0; > #if LJ_TARGET_X86ORX64 > + > uint32_t vendor[4]; > uint32_t features[4]; > if (lj_vm_cpuid(0, vendor) && lj_vm_cpuid(1, features)) { > -#if !LJ_HASJIT > -#define JIT_F_SSE2 2 > -#endif > - flags |= ((features[3] >> 26)&1) * JIT_F_SSE2; > -#if LJ_HASJIT > flags |= ((features[2] >> 0)&1) * JIT_F_SSE3; > flags |= ((features[2] >> 19)&1) * JIT_F_SSE4_1; > - if (vendor[2] == 0x6c65746e) { /* Intel. */ > - if ((features[0] & 0x0fff0ff0) == 0x000106c0) /* Atom. */ > - flags |= JIT_F_LEA_AGU; > - } else if (vendor[2] == 0x444d4163) { /* AMD. */ > - uint32_t fam = (features[0] & 0x0ff00f00); > - if (fam >= 0x00000f00) /* K8, K10. */ > - flags |= JIT_F_PREFER_IMUL; > - } > if (vendor[0] >= 7) { > uint32_t xfeatures[4]; > lj_vm_cpuid(7, xfeatures); > flags |= ((xfeatures[1] >> 8)&1) * JIT_F_BMI2; > } > -#endif > } > - /* Check for required instruction set support on x86 (unnecessary on x64). */ > -#if LJ_TARGET_X86 > - if (!(flags & JIT_F_SSE2)) > - luaL_error(L, "CPU with SSE2 required"); > -#endif > + /* Don't bother checking for SSE2 -- the VM will crash before getting here. */ > + > #elif LJ_TARGET_ARM > -#if LJ_HASJIT > + > int ver = LJ_ARCH_VERSION; /* Compile-time ARM CPU detection. */ > #if LJ_TARGET_LINUX > if (ver < 70) { /* Runtime ARM CPU detection. */ > struct utsname ut; > uname(&ut); > if (strncmp(ut.machine, "armv", 4) == 0) { > - if (ut.machine[4] >= '7') > - ver = 70; > - else if (ut.machine[4] == '6') > - ver = 60; > + if (ut.machine[4] >= '8') ver = 80; > + else if (ut.machine[4] == '7') ver = 70; > + else if (ut.machine[4] == '6') ver = 60; > } > } > #endif > @@ -696,20 +678,22 @@ static uint32_t jit_cpudetect(lua_State *L) > ver >= 61 ? JIT_F_ARMV6T2_ : > ver >= 60 ? JIT_F_ARMV6_ : 0; > flags |= LJ_ARCH_HASFPU == 0 ? 0 : ver >= 70 ? JIT_F_VFPV3 : JIT_F_VFPV2; > -#endif > + > #elif LJ_TARGET_ARM64 > + > /* No optional CPU features to detect (for now). */ > + > #elif LJ_TARGET_PPC > -#if LJ_HASJIT > + > #if LJ_ARCH_SQRT > flags |= JIT_F_SQRT; > #endif > #if LJ_ARCH_ROUND > flags |= JIT_F_ROUND; > #endif > -#endif > + > #elif LJ_TARGET_MIPS > -#if LJ_HASJIT > + > /* Compile-time MIPS CPU detection. */ > #if LJ_ARCH_VERSION >= 20 > flags |= JIT_F_MIPSXXR2; > @@ -727,31 +711,28 @@ static uint32_t jit_cpudetect(lua_State *L) > if (x) flags |= JIT_F_MIPSXXR2; /* Either 0x80000000 (R2) or 0 (R1). */ > } > #endif > -#endif > + > #else > #error "Missing CPU detection for this architecture" > #endif > - UNUSED(L); > return flags; > } > > /* Initialize JIT compiler. */ > static void jit_init(lua_State *L) > { > - uint32_t flags = jit_cpudetect(L); > -#if LJ_HASJIT > jit_State *J = L2J(L); > - J->flags = flags | JIT_F_ON | JIT_F_OPT_DEFAULT; > + J->flags = jit_cpudetect() | JIT_F_ON | JIT_F_OPT_DEFAULT; > memcpy(J->param, jit_param_default, sizeof(J->param)); > lj_dispatch_update(G(L)); > -#else > - UNUSED(flags); > -#endif > } > +#endif > > LUALIB_API int luaopen_jit(lua_State *L) > { > +#if LJ_HASJIT > jit_init(L); > +#endif > lua_pushliteral(L, LJ_OS_NAME); > lua_pushliteral(L, LJ_ARCH_NAME); > lua_pushinteger(L, LUAJIT_VERSION_NUM); > diff --git a/src/lj_arch.h b/src/lj_arch.h > index 3bdbe84e..e853c4a4 100644 > --- a/src/lj_arch.h > +++ b/src/lj_arch.h > @@ -209,13 +209,13 @@ > #define LJ_TARGET_UNIFYROT 2 /* Want only IR_BROR. */ > #define LJ_ARCH_NUMMODE LJ_NUMMODE_DUAL > > -#if __ARM_ARCH____ARM_ARCH_8__ || __ARM_ARCH_8A__ > +#if __ARM_ARCH == 8 || __ARM_ARCH_8__ || __ARM_ARCH_8A__ > #define LJ_ARCH_VERSION 80 > -#elif __ARM_ARCH_7__ || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH_7S__ || __ARM_ARCH_7VE__ > +#elif __ARM_ARCH == 7 || __ARM_ARCH_7__ || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH_7S__ || __ARM_ARCH_7VE__ > #define LJ_ARCH_VERSION 70 > #elif __ARM_ARCH_6T2__ > #define LJ_ARCH_VERSION 61 > -#elif __ARM_ARCH_6__ || __ARM_ARCH_6J__ || __ARM_ARCH_6K__ || __ARM_ARCH_6Z__ || __ARM_ARCH_6ZK__ > +#elif __ARM_ARCH == 6 || __ARM_ARCH_6__ || __ARM_ARCH_6J__ || __ARM_ARCH_6K__ || __ARM_ARCH_6Z__ || __ARM_ARCH_6ZK__ > #define LJ_ARCH_VERSION 60 > #else > #define LJ_ARCH_VERSION 50 > diff --git a/src/lj_asm_x86.h b/src/lj_asm_x86.h > index 86ce3937..5819fa7a 100644 > --- a/src/lj_asm_x86.h > +++ b/src/lj_asm_x86.h > @@ -1222,13 +1222,8 @@ static void asm_href(ASMState *as, IRIns *ir, IROp merge) > emit_rmro(as, XO_MOV, dest|REX_GC64, tab, offsetof(GCtab, node)); > } else { > emit_rmro(as, XO_ARITH(XOg_ADD), dest|REX_GC64, tab, offsetof(GCtab,node)); > - if ((as->flags & JIT_F_PREFER_IMUL)) { > - emit_i8(as, sizeof(Node)); > - emit_rr(as, XO_IMULi8, dest, dest); > - } else { > - emit_shifti(as, XOg_SHL, dest, 3); > - emit_rmrxo(as, XO_LEA, dest, dest, dest, XM_SCALE2, 0); > - } > + emit_shifti(as, XOg_SHL, dest, 3); > + emit_rmrxo(as, XO_LEA, dest, dest, dest, XM_SCALE2, 0); > if (isk) { > emit_gri(as, XG_ARITHi(XOg_AND), dest, (int32_t)khash); > emit_rmro(as, XO_MOV, dest, tab, offsetof(GCtab, hmask)); > @@ -1287,7 +1282,7 @@ static void asm_hrefk(ASMState *as, IRIns *ir) > lj_assertA(ofs % sizeof(Node) == 0, "unaligned HREFK slot"); > if (ra_hasreg(dest)) { > if (ofs != 0) { > - if (dest == node && !(as->flags & JIT_F_LEA_AGU)) > + if (dest == node) > emit_gri(as, XG_ARITHi(XOg_ADD), dest|REX_GC64, ofs); > else > emit_rmro(as, XO_LEA, dest|REX_GC64, node, ofs); > @@ -2181,8 +2176,7 @@ static void asm_add(ASMState *as, IRIns *ir) > { > if (irt_isnum(ir->t)) > asm_fparith(as, ir, XO_ADDSD); > - else if ((as->flags & JIT_F_LEA_AGU) || as->flagmcp == as->mcp || > - irt_is64(ir->t) || !asm_lea(as, ir)) > + else if (as->flagmcp == as->mcp || irt_is64(ir->t) || !asm_lea(as, ir)) > asm_intarith(as, ir, XOg_ADD); > } > > @@ -2887,7 +2881,7 @@ static void asm_tail_fixup(ASMState *as, TraceNo lnk) > MCode *target, *q; > int32_t spadj = as->T->spadjust; > if (spadj == 0) { > - p -= ((as->flags & JIT_F_LEA_AGU) ? 7 : 6) + (LJ_64 ? 1 : 0); > + p -= LJ_64 ? 7 : 6; > } else { > MCode *p1; > /* Patch stack adjustment. */ > @@ -2899,20 +2893,11 @@ static void asm_tail_fixup(ASMState *as, TraceNo lnk) > p1 = p-9; > *(int32_t *)p1 = spadj; > } > - if ((as->flags & JIT_F_LEA_AGU)) { > -#if LJ_64 > - p1[-4] = 0x48; > -#endif > - p1[-3] = (MCode)XI_LEA; > - p1[-2] = MODRM(checki8(spadj) ? XM_OFS8 : XM_OFS32, RID_ESP, RID_ESP); > - p1[-1] = MODRM(XM_SCALE1, RID_ESP, RID_ESP); > - } else { > #if LJ_64 > - p1[-3] = 0x48; > + p1[-3] = 0x48; > #endif > - p1[-2] = (MCode)(checki8(spadj) ? XI_ARITHi8 : XI_ARITHi); > - p1[-1] = MODRM(XM_REG, XOg_ADD, RID_ESP); > - } > + p1[-2] = (MCode)(checki8(spadj) ? XI_ARITHi8 : XI_ARITHi); > + p1[-1] = MODRM(XM_REG, XOg_ADD, RID_ESP); > } > /* Patch exit branch. */ > target = lnk ? traceref(as->J, lnk)->mcode : (MCode *)lj_vm_exit_interp; > @@ -2943,7 +2928,7 @@ static void asm_tail_prep(ASMState *as) > as->invmcp = as->mcp = p; > } else { > /* Leave room for ESP adjustment: add esp, imm or lea esp, [esp+imm] */ > - as->mcp = p - (((as->flags & JIT_F_LEA_AGU) ? 7 : 6) + (LJ_64 ? 1 : 0)); > + as->mcp = p - (LJ_64 ? 7 : 6); > as->invmcp = NULL; > } > } > diff --git a/src/lj_dispatch.c b/src/lj_dispatch.c > index ddee68de..a44a5adf 100644 > --- a/src/lj_dispatch.c > +++ b/src/lj_dispatch.c > @@ -258,15 +258,8 @@ int luaJIT_setmode(lua_State *L, int idx, int mode) > } else { > if (!(mode & LUAJIT_MODE_ON)) > G2J(g)->flags &= ~(uint32_t)JIT_F_ON; > -#if LJ_TARGET_X86ORX64 > - else if ((G2J(g)->flags & JIT_F_SSE2)) > - G2J(g)->flags |= (uint32_t)JIT_F_ON; > - else > - return 0; /* Don't turn on JIT compiler without SSE2 support. */ > -#else > else > G2J(g)->flags |= (uint32_t)JIT_F_ON; > -#endif > lj_dispatch_update(g); > } > break; > diff --git a/src/lj_emit_x86.h b/src/lj_emit_x86.h > index f4990151..85978027 100644 > --- a/src/lj_emit_x86.h > +++ b/src/lj_emit_x86.h > @@ -561,10 +561,7 @@ static void emit_storeofs(ASMState *as, IRIns *ir, Reg r, Reg base, int32_t ofs) > static void emit_addptr(ASMState *as, Reg r, int32_t ofs) > { > if (ofs) { > - if ((as->flags & JIT_F_LEA_AGU)) > - emit_rmro(as, XO_LEA, r|REX_GC64, r, ofs); > - else > - emit_gri(as, XG_ARITHi(XOg_ADD), r|REX_GC64, ofs); > + emit_gri(as, XG_ARITHi(XOg_ADD), r|REX_GC64, ofs); > } > } > > diff --git a/src/lj_errmsg.h b/src/lj_errmsg.h > index 77a08cb0..19c41f0b 100644 > --- a/src/lj_errmsg.h > +++ b/src/lj_errmsg.h > @@ -101,11 +101,7 @@ ERRDEF(STRGSRV, "invalid replacement value (a %s)") > ERRDEF(BADMODN, "name conflict for module " LUA_QS) > #if LJ_HASJIT > ERRDEF(JITPROT, "runtime code generation failed, restricted kernel?") > -#if LJ_TARGET_X86ORX64 > -ERRDEF(NOJIT, "JIT compiler disabled, CPU does not support SSE2") > -#else > ERRDEF(NOJIT, "JIT compiler disabled") > -#endif > #elif defined(LJ_ARCH_NOJIT) > ERRDEF(NOJIT, "no JIT compiler for this architecture (yet)") > #else > diff --git a/src/lj_jit.h b/src/lj_jit.h > index 361570a0..47df85c6 100644 > --- a/src/lj_jit.h > +++ b/src/lj_jit.h > @@ -9,47 +9,49 @@ > #include "lj_obj.h" > #include "lj_ir.h" > > -/* JIT engine flags. */ > +/* -- JIT engine flags ---------------------------------------------------- */ > + > +/* General JIT engine flags. 4 bits. */ > #define JIT_F_ON 0x00000001 > > -/* CPU-specific JIT engine flags. */ > +/* CPU-specific JIT engine flags. 12 bits. Flags and strings must match. */ > +#define JIT_F_CPU 0x00000010 > + > #if LJ_TARGET_X86ORX64 > -#define JIT_F_SSE2 0x00000010 > -#define JIT_F_SSE3 0x00000020 > -#define JIT_F_SSE4_1 0x00000040 > -#define JIT_F_PREFER_IMUL 0x00000080 > -#define JIT_F_LEA_AGU 0x00000100 > -#define JIT_F_BMI2 0x00000200 > - > -/* Names for the CPU-specific flags. Must match the order above. */ > -#define JIT_F_CPU_FIRST JIT_F_SSE2 > -#define JIT_F_CPUSTRING "\4SSE2\4SSE3\6SSE4.1\3AMD\4ATOM\4BMI2" > + > +#define JIT_F_SSE3 (JIT_F_CPU << 0) > +#define JIT_F_SSE4_1 (JIT_F_CPU << 1) > +#define JIT_F_BMI2 (JIT_F_CPU << 2) > + > + > +#define JIT_F_CPUSTRING "\4SSE3\6SSE4.1\4BMI2" > + > #elif LJ_TARGET_ARM > -#define JIT_F_ARMV6_ 0x00000010 > -#define JIT_F_ARMV6T2_ 0x00000020 > -#define JIT_F_ARMV7 0x00000040 > -#define JIT_F_VFPV2 0x00000080 > -#define JIT_F_VFPV3 0x00000100 > - > -#define JIT_F_ARMV6 (JIT_F_ARMV6_|JIT_F_ARMV6T2_|JIT_F_ARMV7) > -#define JIT_F_ARMV6T2 (JIT_F_ARMV6T2_|JIT_F_ARMV7) > + > +#define JIT_F_ARMV6_ (JIT_F_CPU << 0) > +#define JIT_F_ARMV6T2_ (JIT_F_CPU << 1) > +#define JIT_F_ARMV7 (JIT_F_CPU << 2) > +#define JIT_F_ARMV8 (JIT_F_CPU << 3) > +#define JIT_F_VFPV2 (JIT_F_CPU << 4) > +#define JIT_F_VFPV3 (JIT_F_CPU << 5) > + > +#define JIT_F_ARMV6 (JIT_F_ARMV6_|JIT_F_ARMV6T2_|JIT_F_ARMV7|JIT_F_ARMV8) > +#define JIT_F_ARMV6T2 (JIT_F_ARMV6T2_|JIT_F_ARMV7|JIT_F_ARMV8) > #define JIT_F_VFP (JIT_F_VFPV2|JIT_F_VFPV3) > > -/* Names for the CPU-specific flags. Must match the order above. */ > -#define JIT_F_CPU_FIRST JIT_F_ARMV6_ > -#define JIT_F_CPUSTRING "\5ARMv6\7ARMv6T2\5ARMv7\5VFPv2\5VFPv3" > +#define JIT_F_CPUSTRING "\5ARMv6\7ARMv6T2\5ARMv7\5ARMv8\5VFPv2\5VFPv3" > + > #elif LJ_TARGET_PPC > -#define JIT_F_SQRT 0x00000010 > -#define JIT_F_ROUND 0x00000020 > > -/* Names for the CPU-specific flags. Must match the order above. */ > -#define JIT_F_CPU_FIRST JIT_F_SQRT > +#define JIT_F_SQRT (JIT_F_CPU << 0) > +#define JIT_F_ROUND (JIT_F_CPU << 1) > + > #define JIT_F_CPUSTRING "\4SQRT\5ROUND" > + > #elif LJ_TARGET_MIPS > -#define JIT_F_MIPSXXR2 0x00000010 > > -/* Names for the CPU-specific flags. Must match the order above. */ > -#define JIT_F_CPU_FIRST JIT_F_MIPSXXR2 > +#define JIT_F_MIPSXXR2 (JIT_F_CPU << 0) > + > #if LJ_TARGET_MIPS32 > #if LJ_TARGET_MIPSR6 > #define JIT_F_CPUSTRING "\010MIPS32R6" > @@ -63,27 +65,29 @@ > #define JIT_F_CPUSTRING "\010MIPS64R2" > #endif > #endif > + > #else > -#define JIT_F_CPU_FIRST 0 > + > #define JIT_F_CPUSTRING "" > + > #endif > > -/* Optimization flags. */ > +/* Optimization flags. 12 bits. */ > +#define JIT_F_OPT 0x00010000 > #define JIT_F_OPT_MASK 0x0fff0000 > > -#define JIT_F_OPT_FOLD 0x00010000 > -#define JIT_F_OPT_CSE 0x00020000 > -#define JIT_F_OPT_DCE 0x00040000 > -#define JIT_F_OPT_FWD 0x00080000 > -#define JIT_F_OPT_DSE 0x00100000 > -#define JIT_F_OPT_NARROW 0x00200000 > -#define JIT_F_OPT_LOOP 0x00400000 > -#define JIT_F_OPT_ABC 0x00800000 > -#define JIT_F_OPT_SINK 0x01000000 > -#define JIT_F_OPT_FUSE 0x02000000 > +#define JIT_F_OPT_FOLD (JIT_F_OPT << 0) > +#define JIT_F_OPT_CSE (JIT_F_OPT << 1) > +#define JIT_F_OPT_DCE (JIT_F_OPT << 2) > +#define JIT_F_OPT_FWD (JIT_F_OPT << 3) > +#define JIT_F_OPT_DSE (JIT_F_OPT << 4) > +#define JIT_F_OPT_NARROW (JIT_F_OPT << 5) > +#define JIT_F_OPT_LOOP (JIT_F_OPT << 6) > +#define JIT_F_OPT_ABC (JIT_F_OPT << 7) > +#define JIT_F_OPT_SINK (JIT_F_OPT << 8) > +#define JIT_F_OPT_FUSE (JIT_F_OPT << 9) > > /* Optimizations names for -O. Must match the order above. */ > -#define JIT_F_OPT_FIRST JIT_F_OPT_FOLD > #define JIT_F_OPTSTRING \ > "\4fold\3cse\3dce\3fwd\3dse\6narrow\4loop\3abc\4sink\4fuse" > > @@ -95,6 +99,8 @@ > JIT_F_OPT_FWD|JIT_F_OPT_DSE|JIT_F_OPT_ABC|JIT_F_OPT_SINK|JIT_F_OPT_FUSE) > #define JIT_F_OPT_DEFAULT JIT_F_OPT_3 > > +/* -- JIT engine parameters ----------------------------------------------- */ > + > #if LJ_TARGET_WINDOWS || LJ_64 > /* See:http://blogs.msdn.com/oldnewthing/archive/2003/10/08/55239.aspx */ > #define JIT_P_sizemcode_DEFAULT 64 > @@ -137,6 +143,8 @@ JIT_PARAMDEF(JIT_PARAMENUM) > #define JIT_PARAMSTR(len, name, value) #len #name > #define JIT_P_STRING JIT_PARAMDEF(JIT_PARAMSTR) > > +/* -- JIT engine data structures ------------------------------------------ */ > + > /* Trace compiler state. */ > typedef enum { > LJ_TRACE_IDLE, /* Trace compiler idle. */ > diff --git a/src/ljamalg.c b/src/ljamalg.c > index 0ffc7e81..63b4ec87 100644 > --- a/src/ljamalg.c > +++ b/src/ljamalg.c > @@ -3,16 +3,6 @@ > ** Copyright (C) 2005-2017 Mike Pall. See Copyright Notice in luajit.h > */ > > -/* > -+--------------------------------------------------------------------------+ > -| WARNING: Compiling the amalgamation needs a lot of virtual memory | > -| (around 300 MB with GCC 4.x)! If you don't have enough physical memory | > -| your machine will start swapping to disk and the compile will not finish | > -| within a reasonable amount of time. | > -| So either compile on a bigger machine or use the non-amalgamated build. | > -+--------------------------------------------------------------------------+ > -*/ > - > #define ljamalg_c > #define LUA_CORE > --------------jn3J346b9oXAv0fHQm1gHvo5 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 7bit

Hi, Sergey,

thanks for the patch!

LGTM with a minor comment

Sergey

On 13.01.2025 18:17, Sergey Kaplun wrote:
From: Mike Pall <mike>

(cherry picked from commit 0eddcbead2d67c16dcd4039a6765b9d2fc8ea631)

This patch does the following refactoring:
1) Drops optimizations for the Intel Atom CPU [1]: removes the
   `JIT_F_LEA_AGU` flag and related optimizations. The considerations
   for the use of LEA are complex and very CPU-specific, mostly
   dependent on the number of operands. Mostly, it isn't worth it due to
   the extra register pressure and/or extra instructions.

I would say explicitly that `JIT_F_LEA_AGU` is used in "Well, yes, that applies to the original and obsolete Atom architecture. Today "Intel Atom" is just a trade name for reduced-performance implementations of the current Intel architecture."

as Mike explained in LUAJIT#24. So there are no any risks for tarantool users

regarding performance degradation.

2) Drops optimizations for the AMD K8, K10 CPU [2][3]: removes the
   `JIT_F_PREFER_IMUL` flag and related optimizations.
3) Refactors JIT flags defined in the <lj_jit.h>. Now all CPU-specific
   JIT flags are defined as the left shift of `JIT_F_CPU` instead of
   hardcoded constants, similar for the optimization flags.
4) Adds detection of the ARM8 CPU.
5) Drops the check for SSE2 since the VM already presumes CPU supports
   it.
6) Adds checks for `__ARM_ARCH`[4] macro in <lj_arch.h>.
7) Drops outdated comment in the amalgamation file about memory
   requirements.

Sergey Kaplun:
* added the description for the patch

[1]: https://en.wikipedia.org/wiki/Intel_Atom
[2]: https://en.wikipedia.org/wiki/AMD_K8
[3]: https://en.wikipedia.org/wiki/AMD_K10
[4]: https://developer.arm.com/documentation/dui0774/l/Other-Compiler-specific-Features/Predefined-macros

Part of tarantool/tarantool#10709
---
 src/Makefile.original |  1 -
 src/lib_jit.c         | 65 +++++++++++-------------------
 src/lj_arch.h         |  6 +--
 src/lj_asm_x86.h      | 33 +++++----------
 src/lj_dispatch.c     |  7 ----
 src/lj_emit_x86.h     |  5 +--
 src/lj_errmsg.h       |  4 --
 src/lj_jit.h          | 94 +++++++++++++++++++++++--------------------
 src/ljamalg.c         | 10 -----
 9 files changed, 87 insertions(+), 138 deletions(-)

diff --git a/src/Makefile.original b/src/Makefile.original
index 9f55fa32..8d925e3a 100644
--- a/src/Makefile.original
+++ b/src/Makefile.original
@@ -621,7 +621,6 @@ E= @echo
 default all:	$(TARGET_T)
 
 amalg:
-	@grep "^[+|]" ljamalg.c
 	$(MAKE) -f Makefile.original all "LJCORE_O=ljamalg.o"
 
 clean:
diff --git a/src/lib_jit.c b/src/lib_jit.c
index f705f334..9f870f68 100644
--- a/src/lib_jit.c
+++ b/src/lib_jit.c
@@ -104,8 +104,8 @@ LJLIB_CF(jit_status)
   jit_State *J = L2J(L);
   L->top = L->base;
   setboolV(L->top++, (J->flags & JIT_F_ON) ? 1 : 0);
-  flagbits_to_strings(L, J->flags, JIT_F_CPU_FIRST, JIT_F_CPUSTRING);
-  flagbits_to_strings(L, J->flags, JIT_F_OPT_FIRST, JIT_F_OPTSTRING);
+  flagbits_to_strings(L, J->flags, JIT_F_CPU, JIT_F_CPUSTRING);
+  flagbits_to_strings(L, J->flags, JIT_F_OPT, JIT_F_OPTSTRING);
   return (int)(L->top - L->base);
 #else
   setboolV(L->top++, 0);
@@ -467,7 +467,7 @@ static int jitopt_flag(jit_State *J, const char *str)
     str += str[2] == '-' ? 3 : 2;
     set = 0;
   }
-  for (opt = JIT_F_OPT_FIRST; ; opt <<= 1) {
+  for (opt = JIT_F_OPT; ; opt <<= 1) {
     size_t len = *(const uint8_t *)lst;
     if (len == 0)
       break;
@@ -636,59 +636,41 @@ JIT_PARAMDEF(JIT_PARAMINIT)
 #undef JIT_PARAMINIT
   0
 };
-#endif
 
 #if LJ_TARGET_ARM && LJ_TARGET_LINUX
 #include <sys/utsname.h>
 #endif
 
-/* Arch-dependent CPU detection. */
-static uint32_t jit_cpudetect(lua_State *L)
+/* Arch-dependent CPU feature detection. */
+static uint32_t jit_cpudetect(void)
 {
   uint32_t flags = 0;
 #if LJ_TARGET_X86ORX64
+
   uint32_t vendor[4];
   uint32_t features[4];
   if (lj_vm_cpuid(0, vendor) && lj_vm_cpuid(1, features)) {
-#if !LJ_HASJIT
-#define JIT_F_SSE2	2
-#endif
-    flags |= ((features[3] >> 26)&1) * JIT_F_SSE2;
-#if LJ_HASJIT
     flags |= ((features[2] >> 0)&1) * JIT_F_SSE3;
     flags |= ((features[2] >> 19)&1) * JIT_F_SSE4_1;
-    if (vendor[2] == 0x6c65746e) {  /* Intel. */
-      if ((features[0] & 0x0fff0ff0) == 0x000106c0)  /* Atom. */
-	flags |= JIT_F_LEA_AGU;
-    } else if (vendor[2] == 0x444d4163) {  /* AMD. */
-      uint32_t fam = (features[0] & 0x0ff00f00);
-      if (fam >= 0x00000f00)  /* K8, K10. */
-	flags |= JIT_F_PREFER_IMUL;
-    }
     if (vendor[0] >= 7) {
       uint32_t xfeatures[4];
       lj_vm_cpuid(7, xfeatures);
       flags |= ((xfeatures[1] >> 8)&1) * JIT_F_BMI2;
     }
-#endif
   }
-  /* Check for required instruction set support on x86 (unnecessary on x64). */
-#if LJ_TARGET_X86
-  if (!(flags & JIT_F_SSE2))
-    luaL_error(L, "CPU with SSE2 required");
-#endif
+  /* Don't bother checking for SSE2 -- the VM will crash before getting here. */
+
 #elif LJ_TARGET_ARM
-#if LJ_HASJIT
+
   int ver = LJ_ARCH_VERSION;  /* Compile-time ARM CPU detection. */
 #if LJ_TARGET_LINUX
   if (ver < 70) {  /* Runtime ARM CPU detection. */
     struct utsname ut;
     uname(&ut);
     if (strncmp(ut.machine, "armv", 4) == 0) {
-      if (ut.machine[4] >= '7')
-	ver = 70;
-      else if (ut.machine[4] == '6')
-	ver = 60;
+      if (ut.machine[4] >= '8') ver = 80;
+      else if (ut.machine[4] == '7') ver = 70;
+      else if (ut.machine[4] == '6') ver = 60;
     }
   }
 #endif
@@ -696,20 +678,22 @@ static uint32_t jit_cpudetect(lua_State *L)
 	   ver >= 61 ? JIT_F_ARMV6T2_ :
 	   ver >= 60 ? JIT_F_ARMV6_ : 0;
   flags |= LJ_ARCH_HASFPU == 0 ? 0 : ver >= 70 ? JIT_F_VFPV3 : JIT_F_VFPV2;
-#endif
+
 #elif LJ_TARGET_ARM64
+
   /* No optional CPU features to detect (for now). */
+
 #elif LJ_TARGET_PPC
-#if LJ_HASJIT
+
 #if LJ_ARCH_SQRT
   flags |= JIT_F_SQRT;
 #endif
 #if LJ_ARCH_ROUND
   flags |= JIT_F_ROUND;
 #endif
-#endif
+
 #elif LJ_TARGET_MIPS
-#if LJ_HASJIT
+
   /* Compile-time MIPS CPU detection. */
 #if LJ_ARCH_VERSION >= 20
   flags |= JIT_F_MIPSXXR2;
@@ -727,31 +711,28 @@ static uint32_t jit_cpudetect(lua_State *L)
     if (x) flags |= JIT_F_MIPSXXR2;  /* Either 0x80000000 (R2) or 0 (R1). */
   }
 #endif
-#endif
+
 #else
 #error "Missing CPU detection for this architecture"
 #endif
-  UNUSED(L);
   return flags;
 }
 
 /* Initialize JIT compiler. */
 static void jit_init(lua_State *L)
 {
-  uint32_t flags = jit_cpudetect(L);
-#if LJ_HASJIT
   jit_State *J = L2J(L);
-  J->flags = flags | JIT_F_ON | JIT_F_OPT_DEFAULT;
+  J->flags = jit_cpudetect() | JIT_F_ON | JIT_F_OPT_DEFAULT;
   memcpy(J->param, jit_param_default, sizeof(J->param));
   lj_dispatch_update(G(L));
-#else
-  UNUSED(flags);
-#endif
 }
+#endif
 
 LUALIB_API int luaopen_jit(lua_State *L)
 {
+#if LJ_HASJIT
   jit_init(L);
+#endif
   lua_pushliteral(L, LJ_OS_NAME);
   lua_pushliteral(L, LJ_ARCH_NAME);
   lua_pushinteger(L, LUAJIT_VERSION_NUM);
diff --git a/src/lj_arch.h b/src/lj_arch.h
index 3bdbe84e..e853c4a4 100644
--- a/src/lj_arch.h
+++ b/src/lj_arch.h
@@ -209,13 +209,13 @@
 #define LJ_TARGET_UNIFYROT	2	/* Want only IR_BROR. */
 #define LJ_ARCH_NUMMODE		LJ_NUMMODE_DUAL
 
-#if __ARM_ARCH____ARM_ARCH_8__ || __ARM_ARCH_8A__
+#if __ARM_ARCH == 8 || __ARM_ARCH_8__ || __ARM_ARCH_8A__
 #define LJ_ARCH_VERSION		80
-#elif __ARM_ARCH_7__ || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH_7S__ || __ARM_ARCH_7VE__
+#elif __ARM_ARCH == 7 || __ARM_ARCH_7__ || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH_7S__ || __ARM_ARCH_7VE__
 #define LJ_ARCH_VERSION		70
 #elif __ARM_ARCH_6T2__
 #define LJ_ARCH_VERSION		61
-#elif __ARM_ARCH_6__ || __ARM_ARCH_6J__ || __ARM_ARCH_6K__ || __ARM_ARCH_6Z__ || __ARM_ARCH_6ZK__
+#elif __ARM_ARCH == 6 || __ARM_ARCH_6__ || __ARM_ARCH_6J__ || __ARM_ARCH_6K__ || __ARM_ARCH_6Z__ || __ARM_ARCH_6ZK__
 #define LJ_ARCH_VERSION		60
 #else
 #define LJ_ARCH_VERSION		50
diff --git a/src/lj_asm_x86.h b/src/lj_asm_x86.h
index 86ce3937..5819fa7a 100644
--- a/src/lj_asm_x86.h
+++ b/src/lj_asm_x86.h
@@ -1222,13 +1222,8 @@ static void asm_href(ASMState *as, IRIns *ir, IROp merge)
     emit_rmro(as, XO_MOV, dest|REX_GC64, tab, offsetof(GCtab, node));
   } else {
     emit_rmro(as, XO_ARITH(XOg_ADD), dest|REX_GC64, tab, offsetof(GCtab,node));
-    if ((as->flags & JIT_F_PREFER_IMUL)) {
-      emit_i8(as, sizeof(Node));
-      emit_rr(as, XO_IMULi8, dest, dest);
-    } else {
-      emit_shifti(as, XOg_SHL, dest, 3);
-      emit_rmrxo(as, XO_LEA, dest, dest, dest, XM_SCALE2, 0);
-    }
+    emit_shifti(as, XOg_SHL, dest, 3);
+    emit_rmrxo(as, XO_LEA, dest, dest, dest, XM_SCALE2, 0);
     if (isk) {
       emit_gri(as, XG_ARITHi(XOg_AND), dest, (int32_t)khash);
       emit_rmro(as, XO_MOV, dest, tab, offsetof(GCtab, hmask));
@@ -1287,7 +1282,7 @@ static void asm_hrefk(ASMState *as, IRIns *ir)
   lj_assertA(ofs % sizeof(Node) == 0, "unaligned HREFK slot");
   if (ra_hasreg(dest)) {
     if (ofs != 0) {
-      if (dest == node && !(as->flags & JIT_F_LEA_AGU))
+      if (dest == node)
 	emit_gri(as, XG_ARITHi(XOg_ADD), dest|REX_GC64, ofs);
       else
 	emit_rmro(as, XO_LEA, dest|REX_GC64, node, ofs);
@@ -2181,8 +2176,7 @@ static void asm_add(ASMState *as, IRIns *ir)
 {
   if (irt_isnum(ir->t))
     asm_fparith(as, ir, XO_ADDSD);
-  else if ((as->flags & JIT_F_LEA_AGU) || as->flagmcp == as->mcp ||
-	   irt_is64(ir->t) || !asm_lea(as, ir))
+  else if (as->flagmcp == as->mcp || irt_is64(ir->t) || !asm_lea(as, ir))
     asm_intarith(as, ir, XOg_ADD);
 }
 
@@ -2887,7 +2881,7 @@ static void asm_tail_fixup(ASMState *as, TraceNo lnk)
   MCode *target, *q;
   int32_t spadj = as->T->spadjust;
   if (spadj == 0) {
-    p -= ((as->flags & JIT_F_LEA_AGU) ? 7 : 6) + (LJ_64 ? 1 : 0);
+    p -= LJ_64 ? 7 : 6;
   } else {
     MCode *p1;
     /* Patch stack adjustment. */
@@ -2899,20 +2893,11 @@ static void asm_tail_fixup(ASMState *as, TraceNo lnk)
       p1 = p-9;
       *(int32_t *)p1 = spadj;
     }
-    if ((as->flags & JIT_F_LEA_AGU)) {
-#if LJ_64
-      p1[-4] = 0x48;
-#endif
-      p1[-3] = (MCode)XI_LEA;
-      p1[-2] = MODRM(checki8(spadj) ? XM_OFS8 : XM_OFS32, RID_ESP, RID_ESP);
-      p1[-1] = MODRM(XM_SCALE1, RID_ESP, RID_ESP);
-    } else {
 #if LJ_64
-      p1[-3] = 0x48;
+    p1[-3] = 0x48;
 #endif
-      p1[-2] = (MCode)(checki8(spadj) ? XI_ARITHi8 : XI_ARITHi);
-      p1[-1] = MODRM(XM_REG, XOg_ADD, RID_ESP);
-    }
+    p1[-2] = (MCode)(checki8(spadj) ? XI_ARITHi8 : XI_ARITHi);
+    p1[-1] = MODRM(XM_REG, XOg_ADD, RID_ESP);
   }
   /* Patch exit branch. */
   target = lnk ? traceref(as->J, lnk)->mcode : (MCode *)lj_vm_exit_interp;
@@ -2943,7 +2928,7 @@ static void asm_tail_prep(ASMState *as)
     as->invmcp = as->mcp = p;
   } else {
     /* Leave room for ESP adjustment: add esp, imm or lea esp, [esp+imm] */
-    as->mcp = p - (((as->flags & JIT_F_LEA_AGU) ? 7 : 6)  + (LJ_64 ? 1 : 0));
+    as->mcp = p - (LJ_64 ? 7 : 6);
     as->invmcp = NULL;
   }
 }
diff --git a/src/lj_dispatch.c b/src/lj_dispatch.c
index ddee68de..a44a5adf 100644
--- a/src/lj_dispatch.c
+++ b/src/lj_dispatch.c
@@ -258,15 +258,8 @@ int luaJIT_setmode(lua_State *L, int idx, int mode)
     } else {
       if (!(mode & LUAJIT_MODE_ON))
 	G2J(g)->flags &= ~(uint32_t)JIT_F_ON;
-#if LJ_TARGET_X86ORX64
-      else if ((G2J(g)->flags & JIT_F_SSE2))
-	G2J(g)->flags |= (uint32_t)JIT_F_ON;
-      else
-	return 0;  /* Don't turn on JIT compiler without SSE2 support. */
-#else
       else
 	G2J(g)->flags |= (uint32_t)JIT_F_ON;
-#endif
       lj_dispatch_update(g);
     }
     break;
diff --git a/src/lj_emit_x86.h b/src/lj_emit_x86.h
index f4990151..85978027 100644
--- a/src/lj_emit_x86.h
+++ b/src/lj_emit_x86.h
@@ -561,10 +561,7 @@ static void emit_storeofs(ASMState *as, IRIns *ir, Reg r, Reg base, int32_t ofs)
 static void emit_addptr(ASMState *as, Reg r, int32_t ofs)
 {
   if (ofs) {
-    if ((as->flags & JIT_F_LEA_AGU))
-      emit_rmro(as, XO_LEA, r|REX_GC64, r, ofs);
-    else
-      emit_gri(as, XG_ARITHi(XOg_ADD), r|REX_GC64, ofs);
+    emit_gri(as, XG_ARITHi(XOg_ADD), r|REX_GC64, ofs);
   }
 }
 
diff --git a/src/lj_errmsg.h b/src/lj_errmsg.h
index 77a08cb0..19c41f0b 100644
--- a/src/lj_errmsg.h
+++ b/src/lj_errmsg.h
@@ -101,11 +101,7 @@ ERRDEF(STRGSRV,	"invalid replacement value (a %s)")
 ERRDEF(BADMODN,	"name conflict for module " LUA_QS)
 #if LJ_HASJIT
 ERRDEF(JITPROT,	"runtime code generation failed, restricted kernel?")
-#if LJ_TARGET_X86ORX64
-ERRDEF(NOJIT,	"JIT compiler disabled, CPU does not support SSE2")
-#else
 ERRDEF(NOJIT,	"JIT compiler disabled")
-#endif
 #elif defined(LJ_ARCH_NOJIT)
 ERRDEF(NOJIT,	"no JIT compiler for this architecture (yet)")
 #else
diff --git a/src/lj_jit.h b/src/lj_jit.h
index 361570a0..47df85c6 100644
--- a/src/lj_jit.h
+++ b/src/lj_jit.h
@@ -9,47 +9,49 @@
 #include "lj_obj.h"
 #include "lj_ir.h"
 
-/* JIT engine flags. */
+/* -- JIT engine flags ---------------------------------------------------- */
+
+/* General JIT engine flags. 4 bits. */
 #define JIT_F_ON		0x00000001
 
-/* CPU-specific JIT engine flags. */
+/* CPU-specific JIT engine flags. 12 bits. Flags and strings must match. */
+#define JIT_F_CPU		0x00000010
+
 #if LJ_TARGET_X86ORX64
-#define JIT_F_SSE2		0x00000010
-#define JIT_F_SSE3		0x00000020
-#define JIT_F_SSE4_1		0x00000040
-#define JIT_F_PREFER_IMUL	0x00000080
-#define JIT_F_LEA_AGU		0x00000100
-#define JIT_F_BMI2		0x00000200
-
-/* Names for the CPU-specific flags. Must match the order above. */
-#define JIT_F_CPU_FIRST		JIT_F_SSE2
-#define JIT_F_CPUSTRING		"\4SSE2\4SSE3\6SSE4.1\3AMD\4ATOM\4BMI2"
+
+#define JIT_F_SSE3		(JIT_F_CPU << 0)
+#define JIT_F_SSE4_1		(JIT_F_CPU << 1)
+#define JIT_F_BMI2		(JIT_F_CPU << 2)
+
+
+#define JIT_F_CPUSTRING		"\4SSE3\6SSE4.1\4BMI2"
+
 #elif LJ_TARGET_ARM
-#define JIT_F_ARMV6_		0x00000010
-#define JIT_F_ARMV6T2_		0x00000020
-#define JIT_F_ARMV7		0x00000040
-#define JIT_F_VFPV2		0x00000080
-#define JIT_F_VFPV3		0x00000100
-
-#define JIT_F_ARMV6		(JIT_F_ARMV6_|JIT_F_ARMV6T2_|JIT_F_ARMV7)
-#define JIT_F_ARMV6T2		(JIT_F_ARMV6T2_|JIT_F_ARMV7)
+
+#define JIT_F_ARMV6_		(JIT_F_CPU << 0)
+#define JIT_F_ARMV6T2_		(JIT_F_CPU << 1)
+#define JIT_F_ARMV7		(JIT_F_CPU << 2)
+#define JIT_F_ARMV8		(JIT_F_CPU << 3)
+#define JIT_F_VFPV2		(JIT_F_CPU << 4)
+#define JIT_F_VFPV3		(JIT_F_CPU << 5)
+
+#define JIT_F_ARMV6		(JIT_F_ARMV6_|JIT_F_ARMV6T2_|JIT_F_ARMV7|JIT_F_ARMV8)
+#define JIT_F_ARMV6T2		(JIT_F_ARMV6T2_|JIT_F_ARMV7|JIT_F_ARMV8)
 #define JIT_F_VFP		(JIT_F_VFPV2|JIT_F_VFPV3)
 
-/* Names for the CPU-specific flags. Must match the order above. */
-#define JIT_F_CPU_FIRST		JIT_F_ARMV6_
-#define JIT_F_CPUSTRING		"\5ARMv6\7ARMv6T2\5ARMv7\5VFPv2\5VFPv3"
+#define JIT_F_CPUSTRING		"\5ARMv6\7ARMv6T2\5ARMv7\5ARMv8\5VFPv2\5VFPv3"
+
 #elif LJ_TARGET_PPC
-#define JIT_F_SQRT		0x00000010
-#define JIT_F_ROUND		0x00000020
 
-/* Names for the CPU-specific flags. Must match the order above. */
-#define JIT_F_CPU_FIRST		JIT_F_SQRT
+#define JIT_F_SQRT		(JIT_F_CPU << 0)
+#define JIT_F_ROUND		(JIT_F_CPU << 1)
+
 #define JIT_F_CPUSTRING		"\4SQRT\5ROUND"
+
 #elif LJ_TARGET_MIPS
-#define JIT_F_MIPSXXR2		0x00000010
 
-/* Names for the CPU-specific flags. Must match the order above. */
-#define JIT_F_CPU_FIRST		JIT_F_MIPSXXR2
+#define JIT_F_MIPSXXR2		(JIT_F_CPU << 0)
+
 #if LJ_TARGET_MIPS32
 #if LJ_TARGET_MIPSR6
 #define JIT_F_CPUSTRING		"\010MIPS32R6"
@@ -63,27 +65,29 @@
 #define JIT_F_CPUSTRING		"\010MIPS64R2"
 #endif
 #endif
+
 #else
-#define JIT_F_CPU_FIRST		0
+
 #define JIT_F_CPUSTRING		""
+
 #endif
 
-/* Optimization flags. */
+/* Optimization flags. 12 bits. */
+#define JIT_F_OPT		0x00010000
 #define JIT_F_OPT_MASK		0x0fff0000
 
-#define JIT_F_OPT_FOLD		0x00010000
-#define JIT_F_OPT_CSE		0x00020000
-#define JIT_F_OPT_DCE		0x00040000
-#define JIT_F_OPT_FWD		0x00080000
-#define JIT_F_OPT_DSE		0x00100000
-#define JIT_F_OPT_NARROW	0x00200000
-#define JIT_F_OPT_LOOP		0x00400000
-#define JIT_F_OPT_ABC		0x00800000
-#define JIT_F_OPT_SINK		0x01000000
-#define JIT_F_OPT_FUSE		0x02000000
+#define JIT_F_OPT_FOLD		(JIT_F_OPT << 0)
+#define JIT_F_OPT_CSE		(JIT_F_OPT << 1)
+#define JIT_F_OPT_DCE		(JIT_F_OPT << 2)
+#define JIT_F_OPT_FWD		(JIT_F_OPT << 3)
+#define JIT_F_OPT_DSE		(JIT_F_OPT << 4)
+#define JIT_F_OPT_NARROW	(JIT_F_OPT << 5)
+#define JIT_F_OPT_LOOP		(JIT_F_OPT << 6)
+#define JIT_F_OPT_ABC		(JIT_F_OPT << 7)
+#define JIT_F_OPT_SINK		(JIT_F_OPT << 8)
+#define JIT_F_OPT_FUSE		(JIT_F_OPT << 9)
 
 /* Optimizations names for -O. Must match the order above. */
-#define JIT_F_OPT_FIRST		JIT_F_OPT_FOLD
 #define JIT_F_OPTSTRING	\
   "\4fold\3cse\3dce\3fwd\3dse\6narrow\4loop\3abc\4sink\4fuse"
 
@@ -95,6 +99,8 @@
   JIT_F_OPT_FWD|JIT_F_OPT_DSE|JIT_F_OPT_ABC|JIT_F_OPT_SINK|JIT_F_OPT_FUSE)
 #define JIT_F_OPT_DEFAULT	JIT_F_OPT_3
 
+/* -- JIT engine parameters ----------------------------------------------- */
+
 #if LJ_TARGET_WINDOWS || LJ_64
 /* See: http://blogs.msdn.com/oldnewthing/archive/2003/10/08/55239.aspx */
 #define JIT_P_sizemcode_DEFAULT		64
@@ -137,6 +143,8 @@ JIT_PARAMDEF(JIT_PARAMENUM)
 #define JIT_PARAMSTR(len, name, value)	#len #name
 #define JIT_P_STRING	JIT_PARAMDEF(JIT_PARAMSTR)
 
+/* -- JIT engine data structures ------------------------------------------ */
+
 /* Trace compiler state. */
 typedef enum {
   LJ_TRACE_IDLE,	/* Trace compiler idle. */
diff --git a/src/ljamalg.c b/src/ljamalg.c
index 0ffc7e81..63b4ec87 100644
--- a/src/ljamalg.c
+++ b/src/ljamalg.c
@@ -3,16 +3,6 @@
 ** Copyright (C) 2005-2017 Mike Pall. See Copyright Notice in luajit.h
 */
 
-/*
-+--------------------------------------------------------------------------+
-| WARNING: Compiling the amalgamation needs a lot of virtual memory        |
-| (around 300 MB with GCC 4.x)! If you don't have enough physical memory   |
-| your machine will start swapping to disk and the compile will not finish |
-| within a reasonable amount of time.                                      |
-| So either compile on a bigger machine or use the non-amalgamated build.  |
-+--------------------------------------------------------------------------+
-*/
-
 #define ljamalg_c
 #define LUA_CORE
 
--------------jn3J346b9oXAv0fHQm1gHvo5--