From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id D7D074A0A19; Tue, 14 Jan 2025 15:45:21 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org D7D074A0A19 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1736858721; bh=B/fOoVp3K+AL9T5A9h0gjc0nGbIJ1/elhWB2BNbToys=; h=Date:To:Cc:References:In-Reply-To:Subject:List-Id: List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe: From:Reply-To:From; b=fnOeSctb2BFHRIES91yc+lnwPOhVzQnSQIl5f+kDIqe3osGCi6B2lSvzh7P2oQ9gY HiB/c5cHPwxm03FMPyOXmi+aO4SVIPJOfE4BXG1gjCaVcDix4v4gKA4NsynoTvLyBy dtioywujm3j14b+nK/TP430KiG17zENf1MPIEDOU= Received: from send265.i.mail.ru (send265.i.mail.ru [95.163.59.104]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 43BB34A0A19 for ; Tue, 14 Jan 2025 15:45:20 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 43BB34A0A19 Received: by exim-smtp-6758d5575c-k6njp with esmtpa (envelope-from ) id 1tXgIJ-00000000Sav-0jfP; Tue, 14 Jan 2025 15:45:19 +0300 Content-Type: multipart/alternative; boundary="------------3nlrLCZWueX7UfFqFeBQv455" Message-ID: <75dbc9ca-5332-42b4-93b8-471d27370feb@tarantool.org> Date: Tue, 14 Jan 2025 15:45:18 +0300 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Content-Language: en-US To: Sergey Kaplun Cc: tarantool-patches@dev.tarantool.org References: <4cdba52a1ba1a1f2a8ccb4624f00fe156c3088c6.1736779534.git.skaplun@tarantool.org> In-Reply-To: <4cdba52a1ba1a1f2a8ccb4624f00fe156c3088c6.1736779534.git.skaplun@tarantool.org> X-Mailru-Src: smtp X-4EC0790: 10 X-7564579A: 646B95376F6C166E X-77F55803: 4F1203BC0FB41BD97BF177E4F5EB85B87B687978FC1C61E87AAE4A44664722D200894C459B0CD1B9DB603B1ECCC087FDEDF1CDDB2C83CE56A7589F93ACC4058A378C01FC0D4D071C22E3706512C1278F X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE746A11221A744BA93EA1F7E6F0F101C67BD4B6F7A4D31EC0BCC500DACC3FED6E28638F802B75D45FF8AA50765F7900637C8D8BCCD49BBA39A8638F802B75D45FF36EB9D2243A4F8B5A6FCA7DBDB1FC311F39EFFDF887939037866D6147AF826D89EA1AE58ECF94BBB680DE984E12B441C0DCA09024EBA05D5CC7F00164DA146DAFE8445B8C89999728AA50765F7900637BA939FD1B3BAB99B389733CBF5DBD5E9C8A9BA7A39EFB766F5D81C698A659EA7CC7F00164DA146DA9985D098DBDEAEC8744B801E316CB65FF6B57BC7E6449061A352F6E88A58FB86F5D81C698A659EA73AA81AA40904B5D9A18204E546F3947C6089696B24BB1D199735652A29929C6C4AD6D5ED66289B523666184CF4C3C14F6136E347CC761E07725E5C173C3A84C39E5FF5549954B0F1BA3038C0950A5D36B5C8C57E37DE458B330BD67F2E7D9AF16D1867E19FE14079C09775C1D3CA48CF3D321E7403792E342EB15956EA79C166A417C69337E82CC275ECD9A6C639B01B78DA827A17800CE7CD707F342D9BDC98731C566533BA786AA5CC5B56E945C8DA X-C1DE0DAB: 0D63561A33F958A50915FF3B3E4869C85002B1117B3ED69693DD55963F646CE222DFD5397F446790823CB91A9FED034534781492E4B8EEADB71243024C627CEABDAD6C7F3747799A X-C8649E89: 1C3962B70DF3F0ADE00A9FD3E00BEEDF3FED46C3ACD6F73ED3581295AF09D3DF87807E0823442EA2ED31085941D9CD0AF7F820E7B07EA4CF0D9BE7EE1D6264B62F392DBE3F91C5B761DF80419FB7AA13FD92A8CE9FC869E3DDCE95BB4BF576A669A713A3E5A0090D25534C9B480A50242197E1BFFA6CD8FD9B5E86D53EBEE20F111DC66A97D0BFE2913E6812662D5F2AB9AF64DB4688768036DF5FE9C0001AF333F2C28C22F508233FCF178C6DD14203 X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2bioj7YQzuqfQVvzskm12tPfmhQ== X-Mailru-Sender: 520A125C2F17F0B1E52FEF5D219D6140DB603B1ECCC087FDEDF1CDDB2C83CE56030630463D9040CB0152A3D17938EB451EB5A0BCEC6A560B3DDE9B364B0DF289BE2DA36745F2EEB5CEBA01FB949A1F1EEAB4BC95F72C04283CDA0F3B3F5B9367 X-Mras: Ok Subject: Re: [Tarantool-patches] [PATCH luajit 2/2] Disable FMA by default. Use -Ofma or jit.opt.start("+fma") to enable. X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Sergey Bronnikov via Tarantool-patches Reply-To: Sergey Bronnikov Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" This is a multi-part message in MIME format. --------------3nlrLCZWueX7UfFqFeBQv455 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Hi, Sergey! Thanks for the patch! On 14.01.2025 14:06, Sergey Kaplun wrote: > From: Mike Pall > > See the discussion in the corresponding ticket for the rationale. > > (cherry picked from commit de2e1ca9d3d87e74c0c20c1e4ad3c32b31a5875b) > > For the modulo operation, the arm64 VM uses `fmsub` [1] instruction, > which is the fused multiply-add (FMA [2]) operation (more precisely, > multiply-sub). Hence, it may produce different results compared to the > unfused one. This patch fixes the behaviour by using the unfused > instructions by default. However, the new JIT optimization flag (fma) is > introduced to make it possible to take advantage of the FMA > optimizations. > > Sergey Kaplun: > * added the description and the test for the problem > > [1]:https://developer.arm.com/documentation/dui0801/g/A64-Floating-point-Instructions/FMSUB > [2]:https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation > > Part of tarantool/tarantool#10709 > --- > > I intentionally avoid mentioning the ticket in the commit message to > avoid excess mentioning in the LuaJIT issue tracker. You can see the > LuaJIT/LuaJIT#918 link in the cover letter. > > doc/running.html | 8 +++++ > src/lj_asm_arm.h | 6 +++- > src/lj_asm_arm64.h | 3 +- > src/lj_asm_ppc.h | 3 +- > src/lj_jit.h | 4 ++- > src/lj_vmmath.c | 13 ++++++- > src/vm_arm64.dasc | 4 ++- > ...lj-918-fma-numerical-accuracy-jit.test.lua | 36 +++++++++++++++++++ > .../lj-918-fma-numerical-accuracy.test.lua | 31 ++++++++++++++++ > .../lj-918-fma-optimization.test.lua | 25 +++++++++++++ > 10 files changed, 127 insertions(+), 6 deletions(-) > create mode 100644 test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua > create mode 100644 test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua > create mode 100644 test/tarantool-tests/lj-918-fma-optimization.test.lua > > diff --git a/doc/running.html b/doc/running.html > index 7868efab..1cf41f1b 100644 > --- a/doc/running.html > +++ b/doc/running.html > @@ -226,6 +226,12 @@ mix the three forms, but note that setting an optimization level > overrides all earlier flags. >

>

> +Note that -Ofma is not enabled by default at any level, > +because it affects floating-point result accuracy. Only enable this, > +if you fully understand the trade-offs of FMA for performance (higher), > +determinism (lower) and numerical accuracy (higher). > +

> +

> Here are the available flags and at what optimization levels they > are enabled: >

> @@ -257,6 +263,8 @@ are enabled: > sink  •Allocation/Store Sinking > > fuse  •Fusion of operands into instructions > + > +fma    Fused multiply-add > >

> Here are the parameters and their default settings: > diff --git a/src/lj_asm_arm.h b/src/lj_asm_arm.h > index 5a0f925f..041cd794 100644 > --- a/src/lj_asm_arm.h > +++ b/src/lj_asm_arm.h > @@ -310,7 +310,11 @@ static void asm_fusexref(ASMState *as, ARMIns ai, Reg rd, IRRef ref, > } > > #if !LJ_SOFTFP > -/* Fuse to multiply-add/sub instruction. */ > +/* > +** Fuse to multiply-add/sub instruction. > +** VMLA rounds twice (UMA, not FMA) -- no need to check for JIT_F_OPT_FMA. > +** VFMA needs VFPv4, which is uncommon on the remaining ARM32 targets. > +*/ > static int asm_fusemadd(ASMState *as, IRIns *ir, ARMIns ai, ARMIns air) > { > IRRef lref = ir->op1, rref = ir->op2; > diff --git a/src/lj_asm_arm64.h b/src/lj_asm_arm64.h > index 88b47ceb..554bb60a 100644 > --- a/src/lj_asm_arm64.h > +++ b/src/lj_asm_arm64.h > @@ -334,7 +334,8 @@ static int asm_fusemadd(ASMState *as, IRIns *ir, A64Ins ai, A64Ins air) > { > IRRef lref = ir->op1, rref = ir->op2; > IRIns *irm; > - if (lref != rref && > + if ((as->flags & JIT_F_OPT_FMA) && > + lref != rref && > ((mayfuse(as, lref) && (irm = IR(lref), irm->o == IR_MUL) && > ra_noreg(irm->r)) || > (mayfuse(as, rref) && (irm = IR(rref), irm->o == IR_MUL) && > diff --git a/src/lj_asm_ppc.h b/src/lj_asm_ppc.h > index 7bba71b3..52db2926 100644 > --- a/src/lj_asm_ppc.h > +++ b/src/lj_asm_ppc.h > @@ -232,7 +232,8 @@ static int asm_fusemadd(ASMState *as, IRIns *ir, PPCIns pi, PPCIns pir) > { > IRRef lref = ir->op1, rref = ir->op2; > IRIns *irm; > - if (lref != rref && > + if ((as->flags & JIT_F_OPT_FMA) && > + lref != rref && > ((mayfuse(as, lref) && (irm = IR(lref), irm->o == IR_MUL) && > ra_noreg(irm->r)) || > (mayfuse(as, rref) && (irm = IR(rref), irm->o == IR_MUL) && > diff --git a/src/lj_jit.h b/src/lj_jit.h > index 47df85c6..73c355b9 100644 > --- a/src/lj_jit.h > +++ b/src/lj_jit.h > @@ -86,10 +86,11 @@ > #define JIT_F_OPT_ABC (JIT_F_OPT << 7) > #define JIT_F_OPT_SINK (JIT_F_OPT << 8) > #define JIT_F_OPT_FUSE (JIT_F_OPT << 9) > +#define JIT_F_OPT_FMA (JIT_F_OPT << 10) > > /* Optimizations names for -O. Must match the order above. */ > #define JIT_F_OPTSTRING \ > - "\4fold\3cse\3dce\3fwd\3dse\6narrow\4loop\3abc\4sink\4fuse" > + "\4fold\3cse\3dce\3fwd\3dse\6narrow\4loop\3abc\4sink\4fuse\3fma" > > /* Optimization levels set a fixed combination of flags. */ > #define JIT_F_OPT_0 0 > @@ -98,6 +99,7 @@ > #define JIT_F_OPT_3 (JIT_F_OPT_2|\ > JIT_F_OPT_FWD|JIT_F_OPT_DSE|JIT_F_OPT_ABC|JIT_F_OPT_SINK|JIT_F_OPT_FUSE) > #define JIT_F_OPT_DEFAULT JIT_F_OPT_3 > +/* Note: FMA is not set by default. */ > > /* -- JIT engine parameters ----------------------------------------------- */ > > diff --git a/src/lj_vmmath.c b/src/lj_vmmath.c > index faebe719..29b72e0c 100644 > --- a/src/lj_vmmath.c > +++ b/src/lj_vmmath.c > @@ -36,6 +36,17 @@ LJ_FUNCA double lj_wrap_fmod(double x, double y) { return fmod(x, y); } > > /* -- Helper functions ---------------------------------------------------- */ > > +/* Required to prevent the C compiler from applying FMA optimizations. > +** > +** Yes, there's -ffp-contract and the FP_CONTRACT pragma ... in theory. > +** But the current state of C compilers is a mess in this regard. > +** Also, this function is not performance sensitive at all. > +*/ > +LJ_NOINLINE static double lj_vm_floormul(double x, double y) > +{ > + return lj_vm_floor(x / y) * y; > +} > + > double lj_vm_foldarith(double x, double y, int op) > { > switch (op) { > @@ -43,7 +54,7 @@ double lj_vm_foldarith(double x, double y, int op) > case IR_SUB - IR_ADD: return x-y; break; > case IR_MUL - IR_ADD: return x*y; break; > case IR_DIV - IR_ADD: return x/y; break; > - case IR_MOD - IR_ADD: return x-lj_vm_floor(x/y)*y; break; > + case IR_MOD - IR_ADD: return x-lj_vm_floormul(x, y); break; > case IR_POW - IR_ADD: return pow(x, y); break; > case IR_NEG - IR_ADD: return -x; break; > case IR_ABS - IR_ADD: return fabs(x); break; > diff --git a/src/vm_arm64.dasc b/src/vm_arm64.dasc > index 1cf1ea51..c5f0a7a7 100644 > --- a/src/vm_arm64.dasc > +++ b/src/vm_arm64.dasc > @@ -2581,7 +2581,9 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop) > |.macro ins_arithmod, res, reg1, reg2 > | fdiv d2, reg1, reg2 > | frintm d2, d2 > - | fmsub res, d2, reg2, reg1 > + | // Cannot use fmsub, because FMA is not enabled by default. > + | fmul d2, d2, reg2 > + | fsub res, reg1, d2 > |.endmacro > | > |.macro ins_arithdn, intins, fpins > diff --git a/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua b/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua > new file mode 100644 > index 00000000..55ec7b98 > --- /dev/null > +++ b/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua > @@ -0,0 +1,36 @@ > +local tap = require('tap') > + > +-- Test file to demonstrate consistent behaviour for JIT and the > +-- VM regarding FMA optimization (disabled by default). > +-- XXX: The VM behaviour is checked in the > +-- . > +-- See also:https://github.com/LuaJIT/LuaJIT/issues/918. > +local test = tap.test('lj-918-fma-numerical-accuracy-jit'):skipcond({ > + ['Test requires JIT enabled'] = not jit.status(), > +}) > + > +test:plan(1) > + > +local _2pow52 = 2 ^ 52 > + > +-- IEEE754 components to double: > +-- sign * (2 ^ (exp - 1023)) * (mantissa / _2pow52 + normal). > +local a = 1 * (2 ^ (1083 - 1023)) * (4080546448249347 / _2pow52 + 1) > +assert(a == 2197541395358679800) > + > +local b = -1 * (2 ^ (1052 - 1023)) * (3927497732209973 / _2pow52 + 1) > +assert(b == -1005065126.3690554) > + Please add a comment with explanation why exactly these testcases are used. As I got it right, the idea is to calculate negative and positive number, right? Why do you think two examples are enough for testing that behavior for JIT and the VM is consistent? Should we check more corner cases? * Standard/Normal arithmetic * Subnormal arithmetic * Infinite arithmetic * NaN arithmetic * Zero arithmetic > +local results = {} > + > +jit.opt.start('hotloop=1') > +for i = 1, 4 do > + results[i] = a % b > +end > + > +-- XXX: The test doesn't fail before the commit. But it is Please add a commit hash and it's short description. > +-- required to be sure that there are no inconsistencies after the > +-- commit. > +test:samevalues(results, 'consistent behaviour between the JIT and the VM') > + > +test:done(true) > diff --git a/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua b/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua > new file mode 100644 > index 00000000..a3775d6d > --- /dev/null > +++ b/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua > @@ -0,0 +1,31 @@ > +local tap = require('tap') > + > +-- Test file to demonstrate possible numerical inaccuracy if FMA > +-- optimization takes place. I suppose we don't need to test FMA itself, but we should check that FMA is actually enabled when it's option is enabled. Right? if yes I would merge test lj-918-fma-numerical-accuracy.test.lua and test lj-918-fma-optimization.test.lua. > +-- XXX: The JIT consistency is checked in the > +-- . > +-- See also:https://github.com/LuaJIT/LuaJIT/issues/918. > +local test = tap.test('lj-918-fma-numerical-accuracy') > + > +test:plan(2) > + > +local _2pow52 = 2 ^ 52 > + > +-- IEEE754 components to double: > +-- sign * (2 ^ (exp - 1023)) * (mantissa / _2pow52 + normal). > +local a = 1 * (2 ^ (1083 - 1023)) * (4080546448249347 / _2pow52 + 1) > +assert(a == 2197541395358679800) > + > +local b = -1 * (2 ^ (1052 - 1023)) * (3927497732209973 / _2pow52 + 1) > +assert(b == -1005065126.3690554) The same questions as above. > + > +-- These tests fail on ARM64 before the patch or with FMA > +-- optimization enabled. > +-- The first test may not fail if the compiler doesn't generate > +-- an ARM64 FMA operation in `lj_vm_foldarith()`. > +test:is(2197541395358679800 % -1005065126.3690554, -606337536, > + 'FMA in the lj_vm_foldarith() during parsing') > + > +test:is(a % b, -606337536, 'FMA in the VM') > + > +test:done(true) > diff --git a/test/tarantool-tests/lj-918-fma-optimization.test.lua b/test/tarantool-tests/lj-918-fma-optimization.test.lua > new file mode 100644 > index 00000000..af749eb5 > --- /dev/null > +++ b/test/tarantool-tests/lj-918-fma-optimization.test.lua > @@ -0,0 +1,25 @@ > +local tap = require('tap') > +local test = tap.test('lj-918-fma-optimization'):skipcond({ > + ['Test requires JIT enabled'] = not jit.status(), > +}) > + > +test:plan(3) > + > +local function jit_opt_is_on(needed) why `needed` and not something like "flag"? > + for _, opt in ipairs({jit.status()}) do > + if opt == needed then > + return true > + end > + end > + return false > +end > + > +test:ok(not jit_opt_is_on('fma'), 'FMA is disabled by default') > + > +local ok, _ = pcall(jit.opt.start, '+fma') > + > +test:ok(ok, 'fma flag is recognized') > + > +test:ok(jit_opt_is_on('fma'), 'FMA is enabled after jit.opt.start()') > + > +test:done(true) --------------3nlrLCZWueX7UfFqFeBQv455 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 7bit

Hi, Sergey!

Thanks for the patch!


On 14.01.2025 14:06, Sergey Kaplun wrote:
From: Mike Pall <mike>

See the discussion in the corresponding ticket for the rationale.

(cherry picked from commit de2e1ca9d3d87e74c0c20c1e4ad3c32b31a5875b)

For the modulo operation, the arm64 VM uses `fmsub` [1] instruction,
which is the fused multiply-add (FMA [2]) operation (more precisely,
multiply-sub). Hence, it may produce different results compared to the
unfused one. This patch fixes the behaviour by using the unfused
instructions by default. However, the new JIT optimization flag (fma) is
introduced to make it possible to take advantage of the FMA
optimizations.

Sergey Kaplun:
* added the description and the test for the problem

[1]: https://developer.arm.com/documentation/dui0801/g/A64-Floating-point-Instructions/FMSUB
[2]: https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation

Part of tarantool/tarantool#10709
---

I intentionally avoid mentioning the ticket in the commit message to
avoid excess mentioning in the LuaJIT issue tracker. You can see the
LuaJIT/LuaJIT#918 link in the cover letter.

 doc/running.html                              |  8 +++++
 src/lj_asm_arm.h                              |  6 +++-
 src/lj_asm_arm64.h                            |  3 +-
 src/lj_asm_ppc.h                              |  3 +-
 src/lj_jit.h                                  |  4 ++-
 src/lj_vmmath.c                               | 13 ++++++-
 src/vm_arm64.dasc                             |  4 ++-
 ...lj-918-fma-numerical-accuracy-jit.test.lua | 36 +++++++++++++++++++
 .../lj-918-fma-numerical-accuracy.test.lua    | 31 ++++++++++++++++
 .../lj-918-fma-optimization.test.lua          | 25 +++++++++++++
 10 files changed, 127 insertions(+), 6 deletions(-)
 create mode 100644 test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua
 create mode 100644 test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua
 create mode 100644 test/tarantool-tests/lj-918-fma-optimization.test.lua

diff --git a/doc/running.html b/doc/running.html
index 7868efab..1cf41f1b 100644
--- a/doc/running.html
+++ b/doc/running.html
@@ -226,6 +226,12 @@ mix the three forms, but note that setting an optimization level
 overrides all earlier flags.
 </p>
 <p>
+Note that <tt>-Ofma</tt> is not enabled by default at any level,
+because it affects floating-point result accuracy. Only enable this,
+if you fully understand the trade-offs of FMA for performance (higher),
+determinism (lower) and numerical accuracy (higher).
+</p>
+<p>
 Here are the available flags and at what optimization levels they
 are enabled:
 </p>
@@ -257,6 +263,8 @@ are enabled:
 <td class="flag_name">sink</td><td class="flag_level">&nbsp;</td><td class="flag_level">&nbsp;</td><td class="flag_level">&bull;</td><td class="flag_desc">Allocation/Store Sinking</td></tr>
 <tr class="even">
 <td class="flag_name">fuse</td><td class="flag_level">&nbsp;</td><td class="flag_level">&nbsp;</td><td class="flag_level">&bull;</td><td class="flag_desc">Fusion of operands into instructions</td></tr>
+<tr class="odd">
+<td class="flag_name">fma </td><td class="flag_level">&nbsp;</td><td class="flag_level">&nbsp;</td><td class="flag_level">&nbsp;</td><td class="flag_desc">Fused multiply-add</td></tr>
 </table>
 <p>
 Here are the parameters and their default settings:
diff --git a/src/lj_asm_arm.h b/src/lj_asm_arm.h
index 5a0f925f..041cd794 100644
--- a/src/lj_asm_arm.h
+++ b/src/lj_asm_arm.h
@@ -310,7 +310,11 @@ static void asm_fusexref(ASMState *as, ARMIns ai, Reg rd, IRRef ref,
 }
 
 #if !LJ_SOFTFP
-/* Fuse to multiply-add/sub instruction. */
+/*
+** Fuse to multiply-add/sub instruction.
+** VMLA rounds twice (UMA, not FMA) -- no need to check for JIT_F_OPT_FMA.
+** VFMA needs VFPv4, which is uncommon on the remaining ARM32 targets.
+*/
 static int asm_fusemadd(ASMState *as, IRIns *ir, ARMIns ai, ARMIns air)
 {
   IRRef lref = ir->op1, rref = ir->op2;
diff --git a/src/lj_asm_arm64.h b/src/lj_asm_arm64.h
index 88b47ceb..554bb60a 100644
--- a/src/lj_asm_arm64.h
+++ b/src/lj_asm_arm64.h
@@ -334,7 +334,8 @@ static int asm_fusemadd(ASMState *as, IRIns *ir, A64Ins ai, A64Ins air)
 {
   IRRef lref = ir->op1, rref = ir->op2;
   IRIns *irm;
-  if (lref != rref &&
+  if ((as->flags & JIT_F_OPT_FMA) &&
+      lref != rref &&
       ((mayfuse(as, lref) && (irm = IR(lref), irm->o == IR_MUL) &&
        ra_noreg(irm->r)) ||
        (mayfuse(as, rref) && (irm = IR(rref), irm->o == IR_MUL) &&
diff --git a/src/lj_asm_ppc.h b/src/lj_asm_ppc.h
index 7bba71b3..52db2926 100644
--- a/src/lj_asm_ppc.h
+++ b/src/lj_asm_ppc.h
@@ -232,7 +232,8 @@ static int asm_fusemadd(ASMState *as, IRIns *ir, PPCIns pi, PPCIns pir)
 {
   IRRef lref = ir->op1, rref = ir->op2;
   IRIns *irm;
-  if (lref != rref &&
+  if ((as->flags & JIT_F_OPT_FMA) &&
+      lref != rref &&
       ((mayfuse(as, lref) && (irm = IR(lref), irm->o == IR_MUL) &&
 	ra_noreg(irm->r)) ||
        (mayfuse(as, rref) && (irm = IR(rref), irm->o == IR_MUL) &&
diff --git a/src/lj_jit.h b/src/lj_jit.h
index 47df85c6..73c355b9 100644
--- a/src/lj_jit.h
+++ b/src/lj_jit.h
@@ -86,10 +86,11 @@
 #define JIT_F_OPT_ABC		(JIT_F_OPT << 7)
 #define JIT_F_OPT_SINK		(JIT_F_OPT << 8)
 #define JIT_F_OPT_FUSE		(JIT_F_OPT << 9)
+#define JIT_F_OPT_FMA		(JIT_F_OPT << 10)
 
 /* Optimizations names for -O. Must match the order above. */
 #define JIT_F_OPTSTRING	\
-  "\4fold\3cse\3dce\3fwd\3dse\6narrow\4loop\3abc\4sink\4fuse"
+  "\4fold\3cse\3dce\3fwd\3dse\6narrow\4loop\3abc\4sink\4fuse\3fma"
 
 /* Optimization levels set a fixed combination of flags. */
 #define JIT_F_OPT_0	0
@@ -98,6 +99,7 @@
 #define JIT_F_OPT_3	(JIT_F_OPT_2|\
   JIT_F_OPT_FWD|JIT_F_OPT_DSE|JIT_F_OPT_ABC|JIT_F_OPT_SINK|JIT_F_OPT_FUSE)
 #define JIT_F_OPT_DEFAULT	JIT_F_OPT_3
+/* Note: FMA is not set by default. */
 
 /* -- JIT engine parameters ----------------------------------------------- */
 
diff --git a/src/lj_vmmath.c b/src/lj_vmmath.c
index faebe719..29b72e0c 100644
--- a/src/lj_vmmath.c
+++ b/src/lj_vmmath.c
@@ -36,6 +36,17 @@ LJ_FUNCA double lj_wrap_fmod(double x, double y) { return fmod(x, y); }
 
 /* -- Helper functions ---------------------------------------------------- */
 
+/* Required to prevent the C compiler from applying FMA optimizations.
+**
+** Yes, there's -ffp-contract and the FP_CONTRACT pragma ... in theory.
+** But the current state of C compilers is a mess in this regard.
+** Also, this function is not performance sensitive at all.
+*/
+LJ_NOINLINE static double lj_vm_floormul(double x, double y)
+{
+  return lj_vm_floor(x / y) * y;
+}
+
 double lj_vm_foldarith(double x, double y, int op)
 {
   switch (op) {
@@ -43,7 +54,7 @@ double lj_vm_foldarith(double x, double y, int op)
   case IR_SUB - IR_ADD: return x-y; break;
   case IR_MUL - IR_ADD: return x*y; break;
   case IR_DIV - IR_ADD: return x/y; break;
-  case IR_MOD - IR_ADD: return x-lj_vm_floor(x/y)*y; break;
+  case IR_MOD - IR_ADD: return x-lj_vm_floormul(x, y); break;
   case IR_POW - IR_ADD: return pow(x, y); break;
   case IR_NEG - IR_ADD: return -x; break;
   case IR_ABS - IR_ADD: return fabs(x); break;
diff --git a/src/vm_arm64.dasc b/src/vm_arm64.dasc
index 1cf1ea51..c5f0a7a7 100644
--- a/src/vm_arm64.dasc
+++ b/src/vm_arm64.dasc
@@ -2581,7 +2581,9 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
     |.macro ins_arithmod, res, reg1, reg2
     |  fdiv d2, reg1, reg2
     |  frintm d2, d2
-    |  fmsub res, d2, reg2, reg1
+    |  // Cannot use fmsub, because FMA is not enabled by default.
+    |  fmul d2, d2, reg2
+    |  fsub res, reg1, d2
     |.endmacro
     |
     |.macro ins_arithdn, intins, fpins
diff --git a/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua b/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua
new file mode 100644
index 00000000..55ec7b98
--- /dev/null
+++ b/test/tarantool-tests/lj-918-fma-numerical-accuracy-jit.test.lua
@@ -0,0 +1,36 @@
+local tap = require('tap')
+
+-- Test file to demonstrate consistent behaviour for JIT and the
+-- VM regarding FMA optimization (disabled by default).
+-- XXX: The VM behaviour is checked in the
+-- <lj-918-fma-numerical-accuracy.test.lua>.
+-- See also: https://github.com/LuaJIT/LuaJIT/issues/918.
+local test = tap.test('lj-918-fma-numerical-accuracy-jit'):skipcond({
+  ['Test requires JIT enabled'] = not jit.status(),
+})
+
+test:plan(1)
+
+local _2pow52 = 2 ^ 52
+
+-- IEEE754 components to double:
+-- sign * (2 ^ (exp - 1023)) * (mantissa / _2pow52 + normal).
+local a = 1 * (2 ^ (1083 - 1023)) * (4080546448249347 / _2pow52 + 1)
+assert(a == 2197541395358679800)
+
+local b = -1 * (2 ^ (1052 - 1023)) * (3927497732209973 / _2pow52 + 1)
+assert(b == -1005065126.3690554)
+

Please add a comment with explanation why exactly these testcases

are used.

As I got it right, the idea is to calculate negative and positive number, right?

Why do you think two examples are enough for testing that behavior for JIT and the VM

is consistent?

Should we check more corner cases?

  • Standard/Normal arithmetic
  • Subnormal arithmetic
  • Infinite arithmetic
  • NaN arithmetic
  • Zero arithmetic
+local results = {}
+
+jit.opt.start('hotloop=1')
+for i = 1, 4 do
+  results[i] = a % b
+end
+
+-- XXX: The test doesn't fail before the commit. But it is
Please add a commit hash and it's short description.
+-- required to be sure that there are no inconsistencies after the
+-- commit.
+test:samevalues(results, 'consistent behaviour between the JIT and the VM')
+
+test:done(true)
diff --git a/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua b/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua
new file mode 100644
index 00000000..a3775d6d
--- /dev/null
+++ b/test/tarantool-tests/lj-918-fma-numerical-accuracy.test.lua
@@ -0,0 +1,31 @@
+local tap = require('tap')
+
+-- Test file to demonstrate possible numerical inaccuracy if FMA
+-- optimization takes place.

I suppose we don't need to test FMA itself, but we should

check that FMA is actually enabled when it's option

is enabled. Right? if yes I would merge test lj-918-fma-numerical-accuracy.test.lua

and test lj-918-fma-optimization.test.lua.


+-- XXX: The JIT consistency is checked in the
+-- <lj-918-fma-numerical-accuracy-jit.test.lua>.
+-- See also: https://github.com/LuaJIT/LuaJIT/issues/918.
+local test = tap.test('lj-918-fma-numerical-accuracy')
+
+test:plan(2)
+
+local _2pow52 = 2 ^ 52
+
+-- IEEE754 components to double:
+-- sign * (2 ^ (exp - 1023)) * (mantissa / _2pow52 + normal).
+local a = 1 * (2 ^ (1083 - 1023)) * (4080546448249347 / _2pow52 + 1)
+assert(a == 2197541395358679800)
+
+local b = -1 * (2 ^ (1052 - 1023)) * (3927497732209973 / _2pow52 + 1)
+assert(b == -1005065126.3690554)
The same questions as above.
+
+-- These tests fail on ARM64 before the patch or with FMA
+-- optimization enabled.
+-- The first test may not fail if the compiler doesn't generate
+-- an ARM64 FMA operation in `lj_vm_foldarith()`.
+test:is(2197541395358679800 % -1005065126.3690554, -606337536,
+        'FMA in the lj_vm_foldarith() during parsing')
+
+test:is(a % b, -606337536, 'FMA in the VM')
+
+test:done(true)
diff --git a/test/tarantool-tests/lj-918-fma-optimization.test.lua b/test/tarantool-tests/lj-918-fma-optimization.test.lua
new file mode 100644
index 00000000..af749eb5
--- /dev/null
+++ b/test/tarantool-tests/lj-918-fma-optimization.test.lua
@@ -0,0 +1,25 @@
+local tap = require('tap')
+local test = tap.test('lj-918-fma-optimization'):skipcond({
+  ['Test requires JIT enabled'] = not jit.status(),
+})
+
+test:plan(3)
+
+local function jit_opt_is_on(needed)
why `needed` and not something like "flag"?
+  for _, opt in ipairs({jit.status()}) do
+    if opt == needed then
+      return true
+    end
+  end
+  return false
+end
+
+test:ok(not jit_opt_is_on('fma'), 'FMA is disabled by default')
+
+local ok, _ = pcall(jit.opt.start, '+fma')
+
+test:ok(ok, 'fma flag is recognized')
+
+test:ok(jit_opt_is_on('fma'), 'FMA is enabled after jit.opt.start()')
+
+test:done(true)
--------------3nlrLCZWueX7UfFqFeBQv455--