Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH luajit] x86/x64: Check for jcc when using xor r, r in emit_loadi().
@ 2022-07-04  9:33 Sergey Kaplun via Tarantool-patches
  2022-07-05 15:10 ` Igor Munkin via Tarantool-patches
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Sergey Kaplun via Tarantool-patches @ 2022-07-04  9:33 UTC (permalink / raw)
  To: Sergey Ostanevich, Igor Munkin; +Cc: tarantool-patches

From: Mike Pall <mike>

Thanks to Peter Cawley.

(cherry picked from commit fb5e522fbc0750c838ef6a926b11c5d870826183)

To reproduce this issue, we need:
1) a register which contains the constant zero value
2) a floating point comparison operation
3) the comparison operation to perform a fused load, which in
   turn needs to allocate a register, and for there to be no
   free registers at that moment, and for the register chosen
   for sacrifice to be holding the constant zero.

This leads to assembly code like the following:
| ucomisd xmm7, [r14+0x18]
| xor r14d, r14d
| jnb 0x12a0e001c ->3

That xor is a big problem, as it modifies flags between the
ucomisd and the jnb, thereby causing the jnb to do the wrong
thing.

This patch forbids emitting xor in `emit_loadi()` for jcc operations.

Sergey Kaplun:
* added the description and the test for the problem

Part of tarantool/tarantool#7230
---

Branch: https://github.com/tarantool/luajit/tree/skaplun/lj-416-xor-before-jcc-full-ci
Issues:
* https://github.com/LuaJIT/LuaJIT/issues/416
* https://github.com/tarantool/tarantool/issues/7230

Changelog entry (I suggest to update this entry with the each
corresponding bump):
===================================================================
## bugfix/luajit

Backported patches from vanilla LuaJIT trunk (gh-7230). In the scope of this
activity, the following issues have been resolved:
* Fixed `emit_loadi()` on x86/x64 emitting xor between condition check
  and jump instructions.
===================================================================

 src/lj_emit_x86.h                             |  6 +-
 test/tarantool-tests/CMakeLists.txt           |  1 +
 .../lj-416-xor-before-jcc.test.lua            | 70 +++++++++++++++++++
 .../lj-416-xor-before-jcc/CMakeLists.txt      |  1 +
 .../lj-416-xor-before-jcc/testxor.c           | 14 ++++
 5 files changed, 90 insertions(+), 2 deletions(-)
 create mode 100644 test/tarantool-tests/lj-416-xor-before-jcc.test.lua
 create mode 100644 test/tarantool-tests/lj-416-xor-before-jcc/CMakeLists.txt
 create mode 100644 test/tarantool-tests/lj-416-xor-before-jcc/testxor.c

diff --git a/src/lj_emit_x86.h b/src/lj_emit_x86.h
index 5207f9da..6b58306b 100644
--- a/src/lj_emit_x86.h
+++ b/src/lj_emit_x86.h
@@ -274,10 +274,12 @@ static void emit_movmroi(ASMState *as, Reg base, int32_t ofs, int32_t i)
 /* mov r, i / xor r, r */
 static void emit_loadi(ASMState *as, Reg r, int32_t i)
 {
-  /* XOR r,r is shorter, but modifies the flags. This is bad for HIOP. */
+  /* XOR r,r is shorter, but modifies the flags. This is bad for HIOP/jcc. */
   if (i == 0 && !(LJ_32 && (IR(as->curins)->o == IR_HIOP ||
 			    (as->curins+1 < as->T->nins &&
-			     IR(as->curins+1)->o == IR_HIOP)))) {
+			     IR(as->curins+1)->o == IR_HIOP))) &&
+		!((*as->mcp == 0x0f && (as->mcp[1] & 0xf0) == XI_JCCn) ||
+		  (*as->mcp & 0xf0) == XI_JCCs)) {
     emit_rr(as, XO_ARITH(XOg_XOR), r, r);
   } else {
     MCode *p = as->mcp;
diff --git a/test/tarantool-tests/CMakeLists.txt b/test/tarantool-tests/CMakeLists.txt
index 5708822e..ad69e2cc 100644
--- a/test/tarantool-tests/CMakeLists.txt
+++ b/test/tarantool-tests/CMakeLists.txt
@@ -65,6 +65,7 @@ add_subdirectory(gh-5813-resolving-of-c-symbols/stripped)
 add_subdirectory(gh-6098-fix-side-exit-patching-on-arm64)
 add_subdirectory(gh-6189-cur_L)
 add_subdirectory(lj-49-bad-lightuserdata)
+add_subdirectory(lj-416-xor-before-jcc)
 add_subdirectory(lj-601-fix-gc-finderrfunc)
 add_subdirectory(lj-727-lightuserdata-itern)
 add_subdirectory(lj-flush-on-trace)
diff --git a/test/tarantool-tests/lj-416-xor-before-jcc.test.lua b/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
new file mode 100644
index 00000000..7c6ab2b9
--- /dev/null
+++ b/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
@@ -0,0 +1,70 @@
+local ffi = require('ffi')
+local tap = require('tap')
+
+local test = tap.test('lj-416-xor-before-jcc')
+test:plan(1)
+
+-- To reproduce this issue, we need:
+-- 1) a register which contains the constant zero value
+-- 2) a floating point comparison operation
+-- 3) the comparison operation to perform a fused load, which in
+--    turn needs to allocate a register, and for there to be no
+--    free registers at that moment, and for the register chosen
+--    for sacrifice to be holding the constant zero.
+--
+-- This leads to assembly code like the following:
+--   ucomisd xmm7, [r14+0x18]
+--   xor r14d, r14d
+--   jnb 0x12a0e001c ->3
+--
+-- That xor is a big problem, as it modifies flags between the
+-- ucomisd and the jnb, thereby causing the jnb to do the wrong
+-- thing.
+
+ffi.cdef[[
+  int test_xor_func(int a, int b, int c, int d, int e, int f, void * g, int h);
+]]
+local testxor = ffi.load('libtestxor')
+
+local handler = setmetatable({}, {
+  __newindex = function ()
+    -- 0 and nil are suggested as differnt constant-zero values
+    -- for the call and occupied different registers.
+    testxor.test_xor_func(0, 0, 0, 0, 0, 0, nil, 0)
+  end
+})
+
+local mconf = {
+  { use = false, value = 100 },
+  { use = true,  value = 100 },
+}
+
+local function testf()
+  -- Generate register pressure.
+  local value = 50
+  for _, rule in ipairs(mconf) do
+    if rule.use then
+      value = rule.value
+      break
+    end
+  end
+
+  -- This branch shouldn't be taken.
+  if value <= 42 then
+    return true
+  end
+
+  -- Nothing to do, just call testxor with many arguments.
+  handler[4] = 4
+end
+
+-- We need to create long side trace to generate register
+-- pressure.
+jit.opt.start('hotloop=1', 'hotexit=1')
+for _ = 1, 3 do
+  -- Don't use any `test` functions here to freeze the trace.
+  assert (not testf())
+end
+test:ok(true, 'imposible branch is not taken')
+
+os.exit(test:check() and 0 or 1)
diff --git a/test/tarantool-tests/lj-416-xor-before-jcc/CMakeLists.txt b/test/tarantool-tests/lj-416-xor-before-jcc/CMakeLists.txt
new file mode 100644
index 00000000..17aa9f9b
--- /dev/null
+++ b/test/tarantool-tests/lj-416-xor-before-jcc/CMakeLists.txt
@@ -0,0 +1 @@
+BuildTestCLib(libtestxor testxor.c)
diff --git a/test/tarantool-tests/lj-416-xor-before-jcc/testxor.c b/test/tarantool-tests/lj-416-xor-before-jcc/testxor.c
new file mode 100644
index 00000000..32436d42
--- /dev/null
+++ b/test/tarantool-tests/lj-416-xor-before-jcc/testxor.c
@@ -0,0 +1,14 @@
+#define UNUSED(x) ((void)x)
+
+int test_xor_func(int a, int b, int c, int d, int e, int f, void *g, int h)
+{
+	UNUSED(a);
+	UNUSED(b);
+	UNUSED(c);
+	UNUSED(d);
+	UNUSED(e);
+	UNUSED(f);
+	UNUSED(g);
+	UNUSED(h);
+	return 0;
+}
-- 
2.34.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH luajit] x86/x64: Check for jcc when using xor r, r in emit_loadi().
  2022-07-04  9:33 [Tarantool-patches] [PATCH luajit] x86/x64: Check for jcc when using xor r, r in emit_loadi() Sergey Kaplun via Tarantool-patches
@ 2022-07-05 15:10 ` Igor Munkin via Tarantool-patches
  2022-07-06 10:30 ` sergos via Tarantool-patches
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Igor Munkin via Tarantool-patches @ 2022-07-05 15:10 UTC (permalink / raw)
  To: Sergey Kaplun; +Cc: tarantool-patches

Sergey,

Thanks for the patch! Please consider my comments below.

On 04.07.22, Sergey Kaplun wrote:
> From: Mike Pall <mike>
> 
> Thanks to Peter Cawley.
> 
> (cherry picked from commit fb5e522fbc0750c838ef6a926b11c5d870826183)
> 
> To reproduce this issue, we need:
> 1) a register which contains the constant zero value
> 2) a floating point comparison operation
> 3) the comparison operation to perform a fused load, which in
>    turn needs to allocate a register, and for there to be no
>    free registers at that moment, and for the register chosen
>    for sacrifice to be holding the constant zero.
> 
> This leads to assembly code like the following:
> | ucomisd xmm7, [r14+0x18]
> | xor r14d, r14d
> | jnb 0x12a0e001c ->3
> 
> That xor is a big problem, as it modifies flags between the
> ucomisd and the jnb, thereby causing the jnb to do the wrong
> thing.
> 
> This patch forbids emitting xor in `emit_loadi()` for jcc operations.
> 
> Sergey Kaplun:
> * added the description and the test for the problem
> 
> Part of tarantool/tarantool#7230
> ---
> 
> Branch: https://github.com/tarantool/luajit/tree/skaplun/lj-416-xor-before-jcc-full-ci
> Issues:
> * https://github.com/LuaJIT/LuaJIT/issues/416
> * https://github.com/tarantool/tarantool/issues/7230
> 
> Changelog entry (I suggest to update this entry with the each
> corresponding bump):
> ===================================================================
> ## bugfix/luajit
> 
> Backported patches from vanilla LuaJIT trunk (gh-7230). In the scope of this
> activity, the following issues have been resolved:
> * Fixed `emit_loadi()` on x86/x64 emitting xor between condition check
>   and jump instructions.
> ===================================================================
> 
>  src/lj_emit_x86.h                             |  6 +-
>  test/tarantool-tests/CMakeLists.txt           |  1 +
>  .../lj-416-xor-before-jcc.test.lua            | 70 +++++++++++++++++++
>  .../lj-416-xor-before-jcc/CMakeLists.txt      |  1 +
>  .../lj-416-xor-before-jcc/testxor.c           | 14 ++++
>  5 files changed, 90 insertions(+), 2 deletions(-)
>  create mode 100644 test/tarantool-tests/lj-416-xor-before-jcc.test.lua
>  create mode 100644 test/tarantool-tests/lj-416-xor-before-jcc/CMakeLists.txt
>  create mode 100644 test/tarantool-tests/lj-416-xor-before-jcc/testxor.c
> 

<snipped>

> diff --git a/test/tarantool-tests/lj-416-xor-before-jcc.test.lua b/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
> new file mode 100644
> index 00000000..7c6ab2b9
> --- /dev/null
> +++ b/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
> @@ -0,0 +1,70 @@
> +local ffi = require('ffi')
> +local tap = require('tap')
> +
> +local test = tap.test('lj-416-xor-before-jcc')

Should this test be run only on x86_64, considering its semantics?

> +test:plan(1)
> +
> +-- To reproduce this issue, we need:
> +-- 1) a register which contains the constant zero value
> +-- 2) a floating point comparison operation
> +-- 3) the comparison operation to perform a fused load, which in
> +--    turn needs to allocate a register, and for there to be no
> +--    free registers at that moment, and for the register chosen
> +--    for sacrifice to be holding the constant zero.
> +--
> +-- This leads to assembly code like the following:
> +--   ucomisd xmm7, [r14+0x18]
> +--   xor r14d, r14d
> +--   jnb 0x12a0e001c ->3
> +--
> +-- That xor is a big problem, as it modifies flags between the
> +-- ucomisd and the jnb, thereby causing the jnb to do the wrong
> +-- thing.
> +
> +ffi.cdef[[
> +  int test_xor_func(int a, int b, int c, int d, int e, int f, void * g, int h);
> +]]
> +local testxor = ffi.load('libtestxor')
> +
> +local handler = setmetatable({}, {
> +  __newindex = function ()
> +    -- 0 and nil are suggested as differnt constant-zero values
> +    -- for the call and occupied different registers.
> +    testxor.test_xor_func(0, 0, 0, 0, 0, 0, nil, 0)

Minor: Please, describe the purpose of this function.

> +  end
> +})
> +
> +local mconf = {
> +  { use = false, value = 100 },
> +  { use = true,  value = 100 },
> +}
> +
> +local function testf()
> +  -- Generate register pressure.
> +  local value = 50
> +  for _, rule in ipairs(mconf) do
> +    if rule.use then
> +      value = rule.value
> +      break
> +    end
> +  end

Minor: Could you please explain this block of code a bit?

> +
> +  -- This branch shouldn't be taken.

Minor: Why this branch should be taken without the patch?

> +  if value <= 42 then
> +    return true
> +  end
> +
> +  -- Nothing to do, just call testxor with many arguments.
> +  handler[4] = 4

Minor: What is 4 in both key and value senses?

> +end
> +
> +-- We need to create long side trace to generate register
> +-- pressure.

Minor: What makes the generated side trace long?

> +jit.opt.start('hotloop=1', 'hotexit=1')
> +for _ = 1, 3 do
> +  -- Don't use any `test` functions here to freeze the trace.
> +  assert (not testf())

Typo: s/assert (/assert(/.

> +end
> +test:ok(true, 'imposible branch is not taken')
> +
> +os.exit(test:check() and 0 or 1)

<snipped>

> -- 
> 2.34.1
> 

-- 
Best regards,
IM

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH luajit] x86/x64: Check for jcc when using xor r, r in emit_loadi().
  2022-07-04  9:33 [Tarantool-patches] [PATCH luajit] x86/x64: Check for jcc when using xor r, r in emit_loadi() Sergey Kaplun via Tarantool-patches
  2022-07-05 15:10 ` Igor Munkin via Tarantool-patches
@ 2022-07-06 10:30 ` sergos via Tarantool-patches
  2022-11-11 10:20   ` Igor Munkin via Tarantool-patches
  2022-08-31  8:48 ` Sergey Kaplun via Tarantool-patches
  2022-11-23  7:50 ` Igor Munkin via Tarantool-patches
  3 siblings, 1 reply; 10+ messages in thread
From: sergos via Tarantool-patches @ 2022-07-06 10:30 UTC (permalink / raw)
  To: Sergey Kaplun; +Cc: tarantool-patches

Hi!

Thanks for the patch!

I won’t veto on the patch, since it’s from upstream. Still I have got
a number of questions on it’s crutch nature. Perhaps we need to follow
up with some test cases to luajit.

Some description fixes are needed also.

Sergos


> On 4 Jul 2022, at 12:33, Sergey Kaplun <skaplun@tarantool.org> wrote:
> 
> From: Mike Pall <mike>
> 
> Thanks to Peter Cawley.
> 
> (cherry picked from commit fb5e522fbc0750c838ef6a926b11c5d870826183)
> 
> To reproduce this issue, we need:
> 1) a register which contains the constant zero value

> 2) a floating point comparison operation

The integer one will set the same flags (CF for the case of JNB) and XOR will
clear it anyways. 

> 3) the comparison operation to perform a fused load, which in
>   turn needs to allocate a register, and for there to be no
>   free registers at that moment, and for the register chosen
>   for sacrifice to be holding the constant zero.
> 
Unfortunately, it’s not clear what this register (I suppose it’s r14d) is used for.
Is it an argument, needed after the fall-through in the same trace?
Why not it sank down below the branch? 
IIRC it is a dedicated register used for dispatch, so why is it used for sacrificing?

> This leads to assembly code like the following:
> | ucomisd xmm7, [r14+0x18]
> | xor r14d, r14d
> | jnb 0x12a0e001c ->3
> 
> That xor is a big problem, as it modifies flags between the
> ucomisd and the jnb, thereby causing the jnb to do the wrong
> thing.
> 

There’s no trace after the fix, so it’s hard to grasp the resolution used.
 
> This patch forbids emitting xor in `emit_loadi()` for jcc operations.
> 
> Sergey Kaplun:
> * added the description and the test for the problem
> 
> Part of tarantool/tarantool#7230
> ---
> 
> Branch: https://github.com/tarantool/luajit/tree/skaplun/lj-416-xor-before-jcc-full-ci
> Issues:
> * https://github.com/LuaJIT/LuaJIT/issues/416
> * https://github.com/tarantool/tarantool/issues/7230
> 
> Changelog entry (I suggest to update this entry with the each
> corresponding bump):
> ===================================================================
> ## bugfix/luajit
> 
> Backported patches from vanilla LuaJIT trunk (gh-7230). In the scope of this
> activity, the following issues have been resolved:
> * Fixed `emit_loadi()` on x86/x64 emitting xor between condition check
>  and jump instructions.
> ===================================================================
> 
> src/lj_emit_x86.h                             |  6 +-
> test/tarantool-tests/CMakeLists.txt           |  1 +
> .../lj-416-xor-before-jcc.test.lua            | 70 +++++++++++++++++++
> .../lj-416-xor-before-jcc/CMakeLists.txt      |  1 +
> .../lj-416-xor-before-jcc/testxor.c           | 14 ++++
> 5 files changed, 90 insertions(+), 2 deletions(-)
> create mode 100644 test/tarantool-tests/lj-416-xor-before-jcc.test.lua
> create mode 100644 test/tarantool-tests/lj-416-xor-before-jcc/CMakeLists.txt
> create mode 100644 test/tarantool-tests/lj-416-xor-before-jcc/testxor.c
> 
> diff --git a/src/lj_emit_x86.h b/src/lj_emit_x86.h
> index 5207f9da..6b58306b 100644
> --- a/src/lj_emit_x86.h
> +++ b/src/lj_emit_x86.h
> @@ -274,10 +274,12 @@ static void emit_movmroi(ASMState *as, Reg base, int32_t ofs, int32_t i)
> /* mov r, i / xor r, r */
> static void emit_loadi(ASMState *as, Reg r, int32_t i)
> {
> -  /* XOR r,r is shorter, but modifies the flags. This is bad for HIOP. */
> +  /* XOR r,r is shorter, but modifies the flags. This is bad for HIOP/jcc. */
>   if (i == 0 && !(LJ_32 && (IR(as->curins)->o == IR_HIOP ||
> 			    (as->curins+1 < as->T->nins &&
> -			     IR(as->curins+1)->o == IR_HIOP)))) {
> +			     IR(as->curins+1)->o == IR_HIOP))) &&
> +		!((*as->mcp == 0x0f && (as->mcp[1] & 0xf0) == XI_JCCn) ||
> +		  (*as->mcp & 0xf0) == XI_JCCs)) {

I wonder if there can be a case with two instructions emitted between the comparison
and the jump. In such a case the xor still can sneak in? 
Another idea: there could be different instructions that relies on the flags, such as
x86Arith:XOg_ADC or x86Op:XO_CMOV, so the if condition above could be bigger?

>     emit_rr(as, XO_ARITH(XOg_XOR), r, r);
>   } else {
>     MCode *p = as->mcp;
> diff --git a/test/tarantool-tests/CMakeLists.txt b/test/tarantool-tests/CMakeLists.txt
> index 5708822e..ad69e2cc 100644
> --- a/test/tarantool-tests/CMakeLists.txt
> +++ b/test/tarantool-tests/CMakeLists.txt
> @@ -65,6 +65,7 @@ add_subdirectory(gh-5813-resolving-of-c-symbols/stripped)
> add_subdirectory(gh-6098-fix-side-exit-patching-on-arm64)
> add_subdirectory(gh-6189-cur_L)
> add_subdirectory(lj-49-bad-lightuserdata)
> +add_subdirectory(lj-416-xor-before-jcc)
> add_subdirectory(lj-601-fix-gc-finderrfunc)
> add_subdirectory(lj-727-lightuserdata-itern)
> add_subdirectory(lj-flush-on-trace)
> diff --git a/test/tarantool-tests/lj-416-xor-before-jcc.test.lua b/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
> new file mode 100644
> index 00000000..7c6ab2b9
> --- /dev/null
> +++ b/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
> @@ -0,0 +1,70 @@
> +local ffi = require('ffi')
> +local tap = require('tap')
> +
> +local test = tap.test('lj-416-xor-before-jcc')
> +test:plan(1)
> +
> +-- To reproduce this issue, we need:
> +-- 1) a register which contains the constant zero value
> +-- 2) a floating point comparison operation
> +-- 3) the comparison operation to perform a fused load, which in
> +--    turn needs to allocate a register, and for there to be no
> +--    free registers at that moment, and for the register chosen
> +--    for sacrifice to be holding the constant zero.
> +--
> +-- This leads to assembly code like the following:
> +--   ucomisd xmm7, [r14+0x18]
> +--   xor r14d, r14d
> +--   jnb 0x12a0e001c ->3
> +--
> +-- That xor is a big problem, as it modifies flags between the
> +-- ucomisd and the jnb, thereby causing the jnb to do the wrong
> +-- thing.

Same problems with explanation, copy from <fixed> message above

> +
> +ffi.cdef[[
> +  int test_xor_func(int a, int b, int c, int d, int e, int f, void * g, int h);
> +]]
> +local testxor = ffi.load('libtestxor')

Should have a test for x86 platform, since changes are x86 only?

> +
> +local handler = setmetatable({}, {
> +  __newindex = function ()
> +    -- 0 and nil are suggested as differnt constant-zero values
> +    -- for the call and occupied different registers.
> +    testxor.test_xor_func(0, 0, 0, 0, 0, 0, nil, 0)
> +  end
> +})
> +
> +local mconf = {
> +  { use = false, value = 100 },
> +  { use = true,  value = 100 },
> +}
> +
> +local function testf()
> +  -- Generate register pressure.
> +  local value = 50
> +  for _, rule in ipairs(mconf) do
> +    if rule.use then
> +      value = rule.value
> +      break
> +    end
> +  end
> +
> +  -- This branch shouldn't be taken.
> +  if value <= 42 then
> +    return true
> +  end
> +
> +  -- Nothing to do, just call testxor with many arguments.
> +  handler[4] = 4

If it is needed to use the metatable here then why?

> +end
> +
> +-- We need to create long side trace to generate register
> +-- pressure.
> +jit.opt.start('hotloop=1', 'hotexit=1')
> +for _ = 1, 3 do
> +  -- Don't use any `test` functions here to freeze the trace.
> +  assert (not testf())
> +end
> +test:ok(true, 'imposible branch is not taken')
> +
> +os.exit(test:check() and 0 or 1)
> diff --git a/test/tarantool-tests/lj-416-xor-before-jcc/CMakeLists.txt b/test/tarantool-tests/lj-416-xor-before-jcc/CMakeLists.txt
> new file mode 100644
> index 00000000..17aa9f9b
> --- /dev/null
> +++ b/test/tarantool-tests/lj-416-xor-before-jcc/CMakeLists.txt
> @@ -0,0 +1 @@
> +BuildTestCLib(libtestxor testxor.c)
> diff --git a/test/tarantool-tests/lj-416-xor-before-jcc/testxor.c b/test/tarantool-tests/lj-416-xor-before-jcc/testxor.c
> new file mode 100644
> index 00000000..32436d42
> --- /dev/null
> +++ b/test/tarantool-tests/lj-416-xor-before-jcc/testxor.c
> @@ -0,0 +1,14 @@
> +#define UNUSED(x) ((void)x)
> +
> +int test_xor_func(int a, int b, int c, int d, int e, int f, void *g, int h)
> +{
> +	UNUSED(a);
> +	UNUSED(b);
> +	UNUSED(c);
> +	UNUSED(d);
> +	UNUSED(e);
> +	UNUSED(f);
> +	UNUSED(g);
> +	UNUSED(h);
> +	return 0;
> +}
> -- 
> 2.34.1
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH luajit] x86/x64: Check for jcc when using xor r, r in emit_loadi().
  2022-07-04  9:33 [Tarantool-patches] [PATCH luajit] x86/x64: Check for jcc when using xor r, r in emit_loadi() Sergey Kaplun via Tarantool-patches
  2022-07-05 15:10 ` Igor Munkin via Tarantool-patches
  2022-07-06 10:30 ` sergos via Tarantool-patches
@ 2022-08-31  8:48 ` Sergey Kaplun via Tarantool-patches
  2022-08-31 16:04   ` sergos via Tarantool-patches
  2022-11-23  7:50 ` Igor Munkin via Tarantool-patches
  3 siblings, 1 reply; 10+ messages in thread
From: Sergey Kaplun via Tarantool-patches @ 2022-08-31  8:48 UTC (permalink / raw)
  To: Sergey Ostanevich, Igor Munkin; +Cc: tarantool-patches

Sergos, Igor!

Thanks for your review!

I've updated test with more comments considering your suggestions.
See the iterative patch below:

Branch is force-pushed.
===================================================================
diff --git a/test/tarantool-tests/lj-416-xor-before-jcc.test.lua b/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
index 7c6ab2b9..39551184 100644
--- a/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
+++ b/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
@@ -13,9 +13,9 @@ test:plan(1)
 --    for sacrifice to be holding the constant zero.
 --
 -- This leads to assembly code like the following:
---   ucomisd xmm7, [r14+0x18]
+--   ucomisd xmm7, [r14]
 --   xor r14d, r14d
---   jnb 0x12a0e001c ->3
+--   jnb 0x555d3d250014        ->1
 --
 -- That xor is a big problem, as it modifies flags between the
 -- ucomisd and the jnb, thereby causing the jnb to do the wrong
@@ -34,36 +34,53 @@ local handler = setmetatable({}, {
   end
 })
 
+local MAGIC = 42
 local mconf = {
-  { use = false, value = 100 },
-  { use = true,  value = 100 },
+  { use = false, value = MAGIC + 1 },
+  { use = true,  value = MAGIC + 1 },
 }
 
 local function testf()
-  -- Generate register pressure.
-  local value = 50
+  -- All code below is needed for generating register pressure.
+  local value
+  -- The first trace to compile is this for loop, with `rule.use`
+  -- values is `true`.
   for _, rule in ipairs(mconf) do
+    -- The side trace starts here, when `rule.use` value is
+    -- `false`, returns to the `for` loop, where the function was
+    -- called, starts another iteration, calls `testf()` again and
+    -- ends at JITERL bytecode for the loop in this function.
     if rule.use then
       value = rule.value
       break
     end
   end
 
-  -- This branch shouldn't be taken.
-  if value <= 42 then
+  -- The code below is recorded with the following IRs:
+  -- ....              SNAP   #1   [ lj-416-xor-before-jcc.test.lua:44|---- ]
+  -- 0012       >  num UGT    0009  +42
+  --
+  -- That leads to the following assembly:
+  -- ucomisd xmm7, [r14]
+  -- xor r14d, r14d
+  -- jnb 0x555d3d250014        ->1
+  --
+  -- As a result, this branch is taken due to the emitted `xor`
+  -- instruction until the issue is not resolved.
+  if value <= MAGIC then
     return true
   end
 
   -- Nothing to do, just call testxor with many arguments.
-  handler[4] = 4
+  handler.nothing = 'to do'
 end
 
 -- We need to create long side trace to generate register
--- pressure.
+-- pressure (see the comment in `testf()`).
 jit.opt.start('hotloop=1', 'hotexit=1')
 for _ = 1, 3 do
   -- Don't use any `test` functions here to freeze the trace.
-  assert (not testf())
+  assert(not testf())
 end
 test:ok(true, 'imposible branch is not taken')
 
===================================================================

On 04.07.22, Sergey Kaplun wrote:
>> From: Mike Pall <mike>
>> 
>> Thanks to Peter Cawley.
>> 
>> (cherry picked from commit fb5e522fbc0750c838ef6a926b11c5d870826183)
>> 
>> To reproduce this issue, we need:
>> 1) a register which contains the constant zero value
>> 2) a floating point comparison operation

> The integer one will set the same flags (CF for the case of JNB) and XOR will
> clear it anyways.

I suppose that this is just not our case -- for integer comparasions we
should use cdata objects. I failed to create test case with the corresponding
behaviour.

>> 3) the comparison operation to perform a fused load, which in
>>    turn needs to allocate a register, and for there to be no
>>    free registers at that moment, and for the register chosen
>>    for sacrifice to be holding the constant zero.

> Unfortunately, it’s not clear what this register (I suppose it’s r14d) is used for.
> Is it an argument, needed after the fall-through in the same trace?

Yes, it is the argument for the future call.
|  mov [rsp+0x8], r14
|  mov [rsp], r15
|  xor r9d, r9d
|  xor r8d, r8d
|  xor ecx, ecx
|  xor edx, edx
|  xor esi, esi
|  xor edi, edi
|  call rbx

> Why not it sank down below the branch?

I supose that this is issue related to self-written assembly from bottom
of a trace to its start. First the branch is encoded, and only after
that, we assembly the comparision operation.

> IIRC it is a dedicated register used for dispatch, so why is it used for sacrificing?

r14d is DISPATCH register only for LuaJIT VM.

>> 
>> This leads to assembly code like the following:
>> | ucomisd xmm7, [r14]
>> | xor r14d, r14d
>> | jnb 0x55c95c330014 ->1
>> 
>> That xor is a big problem, as it modifies flags between the
>> ucomisd and the jnb, thereby causing the jnb to do the wrong
>> thing.

> There’s no trace after the fix, so it’s hard to grasp the resolution
> used.

The trace is still compiled, but mov is used instead of xor.
| ucomisd xmm7, [r14]
| mov r14d, 0x0
| jnb 0x55c95c330014 ->1

>> 
>> This patch forbids emitting xor in `emit_loadi()` for jcc operations.

>> 
>> Sergey Kaplun:
>> * added the description and the test for the problem
>> 
>> Part of tarantool/tarantool#7230
>> ---
>> 
>> Branch: https://github.com/tarantool/luajit/tree/skaplun/lj-416-xor-before-jcc-full-ci
>> Issues:
>> * https://github.com/LuaJIT/LuaJIT/issues/416
>> * https://github.com/tarantool/tarantool/issues/7230
>> 
>> Changelog entry (I suggest to update this entry with the each
>> corresponding bump):
>> ===================================================================
>> ## bugfix/luajit
>> 
>> Backported patches from vanilla LuaJIT trunk (gh-7230). In the scope of this
>> activity, the following issues have been resolved:
>> * Fixed `emit_loadi()` on x86/x64 emitting xor between condition check
>>   and jump instructions.
>> ===================================================================
>> 
>>  src/lj_emit_x86.h                             |  6 +-
>>  test/tarantool-tests/CMakeLists.txt           |  1 +
>>  .../lj-416-xor-before-jcc.test.lua            | 70 +++++++++++++++++++
>>  .../lj-416-xor-before-jcc/CMakeLists.txt      |  1 +
>>  .../lj-416-xor-before-jcc/testxor.c           | 14 ++++
>>  5 files changed, 90 insertions(+), 2 deletions(-)
>>  create mode 100644 test/tarantool-tests/lj-416-xor-before-jcc.test.lua
>>  create mode 100644 test/tarantool-tests/lj-416-xor-before-jcc/CMakeLists.txt
>>  create mode 100644 test/tarantool-tests/lj-416-xor-before-jcc/testxor.c
>> 
>> diff --git a/src/lj_emit_x86.h b/src/lj_emit_x86.h
>> index 5207f9da..6b58306b 100644
>> --- a/src/lj_emit_x86.h
>> +++ b/src/lj_emit_x86.h
>> @@ -274,10 +274,12 @@ static void emit_movmroi(ASMState *as, Reg base, int32_t ofs, int32_t i)
>>  /* mov r, i / xor r, r */
>>  static void emit_loadi(ASMState *as, Reg r, int32_t i)
>>  {
>> -  /* XOR r,r is shorter, but modifies the flags. This is bad for HIOP. */
>> +  /* XOR r,r is shorter, but modifies the flags. This is bad for HIOP/jcc. */
>>    if (i == 0 && !(LJ_32 && (IR(as->curins)->o == IR_HIOP ||
>>  			    (as->curins+1 < as->T->nins &&
>> -			     IR(as->curins+1)->o == IR_HIOP)))) {
>> +			     IR(as->curins+1)->o == IR_HIOP))) &&
>> +		!((*as->mcp == 0x0f && (as->mcp[1] & 0xf0) == XI_JCCn) ||
>> +		  (*as->mcp & 0xf0) == XI_JCCs)) {

> I wonder if there can be a case with two instructions emitted between the comparison
> and the jump. In such a case the xor still can sneak in?

To be honest I'm nott sure that this is possible due to self-written
assembler. So we must keep it in mind when assembly the correspodning
IRs.

> Another idea: there could be different instructions that relies on the flags, such as
> x86Arith:XOg_ADC or x86Op:XO_CMOV, so the if condition above could be bigger?

AFAIK, it is imposible due to close attention to the assembly.

>>      emit_rr(as, XO_ARITH(XOg_XOR), r, r);
>>    } else {
>>      MCode *p = as->mcp;
>> diff --git a/test/tarantool-tests/CMakeLists.txt b/test/tarantool-tests/CMakeLists.txt
>> index 5708822e..ad69e2cc 100644
>> --- a/test/tarantool-tests/CMakeLists.txt
>> +++ b/test/tarantool-tests/CMakeLists.txt
>> @@ -65,6 +65,7 @@ add_subdirectory(gh-5813-resolving-of-c-symbols/stripped)
>>  add_subdirectory(gh-6098-fix-side-exit-patching-on-arm64)
>>  add_subdirectory(gh-6189-cur_L)
>>  add_subdirectory(lj-49-bad-lightuserdata)
>> +add_subdirectory(lj-416-xor-before-jcc)
>>  add_subdirectory(lj-601-fix-gc-finderrfunc)
>>  add_subdirectory(lj-727-lightuserdata-itern)
>>  add_subdirectory(lj-flush-on-trace)
>> diff --git a/test/tarantool-tests/lj-416-xor-before-jcc.test.lua b/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
>> new file mode 100644
>> index 00000000..7c6ab2b9
>> --- /dev/null
>> +++ b/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
>> @@ -0,0 +1,70 @@
>> +local ffi = require('ffi')
>> +local tap = require('tap')
>> +
>> +local test = tap.test('lj-416-xor-before-jcc')
>> +test:plan(1)
>> +
>> +-- To reproduce this issue, we need:
>> +-- 1) a register which contains the constant zero value
>> +-- 2) a floating point comparison operation
>> +-- 3) the comparison operation to perform a fused load, which in
>> +--    turn needs to allocate a register, and for there to be no
>> +--    free registers at that moment, and for the register chosen
>> +--    for sacrifice to be holding the constant zero.
>> +--
>> +-- This leads to assembly code like the following:
>> +--   ucomisd xmm7, [r14+0x18]
>> +--   xor r14d, r14d
>> +--   jnb 0x12a0e001c ->3
>> +--
>> +-- That xor is a big problem, as it modifies flags between the
>> +-- ucomisd and the jnb, thereby causing the jnb to do the wrong
>> +-- thing.
>> +
>> +ffi.cdef[[
>> +  int test_xor_func(int a, int b, int c, int d, int e, int f, void * g, int h);
>> +]]
>> +local testxor = ffi.load('libtestxor')

> Should have a test for x86 platform, since changes are x86 only?

We have too small amount of tests for LuaJIT anyway, so I prefer to keep
tests for all architectures for now.

>> +
>> +local handler = setmetatable({}, {
>> +  __newindex = function ()
>> +    -- 0 and nil are suggested as differnt constant-zero values
>> +    -- for the call and occupied different registers.
>> +    testxor.test_xor_func(0, 0, 0, 0, 0, 0, nil, 0)
>> +  end
>> +})
>> +
>> +local mconf = {
>> +  { use = false, value = 100 },
>> +  { use = true,  value = 100 },
>> +}
>> +
>> +local function testf()
>> +  -- Generate register pressure.
>> +  local value = 50
>> +  for _, rule in ipairs(mconf) do
>> +    if rule.use then
>> +      value = rule.value
>> +      break
>> +    end
>> +  end
>> +
>> +  -- This branch shouldn't be taken.
>> +  if value <= 42 then
>> +    return true
>> +  end
>> +
>> +  -- Nothing to do, just call testxor with many arguments.
>> +  handler[4] = 4

> If it is needed to use the metatable here then why?

It is needed for suitable registers layout. Unfortunately, I gave up to
create test without it.

>> +end
>> +
>> +-- We need to create long side trace to generate register
>> +-- pressure.
>> +jit.opt.start('hotloop=1', 'hotexit=1')
>> +for _ = 1, 3 do
>> +  -- Don't use any `test` functions here to freeze the trace.
>> +  assert (not testf())
>> +end
>> +test:ok(true, 'imposible branch is not taken')
>> +
>> +os.exit(test:check() and 0 or 1)
>> diff --git a/test/tarantool-tests/lj-416-xor-before-jcc/CMakeLists.txt b/test/tarantool-tests/lj-416-xor-before-jcc/CMakeLists.txt
>> new file mode 100644
>> index 00000000..17aa9f9b
>> --- /dev/null
>> +++ b/test/tarantool-tests/lj-416-xor-before-jcc/CMakeLists.txt
>> @@ -0,0 +1 @@
>> +BuildTestCLib(libtestxor testxor.c)
>> diff --git a/test/tarantool-tests/lj-416-xor-before-jcc/testxor.c b/test/tarantool-tests/lj-416-xor-before-jcc/testxor.c
>> new file mode 100644
>> index 00000000..32436d42
>> --- /dev/null
>> +++ b/test/tarantool-tests/lj-416-xor-before-jcc/testxor.c

<snipped>

>> -- 
>> 2.34.1
>> 

-- 
Best regards,
Sergey Kaplun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH luajit] x86/x64: Check for jcc when using xor r, r in emit_loadi().
  2022-08-31  8:48 ` Sergey Kaplun via Tarantool-patches
@ 2022-08-31 16:04   ` sergos via Tarantool-patches
  2022-09-01 10:32     ` Sergey Kaplun via Tarantool-patches
  0 siblings, 1 reply; 10+ messages in thread
From: sergos via Tarantool-patches @ 2022-08-31 16:04 UTC (permalink / raw)
  To: Sergey Kaplun; +Cc: tarantool-patches

Hi!

Thanks and with last two nits marked with #1 and #2 - LGTM.

Sergos

> On 31 Aug 2022, at 11:48, Sergey Kaplun <skaplun@tarantool.org> wrote:
> 
> Sergos, Igor!
> 
> Thanks for your review!
> 
> I've updated test with more comments considering your suggestions.
> See the iterative patch below:
> 
> Branch is force-pushed.
> ===================================================================
> diff --git a/test/tarantool-tests/lj-416-xor-before-jcc.test.lua b/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
> index 7c6ab2b9..39551184 100644
> --- a/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
> +++ b/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
> @@ -13,9 +13,9 @@ test:plan(1)
> --    for sacrifice to be holding the constant zero.
> --
> -- This leads to assembly code like the following:
> ---   ucomisd xmm7, [r14+0x18]
> +--   ucomisd xmm7, [r14]
> --   xor r14d, r14d
> ---   jnb 0x12a0e001c ->3
> +--   jnb 0x555d3d250014        ->1
> --
> -- That xor is a big problem, as it modifies flags between the
> -- ucomisd and the jnb, thereby causing the jnb to do the wrong

#1.1 Still missing the correct assembly below in the message. 

> @@ -34,36 +34,53 @@ local handler = setmetatable({}, {
>   end
> })
> 
<skipped>

>> There’s no trace after the fix, so it’s hard to grasp the resolution
>> used.
> 
> The trace is still compiled, but mov is used instead of xor.
> | ucomisd xmm7, [r14]
> | mov r14d, 0x0
> | jnb 0x55c95c330014 ->1

#1.2 That what I expected to see in the message, right after ‘forbids emitting’

> 
>>> 
>>> This patch forbids emitting xor in `emit_loadi()` for jcc operations.
> 
>>> 
>>> Sergey Kaplun:
>>> * added the description and the test for the problem
>>> 
>>> Part of tarantool/tarantool#7230
>>> ---
>>> 

<skipped>

>>> --- a/src/lj_emit_x86.h
>>> +++ b/src/lj_emit_x86.h
>>> @@ -274,10 +274,12 @@ static void emit_movmroi(ASMState *as, Reg base, int32_t ofs, int32_t i)
>>> /* mov r, i / xor r, r */
>>> static void emit_loadi(ASMState *as, Reg r, int32_t i)
>>> {
>>> -  /* XOR r,r is shorter, but modifies the flags. This is bad for HIOP. */
>>> +  /* XOR r,r is shorter, but modifies the flags. This is bad for HIOP/jcc. */
>>>   if (i == 0 && !(LJ_32 && (IR(as->curins)->o == IR_HIOP ||
>>> 			    (as->curins+1 < as->T->nins &&
>>> -			     IR(as->curins+1)->o == IR_HIOP)))) {
>>> +			     IR(as->curins+1)->o == IR_HIOP))) &&
>>> +		!((*as->mcp == 0x0f && (as->mcp[1] & 0xf0) == XI_JCCn) ||
>>> +		  (*as->mcp & 0xf0) == XI_JCCs)) {
> 
>> I wonder if there can be a case with two instructions emitted between the comparison
>> and the jump. In such a case the xor still can sneak in?
> 
> To be honest I'm nott sure that this is possible due to self-written
> assembler. So we must keep it in mind when assembly the correspodning
> IRs.
> 
>> Another idea: there could be different instructions that relies on the flags, such as
>> x86Arith:XOg_ADC or x86Op:XO_CMOV, so the if condition above could be bigger?
> 
> AFAIK, it is imposible due to close attention to the assembly.

So, the answer for both Qs: ’not at the moment’. Which, in turn, implies ‘we will be
running around in search for all places our new optimization leads to a wrong code
generation’. Well, fine!

> 
>>>     emit_rr(as, XO_ARITH(XOg_XOR), r, r);
>>>   } else {
>>>     MCode *p = as->mcp;
>>> diff --git a/test/tarantool-tests/CMakeLists.txt b/test/tarantool-tests/CMakeLists.txt
>>> index 5708822e..ad69e2cc 100644
>>> --- a/test/tarantool-tests/CMakeLists.txt
>>> +++ b/test/tarantool-tests/CMakeLists.txt
>>> @@ -65,6 +65,7 @@ add_subdirectory(gh-5813-resolving-of-c-symbols/stripped)
>>> add_subdirectory(gh-6098-fix-side-exit-patching-on-arm64)
>>> add_subdirectory(gh-6189-cur_L)
>>> add_subdirectory(lj-49-bad-lightuserdata)
>>> +add_subdirectory(lj-416-xor-before-jcc)
>>> add_subdirectory(lj-601-fix-gc-finderrfunc)
>>> add_subdirectory(lj-727-lightuserdata-itern)
>>> add_subdirectory(lj-flush-on-trace)
>>> diff --git a/test/tarantool-tests/lj-416-xor-before-jcc.test.lua b/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
>>> new file mode 100644
>>> index 00000000..7c6ab2b9
>>> --- /dev/null
>>> +++ b/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
>>> @@ -0,0 +1,70 @@
>>> +local ffi = require('ffi')
>>> +local tap = require('tap')
>>> +
>>> +local test = tap.test('lj-416-xor-before-jcc')
>>> +test:plan(1)
>>> +
>>> +-- To reproduce this issue, we need:
>>> +-- 1) a register which contains the constant zero value
>>> +-- 2) a floating point comparison operation
>>> +-- 3) the comparison operation to perform a fused load, which in
>>> +--    turn needs to allocate a register, and for there to be no
>>> +--    free registers at that moment, and for the register chosen
>>> +--    for sacrifice to be holding the constant zero.
>>> +--
>>> +-- This leads to assembly code like the following:
>>> +--   ucomisd xmm7, [r14+0x18]
>>> +--   xor r14d, r14d
>>> +--   jnb 0x12a0e001c ->3
>>> +--
>>> +-- That xor is a big problem, as it modifies flags between the
>>> +-- ucomisd and the jnb, thereby causing the jnb to do the wrong
>>> +-- thing.
>>> +
>>> +ffi.cdef[[
>>> +  int test_xor_func(int a, int b, int c, int d, int e, int f, void * g, int h);
>>> +]]
>>> +local testxor = ffi.load('libtestxor')
> 
>> Should have a test for x86 platform, since changes are x86 only?
> 
> We have too small amount of tests for LuaJIT anyway, so I prefer to keep
> tests for all architectures for now.

But... I bet the aarch64 will require `a little bit` different register pressure
preparations, will it? IOW - this test is not applicable to other architectures
‘as is’.

> 
>>> +
>>> +local handler = setmetatable({}, {
>>> +  __newindex = function ()
>>> +    -- 0 and nil are suggested as differnt constant-zero values
>>> +    -- for the call and occupied different registers.
>>> +    testxor.test_xor_func(0, 0, 0, 0, 0, 0, nil, 0)
>>> +  end
>>> +})
>>> +
>>> +local mconf = {
>>> +  { use = false, value = 100 },
>>> +  { use = true,  value = 100 },
>>> +}
>>> +
>>> +local function testf()
>>> +  -- Generate register pressure.
>>> +  local value = 50
>>> +  for _, rule in ipairs(mconf) do
>>> +    if rule.use then
>>> +      value = rule.value
>>> +      break
>>> +    end
>>> +  end
>>> +
>>> +  -- This branch shouldn't be taken.
>>> +  if value <= 42 then
>>> +    return true
>>> +  end
>>> +
>>> +  -- Nothing to do, just call testxor with many arguments.
>>> +  handler[4] = 4
> 
>> If it is needed to use the metatable here then why?
> 
> It is needed for suitable registers layout. Unfortunately, I gave up to
> create test without it.
> 
>>> +end
>>> +
>>> +-- We need to create long side trace to generate register
>>> +-- pressure.
>>> +jit.opt.start('hotloop=1', 'hotexit=1')
>>> +for _ = 1, 3 do
>>> +  -- Don't use any `test` functions here to freeze the trace.
>>> +  assert (not testf())
>>> +end
>>> +test:ok(true, 'imposible branch is not taken')

#2 s/imposible/impossible/

>>> +
>>> +os.exit(test:check() and 0 or 1)
>>> diff --git a/test/tarantool-tests/lj-416-xor-before-jcc/CMakeLists.txt b/test/tarantool-tests/lj-416-xor-before-jcc/CMakeLists.txt
>>> new file mode 100644
>>> index 00000000..17aa9f9b
>>> --- /dev/null
>>> +++ b/test/tarantool-tests/lj-416-xor-before-jcc/CMakeLists.txt
>>> @@ -0,0 +1 @@
>>> +BuildTestCLib(libtestxor testxor.c)
>>> diff --git a/test/tarantool-tests/lj-416-xor-before-jcc/testxor.c b/test/tarantool-tests/lj-416-xor-before-jcc/testxor.c
>>> new file mode 100644
>>> index 00000000..32436d42
>>> --- /dev/null
>>> +++ b/test/tarantool-tests/lj-416-xor-before-jcc/testxor.c
> 
> <snipped>
> 
>>> -- 
>>> 2.34.1
>>> 
> 
> -- 
> Best regards,
> Sergey Kaplun


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH luajit] x86/x64: Check for jcc when using xor r, r in emit_loadi().
  2022-08-31 16:04   ` sergos via Tarantool-patches
@ 2022-09-01 10:32     ` Sergey Kaplun via Tarantool-patches
  0 siblings, 0 replies; 10+ messages in thread
From: Sergey Kaplun via Tarantool-patches @ 2022-09-01 10:32 UTC (permalink / raw)
  To: sergos; +Cc: tarantool-patches

Hi!

Fixed your comments.

On 31.08.22, sergos wrote:
> Hi!
> 
> Thanks and with last two nits marked with #1 and #2 - LGTM.
> 
> Sergos
> 
> > On 31 Aug 2022, at 11:48, Sergey Kaplun <skaplun@tarantool.org> wrote:
> > 
> > Sergos, Igor!
> > 
> > Thanks for your review!
> > 
> > I've updated test with more comments considering your suggestions.
> > See the iterative patch below:
> > 
> > Branch is force-pushed.
> > ===================================================================
> > diff --git a/test/tarantool-tests/lj-416-xor-before-jcc.test.lua b/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
> > index 7c6ab2b9..39551184 100644
> > --- a/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
> > +++ b/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
> > @@ -13,9 +13,9 @@ test:plan(1)
> > --    for sacrifice to be holding the constant zero.
> > --
> > -- This leads to assembly code like the following:
> > ---   ucomisd xmm7, [r14+0x18]
> > +--   ucomisd xmm7, [r14]
> > --   xor r14d, r14d
> > ---   jnb 0x12a0e001c ->3
> > +--   jnb 0x555d3d250014        ->1
> > --
> > -- That xor is a big problem, as it modifies flags between the
> > -- ucomisd and the jnb, thereby causing the jnb to do the wrong
> 
> #1.1 Still missing the correct assembly below in the message. 
> 
> > @@ -34,36 +34,53 @@ local handler = setmetatable({}, {
> >   end
> > })
> > 
> <skipped>
> 
> >> There’s no trace after the fix, so it’s hard to grasp the resolution
> >> used.
> > 
> > The trace is still compiled, but mov is used instead of xor.
> > | ucomisd xmm7, [r14]
> > | mov r14d, 0x0
> > | jnb 0x55c95c330014 ->1
> 
> #1.2 That what I expected to see in the message, right after ‘forbids emitting’

Added the correct assembly to the commit message:

===================================================================
x86/x64: Check for jcc when using xor r,r in emit_loadi().

Thanks to Peter Cawley.

(cherry picked from commit fb5e522fbc0750c838ef6a926b11c5d870826183)

To reproduce this issue, we need:
1) a register which contains the constant zero value
2) a floating point comparison operation
3) the comparison operation to perform a fused load, which in
   turn needs to allocate a register, and for there to be no
   free registers at that moment, and for the register chosen
   for sacrifice to be holding the constant zero.

This leads to assembly code like the following:
| ucomisd xmm7, [r14]
| xor r14d, r14d
| jnb 0x555d3d250014 ->1

That xor is a big problem, as it modifies flags between the
ucomisd and the jnb, thereby causing the jnb to do the wrong
thing.

This patch forbids emitting xor in `emit_loadi()` for jcc operations, so
mov is emmited instead. Now assembly code looks like the following:
| ucomisd xmm7, [r14]
| mov r14d, 0x0
| jnb 0x55c95c330014 ->1

Sergey Kaplun:
* added the description and the test for the problem

Part of tarantool/tarantool#7230
===================================================================

> 
> > 
> >>> 
> >>> This patch forbids emitting xor in `emit_loadi()` for jcc operations.
> > 
> >>> 
> >>> Sergey Kaplun:
> >>> * added the description and the test for the problem
> >>> 
> >>> Part of tarantool/tarantool#7230
> >>> ---
> >>> 
> 
> <skipped>
> 
> >>> --- a/src/lj_emit_x86.h
> >>> +++ b/src/lj_emit_x86.h
> >>> @@ -274,10 +274,12 @@ static void emit_movmroi(ASMState *as, Reg base, int32_t ofs, int32_t i)
> >>> /* mov r, i / xor r, r */
> >>> static void emit_loadi(ASMState *as, Reg r, int32_t i)
> >>> {
> >>> -  /* XOR r,r is shorter, but modifies the flags. This is bad for HIOP. */
> >>> +  /* XOR r,r is shorter, but modifies the flags. This is bad for HIOP/jcc. */
> >>>   if (i == 0 && !(LJ_32 && (IR(as->curins)->o == IR_HIOP ||
> >>> 			    (as->curins+1 < as->T->nins &&
> >>> -			     IR(as->curins+1)->o == IR_HIOP)))) {
> >>> +			     IR(as->curins+1)->o == IR_HIOP))) &&
> >>> +		!((*as->mcp == 0x0f && (as->mcp[1] & 0xf0) == XI_JCCn) ||
> >>> +		  (*as->mcp & 0xf0) == XI_JCCs)) {
> > 
> >> I wonder if there can be a case with two instructions emitted between the comparison
> >> and the jump. In such a case the xor still can sneak in?
> > 
> > To be honest I'm nott sure that this is possible due to self-written
> > assembler. So we must keep it in mind when assembly the correspodning
> > IRs.
> > 
> >> Another idea: there could be different instructions that relies on the flags, such as
> >> x86Arith:XOg_ADC or x86Op:XO_CMOV, so the if condition above could be bigger?
> > 
> > AFAIK, it is imposible due to close attention to the assembly.
> 
> So, the answer for both Qs: ’not at the moment’. Which, in turn, implies ‘we will be
> running around in search for all places our new optimization leads to a wrong code
> generation’. Well, fine!

Yes, you are right.

> 
> > 
> >>>     emit_rr(as, XO_ARITH(XOg_XOR), r, r);
> >>>   } else {
> >>>     MCode *p = as->mcp;
> >>> diff --git a/test/tarantool-tests/CMakeLists.txt b/test/tarantool-tests/CMakeLists.txt
> >>> index 5708822e..ad69e2cc 100644
> >>> --- a/test/tarantool-tests/CMakeLists.txt
> >>> +++ b/test/tarantool-tests/CMakeLists.txt

<snipped>

> >>> +local testxor = ffi.load('libtestxor')
> > 
> >> Should have a test for x86 platform, since changes are x86 only?
> > 
> > We have too small amount of tests for LuaJIT anyway, so I prefer to keep
> > tests for all architectures for now.
> 
> But... I bet the aarch64 will require `a little bit` different register pressure
> preparations, will it? IOW - this test is not applicable to other architectures
> ‘as is’.

Yes, of corse. It doesn't show litterally the same issue, but may show
another one.

> 

<snipped>

> >>> +test:ok(true, 'imposible branch is not taken')
> 
> #2 s/imposible/impossible/

Fixed. Branch is force-pushed.

===================================================================
diff --git a/test/tarantool-tests/lj-416-xor-before-jcc.test.lua b/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
index 39551184..9a1c96ef 100644
--- a/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
+++ b/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
@@ -82,6 +82,6 @@ for _ = 1, 3 do
   -- Don't use any `test` functions here to freeze the trace.
   assert(not testf())
 end
-test:ok(true, 'imposible branch is not taken')
+test:ok(true, 'impossible branch is not taken')
 
 os.exit(test:check() and 0 or 1)
===================================================================

> 
> >>> +

<snipped>

> > 
> >>> -- 
> >>> 2.34.1
> >>> 
> > 
> > -- 
> > Best regards,
> > Sergey Kaplun
> 

-- 
Best regards,
Sergey Kaplun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH luajit] x86/x64: Check for jcc when using xor r, r in emit_loadi().
  2022-07-06 10:30 ` sergos via Tarantool-patches
@ 2022-11-11 10:20   ` Igor Munkin via Tarantool-patches
  2022-11-22  4:36     ` Sergey Kaplun via Tarantool-patches
  0 siblings, 1 reply; 10+ messages in thread
From: Igor Munkin via Tarantool-patches @ 2022-11-11 10:20 UTC (permalink / raw)
  To: sergos; +Cc: tarantool-patches

Sergos,

On 06.07.22, sergos wrote:
> Hi!
> 
> Thanks for the patch!
> 
> I won’t veto on the patch, since it’s from upstream. Still I have got
> a number of questions on it’s crutch nature. Perhaps we need to follow
> up with some test cases to luajit.

Thanks for your precise review. I would like to shed some light onto the
patch regarding LuaJIT semantics. Even if you're familiar with all
specifics, it might be useful for other guys *still* reading ML and some
parts can be used either in commit message or in test comments.

> 
> Some description fixes are needed also.
> 
> Sergos
> 
> 
> > On 4 Jul 2022, at 12:33, Sergey Kaplun <skaplun@tarantool.org> wrote:
> > 
> > From: Mike Pall <mike>
> > 
> > Thanks to Peter Cawley.
> > 
> > (cherry picked from commit fb5e522fbc0750c838ef6a926b11c5d870826183)
> > 
> > To reproduce this issue, we need:
> > 1) a register which contains the constant zero value
> 
> > 2) a floating point comparison operation
> 
> The integer one will set the same flags (CF for the case of JNB) and XOR will
> clear it anyways. 

Strictly saying, yes, I see no reason for "float" requirement here. I
believe this can be changed just to "comparison operation occurs", but
Sergey can correct me if I'm wrong.

> 
> > 3) the comparison operation to perform a fused load, which in
> >   turn needs to allocate a register, and for there to be no
> >   free registers at that moment, and for the register chosen
> >   for sacrifice to be holding the constant zero.
> > 
> Unfortunately, it’s not clear what this register (I suppose it’s r14d) is used for.
> Is it an argument, needed after the fall-through in the same trace?
> Why not it sank down below the branch? 
> IIRC it is a dedicated register used for dispatch, so why is it used for sacrificing?

Like many things in LuaJIT register allocation is "state-of-the-art" (or
as I called this "ad-hoc") entity. Since it's implemented as a linear
scan RA working in a reverse direction of the trace being recorded,
register is "allocated" for IR slot being used as a source and "freed"
at the moment it becomes a destination.

To describe the process Sergey fit in a three-bullet list above, we have
the following.
0) IR for either "internal" (e.g. type check, hmask check) or "external"
   (e.g. branch or loop condition) guard is begin emitted to mcode.
1) JCC to side exit is emitted to the trace mcode at the beginning.
2) Condition (i.e. comparison) is going to be emitted.
3) Fuse optimization takes its place, that ought to allocate a register
   for the load base.
4) There is no free registers at this point.
5) The one storing the constant zero is chosen to be sacrificed and
   reallocated (consider allocation cost in ra_alloc for constant
   materialization).
6) Before (or in the sense of trace execution, after) register is
   being used in the aforementioned comparison, register (r14 in our
   case) is reset by XOR emitted right after (before) jump instruction.
7) The comparison with fused load within is emitted.

As a result flags set by comparison are reset by XOR emitted in between
of condition and jump instructions.

> 
> > This leads to assembly code like the following:
> > | ucomisd xmm7, [r14+0x18]
> > | xor r14d, r14d
> > | jnb 0x12a0e001c ->3
> > 
> > That xor is a big problem, as it modifies flags between the
> > ucomisd and the jnb, thereby causing the jnb to do the wrong
> > thing.
> > 

<snipped>

> > 
> > diff --git a/src/lj_emit_x86.h b/src/lj_emit_x86.h
> > index 5207f9da..6b58306b 100644
> > --- a/src/lj_emit_x86.h
> > +++ b/src/lj_emit_x86.h
> > @@ -274,10 +274,12 @@ static void emit_movmroi(ASMState *as, Reg base, int32_t ofs, int32_t i)
> > /* mov r, i / xor r, r */
> > static void emit_loadi(ASMState *as, Reg r, int32_t i)
> > {
> > -  /* XOR r,r is shorter, but modifies the flags. This is bad for HIOP. */
> > +  /* XOR r,r is shorter, but modifies the flags. This is bad for HIOP/jcc. */
> >   if (i == 0 && !(LJ_32 && (IR(as->curins)->o == IR_HIOP ||
> > 			    (as->curins+1 < as->T->nins &&
> > -			     IR(as->curins+1)->o == IR_HIOP)))) {
> > +			     IR(as->curins+1)->o == IR_HIOP))) &&
> > +		!((*as->mcp == 0x0f && (as->mcp[1] & 0xf0) == XI_JCCn) ||
> > +		  (*as->mcp & 0xf0) == XI_JCCs)) {
> 
> I wonder if there can be a case with two instructions emitted between the comparison
> and the jump. In such a case the xor still can sneak in? 
> Another idea: there could be different instructions that relies on the flags, such as
> x86Arith:XOg_ADC or x86Op:XO_CMOV, so the if condition above could be bigger?

Nice question indeed! I believe, we need to re-check this case and send
the patch to upstream if the issue occurs.

> 
> >     emit_rr(as, XO_ARITH(XOg_XOR), r, r);
> >   } else {
> >     MCode *p = as->mcp;

<snipped>

> > -- 
> > 2.34.1
> > 
> 

-- 
Best regards,
IM

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH luajit] x86/x64: Check for jcc when using xor r, r in emit_loadi().
  2022-11-11 10:20   ` Igor Munkin via Tarantool-patches
@ 2022-11-22  4:36     ` Sergey Kaplun via Tarantool-patches
  2022-11-22  5:45       ` Igor Munkin via Tarantool-patches
  0 siblings, 1 reply; 10+ messages in thread
From: Sergey Kaplun via Tarantool-patches @ 2022-11-22  4:36 UTC (permalink / raw)
  To: Igor Munkin; +Cc: tarantool-patches

Hi, Igor!

Thanks for the review!

I've updated the commit message and the comment in the test considering
your proposal.

On 11.11.22, Igor Munkin wrote:
> Sergos,
> 
> On 06.07.22, sergos wrote:
> > Hi!
> > 
> > Thanks for the patch!
> > 

<snipped>

> > 
> > > 3) the comparison operation to perform a fused load, which in
> > >   turn needs to allocate a register, and for there to be no
> > >   free registers at that moment, and for the register chosen
> > >   for sacrifice to be holding the constant zero.
> > > 
> > Unfortunately, it’s not clear what this register (I suppose it’s r14d) is used for.
> > Is it an argument, needed after the fall-through in the same trace?
> > Why not it sank down below the branch? 
> > IIRC it is a dedicated register used for dispatch, so why is it used for sacrificing?
> 
> Like many things in LuaJIT register allocation is "state-of-the-art" (or
> as I called this "ad-hoc") entity. Since it's implemented as a linear
> scan RA working in a reverse direction of the trace being recorded,
> register is "allocated" for IR slot being used as a source and "freed"
> at the moment it becomes a destination.
> 
> To describe the process Sergey fit in a three-bullet list above, we have
> the following.
> 0) IR for either "internal" (e.g. type check, hmask check) or "external"
>    (e.g. branch or loop condition) guard is begin emitted to mcode.
> 1) JCC to side exit is emitted to the trace mcode at the beginning.
> 2) Condition (i.e. comparison) is going to be emitted.
> 3) Fuse optimization takes its place, that ought to allocate a register
>    for the load base.
> 4) There is no free registers at this point.
> 5) The one storing the constant zero is chosen to be sacrificed and
>    reallocated (consider allocation cost in ra_alloc for constant
>    materialization).
> 6) Before (or in the sense of trace execution, after) register is
>    being used in the aforementioned comparison, register (r14 in our
>    case) is reset by XOR emitted right after (before) jump instruction.
> 7) The comparison with fused load within is emitted.
> 
> As a result flags set by comparison are reset by XOR emitted in between
> of condition and jump instructions.
> 

Updated branch with force-push.

<snipped>

> 
> > > -- 
> > > 2.34.1
> > > 
> > 
> 
> -- 
> Best regards,
> IM

-- 
Best regards,
Sergey Kaplun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH luajit] x86/x64: Check for jcc when using xor r, r in emit_loadi().
  2022-11-22  4:36     ` Sergey Kaplun via Tarantool-patches
@ 2022-11-22  5:45       ` Igor Munkin via Tarantool-patches
  0 siblings, 0 replies; 10+ messages in thread
From: Igor Munkin via Tarantool-patches @ 2022-11-22  5:45 UTC (permalink / raw)
  To: Sergey Kaplun; +Cc: tarantool-patches

Sergey,

Thanks for the fixes! LGTM.

-- 
Best regards,
IM

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH luajit] x86/x64: Check for jcc when using xor r, r in emit_loadi().
  2022-07-04  9:33 [Tarantool-patches] [PATCH luajit] x86/x64: Check for jcc when using xor r, r in emit_loadi() Sergey Kaplun via Tarantool-patches
                   ` (2 preceding siblings ...)
  2022-08-31  8:48 ` Sergey Kaplun via Tarantool-patches
@ 2022-11-23  7:50 ` Igor Munkin via Tarantool-patches
  3 siblings, 0 replies; 10+ messages in thread
From: Igor Munkin via Tarantool-patches @ 2022-11-23  7:50 UTC (permalink / raw)
  To: Sergey Kaplun; +Cc: tarantool-patches

Sergey,

I've checked the patches into all long-term branches in tarantool/luajit
and bumped a new version in master, 2.10 and 1.10.

-- 
Best regards,
IM

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-11-23  8:04 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-04  9:33 [Tarantool-patches] [PATCH luajit] x86/x64: Check for jcc when using xor r, r in emit_loadi() Sergey Kaplun via Tarantool-patches
2022-07-05 15:10 ` Igor Munkin via Tarantool-patches
2022-07-06 10:30 ` sergos via Tarantool-patches
2022-11-11 10:20   ` Igor Munkin via Tarantool-patches
2022-11-22  4:36     ` Sergey Kaplun via Tarantool-patches
2022-11-22  5:45       ` Igor Munkin via Tarantool-patches
2022-08-31  8:48 ` Sergey Kaplun via Tarantool-patches
2022-08-31 16:04   ` sergos via Tarantool-patches
2022-09-01 10:32     ` Sergey Kaplun via Tarantool-patches
2022-11-23  7:50 ` Igor Munkin via Tarantool-patches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox