[Tarantool-patches] [PATCH luajit] x86/x64: Check for jcc when using xor r, r in emit_loadi().

Sergey Kaplun skaplun at tarantool.org
Wed Aug 31 11:48:02 MSK 2022


Sergos, Igor!

Thanks for your review!

I've updated test with more comments considering your suggestions.
See the iterative patch below:

Branch is force-pushed.
===================================================================
diff --git a/test/tarantool-tests/lj-416-xor-before-jcc.test.lua b/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
index 7c6ab2b9..39551184 100644
--- a/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
+++ b/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
@@ -13,9 +13,9 @@ test:plan(1)
 --    for sacrifice to be holding the constant zero.
 --
 -- This leads to assembly code like the following:
---   ucomisd xmm7, [r14+0x18]
+--   ucomisd xmm7, [r14]
 --   xor r14d, r14d
---   jnb 0x12a0e001c ->3
+--   jnb 0x555d3d250014        ->1
 --
 -- That xor is a big problem, as it modifies flags between the
 -- ucomisd and the jnb, thereby causing the jnb to do the wrong
@@ -34,36 +34,53 @@ local handler = setmetatable({}, {
   end
 })
 
+local MAGIC = 42
 local mconf = {
-  { use = false, value = 100 },
-  { use = true,  value = 100 },
+  { use = false, value = MAGIC + 1 },
+  { use = true,  value = MAGIC + 1 },
 }
 
 local function testf()
-  -- Generate register pressure.
-  local value = 50
+  -- All code below is needed for generating register pressure.
+  local value
+  -- The first trace to compile is this for loop, with `rule.use`
+  -- values is `true`.
   for _, rule in ipairs(mconf) do
+    -- The side trace starts here, when `rule.use` value is
+    -- `false`, returns to the `for` loop, where the function was
+    -- called, starts another iteration, calls `testf()` again and
+    -- ends at JITERL bytecode for the loop in this function.
     if rule.use then
       value = rule.value
       break
     end
   end
 
-  -- This branch shouldn't be taken.
-  if value <= 42 then
+  -- The code below is recorded with the following IRs:
+  -- ....              SNAP   #1   [ lj-416-xor-before-jcc.test.lua:44|---- ]
+  -- 0012       >  num UGT    0009  +42
+  --
+  -- That leads to the following assembly:
+  -- ucomisd xmm7, [r14]
+  -- xor r14d, r14d
+  -- jnb 0x555d3d250014        ->1
+  --
+  -- As a result, this branch is taken due to the emitted `xor`
+  -- instruction until the issue is not resolved.
+  if value <= MAGIC then
     return true
   end
 
   -- Nothing to do, just call testxor with many arguments.
-  handler[4] = 4
+  handler.nothing = 'to do'
 end
 
 -- We need to create long side trace to generate register
--- pressure.
+-- pressure (see the comment in `testf()`).
 jit.opt.start('hotloop=1', 'hotexit=1')
 for _ = 1, 3 do
   -- Don't use any `test` functions here to freeze the trace.
-  assert (not testf())
+  assert(not testf())
 end
 test:ok(true, 'imposible branch is not taken')
 
===================================================================

On 04.07.22, Sergey Kaplun wrote:
>> From: Mike Pall <mike>
>> 
>> Thanks to Peter Cawley.
>> 
>> (cherry picked from commit fb5e522fbc0750c838ef6a926b11c5d870826183)
>> 
>> To reproduce this issue, we need:
>> 1) a register which contains the constant zero value
>> 2) a floating point comparison operation

> The integer one will set the same flags (CF for the case of JNB) and XOR will
> clear it anyways.

I suppose that this is just not our case -- for integer comparasions we
should use cdata objects. I failed to create test case with the corresponding
behaviour.

>> 3) the comparison operation to perform a fused load, which in
>>    turn needs to allocate a register, and for there to be no
>>    free registers at that moment, and for the register chosen
>>    for sacrifice to be holding the constant zero.

> Unfortunately, it’s not clear what this register (I suppose it’s r14d) is used for.
> Is it an argument, needed after the fall-through in the same trace?

Yes, it is the argument for the future call.
|  mov [rsp+0x8], r14
|  mov [rsp], r15
|  xor r9d, r9d
|  xor r8d, r8d
|  xor ecx, ecx
|  xor edx, edx
|  xor esi, esi
|  xor edi, edi
|  call rbx

> Why not it sank down below the branch?

I supose that this is issue related to self-written assembly from bottom
of a trace to its start. First the branch is encoded, and only after
that, we assembly the comparision operation.

> IIRC it is a dedicated register used for dispatch, so why is it used for sacrificing?

r14d is DISPATCH register only for LuaJIT VM.

>> 
>> This leads to assembly code like the following:
>> | ucomisd xmm7, [r14]
>> | xor r14d, r14d
>> | jnb 0x55c95c330014 ->1
>> 
>> That xor is a big problem, as it modifies flags between the
>> ucomisd and the jnb, thereby causing the jnb to do the wrong
>> thing.

> There’s no trace after the fix, so it’s hard to grasp the resolution
> used.

The trace is still compiled, but mov is used instead of xor.
| ucomisd xmm7, [r14]
| mov r14d, 0x0
| jnb 0x55c95c330014 ->1

>> 
>> This patch forbids emitting xor in `emit_loadi()` for jcc operations.

>> 
>> Sergey Kaplun:
>> * added the description and the test for the problem
>> 
>> Part of tarantool/tarantool#7230
>> ---
>> 
>> Branch: https://github.com/tarantool/luajit/tree/skaplun/lj-416-xor-before-jcc-full-ci
>> Issues:
>> * https://github.com/LuaJIT/LuaJIT/issues/416
>> * https://github.com/tarantool/tarantool/issues/7230
>> 
>> Changelog entry (I suggest to update this entry with the each
>> corresponding bump):
>> ===================================================================
>> ## bugfix/luajit
>> 
>> Backported patches from vanilla LuaJIT trunk (gh-7230). In the scope of this
>> activity, the following issues have been resolved:
>> * Fixed `emit_loadi()` on x86/x64 emitting xor between condition check
>>   and jump instructions.
>> ===================================================================
>> 
>>  src/lj_emit_x86.h                             |  6 +-
>>  test/tarantool-tests/CMakeLists.txt           |  1 +
>>  .../lj-416-xor-before-jcc.test.lua            | 70 +++++++++++++++++++
>>  .../lj-416-xor-before-jcc/CMakeLists.txt      |  1 +
>>  .../lj-416-xor-before-jcc/testxor.c           | 14 ++++
>>  5 files changed, 90 insertions(+), 2 deletions(-)
>>  create mode 100644 test/tarantool-tests/lj-416-xor-before-jcc.test.lua
>>  create mode 100644 test/tarantool-tests/lj-416-xor-before-jcc/CMakeLists.txt
>>  create mode 100644 test/tarantool-tests/lj-416-xor-before-jcc/testxor.c
>> 
>> diff --git a/src/lj_emit_x86.h b/src/lj_emit_x86.h
>> index 5207f9da..6b58306b 100644
>> --- a/src/lj_emit_x86.h
>> +++ b/src/lj_emit_x86.h
>> @@ -274,10 +274,12 @@ static void emit_movmroi(ASMState *as, Reg base, int32_t ofs, int32_t i)
>>  /* mov r, i / xor r, r */
>>  static void emit_loadi(ASMState *as, Reg r, int32_t i)
>>  {
>> -  /* XOR r,r is shorter, but modifies the flags. This is bad for HIOP. */
>> +  /* XOR r,r is shorter, but modifies the flags. This is bad for HIOP/jcc. */
>>    if (i == 0 && !(LJ_32 && (IR(as->curins)->o == IR_HIOP ||
>>  			    (as->curins+1 < as->T->nins &&
>> -			     IR(as->curins+1)->o == IR_HIOP)))) {
>> +			     IR(as->curins+1)->o == IR_HIOP))) &&
>> +		!((*as->mcp == 0x0f && (as->mcp[1] & 0xf0) == XI_JCCn) ||
>> +		  (*as->mcp & 0xf0) == XI_JCCs)) {

> I wonder if there can be a case with two instructions emitted between the comparison
> and the jump. In such a case the xor still can sneak in?

To be honest I'm nott sure that this is possible due to self-written
assembler. So we must keep it in mind when assembly the correspodning
IRs.

> Another idea: there could be different instructions that relies on the flags, such as
> x86Arith:XOg_ADC or x86Op:XO_CMOV, so the if condition above could be bigger?

AFAIK, it is imposible due to close attention to the assembly.

>>      emit_rr(as, XO_ARITH(XOg_XOR), r, r);
>>    } else {
>>      MCode *p = as->mcp;
>> diff --git a/test/tarantool-tests/CMakeLists.txt b/test/tarantool-tests/CMakeLists.txt
>> index 5708822e..ad69e2cc 100644
>> --- a/test/tarantool-tests/CMakeLists.txt
>> +++ b/test/tarantool-tests/CMakeLists.txt
>> @@ -65,6 +65,7 @@ add_subdirectory(gh-5813-resolving-of-c-symbols/stripped)
>>  add_subdirectory(gh-6098-fix-side-exit-patching-on-arm64)
>>  add_subdirectory(gh-6189-cur_L)
>>  add_subdirectory(lj-49-bad-lightuserdata)
>> +add_subdirectory(lj-416-xor-before-jcc)
>>  add_subdirectory(lj-601-fix-gc-finderrfunc)
>>  add_subdirectory(lj-727-lightuserdata-itern)
>>  add_subdirectory(lj-flush-on-trace)
>> diff --git a/test/tarantool-tests/lj-416-xor-before-jcc.test.lua b/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
>> new file mode 100644
>> index 00000000..7c6ab2b9
>> --- /dev/null
>> +++ b/test/tarantool-tests/lj-416-xor-before-jcc.test.lua
>> @@ -0,0 +1,70 @@
>> +local ffi = require('ffi')
>> +local tap = require('tap')
>> +
>> +local test = tap.test('lj-416-xor-before-jcc')
>> +test:plan(1)
>> +
>> +-- To reproduce this issue, we need:
>> +-- 1) a register which contains the constant zero value
>> +-- 2) a floating point comparison operation
>> +-- 3) the comparison operation to perform a fused load, which in
>> +--    turn needs to allocate a register, and for there to be no
>> +--    free registers at that moment, and for the register chosen
>> +--    for sacrifice to be holding the constant zero.
>> +--
>> +-- This leads to assembly code like the following:
>> +--   ucomisd xmm7, [r14+0x18]
>> +--   xor r14d, r14d
>> +--   jnb 0x12a0e001c ->3
>> +--
>> +-- That xor is a big problem, as it modifies flags between the
>> +-- ucomisd and the jnb, thereby causing the jnb to do the wrong
>> +-- thing.
>> +
>> +ffi.cdef[[
>> +  int test_xor_func(int a, int b, int c, int d, int e, int f, void * g, int h);
>> +]]
>> +local testxor = ffi.load('libtestxor')

> Should have a test for x86 platform, since changes are x86 only?

We have too small amount of tests for LuaJIT anyway, so I prefer to keep
tests for all architectures for now.

>> +
>> +local handler = setmetatable({}, {
>> +  __newindex = function ()
>> +    -- 0 and nil are suggested as differnt constant-zero values
>> +    -- for the call and occupied different registers.
>> +    testxor.test_xor_func(0, 0, 0, 0, 0, 0, nil, 0)
>> +  end
>> +})
>> +
>> +local mconf = {
>> +  { use = false, value = 100 },
>> +  { use = true,  value = 100 },
>> +}
>> +
>> +local function testf()
>> +  -- Generate register pressure.
>> +  local value = 50
>> +  for _, rule in ipairs(mconf) do
>> +    if rule.use then
>> +      value = rule.value
>> +      break
>> +    end
>> +  end
>> +
>> +  -- This branch shouldn't be taken.
>> +  if value <= 42 then
>> +    return true
>> +  end
>> +
>> +  -- Nothing to do, just call testxor with many arguments.
>> +  handler[4] = 4

> If it is needed to use the metatable here then why?

It is needed for suitable registers layout. Unfortunately, I gave up to
create test without it.

>> +end
>> +
>> +-- We need to create long side trace to generate register
>> +-- pressure.
>> +jit.opt.start('hotloop=1', 'hotexit=1')
>> +for _ = 1, 3 do
>> +  -- Don't use any `test` functions here to freeze the trace.
>> +  assert (not testf())
>> +end
>> +test:ok(true, 'imposible branch is not taken')
>> +
>> +os.exit(test:check() and 0 or 1)
>> diff --git a/test/tarantool-tests/lj-416-xor-before-jcc/CMakeLists.txt b/test/tarantool-tests/lj-416-xor-before-jcc/CMakeLists.txt
>> new file mode 100644
>> index 00000000..17aa9f9b
>> --- /dev/null
>> +++ b/test/tarantool-tests/lj-416-xor-before-jcc/CMakeLists.txt
>> @@ -0,0 +1 @@
>> +BuildTestCLib(libtestxor testxor.c)
>> diff --git a/test/tarantool-tests/lj-416-xor-before-jcc/testxor.c b/test/tarantool-tests/lj-416-xor-before-jcc/testxor.c
>> new file mode 100644
>> index 00000000..32436d42
>> --- /dev/null
>> +++ b/test/tarantool-tests/lj-416-xor-before-jcc/testxor.c

<snipped>

>> -- 
>> 2.34.1
>> 

-- 
Best regards,
Sergey Kaplun


More information about the Tarantool-patches mailing list