[Tarantool-patches] [PATCH 1/1] applier: process synchro rows after WAL write

Fri Apr 9 11:26:23 MSK 2021

09.04.2021 01:56, Vladislav Shpilevoy пишет:
> Hi! Thanks for the review!
>
>> Thanks for the patch!
>>
>> I'm a bit worried about two different synchro rows coming from two
>> appliers. Is everything ok in this case?
>> Or even normal rows coming from other appliers. Say some other replica
>> has already applied this synchro row and even has written some rows on
>> top of it. Its applier won't block on replica_id latch, and may fail to apply
>> some txs following this synchro row, because it's not yet written to WAL
>> and thus not applied (limbo is still not empty or belongs to other instance).
> Should not happen, if I understood you correctly. See my response in the
> other email + a test for test-run + a console test.

Yes, it's fine, thanks for the explanation & the test!

>
>>> diff --git a/src/box/applier.cc b/src/box/applier.cc
>>> index 971b2e64c..f7c82548c 100644
>>> --- a/src/box/applier.cc
>>> +++ b/src/box/applier.cc
>>> @@ -762,26 +762,21 @@ applier_txn_wal_write_cb(struct trigger *trigger, void *event)
>>>    }
>>>      struct synchro_entry {
>>> -    /** Encoded form of a synchro record. */
>>> -    struct synchro_body_bin    body_bin;
>>> -
>>> -    /** xrow to write, used by the journal engine. */
>>> -    struct xrow_header row;
>>> -
>>> +    /** Request to process when WAL write is done. */
>>> +    struct synchro_request *req;
>>> +    /** Fiber created the entry. To wakeup when WAL write is done. */
>>> +    struct fiber *owner;
>>>        /**
>>> -     * The journal entry itself. Note since
>>> -     * it has unsized array it must be the
>>> -     * last entry in the structure.
>>> +     * The base journal entry. It has unsized array and then must be the
>>> +     * last entry in the structure. But can workaround it via a union
>>> +     * adding the needed tail as char[].
>>>         */
>>> -    struct journal_entry journal_entry;
>>> +    union {
>>> +        struct journal_entry base;
>>> +        char base_buf[sizeof(base) + sizeof(base.rows[0])];
>>> +    };
>>>    };
>> I don't understand this union stuff.
>> The journal_entry is the last entry in synchro_entry anyway.
>> Is this a hack for allowing to allocate synchro_entry on the stack?
> Yes. Otherwise I would need a more ugly hack:
>
> 	struct synchro_entry *e;
> 	alignas(alignof(*e)) char e_buf[sizeof(*e) + sizeof(e->base.rows[0])];
> 	e = (struct synchro_entry *)e_buf;
>
> Then I managed to create the union hack, which I find
> less ugly, and it seems to be allow to declare it on the stack.
> The interesting thing though is that I still can't make that
> union not a last member.
>
> I also see now there is a problem in one of the CI jobs:
> https://github.com/tarantool/tarantool/runs/2291961909
>
> It complains about out of bound array access. Which has nothing to
> do with the union, just a C++ issue I suppose. I fixed it in the new
> version, see the other email.

Ok, I see now. Thanks!

>
>>> diff --git a/test/replication/gh-5213-qsync-applier-order.test.lua b/test/replication/gh-5213-qsync-applier-order.test.lua
>>> new file mode 100644
>>> index 000000000..3a32626f2
>>> --- /dev/null
>>> +++ b/test/replication/gh-5213-qsync-applier-order.test.lua
>>> @@ -0,0 +1,129 @@
>>> +--
>>> +-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL.
>>> +-- As a result it could happen that the transactions became visible on CONFIRM,
>>> +-- then somehow weren't written to WAL, and on restart the data might not be
>>> +-- visible again. Which means rollback of confirmed data and is not acceptable
>>> +-- (on the contrary with commit after rollback).
>>> +--
>>> +test_run = require('test_run').new()
>>> +fiber = require('fiber')
>>> +old_synchro_quorum = box.cfg.replication_synchro_quorum
>>> +old_synchro_timeout = box.cfg.replication_synchro_timeout
>>> +
>>> +box.schema.user.grant('guest', 'super')
>>> +
>>> +s = box.schema.space.create('test', {is_sync = true})
>>> +_ = s:create_index('pk')
>>> +
>>> +test_run:cmd('create server replica with rpl_master=default,\
>>> +              script="replication/gh-5213-replica.lua"')
>>> +test_run:cmd('start server replica')
>>> +
>>> +test_run:switch('replica')
>>> +assert(box.info.id == 2)
>>> +lsn = box.info.vclock[1]
>>> +
>>> +test_run:switch('default')
>>> +fiber = require('fiber')
>>> +box.cfg{                                                                        \
>>> +    replication_synchro_quorum = 3,                                             \
>>> +    replication_synchro_timeout = 1000,                                         \
>>> +}
>>> +f = fiber.new(function() s:replace{1} end)
>>> +
>>> +test_run:switch('replica')
>>> +-- Wait when the transaction is written to WAL.
>>> +test_run:wait_cond(function() return box.info.vclock[1] == lsn + 1 end)
>> This shouldn't go wrong, but I find
>> test_run:wait_lsn('replica, 'default') more durable.
>> And you wouldn't need to save lsn above then.
>>
>> Same for other usages of lsn = ... and wait_cond(box.info.vclock[1] == lsn + ...)
>>
>> Up to you though.
> I totally forgot about wait_lsn(), thanks for noticing.
> I applied it. See the full diff in the second email.

Thanks for the fixes, LGTM.

-- 
Serge Petrenko