[Tarantool-patches] [PATCH 1/1] applier: process synchro rows after WAL write

Fri Apr 9 01:56:59 MSK 2021

Hi! Thanks for the review!

> Thanks for the patch!
> 
> I'm a bit worried about two different synchro rows coming from two
> appliers. Is everything ok in this case?
> Or even normal rows coming from other appliers. Say some other replica
> has already applied this synchro row and even has written some rows on
> top of it. Its applier won't block on replica_id latch, and may fail to apply
> some txs following this synchro row, because it's not yet written to WAL
> and thus not applied (limbo is still not empty or belongs to other instance).

Should not happen, if I understood you correctly. See my response in the
other email + a test for test-run + a console test.

>> diff --git a/src/box/applier.cc b/src/box/applier.cc
>> index 971b2e64c..f7c82548c 100644
>> --- a/src/box/applier.cc
>> +++ b/src/box/applier.cc
>> @@ -762,26 +762,21 @@ applier_txn_wal_write_cb(struct trigger *trigger, void *event)
>>   }
>>     struct synchro_entry {
>> -    /** Encoded form of a synchro record. */
>> -    struct synchro_body_bin    body_bin;
>> -
>> -    /** xrow to write, used by the journal engine. */
>> -    struct xrow_header row;
>> -
>> +    /** Request to process when WAL write is done. */
>> +    struct synchro_request *req;
>> +    /** Fiber created the entry. To wakeup when WAL write is done. */
>> +    struct fiber *owner;
>>       /**
>> -     * The journal entry itself. Note since
>> -     * it has unsized array it must be the
>> -     * last entry in the structure.
>> +     * The base journal entry. It has unsized array and then must be the
>> +     * last entry in the structure. But can workaround it via a union
>> +     * adding the needed tail as char[].
>>        */
>> -    struct journal_entry journal_entry;
>> +    union {
>> +        struct journal_entry base;
>> +        char base_buf[sizeof(base) + sizeof(base.rows[0])];
>> +    };
>>   };
> 
> I don't understand this union stuff.
> The journal_entry is the last entry in synchro_entry anyway.
> Is this a hack for allowing to allocate synchro_entry on the stack?

Yes. Otherwise I would need a more ugly hack:

	struct synchro_entry *e;
	alignas(alignof(*e)) char e_buf[sizeof(*e) + sizeof(e->base.rows[0])];
	e = (struct synchro_entry *)e_buf;

Then I managed to create the union hack, which I find
less ugly, and it seems to be allow to declare it on the stack.
The interesting thing though is that I still can't make that
union not a last member.

I also see now there is a problem in one of the CI jobs:
https://github.com/tarantool/tarantool/runs/2291961909

It complains about out of bound array access. Which has nothing to
do with the union, just a C++ issue I suppose. I fixed it in the new
version, see the other email.

>> diff --git a/test/replication/gh-5213-qsync-applier-order.test.lua b/test/replication/gh-5213-qsync-applier-order.test.lua
>> new file mode 100644
>> index 000000000..3a32626f2
>> --- /dev/null
>> +++ b/test/replication/gh-5213-qsync-applier-order.test.lua
>> @@ -0,0 +1,129 @@
>> +--
>> +-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL.
>> +-- As a result it could happen that the transactions became visible on CONFIRM,
>> +-- then somehow weren't written to WAL, and on restart the data might not be
>> +-- visible again. Which means rollback of confirmed data and is not acceptable
>> +-- (on the contrary with commit after rollback).
>> +--
>> +test_run = require('test_run').new()
>> +fiber = require('fiber')
>> +old_synchro_quorum = box.cfg.replication_synchro_quorum
>> +old_synchro_timeout = box.cfg.replication_synchro_timeout
>> +
>> +box.schema.user.grant('guest', 'super')
>> +
>> +s = box.schema.space.create('test', {is_sync = true})
>> +_ = s:create_index('pk')
>> +
>> +test_run:cmd('create server replica with rpl_master=default,\
>> +              script="replication/gh-5213-replica.lua"')
>> +test_run:cmd('start server replica')
>> +
>> +test_run:switch('replica')
>> +assert(box.info.id == 2)
>> +lsn = box.info.vclock[1]
>> +
>> +test_run:switch('default')
>> +fiber = require('fiber')
>> +box.cfg{                                                                        \
>> +    replication_synchro_quorum = 3,                                             \
>> +    replication_synchro_timeout = 1000,                                         \
>> +}
>> +f = fiber.new(function() s:replace{1} end)
>> +
>> +test_run:switch('replica')
>> +-- Wait when the transaction is written to WAL.
>> +test_run:wait_cond(function() return box.info.vclock[1] == lsn + 1 end)
> 
> This shouldn't go wrong, but I find
> test_run:wait_lsn('replica, 'default') more durable.
> And you wouldn't need to save lsn above then.
> 
> Same for other usages of lsn = ... and wait_cond(box.info.vclock[1] == lsn + ...)
> 
> Up to you though.

I totally forgot about wait_lsn(), thanks for noticing.
I applied it. See the full diff in the second email.