[tarantool-patches] Re: [PATCH v2 2/4] wal: wal memory buffer

Sat Sep 21 01:58:45 MSK 2019

A common question on the whole patch: why won't lsregion here
work better than a ring of obufs/ibufs? lsregion works like a ring
but of a dynamic size, and perhaps implements everything you need.

>> 2. You should not bind to obuf parameters here. Xrow_buf != obuf.
>> Direct assumption that xrow_buf_write always writes <= SMALL_OBUF_IOV_MAX
>> breaks the encapsulation.
> As it was specified in comments a xrow_buf writes a whole transaction in one 
> chunk which is obuf based.

Which is obuf based *now*. And not necessarily will be in future. Even in a
next version of this patch.

> Without this assumption it could be possible that 
> the last xrow_buf transaction could not fit into a SMALL_OBUF_IOV_MAX number of 
> iovs.

I am not against having a fixed size array of iovec structures. I am
against direct usage of obuf constants here. It is a very non-trivial link
which a next programmer of this code would need to find - what to pass
to xrow_buf_write()?, why does it take iovec array without its max size?,
why do you allocate an array of iovecs with fixed size using an obuf
constant while this code has nothing to do with obuf?

If you need a constant - please, declare a new one, starting with XROW_BUF_
prefix, with a static assertion that it is <= SMALL_OBUF_IOV_MAX, and why.

> So we could base on this assumption or implement a cursor which allows us to 
> iterate over the last transaction and fetch iovs or iovecs one by one. What 
> would you like to choose?

Iterators? No, I think we just need to improve the version with a constant
size iovec array. To make it more obvious what and why is used.

>>
>>> +			xrow_buf_tx_begin(&writer->xrow_buf, &writer->vclock);
>>
>> 3. Why don't you have 'commit' function? There is xrow_buf_tx_begin,
>> xrow_buf_tx_rollback, but I don't see a commit. Seems like the API
>> is inconsistent.
> Because there is nothing to do. But I understood your point and will add a 
> commit function in order to have a symmetric API
>>
>> 4. If this was the last iteration, you don't call rollback() nor write()
>> after begin(). I don't think it is ok.
> I could not understand you pretty clear what did you mean?

I mean the same as above - the API is inconsistent. It is very unnatural,
when code does 'begin()' but nether does 'end()' or 'commit()'.

> After last iteration all data is in a xrow_buffer and already written to a 
> xlog. The only think we should do is to call a xlog_flush to flush data from an 
> xlog internals to a disk.
>>
>>>  		}
>>>  		/* rc == 0: the write is buffered in xlog_tx */
>>>  	
>>>  	}
>>>
>>> +
>>>
>>>  	rc = xlog_flush(l);
>>>
>>> -	if (rc < 0)
>>> +	if (rc < 0) {
>>> +		xrow_buf_tx_rollback(&writer->xrow_buf);
>>>
>>>  		goto done;
>>>
>>> -
>>> +	}
>>>
>>>  	writer->checkpoint_wal_size += rc;
>>>  	last_committed = stailq_last(&wal_msg->commit);
>>>  	vclock_merge(&writer->vclock, &vclock_diff);
>>>
>>> diff --git a/src/box/xlog.c b/src/box/xlog.c
>>> index 8254cce20..838b3e56d 100644
>>> --- a/src/box/xlog.c
>>> +++ b/src/box/xlog.c
>>> @@ -1325,10 +1319,33 @@ xlog_write_row(struct xlog *log, const struct
>>> xrow_header *packet)> 
>>>  			return -1;
>>>  		
>>>  		}
>>>  	
>>>  	}
>>>
>>> +	return obuf_size(&log->obuf) - old_size;
>>> +}
>>> +
>>> +/*
>>> + * Add a row to a log and possibly flush the log.
>>> + *
>>> + * @retval  -1 error, check diag.
>>> + * @retval >=0 the number of bytes written to buffer.
>>> + */

I just noticed, that you have 2 comments for this function -
in the .c and .h file. Please, keep only the header's one.

>>> +ssize_t
>>> +xlog_write_row(struct xlog *log, const struct xrow_header *packet)
>>> +{
>>> +	if (xlog_write_prepare(log) != 0)
>>> +		return -1;
>>> +
>>> +	/** encode row into iovec */
>>> +	struct iovec iov[XROW_IOVMAX];
>>> +	/** don't write sync to the disk */
>>> +	int iovcnt = xrow_header_encode(packet, 0, iov, 0);
>>> +	if (iovcnt < 0)
>>> +		return -1;
>>
>> 5. Perhaps obuf leaks here. Before your patch in case of an encode
>> error the obuf was rolled back to the initial state. Now it is not.
>> The same below about xlog_writev error, and in xlog_write_iov.
> xlog_write_row and xlog_write_iov use xlog_writev to put data into an obuf.
> xlog_writev restores the obuf to svp in case of an error.

Exactly. It restores it to svp created inside xlog_writev. But you
allocate memory before xlog_writev() - xlog_write_prepare() does
obuf_alloc(&log->obuf, XLOG_FIXHEADER_SIZE). If xlog_writev() fails
afterwards, this memory is not freed.

> After that, if xlog 
> want to flush a transaction (disregarding the fact was it triggered by auto 
> flush logic or an explicit xlog_flush call) xlog_tx_write is gooing to be 
> called. And this function calls obuf_reset in any case whether the data was 
> written successfully or not.
> >> 15. I don't understand. How can you just silently truncate the
>> oldest chunk? I thought, that it contains sensible data, and can't
>> be reused until written to a disk.
> This described in the next patch which implement a xrow buffer cursor.
> But let me write some words about:
> A xrow_buf track two numbers - the first and the last used buffer generations. 
> So if there is an rotation both numbers going to be increased. Also a xrow 
> buffer cursor maintains its current buffer generation. So when cursor wants to 
> continue it checks before a data access if the cursor still in xrow buffer data 
> ranges. 

Sorry, but I still don't understand. Firstly, at rotation only the last
generation is increased. The first one may stay unchanged.

Secondly, look at the code:

> 	++xrow_buf->last_chunk_gen;
> 	chunk = xrow_buf->chunk + xrow_buf->last_chunk_gen %
> 				  XROW_BUF_CHUNK_COUNT;
> 	/* Check there is an unused chunk. */
> 	if (xrow_buf->last_chunk_gen - xrow_buf->first_chunk_gen >=
> 	    XROW_BUF_CHUNK_COUNT) {
> 		/* Discard the chunk data and adjust first chunk generation. */
> 		ibuf_reset(&chunk->rows);
> 		obuf_reset(&chunk->data);
> 		++xrow_buf->first_chunk_gen;
> 	}

Why 'last - first >= CHUNK_COUNT' means, that there are free chunks?
How is it possible at all, that the last pointer has gone from the first
pointer on more than the ring size steps? It is a ring. If last goes
from first on more that 'CHUNK_COUNT', it already overrides the first.

As far as I understand a ring buffer idea, it should be vice versa -
if last - first < chunk count, then there are free chunks.

Thirdly, why in case of having a free chunk you discard the last buffer,
but move first chunk gen? You do 'ibuf/obuf_reset(&chunk...' instead of
'ibuf/obuf_reset(&(xrow_buf->chunk[xrow_buf->first_chunk_gen % XROW_BUF_CHUNK_COUNT]))'.

>>
>>> + */
>>> +static struct xrow_buf_chunk *
>>> +xrow_buf_rotate(struct xrow_buf *xrow_buf, struct vclock *vclock)
>>
>> 16. Why 'rotate' operation needs a vclock? It does not write
>> anything. If you need it for a trigger, why can't you use
>> vclock from the rotated chunk? Also I don't understand, what is
>> this vclock - is it a new vclock which is going to be written
>> after the rotation? Or is it a vclock of the rotated buffer?
>> Or what? In the on_rotate trigger I don't see any explanation
>> of what arguments it takes.
> In a chunk the vclock means the vclock of the first row in the chunk.

Of the first row? But in the comment in xrow_buf.h I see
"/* vclock just before the first row. */". So before the first
row, or of the first row?

> It allows 
> us to determine which is the oldest row in the buffer (aas vclock of the first 
> used buffer)

It is worth adding to a comment in xrow_buf.h somewhere.

>>> +int
>>> +xrow_buf_write(struct xrow_buf *xrow_buf, struct xrow_header **begin,
>>> +	       struct xrow_header **end,
>>
>> 20. Why not to pass count instead of end pointer? Count is known in the
>> caller function wal_write_to_disk(), and you calculate count below
>> using 'end - begin' anyway.
> I inherited the API from wal_assing_lsn, and it does not matter, I will do 
> that if you wish.

Up to you. But now it looks strange, IMO. You calculate 'end' in the
caller function so as to just calculate 'count' again inside this
function. Talking of what Kostja said:

"
    It's the general convention across the rest of the code to pass
    arrays as a pair of pointers.
"

It is not true, we use start-and-end pointers only for char buffers,
and even for them we often switch back to start-and-size pair. Xrow.h
contains examples: xrow_encode_auth() uses start-and-size,
xrow_header_decode() uses start-and-end. Key_def.h contains examples:
key_def_encode_parts() uses start-and-size, tuple_extract_key_raw()
uses start-and-end. And many similar places. So I guess we use what is
easier for each case. For this one I think count is easier.

>>> +		for (i = 1; i < iov_cnt; ++i) {
>>> +			data = obuf_alloc(&chunk->data, iov[i].iov_len);
>>> +			memcpy(data, iov[i].iov_base, iov[i].iov_len);
>>> +			mem_info->xrow.body[i - 1].iov_base = data;
>>> +			mem_info->size += iov[i].iov_len;
>>> +		}
>>> +	}
>>> +
>>> +	/* Copy all obuf lines containing the written data. */
>>> +	int iov_cnt = 1 + obuf_iovcnt(&chunk->data) - data_svp.pos;
>>> +	memcpy(iovec, chunk->data.iov + data_svp.pos,
>>> +	       sizeof(struct iovec) * iov_cnt);
>>> +	/* Adjust first iover member starting pointer and length. */
>>
>> 25. What is 'iover'?
> iovec, I'll fix it
>>

Ok, now I understand the word, but still don't understand the
comment. Why do you adjust it? Why with starting pointer
and length? Why only the first iovec?
Do you do it because before xlog_write the chunk might has
already stored something, and you don't want to return it
again?
If so, then why can't the old data occupy more than one
first iovec which you adjust?

>>> +	/* Decoded xrow header. */
>>> +	struct xrow_header xrow;
>>> +	/* Pointer to the xrow encoded raw data. */
>>> +	void *data;
>>> +	/* xrow raw data size. */
>>> +	size_t size;
>>> +};
>>> +
>>> +/*
>>> + * Xrow memory data chunk info contains
>>> + *  a vclock just before the first stored row,
>>> + *  an ibuf with row descriptors
>>> + *  an obuf with encoded data
>>
>> 29. I don't think it really makes sense to just dupicate here
>> the comments of each attribute of the structure. Please,
>> better use that space to explain, why you need vclock, why
>> 'before first row' instead of just 'first row', or 'last row'.
>>
>> What is a 'row descriptor'?
>>
>> Why do you need both initial xrows and encoded? I thought, that
>> xrow_buf stores only encoded rows.
> Relay will need initial xrows to perform filtering (replication group, lsn and 
> e.t.c.) without xrow encoded body to decode.

Hm, it looks quite expensive just for filtering. But assume it is ok.
1) Why can't you store only a header instead of the whole xrow with a
body? As I understand, for any filtering the header is enough. Or not?

2) Suppose, that you found a row which should be filtered out. That
row is already encoded somewhere inside obuf of that chunk. How will
you find where is it encoded so as not to send this part to a replica?
Where is that mapping of index in ibuf rows <-> range in obuf data?

Actually, after I wrote the second question, I looked at the code again,
and found xrow_buf_row_info.data. Is it this pointer I was talking about?
If so, then it is super-ultra not obvious, that it is a pointer at some
place inside xrow_buf_chunk.data. Such a link should be documented in a
very detailed comment.