[tarantool-patches] Re: [PATCH v2 2/4] wal: wal memory buffer

Georgy Kirichenko georgy at tarantool.org
Fri Sep 20 11:10:55 MSK 2019


On Friday, September 20, 2019 1:02:13 AM MSK Vladislav Shpilevoy wrote:
> Thanks for the patch!
 
> Just to ensure that I correctly understand - it does not change WAL
> writing logic, right? You just added an intermediate buffer between
> xrow records and disk. Correct?
Exactly, it is an intermediate buffer which stores data some time after it was 
written  to disk and could be used for purposes of replication.
> 

> 'of rotated' -> 'consists of rotated', I think.
Ok
> 
> > Prerequisites: 3794
> 
> Please, drop ':' and add '#'. Without '#' it won't reference the issue on
> github.
Ok, thanks
> 
> > +	struct xrow_buf xrow_buf;
> 
> 1. Please, write a more descriptive comment.
> From xrow_buf struct name it is already obvious, that it
> is 'xrow buffer'. It would be better to use the comment to
> explain why is it needed. Why not to just write everything
> to WAL directly.
Accepted
> 
> >  };
> 
> 2. You should not bind to obuf parameters here. Xrow_buf != obuf.
> Direct assumption that xrow_buf_write always writes <= SMALL_OBUF_IOV_MAX
> breaks the encapsulation.
As it was specified in comments a xrow_buf writes a whole transaction in one 
chunk which is obuf based. Without this assumption it could be possible that 
the last xrow_buf transaction could not fit into a SMALL_OBUF_IOV_MAX number of 
iovs.
So we could base on this assumption or implement a cursor which allows us to 
iterate over the last transaction and fetch iovs or iovecs one by one. What 
would you like to choose?
> 
> > +			xrow_buf_tx_begin(&writer->xrow_buf, &writer->vclock);
> 
> 3. Why don't you have 'commit' function? There is xrow_buf_tx_begin,
> xrow_buf_tx_rollback, but I don't see a commit. Seems like the API
> is inconsistent.
Because there is nothing to do. But I understood your point and will add a 
commit function in order to have a symmetric API
> 
> 4. If this was the last iteration, you don't call rollback() nor write()
> after begin(). I don't think it is ok.
I could not understand you pretty clear what did you mean?
After last iteration all data is in a xrow_buffer and already written to a 
xlog. The only think we should do is to call a xlog_flush to flush data from an 
xlog internals to a disk.
> 
> >  		}
> >  		/* rc == 0: the write is buffered in xlog_tx */
> >  	
> >  	}
> > 
> > +
> > 
> >  	rc = xlog_flush(l);
> > 
> > -	if (rc < 0)
> > +	if (rc < 0) {
> > +		xrow_buf_tx_rollback(&writer->xrow_buf);
> > 
> >  		goto done;
> > 
> > -
> > +	}
> > 
> >  	writer->checkpoint_wal_size += rc;
> >  	last_committed = stailq_last(&wal_msg->commit);
> >  	vclock_merge(&writer->vclock, &vclock_diff);
> > 
> > diff --git a/src/box/xlog.c b/src/box/xlog.c
> > index 8254cce20..838b3e56d 100644
> > --- a/src/box/xlog.c
> > +++ b/src/box/xlog.c
> > @@ -1325,10 +1319,33 @@ xlog_write_row(struct xlog *log, const struct
> > xrow_header *packet)> 
> >  			return -1;
> >  		
> >  		}
> >  	
> >  	}
> > 
> > +	return obuf_size(&log->obuf) - old_size;
> > +}
> > +
> > +/*
> > + * Add a row to a log and possibly flush the log.
> > + *
> > + * @retval  -1 error, check diag.
> > + * @retval >=0 the number of bytes written to buffer.
> > + */
> > +ssize_t
> > +xlog_write_row(struct xlog *log, const struct xrow_header *packet)
> > +{
> > +	if (xlog_write_prepare(log) != 0)
> > +		return -1;
> > +
> > +	/** encode row into iovec */
> > +	struct iovec iov[XROW_IOVMAX];
> > +	/** don't write sync to the disk */
> > +	int iovcnt = xrow_header_encode(packet, 0, iov, 0);
> > +	if (iovcnt < 0)
> > +		return -1;
> 
> 5. Perhaps obuf leaks here. Before your patch in case of an encode
> error the obuf was rolled back to the initial state. Now it is not.
> The same below about xlog_writev error, and in xlog_write_iov.
xlog_write_row and xlog_write_iov use xlog_writev to put data into an obuf.
xlog_writev restores the obuf to svp in case of an error. After that, if xlog 
want to flush a transaction (disregarding the fact was it triggered by auto 
flush logic or an explicit xlog_flush call) xlog_tx_write is gooing to be 
called. And this function calls obuf_reset in any case whether the data was 
written successfully or not.

> > +ssize_t
> > +xlog_write_iov(struct xlog *log, struct iovec *iov, int iovcnt, int rows)
> 
> 6. You have functions xlog_writev() and xlog_write_iov(), but how are they
> different? I don't understand. Perhaps you could find more accurate names?
xlog_write_v is an internal function which just push data to xlog internal 
buffer. And xlog_write_iov is a wrapper which implement xlog-flush logic around 
xlog internal data. However, I'm agreed that better names could be chosen.
> 
> > +{
> > +		return -1;
> 
> 7. You now have two functions which are > 50% exactly the same. Can it
> be simplified like this?
Accepted, will try
> 
> > +	XROW_BUF_CHUNK_DATA_LIMIT = 1 << 19,
> 
> 8. From where did you get these numbers? Is there any explanation,
> why they are they?
These numbers are empirical, however, I could add some explanation why this 
number were chosen. Also I think that the beast way is to implement in-memory 
replication and the perform some performance testing in order to have the best 
values.
> 
> 9. Why do you need initial/limit for row count? Why not to use
> a fixed buffer for rows? IMO, 4096 vs 8192 isn't a big difference.
> You could just always use 8192. Or 4096. Not sure there will be any
> difference except the code will be simpler.
Because a transaction should not be beaten into pieces when xrow buffer 
rotated. If we chosen a predefined row count then we could not guarantee that.
> 
> > +};
> > +
> > +void
> > +xrow_buf_create(struct xrow_buf *xrow_buf)
> > +{
> > +	int i;
> > +	for (i = 0; i < XROW_BUF_CHUNK_COUNT; ++i) {
> 
> 10. Please, put 'int i' into the 'for'.
Ok
> 
> > +		ibuf_create(&xrow_buf->chunk[i].rows, &cord()->slabc,
> > +			    XROW_BUF_CHUNK_INITIAL_ROW_COUNT *
> > +			    sizeof(struct xrow_header));
> 
> 11. Why do you store xrows by value? I thought, that 'rows' is an
> array of pointers. As far as I understand, the xrows are created
> in tx thread by a transaction, and from that moment can be always
> passed by a pointer, no? The transaction and its memory anyway are
> alive until the WAL write is finished.
Because xrows should continue their persistence in wal memory (a xrow buffer) 
after transaction finished and cleared its data (including gc where xrows were 
encoded)
> 
> > +		obuf_create(&xrow_buf->chunk[i].data, &cord()->slabc,
> > +			    XROW_BUF_CHUNK_INITIAL_DATA_SIZE);
> > +	int i;
> > +	for (i = 0; i < XROW_BUF_CHUNK_COUNT; ++i) {
> 
> 12. The same - please, avoid declaration of variables
> in the beginning of a function. It does not make the
> code easier to read.
Ok
> 
> > +		ibuf_destroy(&xrow_buf->chunk[i].rows);
> > +		obuf_destroy(&xrow_buf->chunk[i].data);

> > then + * the oldest one is going to be trucated and the reused as the
> > youngest one.
> 13. 'trucated' -> 'truncated'.
Ok
> 
> 14. 'the reused'? Articles can't be applied to verbs. Please, rephrase.
Ok
> 
> 15. I don't understand. How can you just silently truncate the
> oldest chunk? I thought, that it contains sensible data, and can't
> be reused until written to a disk.
This described in the next patch which implement a xrow buffer cursor.
But let me write some words about:
A xrow_buf track two numbers - the first and the last used buffer generations. 
So if there is an rotation both numbers going to be increased. Also a xrow 
buffer cursor maintains its current buffer generation. So when cursor wants to 
continue it checks before a data access if the cursor still in xrow buffer data 
ranges. 
> 
> > + */
> > +static struct xrow_buf_chunk *
> > +xrow_buf_rotate(struct xrow_buf *xrow_buf, struct vclock *vclock)
> 
> 16. Why 'rotate' operation needs a vclock? It does not write
> anything. If you need it for a trigger, why can't you use
> vclock from the rotated chunk? Also I don't understand, what is
> this vclock - is it a new vclock which is going to be written
> after the rotation? Or is it a vclock of the rotated buffer?
> Or what? In the on_rotate trigger I don't see any explanation
> of what arguments it takes.
In a chunk the vclock means the vclock of the first row in the chunk. It allows 
us to determine which is the oldest row in the buffer (aas vclock of the first 
used buffer)
> 
> > +{
> > +	struct xrow_buf_chunk *chunk = xrow_buf->chunk +
> > +		xrow_buf->last_chunk_gen % XROW_BUF_CHUNK_COUNT;
> > +	/* Check the current chunk could accept new data. */
> > +	if (ibuf_used(&chunk->rows) < XROW_BUF_CHUNK_ROW_LIMIT *
> > +				      sizeof(struct xrow_header) &&
> > +	    obuf_size(&chunk->data) < XROW_BUF_CHUNK_DATA_LIMIT)
> 
> 17. So it is not 'rotate'. It is rather 'reserve', because you
> check if there is enough space.
Reserve is less possible because the function does not actually reserve space 
to encode rows.
> 
> 18. As 'new data' you mean one new row? Because what if I want to
> write 20 rows? Then the check above won't guarantee anything.
> xrow_buf_rotate() is used from xrow_buf_tx_begin(). After
> begin() you write many rows in wal.c:wal_write_to_disk().
'New data' means a transaction, and xrow buffure would not perform a rotation 
until _rotate was not called. I think I should rename LIMIT to THRESHOLD.
> 
> > +		return chunk;
> > +	/* Save the current xrow buffer state. */
> > +	xrow_buf->tx_first_row_index = ibuf_used(&chunk->rows) /
> > +				      sizeof(struct xrow_buf_row_info);
> 
> 19. In xrow_buf_create() you do
> 
>     ibuf_create(&rows, ..., ... * sizeof(struct xrow_header))
> 
> It means, that ibuf 'rows' stores 'xrow_header' objects.
> And here + other places you treat 'rows' as a container of
> 'xrow_buf_row_info' objects. Why? It does not look right.
You are absolutely right, I should use xrow_buf_row_info here, thank you.
> 
> > +int
> > +xrow_buf_write(struct xrow_buf *xrow_buf, struct xrow_header **begin,
> > +	       struct xrow_header **end,
> 
> 20. Why not to pass count instead of end pointer? Count is known in the
> caller function wal_write_to_disk(), and you calculate count below
> using 'end - begin' anyway.
I inherited the API from wal_assing_lsn, and it does not matter, I will do 
that if you wish.
> 
> > +	       struct iovec *iovec)
> > +{
> > +	struct xrow_buf_chunk *chunk = xrow_buf->chunk +
> > +		xrow_buf->last_chunk_gen % XROW_BUF_CHUNK_COUNT;
> > +
> > +	/* Save current postion to restore the state in case of an error. */
> 
> 21. 'postion' -> 'position'.
Ok
> 
> > +	size_t old_rows_size = ibuf_used(&chunk->rows);

> > +
> > +	/* Write rows. */
> > +	struct xrow_header **row;
> 
> 22. Please, move it inside 'for' first line.
Ok
> 
> > +	for (row = begin; row < end; ++row, ++mem_info) {
> > +		struct errinj *inj = errinj(ERRINJ_WAL_BREAK_LSN, ERRINJ_INT);

> > +		mem_info->size = iov[0].iov_len;
> > +		/* Copy bodies and patch it's location. */
> 
> 23. Whose location? Did you mean 'their location', of these
> bodies?
We copied xrow_header which includes two iov with xrow bodies. As we also 
encode bodies to a different place we should adjust copied data, I will change 
the comment above.
> 
> > +		int i;
> 
> 24. Please, move it inside 'for'.
Ok
> 
> > +		for (i = 1; i < iov_cnt; ++i) {
> > +			data = obuf_alloc(&chunk->data, iov[i].iov_len);
> > +			memcpy(data, iov[i].iov_base, iov[i].iov_len);
> > +			mem_info->xrow.body[i - 1].iov_base = data;
> > +			mem_info->size += iov[i].iov_len;
> > +		}
> > +	}
> > +
> > +	/* Copy all obuf lines containing the written data. */
> > +	int iov_cnt = 1 + obuf_iovcnt(&chunk->data) - data_svp.pos;
> > +	memcpy(iovec, chunk->data.iov + data_svp.pos,
> > +	       sizeof(struct iovec) * iov_cnt);
> > +	/* Adjust first iover member starting pointer and length. */
> 
> 25. What is 'iover'?
iovec, I'll fix it
> 
> > +	iovec[0].iov_base += data_svp.iov_len;
> > +	iovec[0].iov_len -= data_svp.iov_len;
> > +	return iov_cnt;
> > +
> > +error:
> > +	/* Restore buffer state. */
> 
> 26. Sorry, but such comments are quite useless. Comments
> should not narrate code. Otherwise they just pad the code
> out, and make it harder to follow. Please, drop such
> water-comments.
Ok
> 
> > +	chunk->rows.wpos = chunk->rows.rpos + old_rows_size;
> > +	obuf_rollback_to_svp(&chunk->data, &data_svp);
> > +	return -1;
> > +}
> > +
> > diff --git a/src/box/xrow_buf.h b/src/box/xrow_buf.h
> > new file mode 100644
> > index 000000000..76f9bc6f8
> > --- /dev/null
> > +++ b/src/box/xrow_buf.h
> > @@ -0,0 +1,138 @@
> > +#ifndef TARANTOOL_XROW_BUF_H_INCLUDED
> > +#define TARANTOOL_XROW_BUF_H_INCLUDED
> > +/*
> > + * Copyright 2010-2019, Tarantool AUTHORS, please see AUTHORS file.
> > + *
> > + * Redistribution and use in source and binary forms, with or
> > + * without modification, are permitted provided that the following
> > + * conditions are met:
> > + *
> > + * 1. Redistributions of source code must retain the above
> > + *    copyright notice, this list of conditions and the
> > + *    following disclaimer.
> > + *
> > + * 2. Redistributions in binary form must reproduce the above
> > + *    copyright notice, this list of conditions and the following
> > + *    disclaimer in the documentation and/or other materials
> > + *    provided with the distribution.
> > + *
> > + * THIS SOFTWARE IS PROVIDED BY <COPYRIGHT HOLDER> ``AS IS'' AND
> > + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
> > + * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
> > + * <COPYRIGHT HOLDER> OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
> > + * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> > + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
> > + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
> > + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
> > + * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
> > + * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> > + * SUCH DAMAGE.
> > + */
> > +#include <stdint.h>
> > +
> > +#include "small/ibuf.h"
> > +#include "small/obuf.h"
> > +#include "small/rlist.h"
> > +#include "xrow.h"
> > +#include "vclock.h"
> > +
> > +enum {
> > +	/*
> > +	 * Xrow memory object contains some count of rotating data chunks.
> > +	 * Every rotation estimated decrease in amount of stored rows is
> > +	 * about 1/(COUNT OF CHUNKS). However the bigger value makes rotation
> > +	 * more frequent, the decrease would be smoother and size of
> > +	 * a wal memory more stable.
> 
> 27. Please, limit max length of a comment string to 66 symbols. It
> helps reading, no jokes. In the C file too.
Ok
> 
> > +	 */
> > +	XROW_BUF_CHUNK_COUNT = 8,
> > +};
> > +
> > +/*
> > + * Each stored xrow is described by structure which contains decoded xrow
> > + * header and encoded data pointer and size.
> > + */
> > +struct xrow_buf_row_info {
> 
> 28. This structure is not used outside xrow_buf.c. Why do you need it
> in the header file?
Sorry, it is an artifact from previous version.
> 
> > +	/* Decoded xrow header. */
> > +	struct xrow_header xrow;
> > +	/* Pointer to the xrow encoded raw data. */
> > +	void *data;
> > +	/* xrow raw data size. */
> > +	size_t size;
> > +};
> > +
> > +/*
> > + * Xrow memory data chunk info contains
> > + *  a vclock just before the first stored row,
> > + *  an ibuf with row descriptors
> > + *  an obuf with encoded data
> 
> 29. I don't think it really makes sense to just dupicate here
> the comments of each attribute of the structure. Please,
> better use that space to explain, why you need vclock, why
> 'before first row' instead of just 'first row', or 'last row'.
> 
> What is a 'row descriptor'?
> 
> Why do you need both initial xrows and encoded? I thought, that
> xrow_buf stores only encoded rows.
Relay will need initial xrows to perform filtering (replication group, lsn and 
e.t.c.) without xrow encoded body to decode.
> 
> > + */
> > +struct xrow_buf_chunk {
> > +	/* vclock just before the first row. */
> > +	struct vclock vclock;
> > +	/* A row descriptor array. */
> > +	struct ibuf rows;
> > +	/* Data storage for encoded row data. */
> > +	struct obuf data;
> > +};
> > +
> > +/*
> > + * Xrow memory contains XROW_BUF_CHUNK_COUNT chunks in a ring. Xrow
> > memory
> > + * tracks the first and the last used chunk generation. The generation is
> > a + * monotonic sequence increasing every time when a chunk rotated on
> > the top of + * a xrow memory. So each xrow memory rotation increases the
> > last used chunk
> 30. What does it mean 'rotated on top of memory'?
> 
> > + * generation and each chunk discard increases the first used chunk
> > generation.
> 31. How are these generations used? As I see from the code, they are just
> 'begin' and 'end' indexes in the ring. Then why so complex names?
> 
> > + * To evaluate an effective chunk index from the generation a modulo
> > operation + * (or mask) should be used.
> > + */
> > +struct xrow_buf {
> > +	/* A generation of the first used chunk. */
> > +	uint32_t first_chunk_gen;
> > +	/* A generation of the last used chunk. */
> > +	uint32_t last_chunk_gen;
> 
> 32. I am confused. Is 'generation' attribute of a chunk, or of the
> whole buffer?
I think, I'll try to improve my comments.
> 
> > +	/* A chunks array. */
> > +	struct xrow_buf_chunk chunk[XROW_BUF_CHUNK_COUNT];
> > +	/* The first row index written in the current transaction. */
> > +	uint32_t tx_first_row_index;
> > +	/* The first row data svp written in the current transaction. */
> > +	struct obuf_svp tx_first_row_svp;
> > +	/* A trigger to fire on rotation event. */
> > +	struct rlist on_rotate;
> 
> 33. This attribute is unused. It is initialized, and trigger_run is
> called, but no one trigger is ever added to this list.
It is an artifact, I'll purge it.
> 
> > +};
> > +
> > +/* Create a wal memory. */
> > +void
> > +xrow_buf_create(struct xrow_buf *xrow_buf);
> > +
> > +/* Destroy wal memory structure. */
> > +void
> > +xrow_buf_destroy(struct xrow_buf *xrow_buf);
> > +
> > +/*
> > + * Start a xrow buffer transaction.
> > + * This function may rotate the xrow buffer and use the vclock as a top
> > chunk + * starting vclock.
> > + */
> > +void
> > +xrow_buf_tx_begin(struct xrow_buf *xrow_buf, struct vclock *vclock);
> 
> 34. Why is not it 'const struct vclock'? How can it be changed?
> What is more, this argument seems to be unused, because it is
> passed to xrow_buf_rotate, which passes it to on_rotate trigger
> list, which is always empty. Or I just didn't find where you put
> anything into that list.
I'll remove it to, thanks
> 
> > +
> > +/* Discard all the data written after the last transaction. */
> > +void
> > +xrow_buf_tx_rollback(struct xrow_buf *xrow_buf);
> > +
> > +/*
> > + * Append xrow array to a wal memory. The array is placed into one
> > + * wal memory buffer and each row takes a continuous space in a data
> > buffer. + * continuously. The start pointer and length of encoded data
> > are placed to
> 35. I gave up trying to explain, that sentenses should start with a capital
> letter even in small comments. But here it looks too bad. Please, either
> drop 'continuously.', because it just repeats the previous sentence, or use
> 'C' in the beginning.
It is just a typo.
> 
> > + * out.
> 
> 36. What is 'out'? I don't see out-parameters in this function.
> 
> > + * Return
> 
> 37. If you want to document returned values, then please, use @retval
> and @return.
Ok
> 
> > + *  count of written iovec in case of Ok
> > + *  -1 in case of error
> > + */
> > +int
> > +xrow_buf_write(struct xrow_buf *xrow_buf, struct xrow_header **begin,
> > +	       struct xrow_header **end,
> > +	       struct iovec *iovec);
> > +
> > +#endif /* TARANTOOL_XROW_BUF_H_INCLUDED */








More information about the Tarantool-patches mailing list