[tarantool-patches] Re: [PATCH v2 2/4] wal: wal memory buffer

Sat Sep 21 11:20:18 MSK 2019

On Saturday, September 21, 2019 1:58:45 AM MSK Vladislav Shpilevoy wrote:
> A common question on the whole patch: why won't lsregion here
> work better than a ring of obufs/ibufs? lsregion works like a ring
> but of a dynamic size, and perhaps implements everything you need.
> 
> >> 2. You should not bind to obuf parameters here. Xrow_buf != obuf.
> >> Direct assumption that xrow_buf_write always writes <= SMALL_OBUF_IOV_MAX
> >> breaks the encapsulation.
> > 
> > As it was specified in comments a xrow_buf writes a whole transaction in
> > one chunk which is obuf based.
> 
> Which is obuf based *now*. And not necessarily will be in future. Even in a
> next version of this patch.
> 
> > Without this assumption it could be possible that
> > the last xrow_buf transaction could not fit into a SMALL_OBUF_IOV_MAX
> > number of iovs.
> 
> I am not against having a fixed size array of iovec structures. I am
> against direct usage of obuf constants here. It is a very non-trivial link
> which a next programmer of this code would need to find - what to pass
> to xrow_buf_write()?, why does it take iovec array without its max size?,
> why do you allocate an array of iovecs with fixed size using an obuf
> constant while this code has nothing to do with obuf?
> 
> If you need a constant - please, declare a new one, starting with XROW_BUF_
> prefix, with a static assertion that it is <= SMALL_OBUF_IOV_MAX, and why.
> 
> > So we could base on this assumption or implement a cursor which allows us
> > to iterate over the last transaction and fetch iovs or iovecs one by one.
> > What would you like to choose?
> 
> Iterators? No, I think we just need to improve the version with a constant
> size iovec array. To make it more obvious what and why is used.
Ok, I'll just choose a constant and use it.
> 
> >>> +			xrow_buf_tx_begin(&writer->xrow_buf, &writer->vclock);
> >> 
> >> 3. Why don't you have 'commit' function? There is xrow_buf_tx_begin,
> >> xrow_buf_tx_rollback, but I don't see a commit. Seems like the API
> >> is inconsistent.
> > 
> > Because there is nothing to do. But I understood your point and will add a
> > commit function in order to have a symmetric API
> > 
> >> 4. If this was the last iteration, you don't call rollback() nor write()
> >> after begin(). I don't think it is ok.
> > 
> > I could not understand you pretty clear what did you mean?
> 
> I mean the same as above - the API is inconsistent. It is very unnatural,
> when code does 'begin()' but nether does 'end()' or 'commit()'.
Ok, I'll add a commit stub
> 
> > After last iteration all data is in a xrow_buffer and already written to a
> > xlog. The only think we should do is to call a xlog_flush to flush data
> > from an xlog internals to a disk.
> > 
> >>>  		}
> >>>  		/* rc == 0: the write is buffered in xlog_tx */
> >>>  	
> >>>  	}
> >>> 
> >>> +
> >>> 
> >>>  	rc = xlog_flush(l);
> >>> 
> >>> -	if (rc < 0)
> >>> +	if (rc < 0) {
> >>> +		xrow_buf_tx_rollback(&writer->xrow_buf);
> >>> 
> >>>  		goto done;
> >>> 
> >>> -
> >>> +	}
> >>> 
> >>>  	writer->checkpoint_wal_size += rc;
> >>>  	last_committed = stailq_last(&wal_msg->commit);
> >>>  	vclock_merge(&writer->vclock, &vclock_diff);
> >>> 
> >>> diff --git a/src/box/xlog.c b/src/box/xlog.c
> >>> index 8254cce20..838b3e56d 100644
> >>> --- a/src/box/xlog.c
> >>> +++ b/src/box/xlog.c
> >>> @@ -1325,10 +1319,33 @@ xlog_write_row(struct xlog *log, const struct
> >>> xrow_header *packet)>
> >>> 
> >>>  			return -1;
> >>>  		
> >>>  		}
> >>>  	
> >>>  	}
> >>> 
> >>> +	return obuf_size(&log->obuf) - old_size;
> >>> +}
> >>> +
> >>> +/*
> >>> + * Add a row to a log and possibly flush the log.
> >>> + *
> >>> + * @retval  -1 error, check diag.
> >>> + * @retval >=0 the number of bytes written to buffer.
> >>> + */
> 
> I just noticed, that you have 2 comments for this function -
> in the .c and .h file. Please, keep only the header's one.
Ok
> 
> >>> +ssize_t
> >>> +xlog_write_row(struct xlog *log, const struct xrow_header *packet)
> >>> +{
> >>> +	if (xlog_write_prepare(log) != 0)
> >>> +		return -1;
> >>> +
> >>> +	/** encode row into iovec */
> >>> +	struct iovec iov[XROW_IOVMAX];
> >>> +	/** don't write sync to the disk */
> >>> +	int iovcnt = xrow_header_encode(packet, 0, iov, 0);
> >>> +	if (iovcnt < 0)
> >>> +		return -1;
> >> 
> >> 5. Perhaps obuf leaks here. Before your patch in case of an encode
> >> error the obuf was rolled back to the initial state. Now it is not.
> >> The same below about xlog_writev error, and in xlog_write_iov.
> > 
> > xlog_write_row and xlog_write_iov use xlog_writev to put data into an
> > obuf.
> > xlog_writev restores the obuf to svp in case of an error.
> 
> Exactly. It restores it to svp created inside xlog_writev. But you
> allocate memory before xlog_writev() - xlog_write_prepare() does
> obuf_alloc(&log->obuf, XLOG_FIXHEADER_SIZE). If xlog_writev() fails
> afterwards, this memory is not freed.
I do not see any issue here: xlog allocates XLOG_FIXHEADER_SIZE of bytes that 
are going to be reused for the next write or truncated if xlog_tx_rollback 
would be fired.
> 
> > After that, if xlog
> > want to flush a transaction (disregarding the fact was it triggered by
> > auto
> > flush logic or an explicit xlog_flush call) xlog_tx_write is gooing to be
> > called. And this function calls obuf_reset in any case whether the data
> > was
> > written successfully or not.
> > 
> > >> 15. I don't understand. How can you just silently truncate the
> >> 
> >> oldest chunk? I thought, that it contains sensible data, and can't
> >> be reused until written to a disk.
> > 
> > This described in the next patch which implement a xrow buffer cursor.
> > But let me write some words about:
> > A xrow_buf track two numbers - the first and the last used buffer
> > generations. So if there is an rotation both numbers going to be
> > increased. Also a xrow buffer cursor maintains its current buffer
> > generation. So when cursor wants to continue it checks before a data
> > access if the cursor still in xrow buffer data ranges.
> 
> Sorry, but I still don't understand. Firstly, at rotation only the last
> generation is increased. The first one may stay unchanged.
The first one may stay unchanged - only in case when there is a free chunk, 
what happens only for the first XROW_BUF_CHUNK_COUNT rotations.
> 
> Secondly, look at the code:
> > 	++xrow_buf->last_chunk_gen;
> > 	chunk = xrow_buf->chunk + xrow_buf->last_chunk_gen %
> > 	
> > 				  XROW_BUF_CHUNK_COUNT;
> > 	
> > 	/* Check there is an unused chunk. */
> > 	if (xrow_buf->last_chunk_gen - xrow_buf->first_chunk_gen >=
> > 	
> > 	    XROW_BUF_CHUNK_COUNT) {
> > 		
> > 		/* Discard the chunk data and adjust first chunk generation. */
> > 		ibuf_reset(&chunk->rows);
> > 		obuf_reset(&chunk->data);
> > 		++xrow_buf->first_chunk_gen;
> > 	
> > 	}
> 
> Why 'last - first >= CHUNK_COUNT' means, that there are free chunks?
If last - first >= CHUNCK_COUNT (however, the '>' never happens) means that 
there is no more free chunks so I discard the first one and increase its 
generation.
> How is it possible at all, that the last pointer has gone from the first
> pointer on more than the ring size steps? It is a ring. If last goes
> from first on more that 'CHUNK_COUNT', it already overrides the first.
Ok, I'll just replace > condition with an assert.
> 
> As far as I understand a ring buffer idea, it should be vice versa -
> if last - first < chunk count, then there are free chunks.
Yes, last-first < chunk_count = !( last - first >= CHUNCK_COUNT)
> 
> Thirdly, why in case of having a free chunk you discard the last buffer,
> but move first chunk gen? You do 'ibuf/obuf_reset(&chunk...' instead of
> 'ibuf/obuf_reset(&(xrow_buf->chunk[xrow_buf->first_chunk_gen %
> XROW_BUF_CHUNK_COUNT]))'.
I think you misunderstood the code - I did exactly what you suggest.
last_chunk - is not the oldest but the newest - the last chunk which was 
filled.
> >>> + */
> >>> +static struct xrow_buf_chunk *
> >>> +xrow_buf_rotate(struct xrow_buf *xrow_buf, struct vclock *vclock)
> >> 
> >> 16. Why 'rotate' operation needs a vclock? It does not write
> >> anything. If you need it for a trigger, why can't you use
> >> vclock from the rotated chunk? Also I don't understand, what is
> >> this vclock - is it a new vclock which is going to be written
> >> after the rotation? Or is it a vclock of the rotated buffer?
> >> Or what? In the on_rotate trigger I don't see any explanation
> >> of what arguments it takes.
> > 
> > In a chunk the vclock means the vclock of the first row in the chunk.
> 
> Of the first row? But in the comment in xrow_buf.h I see
> "/* vclock just before the first row. */". So before the first
> row, or of the first row?
Thanks, will check.
> 
> > It allows
> > us to determine which is the oldest row in the buffer (aas vclock of the
> > first used buffer)
> 
> It is worth adding to a comment in xrow_buf.h somewhere.
> 
> >>> +int
> >>> +xrow_buf_write(struct xrow_buf *xrow_buf, struct xrow_header **begin,
> >>> +	       struct xrow_header **end,
> >> 
> >> 20. Why not to pass count instead of end pointer? Count is known in the
> >> caller function wal_write_to_disk(), and you calculate count below
> >> using 'end - begin' anyway.
> > 
> > I inherited the API from wal_assing_lsn, and it does not matter, I will do
> > that if you wish.
> 
> Up to you. But now it looks strange, IMO. You calculate 'end' in the
> caller function so as to just calculate 'count' again inside this
> function. Talking of what Kostja said:
> 
> "
>     It's the general convention across the rest of the code to pass
>     arrays as a pair of pointers.
> "
> 
> It is not true, we use start-and-end pointers only for char buffers,
> and even for them we often switch back to start-and-size pair. Xrow.h
> contains examples: xrow_encode_auth() uses start-and-size,
> xrow_header_decode() uses start-and-end. Key_def.h contains examples:
> key_def_encode_parts() uses start-and-size, tuple_extract_key_raw()
> uses start-and-end. And many similar places. So I guess we use what is
> easier for each case. For this one I think count is easier.
> 
> >>> +		for (i = 1; i < iov_cnt; ++i) {
> >>> +			data = obuf_alloc(&chunk->data, iov[i].iov_len);
> >>> +			memcpy(data, iov[i].iov_base, iov[i].iov_len);
> >>> +			mem_info->xrow.body[i - 1].iov_base = data;
> >>> +			mem_info->size += iov[i].iov_len;
> >>> +		}
> >>> +	}
> >>> +
> >>> +	/* Copy all obuf lines containing the written data. */
> >>> +	int iov_cnt = 1 + obuf_iovcnt(&chunk->data) - data_svp.pos;
> >>> +	memcpy(iovec, chunk->data.iov + data_svp.pos,
> >>> +	       sizeof(struct iovec) * iov_cnt);
> >>> +	/* Adjust first iover member starting pointer and length. */
> >> 
> >> 25. What is 'iover'?
> > 
> > iovec, I'll fix it
> 
> Ok, now I understand the word, but still don't understand the
> comment. Why do you adjust it? Why with starting pointer
> and length? Why only the first iovec?
> Do you do it because before xlog_write the chunk might has
> already stored something, and you don't want to return it
> again?
Yes, iovec could already contain previous transaction data.
> If so, then why can't the old data occupy more than one
> first iovec which you adjust?
Because data is written sequentially:
vec0:<old>
vec1:<old>
vec2:<old><new>
vec3:<new>
So I start copying from vec2 and should adjust vec2 to <old> length:
vec0` = vec2<old>!rebase to here<new>
vec1` = vec3

> 
> >>> +	/* Decoded xrow header. */
> >>> +	struct xrow_header xrow;
> >>> +	/* Pointer to the xrow encoded raw data. */
> >>> +	void *data;
> >>> +	/* xrow raw data size. */
> >>> +	size_t size;
> >>> +};
> >>> +
> >>> +/*
> >>> + * Xrow memory data chunk info contains
> >>> + *  a vclock just before the first stored row,
> >>> + *  an ibuf with row descriptors
> >>> + *  an obuf with encoded data
> >> 
> >> 29. I don't think it really makes sense to just dupicate here
> >> the comments of each attribute of the structure. Please,
> >> better use that space to explain, why you need vclock, why
> >> 'before first row' instead of just 'first row', or 'last row'.
> >> 
> >> What is a 'row descriptor'?
> >> 
> >> Why do you need both initial xrows and encoded? I thought, that
> >> xrow_buf stores only encoded rows.
> > 
> > Relay will need initial xrows to perform filtering (replication group, lsn
> > and e.t.c.) without xrow encoded body to decode.
> 
> Hm, it looks quite expensive just for filtering. But assume it is ok.
> 1) Why can't you store only a header instead of the whole xrow with a
> body? As I understand, for any filtering the header is enough. Or not?
I should also store xrow bodies and then encode the xrow as many times as 
count of replicas I have. Because of this I store not only a xrow_header 
struct but also the xrow encoded data location and size.
> 
> 2) Suppose, that you found a row which should be filtered out. That
> row is already encoded somewhere inside obuf of that chunk. How will
> you find where is it encoded so as not to send this part to a replica?
> Where is that mapping of index in ibuf rows <-> range in obuf data?
You could see it in the next patch.
> 
> Actually, after I wrote the second question, I looked at the code again,
> and found xrow_buf_row_info.data. Is it this pointer I was talking about?
> If so, then it is super-ultra not obvious, that it is a pointer at some
> place inside xrow_buf_chunk.data. Such a link should be documented in a
> very detailed comment.
Ok, will add the comment