[Tarantool-patches] [PATCH v6 3/3] iproto: move greeting from tx thread to iproto

Wed Dec 16 02:04:41 MSK 2020

Hi! Thanks for the patch!

See 4 comments below.

On 13.12.2020 00:38, Ilya Kosarev wrote:
> On connection, an evio service callback is invoked to accept it. The
> next step after acceptance was to process connection to tx thread
> through cbus. This meant that any connection interaction involves tx
> thread even before we get to send the greeting to the client.
> Consequently, the client might reach the instance with enormous number
> of connections, leading to the file descriptors limit exhaustion in
> case of unresponsive tx thread (for example, when building secondary
> index) and extra tx_process_connects afterwords, even in case the
> instance doesn't fail with too many open files error.
> This patch allows iproto to accept connection and send greeting by
> itself. Thus the connection is being established and stays in
> fetch_schema state while tx thread is unresponsive.
> 
> Closes #3776
> ---
>  src/box/iproto.cc                             | 118 +++++++++++++-----
>  test/app/gh-4787-netbox-empty-errmsg.result   |  18 ---
>  test/app/gh-4787-netbox-empty-errmsg.test.lua |   8 --
>  3 files changed, 88 insertions(+), 56 deletions(-)
> 
> diff --git a/src/box/iproto.cc b/src/box/iproto.cc
> index f7330af21d..a7da115925 100644
> --- a/src/box/iproto.cc
> +++ b/src/box/iproto.cc
> @@ -1091,6 +1093,46 @@ iproto_connection_on_output(ev_loop *loop, struct ev_io *watcher,
>  	}
>  }
>  
> +static int
> +iproto_buf_flush(struct iproto_connection *con)

1. The function lacks a comment. It is not so trivial why do we
have 2 flush functions. Also maybe worth renaming it to
iproto_connection_greeting_flush. Because it is a method of
'iproto_connection' (thus the prefix), and works with the
greeting only.

> +{
> +	struct iovec greeting;
> +	greeting.iov_base = &con->greeting_buf;
> +	greeting.iov_len = IPROTO_GREETING_SIZE;
> +	ssize_t nwr = sio_writev(con->output.fd, &greeting, 1);
> +
> +	if (nwr > 0) {
> +		/* Count statistics */
> +		rmean_collect(rmean_net, IPROTO_SENT, nwr);
> +		return 1;
> +	} else if (nwr < 0 && !sio_wouldblock(errno)) {
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +static void
> +iproto_connection_on_greeting(ev_loop *loop, struct ev_io *watcher,
> +			      int /* revents */)
> +{
> +	struct iproto_connection *con =
> +		(struct iproto_connection *)watcher->data;
> +	int rc = iproto_buf_flush(con);
> +	if (rc <= 0) {
> +		if (rc == 0) {
> +			ev_io_start(loop, &con->output);
> +		} else {
> +			diag_log();
> +			con->state = IPROTO_CONNECTION_CLOSED;

2. I don't understand how this part works. You didn't close the
socket, and didn't delete it from libev. But IPROTO_CONNECTION_CLOSED
clearly states

	/**
	 * Socket was closed, a notification is sent to the TX
	 * thread to close the session.
	 */

The way you change it here does not match the description. AFAIU,
the only legal function able to set this state is
iproto_connection_close(), and it should remain so.

Moreover, I see this should crash in an assertion in
iproto_connection_close(). Because this function will be called by
net_finish_connect() due to state being CLOSED (which is weird -
why is close() called if the state is already CLOSED?). But inside
it has "assert(con->state == IPROTO_CONNECTION_ALIVE)" in case the
ev_watcher still has a descriptor - it still has, so it will go there
any crash.

Did I miss something?

> +		}
> +		return;
> +	}
> +	ev_io_stop(con->loop, &con->output);
> +	ev_io_init(&con->output, iproto_connection_on_output,
> +		   con->output.fd, EV_WRITE);
> +}

3. What if sio_writev() has written only half of the bytes?
You didn't check and immediately proceeded to the normal
operation mode. But the greeting wasn't sent yet. Not all.

Also you don't propagate iov_base in case only a part of
the data was sent.

You can easily check it using a temporary error injection.
Before sio_writev() try to inject iov_len =
max(iov_len, IPROTO_GREETING_SIZE / 2) to force it to work
in 2 iterations, and you will see the issue.

Partial write is a rare thing, but it is totally real. It
is not some kind of an impossible thing only from the books.
Even if it is just 128 bytes.

> +
>  static struct iproto_connection *
>  iproto_connection_new(int fd)
>  {
> @@ -1938,50 +1980,64 @@ net_end_subscribe(struct cmsg *m)
>  }
>  
>  /**
> - * Handshake a connection: invoke the on-connect trigger
> - * and possibly authenticate. Try to send the client an error
> - * upon a failure.
> + * Handshake a connection: send greeting for it.
> + */
> +static void
> +iproto_process_connect(struct iproto_msg *msg)

4. Why do you take iproto_msg object if the only thing you do
with it is take the connection object? Wouldn't it be simpler
and more understandable, if it would be iproto_connection_start()
or something like that? With an iproto_connection * argument.

> +{
> +	struct iproto_connection *con = msg->connection;
> +	/*
> +	 * INSTANCE_UUID is guaranteed to be inited before this moment.
> +	 * We start listening either in local_recovery() or bootstrap().
> +	 * The INSTANCE_UUID is ensured to be inited in the beginning of
> +	 * both methods. In case of local_recovery() it is verified that
> +	 * INSTANCE_UUID was read from the snapshot in memtx_engine_new().
> +	 * In bootstrap() INSTANCE_UUID is either taken from the
> +	 * instance_uuid box.cfg{} param or created on the spot.
> +	 */
> +	random_bytes(con->salt, IPROTO_SALT_SIZE);
> +	greeting_encode(con->greeting_buf, tarantool_version_id(),
> +			&INSTANCE_UUID, con->salt, IPROTO_SALT_SIZE);
> +	assert(evio_has_fd(&con->output));
> +	ev_feed_event(con->loop, &con->output, EV_WRITE);
> +}