[Tarantool-patches] [PATCH v9 3/3] iproto: move greeting from tx thread to iproto

Mon Dec 28 14:21:43 MSK 2020

Hi! Thanks for the patch!

We are getting closer and closer to the final solution!

> diff --git a/src/box/iproto.cc b/src/box/iproto.cc
> index f7330af21..b92a0433b 100644
> --- a/src/box/iproto.cc
> +++ b/src/box/iproto.cc
> @@ -963,7 +971,7 @@ iproto_connection_on_input(ev_loop *loop, struct ev_io *watcher,
>  	 * otherwise we might deplete the fiber pool in tx
>  	 * thread and deadlock.
>  	 */
> -	if (iproto_check_msg_max()) {
> +	if (iproto_check_msg_max() && con->session != NULL) {

Ok, I see why didn't it close anything before. It didn't even try to read,
because my tests used the entire net_msg limit, and only then closed the socket.
On the server side it didn't detect the close, because didn't call sio_read().

It is better now, but leads to another issue - you allow to allocate infinite
numer of network messages until the connection is established. Net_msg max
was here for a reason. That will lead to OOM eventually if connection count is
big, and they send a lot or messages, or a few big messages. Because you will
allocate net_msg objects and also will consume ibuf and keep it until TX
responds. Even if the connection is already closed.

I tried to find a way how to resolve this. For example, don't read anything
until TX acks the connection. But it has 2 issues: there is no cross-platform
way to find if a socket is closed without calling read() on it and consuming
all pending data. And this way is basically an ugly version of the next step
of what to do with IProto - make TX control its behaviour and introduce a
proper state machine for TX.

With that said, it looks that this patch won't fix the issue with the current
approach. Or rather it fixes the issue, but simply introduces another one, not
less severe, about memory instead of descriptors.

It seems, you need to implement TX control over IProto now. Make it tell IProto
that it should stop accepting new connections and stop sending messages to TX.

It should not even hurt replication, because if TX does not dispatch IProto
messages, it includes service messages as well, and therefore there is no
difference if you not send them to TX, or send and they are not dispatched.

After writing the text above, I had a discussion with Mons to sync about what
to do here, and in similar places, such as applier fibers running the entire
process out of memory when they read their sockets too aggressively (that
should persuade you that OOM on network messages is real).

For one-thread communications that can be solved with a simple queue limit.

For our case we already have a queue limit - net_msg_max. But it is not enough
apparently. We need a proper state machine for the entire thread. You need
to find all the states TX thread can be in, create a diagram, and describe
what it does in each state, and what it can't do. For example, obviously,
during local recovery TX thread does not yield, and therefore does not handle
messages from other threads (because does not return control to the scheduler
fiber) (although you better validate this statement, we may have missed
something).

After we have the states, we can decide what to do in each state with other
threads. For example, before starting local recovery we sync with IProto thread.
We make it stop generating new messages, handle all pending messages, and start
recovery. After local recovery we let IProto work normally, like now in master
branch.

Also you need to find what was the reason why listen is even started before local
recovery. Because it seems that if we have data to recover locally, it happens
regardless of replication settings. And that makes me wonder why the hell do we
even start listen so early then? Anyway during local recovery we can't accept
replication messages as well, AFAIK (but that can be false).

It is worth checking if this commit was really needed:
https://github.com/tarantool/tarantool/commit/601bb92a173943185b9050dab4572dc3e6f8e9a1

It could be just that Kostja said we need to do that, and Vova did, because
it was easier than argue, but there was no real necessity in it.

The state machine for TX thread would be good to make global. Because now it is
inside memtx engine if I remember correctly. But it does not belong to memtx engine
really. The states are going to be also used to run triggers installed by users,
from what I understood after talking to Mons.