[tarantool-patches] Re: [PATCH] net.box: ignore absence of _vcollation

Vladislav Shpilevoy v.shpilevoy at tarantool.org
Thu Jul 25 01:43:22 MSK 2019


Hi! Yeah, I remember that I should explain my vision of how that
task should be solved. See below a long-read.

Here are my explanations on how netbox works, and on how should this
issue be fixed (perhaps).

Netbox for each connection creates a worker fiber and a metadata object
keeping connect timeout, reconnect timeout, status, etc. The metadata
object is known as a netbox connection, and is the object with which a
user works. It is a result of 'netbox.connect()'. Worker fiber does real
work: establishes a tcp connection, sends/receives data.

But there appears a problem of cyclic references. The thing is that the
worker fiber needs some metadata from the netbox object. At least,
'reconnect_timeout'. On the other hand, the netbox object needs some
results from the worker fiber - responses on the requests, connection
status. In other words, both the worker fiber and the netbox object need
something from each other.

If the worker fiber and the netbox object would keep references to each
other, they would never be deleted - Lua GC is not that smart. So there is
a channel of communication between them, which references them both weakly.
https://github.com/tarantool/tarantool/blob/0529e2f0ef7dd4c19283b9e8e6b0e7e43bf400a3/src/box/lua/net_box.lua#L834-L852

This function is called when a netbox object is created. It takes some
mandatory meta like host, port, and a so called 'callback'. It is a closure
defined here:
https://github.com/tarantool/tarantool/blob/0529e2f0ef7dd4c19283b9e8e6b0e7e43bf400a3/src/box/lua/net_box.lua#L910-L954

The closure has access to the netbox object
https://github.com/tarantool/tarantool/blob/0529e2f0ef7dd4c19283b9e8e6b0e7e43bf400a3/src/box/lua/net_box.lua#L909
and returns certain parameters of it. The worker fiber uses a weak reference
to this closure as a proxy to read metadata of the netbox object:
https://github.com/tarantool/tarantool/blob/0529e2f0ef7dd4c19283b9e8e6b0e7e43bf400a3/src/box/lua/net_box.lua#L837-L840.

For example, when the worker fiber needs to obtain reconnect_timeout, it
calls the callback
https://github.com/tarantool/tarantool/blob/0529e2f0ef7dd4c19283b9e8e6b0e7e43bf400a3/src/box/lua/net_box.lua#L443.
The callback here is a weak reference to the closure
https://github.com/tarantool/tarantool/blob/0529e2f0ef7dd4c19283b9e8e6b0e7e43bf400a3/src/box/lua/net_box.lua#L948-L953.

And vice versa - when the worker fiber wants to return something back to
the netbox object, it calls the same callback, but with a different argument:
https://github.com/tarantool/tarantool/blob/0529e2f0ef7dd4c19283b9e8e6b0e7e43bf400a3/src/box/lua/net_box.lua#L399.

How does that callback help to GC the connection? Lets assume, that there are
no references to a netbox object in user code. It is not referenced in the
worker fiber either, because the only reference is callback, and it is weak. So
it is deleted. Together with
https://github.com/tarantool/tarantool/blob/0529e2f0ef7dd4c19283b9e8e6b0e7e43bf400a3/src/box/lua/net_box.lua#L844-L846.
It is a standard way of setting your own callbacks on GC of a table - you create a
dummy cdata object with a custom GC callback. On deletion of that gc-hook a
function
https://github.com/tarantool/tarantool/blob/0529e2f0ef7dd4c19283b9e8e6b0e7e43bf400a3/src/box/lua/net_box.lua#L473-L481
is called. It cancels the worker fiber. Now both the netbox object and the worker
fiber are deleted.

Talking of this issue: your situation is simpler, as I see now, when I dived a
bit into the problem. You have nothing to do about the netbox object. Your target
is the worker fiber only. And in such a case I don't see any magic solution that
would shrink this patch https://github.com/tarantool/tarantool/commit/abc88ef8d38e21a6c7faf925a19105ce65d45467
to a couple of lines. The only obvious thing you can do here is to avoid passing
version id to iproto_auth_sm and iproto_schema_sm functions, and Alexander proposed.
Just make it a global value here
https://github.com/tarantool/tarantool/blob/0529e2f0ef7dd4c19283b9e8e6b0e7e43bf400a3/src/box/lua/net_box.lua#L232
together with the others, if you still will need.

Your case is different from reconnect_timeout and other callback-stuff, because
these parameters are set by a user out of the worker fiber, and can be changed by
him. Callback allows the worker fiber to read actual values always. Version id
originates from the worker fiber, and so can be maintained and kept here.

Talking of how to remove branching - it has nothing to do with netbox specific code,
or GC, or anything application dependent. It is a pure technical matter. You have a
classic problem already solved in box/lua/upgrade.lua. The problem is that you have a
pipeline of versions, each adding, or removing, or changing something. And it is logical
to have a pipeline of functions building that schema step by step. IMO, you should have
several functions to fetch and to install schema.

Foe example:

function fetch_schema_2_2_1()
...
end

function fetch_schema_2_2_2()
   fetch_schema_2_2_1()
   -- + some v_collation things.
end

function fetch_schema_2_2_3()
   fetch_schema_2_2_3()
   --- + ...
end

And you have a global table where key is version id, value is a schema fetch
function. Here
https://github.com/tarantool/tarantool/commit/abc88ef8d38e21a6c7faf925a19105ce65d45467#diff-a9dd9be71f8af749f288abb6304a5185R733
you take a fetcher from the table like 'fetcher = fetchers[version_id]', and just
call it.

All these encode selects
https://github.com/tarantool/tarantool/commit/abc88ef8d38e21a6c7faf925a19105ce65d45467#diff-a9dd9be71f8af749f288abb6304a5185R754
should move to these schema fetchers. Of course, it wouldn't be that simple. At least,
because schema fetch requests are sent in parallel, as you can see here:
https://github.com/tarantool/tarantool/blob/0529e2f0ef7dd4c19283b9e8e6b0e7e43bf400a3/src/box/lua/net_box.lua#L743-L748
And still the idea is the same. Perhaps you will have not only fetcher for each
version, but two functions: start_fetch/end_fetch, or begin_fetch/end_fetch. A
first function sends a request (encode_select), a second one waits for a response.
Or let these fetchers return an asynchronous id, by which you will wait for
responses in one place. Don't know what to chose. If I were you, and I had time, I
would try to implement these step-wise fetchers. They would be quite easy to
support and to extend for other future schema changes.

Perhaps, it will remove the branching. Up to you what to do, to Alexander, and to
other managers. My only wish is to look at the final patch when it will be ready.




More information about the Tarantool-patches mailing list