[patches] [PATCH 1/1] iproto: get iproto obuf only right before usage

Vladimir Davydov vdavydov.dev at gmail.com
Fri Mar 16 16:51:03 MSK 2018


On Thu, Mar 15, 2018 at 12:31:08AM +0300, Vladislav Shpilevoy wrote:
> Branch: https://github.com/tarantool/tarantool/tree/fix-iproto-buffers
> There is no issue, since the fix is simple, and does not affect
> external behavior.

Please add the branch name as well as everything you want to say but
don't want to be committed after the triple dash (---).

> 
> It is possible to discard non-sent responses using a special
> sequence of requests and yields. In details: if DML requests
> yield on commit too long, there are fast read requests, and a
> network is saturated, then some non-sent DML responses are
> discarded.
> 
> Signed-off-by: Vladislav Shpilevoy <v.shpilevoy at tarantool.org>
> ---
>  src/box/iproto.cc        |  14 +++--
>  src/errinj.h             |   1 +
>  test/box/errinj.result   | 129 +++++++++++++++++++++++++++++++++++++++++++++++
>  test/box/errinj.test.lua |  60 ++++++++++++++++++++++
>  4 files changed, 200 insertions(+), 4 deletions(-)
> 

> diff --git a/test/box/errinj.test.lua b/test/box/errinj.test.lua
> index 9bcf53847..666d1e790 100644
> --- a/test/box/errinj.test.lua
> +++ b/test/box/errinj.test.lua
> @@ -348,3 +348,63 @@ box.error.injection.set('ERRINJ_BUILD_SECONDARY', sk.id)
>  sk:alter({parts = {2, 'number'}})
>  box.error.injection.set('ERRINJ_BUILD_SECONDARY', -1)
>  s:drop()
> +
> +--
> +-- Check that IProto commands get obuf right before sending it to
> +-- IProto thread. The test was failed with discarding an response
> +-- before its sending to a client. A plan is the following:
> +--
> +--        obuf[0]                           obuf[1]
> +-- 1) start DML, yield;
> +-- 2) do DQL, store in obuf,
> +--    block on socket write;
> +--                                      3) do DQL, store in obuf,
> +--                                         block on socket write;
> +--                                      4) start DML, yield;
> +--                                      5) do DQL, store in obuf,
> +--                                         block on socket write;
> +--              6) flush all DQL responses,
> +--                 DML still yields in both
> +--                 obufs;
> +--
> +--              7) both DML finishes, and
> +--                 are dumped into both
> +--                 obuf[0] and obuf[1], but
> +--                 block on socket write
> +--                 again;
> +--
> +-- 8) Now TX thread owns obuf[1] and thinks, that obuf[0] is fully
> +--    flushed, and on a next DQL it resets obuf[0], discarding all
> +--    its data, including non-sent DML response.
> +--
> +--
> +box.schema.user.grant('guest', 'read,write,execute', 'universe')
> +s = box.schema.create_space('test')
> +pk = s:create_index('pk')
> +s:replace{1}
> +c = net_box.connect(box.cfg.listen)
> +errinj.set("ERRINJ_WAL_DELAY", true)
> +errinj.set("ERRINJ_IPROTO_FLUSH_DELAY", true)
> +long_dml1_f = fiber.create(function() c.space.test:replace{100} end)
> +fiber.sleep(0.05)
> +long_get1_f = fiber.create(function() c.space.test:get{1} end)
> +fiber.sleep(0.05)
> +long_get2_f = fiber.create(function() c.space.test:get{1} end)
> +fiber.sleep(0.05)
> +long_dml2_f = fiber.create(function() c.space.test:replace{200} end)
> +fiber.sleep(0.05)
> +long_get3_f = fiber.create(function() c.space.test:get{1} end)
> +fiber.sleep(0.05)
> +errinj.set("ERRINJ_IPROTO_FLUSH_DELAY", false)
> +while long_get1_f:status() ~= 'dead' or long_get2_f:status() ~= 'dead' or long_get3_f:status() ~= 'dead' do fiber.sleep(0.1) end
> +errinj.set("ERRINJ_IPROTO_FLUSH_DELAY", true)
> +errinj.set("ERRINJ_WAL_DELAY", false)
> +fiber.sleep(0.05)
> +long_get1_f = fiber.create(function() c.space.test:get{1} end)
> +fiber.sleep(0.05)
> +errinj.set("ERRINJ_IPROTO_FLUSH_DELAY", false)
> +fiber.sleep(0.05)
> +long_dml1_f:status() == 'dead' and long_dml2_f:status() == 'dead'
> +c:close()
> +s:drop()
> +box.schema.user.revoke('guest', 'read,write,execute', 'universe')

The patch is fine, but I don't like the test: the problem is it too
heavily relies on iproto internals so should we change anything in
iproto (e.g. the way buffers are rotated), it would become meaningless.

What we want to check here is that a client doesn't get stuck if a
request (DQL or DML) takes longer than usual. We could try to inject
random delays in tx_process1() and tx_process_select() while processing
concurrent requests and check that no fiber stalls forever. Such a test
would be close to real use cases and hence would remain useful even if
we completely rewrote iproto.

I tried to do that and got a test (see below) that hangs or crashes
without your patch and passes with your patch and seems to be much
easier for understanding. Please rewrite your test in your similar
fashion.

diff --git a/src/box/iproto.cc b/src/box/iproto.cc
index a75127a9..329e769b 100644
--- a/src/box/iproto.cc
+++ b/src/box/iproto.cc
@@ -1175,6 +1175,8 @@ tx_process1(struct cmsg *m)
 
 	struct tuple *tuple;
 	struct obuf_svp svp;
+	if (rand() % 100 < 10)
+		fiber_sleep(0.001);
 	if (box_process1(&msg->dml, &tuple) ||
 	    iproto_prepare_select(out, &svp))
 		goto error;
@@ -1204,6 +1206,8 @@ tx_process_select(struct cmsg *m)
 	if (tx_check_schema(msg->header.schema_version))
 		goto error;
 
+	if (rand() % 100 < 10)
+		fiber_sleep(0.001);
 	rc = box_select(req->space_id, req->index_id,
 			req->iterator, req->offset, req->limit,
 			req->key, req->key_end, &port);
diff --git a/test/box/errinj.test.lua b/test/box/errinj.test.lua
index 9bcf5384..12fded9b 100644
--- a/test/box/errinj.test.lua
+++ b/test/box/errinj.test.lua
@@ -348,3 +348,17 @@ box.error.injection.set('ERRINJ_BUILD_SECONDARY', sk.id)
 sk:alter({parts = {2, 'number'}})
 box.error.injection.set('ERRINJ_BUILD_SECONDARY', -1)
 s:drop()
+
+box.schema.user.grant('guest', 'read,write,execute', 'universe')
+s = box.schema.space.create('test')
+_ = s:create_index('pk')
+
+c = net_box.connect(box.cfg.listen)
+
+ch = fiber.channel(200)
+for i = 1, 100 do fiber.create(function() for j = 1, 10 do c.space.test:replace{1} end ch:put(true) end) end
+for i = 1, 100 do fiber.create(function() for j = 1, 10 do c.space.test:select() end ch:put(true) end) end
+for i = 1, 200 do ch:get() end
+
+s:drop()
+box.schema.user.revoke('guest', 'read,write,execute', 'universe')



More information about the Tarantool-patches mailing list