* [Tarantool-patches] [PATCH v10 0/4] box/replication: add missing diag set and fix sigsegv @ 2020-02-14 14:03 Cyrill Gorcunov 2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 1/4] box/request: add missing OutOfMemory diag_set Cyrill Gorcunov ` (4 more replies) 0 siblings, 5 replies; 13+ messages in thread From: Cyrill Gorcunov @ 2020-02-14 14:03 UTC (permalink / raw) To: tml In v10 I tried to sanitize code a bit. Still there are few problems we will need (well, I'll need) to solve: - move error injection deeper into transaction code - keep former error - make sure that we really need separated replicaset diag instance (i must confess i still not follow why we need it) Anyway with the patch we stop causing nil dereference and write error log. Same time since I'll have to deal with inmemory wal task I won't leave these FIXME and continue working on them. Cyrill Gorcunov (4): box/request: add missing OutOfMemory diag_set box/applier: add missing diag_set on region_alloc failure box/applier: prevent nil dereference on applier rollback test: add replication/applier-rollback src/box/applier.cc | 43 +++++- src/box/request.c | 8 +- src/lib/core/errinj.h | 1 + test/box/errinj.result | 1 + test/replication/applier-rollback-slave.lua | 16 ++ test/replication/applier-rollback.result | 160 ++++++++++++++++++++ test/replication/applier-rollback.test.lua | 79 ++++++++++ test/replication/suite.ini | 2 +- 8 files changed, 305 insertions(+), 5 deletions(-) create mode 100644 test/replication/applier-rollback-slave.lua create mode 100644 test/replication/applier-rollback.result create mode 100644 test/replication/applier-rollback.test.lua base-commit: 95b9a48daf00b6686b8e1993a023daa0755b460e -- 2.20.1 ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Tarantool-patches] [PATCH v10 1/4] box/request: add missing OutOfMemory diag_set 2020-02-14 14:03 [Tarantool-patches] [PATCH v10 0/4] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov @ 2020-02-14 14:03 ` Cyrill Gorcunov 2020-02-15 17:33 ` Konstantin Osipov 2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 2/4] box/applier: add missing diag_set on region_alloc failure Cyrill Gorcunov ` (3 subsequent siblings) 4 siblings, 1 reply; 13+ messages in thread From: Cyrill Gorcunov @ 2020-02-14 14:03 UTC (permalink / raw) To: tml In request_create_from_tuple and request_handle_sequence we may be unable to request memory for tuples, don't forget to setup diag error otherwise diag_raise will lead to nil dereference. Acked-by: Sergey Ostanevich <sergos@tarantool.org> Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> --- src/box/request.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/src/box/request.c b/src/box/request.c index 82232a155..994f2da62 100644 --- a/src/box/request.c +++ b/src/box/request.c @@ -109,8 +109,10 @@ request_create_from_tuple(struct request *request, struct space *space, * the tuple data to WAL on commit. */ char *buf = region_alloc(&fiber()->gc, size); - if (buf == NULL) + if (buf == NULL) { + diag_set(OutOfMemory, size, "region_alloc", "tuple"); return -1; + } memcpy(buf, data, size); request->tuple = buf; request->tuple_end = buf + size; @@ -199,8 +201,10 @@ request_handle_sequence(struct request *request, struct space *space) size_t buf_size = (request->tuple_end - request->tuple) + mp_sizeof_uint(UINT64_MAX); char *tuple = region_alloc(&fiber()->gc, buf_size); - if (tuple == NULL) + if (tuple == NULL) { + diag_set(OutOfMemory, buf_size, "region_alloc", "tuple"); return -1; + } char *tuple_end = mp_encode_array(tuple, len); if (unlikely(key != data)) { -- 2.20.1 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Tarantool-patches] [PATCH v10 1/4] box/request: add missing OutOfMemory diag_set 2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 1/4] box/request: add missing OutOfMemory diag_set Cyrill Gorcunov @ 2020-02-15 17:33 ` Konstantin Osipov 0 siblings, 0 replies; 13+ messages in thread From: Konstantin Osipov @ 2020-02-15 17:33 UTC (permalink / raw) To: Cyrill Gorcunov; +Cc: tml * Cyrill Gorcunov <gorcunov@gmail.com> [20/02/14 17:06]: > In request_create_from_tuple and request_handle_sequence > we may be unable to request memory for tuples, don't > forget to setup diag error otherwise diag_raise will > lead to nil dereference. > > Acked-by: Sergey Ostanevich <sergos@tarantool.org> > Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> > --- > src/box/request.c | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > lgmt > diff --git a/src/box/request.c b/src/box/request.c > index 82232a155..994f2da62 100644 > --- a/src/box/request.c > +++ b/src/box/request.c > @@ -109,8 +109,10 @@ request_create_from_tuple(struct request *request, struct space *space, > * the tuple data to WAL on commit. > */ > char *buf = region_alloc(&fiber()->gc, size); > - if (buf == NULL) > + if (buf == NULL) { > + diag_set(OutOfMemory, size, "region_alloc", "tuple"); > return -1; > + } > memcpy(buf, data, size); > request->tuple = buf; > request->tuple_end = buf + size; > @@ -199,8 +201,10 @@ request_handle_sequence(struct request *request, struct space *space) > size_t buf_size = (request->tuple_end - request->tuple) + > mp_sizeof_uint(UINT64_MAX); > char *tuple = region_alloc(&fiber()->gc, buf_size); > - if (tuple == NULL) > + if (tuple == NULL) { > + diag_set(OutOfMemory, buf_size, "region_alloc", "tuple"); > return -1; > + } > char *tuple_end = mp_encode_array(tuple, len); > > if (unlikely(key != data)) { > -- > 2.20.1 -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Tarantool-patches] [PATCH v10 2/4] box/applier: add missing diag_set on region_alloc failure 2020-02-14 14:03 [Tarantool-patches] [PATCH v10 0/4] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov 2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 1/4] box/request: add missing OutOfMemory diag_set Cyrill Gorcunov @ 2020-02-14 14:03 ` Cyrill Gorcunov 2020-02-15 17:34 ` Konstantin Osipov 2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 3/4] box/applier: prevent nil dereference on applier rollback Cyrill Gorcunov ` (2 subsequent siblings) 4 siblings, 1 reply; 13+ messages in thread From: Cyrill Gorcunov @ 2020-02-14 14:03 UTC (permalink / raw) To: tml In case if we're hitting memory limit allocating triggers we should setup diag error to prevent nil dereference in diag_raise call (for example from applier_apply_tx). Note that there are region_alloc_xc helpers which are throwing errors but as far as I understand we need the rollback action to process first instead of immediate throw/catch thus we use diag_set. Acked-by: Sergey Ostanevich <sergos@tarantool.org> Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> --- src/box/applier.cc | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/src/box/applier.cc b/src/box/applier.cc index ae3d281a5..2ed5125d0 100644 --- a/src/box/applier.cc +++ b/src/box/applier.cc @@ -796,8 +796,11 @@ applier_apply_tx(struct stailq *rows) sizeof(struct trigger)); on_commit = (struct trigger *)region_alloc(&txn->region, sizeof(struct trigger)); - if (on_rollback == NULL || on_commit == NULL) + if (on_rollback == NULL || on_commit == NULL) { + diag_set(OutOfMemory, sizeof(struct trigger), + "region_alloc", "on_rollback/on_commit"); goto rollback; + } trigger_create(on_rollback, applier_txn_rollback_cb, NULL, NULL); txn_on_rollback(txn, on_rollback); -- 2.20.1 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Tarantool-patches] [PATCH v10 2/4] box/applier: add missing diag_set on region_alloc failure 2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 2/4] box/applier: add missing diag_set on region_alloc failure Cyrill Gorcunov @ 2020-02-15 17:34 ` Konstantin Osipov 0 siblings, 0 replies; 13+ messages in thread From: Konstantin Osipov @ 2020-02-15 17:34 UTC (permalink / raw) To: Cyrill Gorcunov; +Cc: tml * Cyrill Gorcunov <gorcunov@gmail.com> [20/02/14 17:06]: > In case if we're hitting memory limit allocating triggers > we should setup diag error to prevent nil dereference > in diag_raise call (for example from applier_apply_tx). > > Note that there are region_alloc_xc helpers which are > throwing errors but as far as I understand we need the > rollback action to process first instead of immediate > throw/catch thus we use diag_set. > > Acked-by: Sergey Ostanevich <sergos@tarantool.org> > Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> lgtm -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Tarantool-patches] [PATCH v10 3/4] box/applier: prevent nil dereference on applier rollback 2020-02-14 14:03 [Tarantool-patches] [PATCH v10 0/4] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov 2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 1/4] box/request: add missing OutOfMemory diag_set Cyrill Gorcunov 2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 2/4] box/applier: add missing diag_set on region_alloc failure Cyrill Gorcunov @ 2020-02-14 14:03 ` Cyrill Gorcunov 2020-02-15 17:37 ` Konstantin Osipov 2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 4/4] test: add replication/applier-rollback Cyrill Gorcunov 2020-02-14 14:12 ` [Tarantool-patches] [PATCH v10 0/4] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov 4 siblings, 1 reply; 13+ messages in thread From: Cyrill Gorcunov @ 2020-02-14 14:03 UTC (permalink / raw) To: tml Currently when transaction rollback happens we just drop an existing error setting ClientError to the replicaset.applier.diag. This action leaves current fiber with diag=nil, which in turn leads to sigsegv once diag_raise() called right after applier_apply_tx(): | applier_f | try { | applier_subscribe | applier_apply_tx | // error happens | txn_rollback | diag_set(ClientError, ER_WAL_IO) | diag_move(&fiber()->diag, &replicaset.applier.diag) | // fiber->diag = nil | applier_on_rollback | diag_add_error(&applier->diag, diag_last_error(&replicaset.applier.diag) | fiber_cancel(applier->reader); | diag_raise() -> NULL dereference | } catch { ... } The applier_f works in try/catch cycle and handles errors depending on what exactly happened during transaction application. It might reconnect appliers in some cases, the applier is simply cancelled and reaped out in others. The problem is that the shared replicaset.applier.diag is handled on FiberIsCancelled exception only (while it is set inside transaction rollback action) and we never trigger this specific exception. But even if we would the former error which has been causing the applier abort is vanished by ClientError which is too general. Thus: - log the former error, but leave ClientError as a new one to be preserved in replicaset diag instance (I don't want to make an intrusive patch which would change a logic since I'm far from being expert in this code); - put fixme mark into the code: we need to rework it in a more sense way. Part-of #4730 Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> --- src/box/applier.cc | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/src/box/applier.cc b/src/box/applier.cc index 2ed5125d0..a4a73d383 100644 --- a/src/box/applier.cc +++ b/src/box/applier.cc @@ -692,9 +692,33 @@ static int applier_txn_rollback_cb(struct trigger *trigger, void *event) { (void) trigger; + /* + * FIXME: Do not clear fiber()->diag since it + * cause nil dereference + * + * applier_subscribe + * applier_apply_tx + * diag_raise + * + * In turn we need to redesign this code: + * - preserve original error or log it somewhere + * - make the error path more clear + * + * We must never reach this point with clean diag + * area, if we do it means we're simply screwed + * somewhere and there is a bug. + */ + + if (!diag_is_empty(diag_get())) + diag_log(); + else + say_warn_ratelimited("applier_txn_rollback_cb: empty diag"); + /* Setup shared applier diagnostic area. */ diag_set(ClientError, ER_WAL_IO); - diag_move(&fiber()->diag, &replicaset.applier.diag); + diag_add_error(&replicaset.applier.diag, + diag_last_error(&fiber()->diag)); + /* Broadcast the rollback event across all appliers. */ trigger_run(&replicaset.applier.on_rollback, event); /* Rollback applier vclock to the committed one. */ -- 2.20.1 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Tarantool-patches] [PATCH v10 3/4] box/applier: prevent nil dereference on applier rollback 2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 3/4] box/applier: prevent nil dereference on applier rollback Cyrill Gorcunov @ 2020-02-15 17:37 ` Konstantin Osipov 2020-02-15 18:24 ` Cyrill Gorcunov 0 siblings, 1 reply; 13+ messages in thread From: Konstantin Osipov @ 2020-02-15 17:37 UTC (permalink / raw) To: Cyrill Gorcunov; +Cc: tml * Cyrill Gorcunov <gorcunov@gmail.com> [20/02/14 17:06]: > + /* > + * FIXME: Do not clear fiber()->diag since it > + * cause nil dereference > + * > + * applier_subscribe > + * applier_apply_tx > + * diag_raise > + * > + * In turn we need to redesign this code: > + * - preserve original error or log it somewhere > + * - make the error path more clear > + * > + * We must never reach this point with clean diag > + * area, if we do it means we're simply screwed > + * somewhere and there is a bug. I think this comment is obsolete now with the fix below. You no longer clear fiber->diag. > + */ > + > + if (!diag_is_empty(diag_get())) > + diag_log(); > + else > + say_warn_ratelimited("applier_txn_rollback_cb: empty diag"); You can also add assert here for debug mode. It should never happen. > + > /* Setup shared applier diagnostic area. */ > diag_set(ClientError, ER_WAL_IO); > - diag_move(&fiber()->diag, &replicaset.applier.diag); > + diag_add_error(&replicaset.applier.diag, > + diag_last_error(&fiber()->diag)); > + It would be nice to explain in the comment why you want to preserve the original error in the fiber here: because when later this fiber is joined in (add call site here), we may want to check its diagnostics area. > /* Broadcast the rollback event across all appliers. */ > trigger_run(&replicaset.applier.on_rollback, event); > /* Rollback applier vclock to the committed one. */ -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Tarantool-patches] [PATCH v10 3/4] box/applier: prevent nil dereference on applier rollback 2020-02-15 17:37 ` Konstantin Osipov @ 2020-02-15 18:24 ` Cyrill Gorcunov 2020-02-15 18:51 ` Konstantin Osipov 0 siblings, 1 reply; 13+ messages in thread From: Cyrill Gorcunov @ 2020-02-15 18:24 UTC (permalink / raw) To: Konstantin Osipov; +Cc: tml On Sat, Feb 15, 2020 at 08:37:41PM +0300, Konstantin Osipov wrote: > * Cyrill Gorcunov <gorcunov@gmail.com> [20/02/14 17:06]: > > + /* > > + * FIXME: Do not clear fiber()->diag since it > > + * cause nil dereference > > + * > > + * applier_subscribe > > + * applier_apply_tx > > + * diag_raise > > + * > > + * In turn we need to redesign this code: > > + * - preserve original error or log it somewhere > > + * - make the error path more clear > > + * > > + * We must never reach this point with clean diag > > + * area, if we do it means we're simply screwed > > + * somewhere and there is a bug. > > I think this comment is obsolete now with the fix below. The thing is that I saw once the error path which may end up here without setting diag (well, to be honest i can't plot a call graph right now but it was related to our "light" salad helpers and [probably] bptree). That's why I don't wanted to put assert. But if you prefer sure, I may update the comment and put assert. > > You no longer clear fiber->diag. > > > + */ > > + > > + if (!diag_is_empty(diag_get())) > > + diag_log(); > > + else > > + say_warn_ratelimited("applier_txn_rollback_cb: empty diag"); > > You can also add assert here for debug mode. It should never > happen. > > > + > > /* Setup shared applier diagnostic area. */ > > diag_set(ClientError, ER_WAL_IO); > > - diag_move(&fiber()->diag, &replicaset.applier.diag); > > + diag_add_error(&replicaset.applier.diag, > > + diag_last_error(&fiber()->diag)); > > + > > It would be nice to explain in the comment why you want to > preserve the original error in the fiber here: because when later > this fiber is joined in (add call site here), we may want to check > its diagnostics area. You know I think we might need to drop replicaset.diag completely and use applier.diag instead. But again, Kostya, I don't fully understand this code yet, so I need more time to come with some clean solution (which I plan to implement on the top). > > > /* Broadcast the rollback event across all appliers. */ > > trigger_run(&replicaset.applier.on_rollback, event); > > /* Rollback applier vclock to the committed one. */ > > -- > Konstantin Osipov, Moscow, Russia > Cyrill ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Tarantool-patches] [PATCH v10 3/4] box/applier: prevent nil dereference on applier rollback 2020-02-15 18:24 ` Cyrill Gorcunov @ 2020-02-15 18:51 ` Konstantin Osipov 0 siblings, 0 replies; 13+ messages in thread From: Konstantin Osipov @ 2020-02-15 18:51 UTC (permalink / raw) To: Cyrill Gorcunov; +Cc: tml * Cyrill Gorcunov <gorcunov@gmail.com> [20/02/15 21:26]: > > It would be nice to explain in the comment why you want to > > preserve the original error in the fiber here: because when later > > this fiber is joined in (add call site here), we may want to check > > its diagnostics area. > > You know I think we might need to drop replicaset.diag completely > and use applier.diag instead. I don't mind, but that could be too intrusive for a bug fix. I still don't see why you can't patch the crash site - i.e. the place which joins the fiber with broken diag? Let me remind you that this is just a regression from the parallel applier patch, and not a big one - just a coding bug. So let's fix the coding bug first and refactor the patch second. > understand this code yet, so I need more time to come with some > clean solution (which I plan to implement on the top). > > > > > > /* Broadcast the rollback event across all appliers. */ > > > trigger_run(&replicaset.applier.on_rollback, event); > > > /* Rollback applier vclock to the committed one. */ -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Tarantool-patches] [PATCH v10 4/4] test: add replication/applier-rollback 2020-02-14 14:03 [Tarantool-patches] [PATCH v10 0/4] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov ` (2 preceding siblings ...) 2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 3/4] box/applier: prevent nil dereference on applier rollback Cyrill Gorcunov @ 2020-02-14 14:03 ` Cyrill Gorcunov 2020-02-15 17:38 ` Konstantin Osipov 2020-02-14 14:12 ` [Tarantool-patches] [PATCH v10 0/4] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov 4 siblings, 1 reply; 13+ messages in thread From: Cyrill Gorcunov @ 2020-02-14 14:03 UTC (permalink / raw) To: tml In the test force error injection ERRINJ_REPLICA_TXN_WRITE to happen which will initiate applier transaction rollback. Without the fix it will cause SIGSEGV due to lack of error propagation. Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> --- src/box/applier.cc | 12 ++ src/lib/core/errinj.h | 1 + test/box/errinj.result | 1 + test/replication/applier-rollback-slave.lua | 16 ++ test/replication/applier-rollback.result | 160 ++++++++++++++++++++ test/replication/applier-rollback.test.lua | 79 ++++++++++ test/replication/suite.ini | 2 +- 7 files changed, 270 insertions(+), 1 deletion(-) create mode 100644 test/replication/applier-rollback-slave.lua create mode 100644 test/replication/applier-rollback.result create mode 100644 test/replication/applier-rollback.test.lua diff --git a/src/box/applier.cc b/src/box/applier.cc index a4a73d383..574d086e0 100644 --- a/src/box/applier.cc +++ b/src/box/applier.cc @@ -51,6 +51,7 @@ #include "txn.h" #include "box.h" #include "scoped_guard.h" +#include "errinj.h" STRS(applier_state, applier_STATE); @@ -832,6 +833,17 @@ applier_apply_tx(struct stailq *rows) trigger_create(on_commit, applier_txn_commit_cb, NULL, NULL); txn_on_commit(txn, on_commit); + /* + * FIXME: Move this injection somewhere inside + * txn_write, but since it has own bug (gh 4776) + * we will cure it a bit later. + */ + ERROR_INJECT(ERRINJ_REPLICA_TXN_WRITE, { + diag_set(ClientError, ER_INJECTION, + "replica txn write injection"); + goto rollback; + }); + if (txn_write(txn) < 0) goto fail; diff --git a/src/lib/core/errinj.h b/src/lib/core/errinj.h index 672da2119..da6adfe97 100644 --- a/src/lib/core/errinj.h +++ b/src/lib/core/errinj.h @@ -135,6 +135,7 @@ struct errinj { _(ERRINJ_COIO_SENDFILE_CHUNK, ERRINJ_INT, {.iparam = -1}) \ _(ERRINJ_SWIM_FD_ONLY, ERRINJ_BOOL, {.bparam = false}) \ _(ERRINJ_DYN_MODULE_COUNT, ERRINJ_INT, {.iparam = 0}) \ + _(ERRINJ_REPLICA_TXN_WRITE, ERRINJ_BOOL, {.bparam = false}) \ ENUM0(errinj_id, ERRINJ_LIST); extern struct errinj errinjs[]; diff --git a/test/box/errinj.result b/test/box/errinj.result index f043c6689..60db28938 100644 --- a/test/box/errinj.result +++ b/test/box/errinj.result @@ -63,6 +63,7 @@ evals - ERRINJ_RELAY_SEND_DELAY: false - ERRINJ_RELAY_TIMEOUT: 0 - ERRINJ_REPLICA_JOIN_DELAY: false + - ERRINJ_REPLICA_TXN_WRITE: false - ERRINJ_SIO_READ_MAX: -1 - ERRINJ_SNAP_COMMIT_DELAY: false - ERRINJ_SNAP_WRITE_DELAY: false diff --git a/test/replication/applier-rollback-slave.lua b/test/replication/applier-rollback-slave.lua new file mode 100644 index 000000000..26fb10055 --- /dev/null +++ b/test/replication/applier-rollback-slave.lua @@ -0,0 +1,16 @@ +-- +-- vim: ts=4 sw=4 et +-- + +print('arg', arg) + +box.cfg({ + replication = os.getenv("MASTER"), + listen = os.getenv("LISTEN"), + memtx_memory = 107374182, + replication_timeout = 0.1, + replication_connect_timeout = 0.5, + read_only = true, +}) + +require('console').listen(os.getenv('ADMIN')) diff --git a/test/replication/applier-rollback.result b/test/replication/applier-rollback.result new file mode 100644 index 000000000..3209fc7fd --- /dev/null +++ b/test/replication/applier-rollback.result @@ -0,0 +1,160 @@ +-- test-run result file version 2 +#!/usr/bin/env tarantool + | --- + | ... +-- +-- vim: ts=4 sw=4 et +-- + +test_run = require('test_run').new() + | --- + | ... + +errinj = box.error.injection + | --- + | ... +engine = test_run:get_cfg('engine') + | --- + | ... + +-- +-- Allow replica to connect to us +box.schema.user.grant('guest', 'replication') + | --- + | ... + +-- +-- Create replica instance, we're the master and +-- start it, no data to sync yet though +test_run:cmd("create server replica_slave with rpl_master=default, script='replication/applier-rollback-slave.lua'") + | --- + | - true + | ... +test_run:cmd("start server replica_slave") + | --- + | - true + | ... + +-- +-- Fill initial data on the master instance +test_run:cmd('switch default') + | --- + | - true + | ... + +_ = box.schema.space.create('test', {engine=engine}) + | --- + | ... +s = box.space.test + | --- + | ... + +s:format({{name = 'id', type = 'unsigned'}, {name = 'band_name', type = 'string'}}) + | --- + | ... + +_ = s:create_index('primary', {type = 'tree', parts = {'id'}}) + | --- + | ... +s:insert({1, '1'}) + | --- + | - [1, '1'] + | ... +s:insert({2, '2'}) + | --- + | - [2, '2'] + | ... +s:insert({3, '3'}) + | --- + | - [3, '3'] + | ... + +-- +-- To make sure we're running +box.info.status + | --- + | - running + | ... + +-- +-- Wait for data from master get propagated +test_run:wait_lsn('replica_slave', 'default') + | --- + | ... + +-- +-- Now inject error into slave instance +test_run:cmd('switch replica_slave') + | --- + | - true + | ... + +-- +-- To make sure we're running +box.info.status + | --- + | - running + | ... + +errinj = box.error.injection + | --- + | ... +errinj.set('ERRINJ_REPLICA_TXN_WRITE', true) + | --- + | - ok + | ... + +-- +-- Jump back to master node and write new +-- entry which should cause error to happen +-- on slave instance +test_run:cmd('switch default') + | --- + | - true + | ... +s:insert({4, '4'}) + | --- + | - [4, '4'] + | ... + +-- +-- Wait for error to trigger +test_run:cmd('switch replica_slave') + | --- + | - true + | ... +fiber = require('fiber') + | --- + | ... +while test_run:grep_log('replica_slave', 'ER_INJECTION:[^\n]*') == nil do fiber.sleep(0.1) end + | --- + | ... + +---- +---- Such error cause the applier to be +---- cancelled and reaped, thus stop the +---- slave node and cleanup +test_run:cmd('switch default') + | --- + | - true + | ... + +-- +-- Cleanup +test_run:cmd("stop server replica_slave") + | --- + | - true + | ... +test_run:cmd("delete server replica_slave") + | --- + | - true + | ... +box.cfg{replication=""} + | --- + | ... +box.space.test:drop() + | --- + | ... +box.schema.user.revoke('guest', 'replication') + | --- + | ... diff --git a/test/replication/applier-rollback.test.lua b/test/replication/applier-rollback.test.lua new file mode 100644 index 000000000..d31eff9f0 --- /dev/null +++ b/test/replication/applier-rollback.test.lua @@ -0,0 +1,79 @@ +#!/usr/bin/env tarantool +-- +-- vim: ts=4 sw=4 et +-- + +test_run = require('test_run').new() + +errinj = box.error.injection +engine = test_run:get_cfg('engine') + +-- +-- Allow replica to connect to us +box.schema.user.grant('guest', 'replication') + +-- +-- Create replica instance, we're the master and +-- start it, no data to sync yet though +test_run:cmd("create server replica_slave with rpl_master=default, script='replication/applier-rollback-slave.lua'") +test_run:cmd("start server replica_slave") + +-- +-- Fill initial data on the master instance +test_run:cmd('switch default') + +_ = box.schema.space.create('test', {engine=engine}) +s = box.space.test + +s:format({{name = 'id', type = 'unsigned'}, {name = 'band_name', type = 'string'}}) + +_ = s:create_index('primary', {type = 'tree', parts = {'id'}}) +s:insert({1, '1'}) +s:insert({2, '2'}) +s:insert({3, '3'}) + +-- +-- To make sure we're running +box.info.status + +-- +-- Wait for data from master get propagated +test_run:wait_lsn('replica_slave', 'default') + +-- +-- Now inject error into slave instance +test_run:cmd('switch replica_slave') + +-- +-- To make sure we're running +box.info.status + +errinj = box.error.injection +errinj.set('ERRINJ_REPLICA_TXN_WRITE', true) + +-- +-- Jump back to master node and write new +-- entry which should cause error to happen +-- on slave instance +test_run:cmd('switch default') +s:insert({4, '4'}) + +-- +-- Wait for error to trigger +test_run:cmd('switch replica_slave') +fiber = require('fiber') +while test_run:grep_log('replica_slave', 'ER_INJECTION:[^\n]*') == nil do fiber.sleep(0.1) end + +---- +---- Such error cause the applier to be +---- cancelled and reaped, thus stop the +---- slave node and cleanup +test_run:cmd('switch default') + +-- +-- Cleanup +test_run:cmd("stop server replica_slave") +test_run:cmd("delete server replica_slave") +box.cfg{replication=""} +box.space.test:drop() +box.schema.user.revoke('guest', 'replication') diff --git a/test/replication/suite.ini b/test/replication/suite.ini index ed1de3140..b804b85f6 100644 --- a/test/replication/suite.ini +++ b/test/replication/suite.ini @@ -3,7 +3,7 @@ core = tarantool script = master.lua description = tarantool/box, replication disabled = consistent.test.lua -release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua +release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua applier-rollback.test.lua config = suite.cfg lua_libs = lua/fast_replica.lua lua/rlimit.lua use_unix_sockets = True -- 2.20.1 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Tarantool-patches] [PATCH v10 4/4] test: add replication/applier-rollback 2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 4/4] test: add replication/applier-rollback Cyrill Gorcunov @ 2020-02-15 17:38 ` Konstantin Osipov 2020-02-15 17:59 ` Cyrill Gorcunov 0 siblings, 1 reply; 13+ messages in thread From: Konstantin Osipov @ 2020-02-15 17:38 UTC (permalink / raw) To: Cyrill Gorcunov; +Cc: tml * Cyrill Gorcunov <gorcunov@gmail.com> [20/02/14 17:06]: > STRS(applier_state, applier_STATE); > > @@ -832,6 +833,17 @@ applier_apply_tx(struct stailq *rows) > trigger_create(on_commit, applier_txn_commit_cb, NULL, NULL); > txn_on_commit(txn, on_commit); > > + /* > + * FIXME: Move this injection somewhere inside > + * txn_write, but since it has own bug (gh 4776) > + * we will cure it a bit later. > + */ > + ERROR_INJECT(ERRINJ_REPLICA_TXN_WRITE, { > + diag_set(ClientError, ER_INJECTION, > + "replica txn write injection"); > + goto rollback; > + }); I thought yo wrote in the changes in v10 that you pushed the error down into txn layer. But here you don't do it. What am I missing? -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Tarantool-patches] [PATCH v10 4/4] test: add replication/applier-rollback 2020-02-15 17:38 ` Konstantin Osipov @ 2020-02-15 17:59 ` Cyrill Gorcunov 0 siblings, 0 replies; 13+ messages in thread From: Cyrill Gorcunov @ 2020-02-15 17:59 UTC (permalink / raw) To: Konstantin Osipov; +Cc: tml On Sat, Feb 15, 2020 at 08:38:51PM +0300, Konstantin Osipov wrote: > * Cyrill Gorcunov <gorcunov@gmail.com> [20/02/14 17:06]: > > STRS(applier_state, applier_STATE); > > > > @@ -832,6 +833,17 @@ applier_apply_tx(struct stailq *rows) > > trigger_create(on_commit, applier_txn_commit_cb, NULL, NULL); > > txn_on_commit(txn, on_commit); > > > > + /* > > + * FIXME: Move this injection somewhere inside > > + * txn_write, but since it has own bug (gh 4776) > > + * we will cure it a bit later. > > + */ > > + ERROR_INJECT(ERRINJ_REPLICA_TXN_WRITE, { > > + diag_set(ClientError, ER_INJECTION, > > + "replica txn write injection"); > > + goto rollback; > > + }); > > I thought yo wrote in the changes in v10 that you pushed the error > down into txn layer. But here you don't do it. What am I missing? Not yet. Pushing it sown cause https://github.com/tarantool/tarantool/issues/4776 I plan to fix it and then get back to this issue. Look this all code is still a bit vague for me, so I've to understand every byte before doing some more deep changes. For exactly this reason the patch in the series does NOT change the current logic of code flow much but tries to leave everything as it were except a hot fix for nil dereference. That said, Kostya, I remember about your proposal and FIXME left here for exactly that. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Tarantool-patches] [PATCH v10 0/4] box/replication: add missing diag set and fix sigsegv 2020-02-14 14:03 [Tarantool-patches] [PATCH v10 0/4] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov ` (3 preceding siblings ...) 2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 4/4] test: add replication/applier-rollback Cyrill Gorcunov @ 2020-02-14 14:12 ` Cyrill Gorcunov 4 siblings, 0 replies; 13+ messages in thread From: Cyrill Gorcunov @ 2020-02-14 14:12 UTC (permalink / raw) To: tml On Fri, Feb 14, 2020 at 05:03:35PM +0300, Cyrill Gorcunov wrote: > In v10 I tried to sanitize code a bit. Still there are few problems we will > need (well, I'll need) to solve: https://github.com/tarantool/tarantool/issues/4730 https://github.com/tarantool/tarantool/tree/gorcunov/gh-4730-diag-raise-master-10 ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2020-02-15 18:51 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-02-14 14:03 [Tarantool-patches] [PATCH v10 0/4] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov 2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 1/4] box/request: add missing OutOfMemory diag_set Cyrill Gorcunov 2020-02-15 17:33 ` Konstantin Osipov 2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 2/4] box/applier: add missing diag_set on region_alloc failure Cyrill Gorcunov 2020-02-15 17:34 ` Konstantin Osipov 2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 3/4] box/applier: prevent nil dereference on applier rollback Cyrill Gorcunov 2020-02-15 17:37 ` Konstantin Osipov 2020-02-15 18:24 ` Cyrill Gorcunov 2020-02-15 18:51 ` Konstantin Osipov 2020-02-14 14:03 ` [Tarantool-patches] [PATCH v10 4/4] test: add replication/applier-rollback Cyrill Gorcunov 2020-02-15 17:38 ` Konstantin Osipov 2020-02-15 17:59 ` Cyrill Gorcunov 2020-02-14 14:12 ` [Tarantool-patches] [PATCH v10 0/4] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox