Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH 1/1] json: use cord_ibuf for encoding and decoding
@ 2021-05-23 14:06 Vladislav Shpilevoy via Tarantool-patches
  2021-05-24 10:04 ` Oleg Babin via Tarantool-patches
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-05-23 14:06 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko, olegrok

Lua json module used to have a global buffer for all encodings. It
was reused by each next encode().

This was not correct, because during encode() might happen a GC
step, which might call encode() again and spoil the global buffer.

The same problem was already fixed for the global static buffer in
scope of #5632. Similarly to that time, the patch makes Lua json
module use cord_ibuf to prevent "concurrent" usage of the buffer
data. The global buffer is deleted.

According to a few microbenchmarks it didn't affect the perf
anyhow.

Core part of the patch is strbuf changes. Firstly, its destruction
is now optional, cord_ibuf can free itself on a next yield.
Secondly, its reallocation algorithm is kept intact - ibuf is used
as an allocator, not as the buffer itself. This is done so as not
to be too intrusive in the third party module which might need an
upgrade to the upstream in the future.

Closes #6050
---
Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-6050-lua-json-static-buf-gc
Issue: https://github.com/tarantool/tarantool/issues/6050

 .../unreleased/gh-6050-gc-global-buf-json.md  |  5 ++
 ...lua => gh-5632-6050-gc-buf-reuse.test.lua} | 61 ++++++++++++++--
 test/app-tap/json.test.lua                    | 14 +++-
 third_party/lua-cjson/lua_cjson.c             | 38 ++++++----
 third_party/lua-cjson/strbuf.c                | 70 +++++--------------
 third_party/lua-cjson/strbuf.h                | 16 ++---
 6 files changed, 121 insertions(+), 83 deletions(-)
 create mode 100644 changelogs/unreleased/gh-6050-gc-global-buf-json.md
 rename test/app-tap/{gh-5632-gc-buf-reuse.test.lua => gh-5632-6050-gc-buf-reuse.test.lua} (70%)

diff --git a/changelogs/unreleased/gh-6050-gc-global-buf-json.md b/changelogs/unreleased/gh-6050-gc-global-buf-json.md
new file mode 100644
index 000000000..f214ea230
--- /dev/null
+++ b/changelogs/unreleased/gh-6050-gc-global-buf-json.md
@@ -0,0 +1,5 @@
+## bugfix/core
+
+* Fixed invalid results produced by `json` module's `encode` function when it
+  was used from Lua's garbage collector. For instance, in functions used as
+  `ffi.gc()` (gh-6050).
diff --git a/test/app-tap/gh-5632-gc-buf-reuse.test.lua b/test/app-tap/gh-5632-6050-gc-buf-reuse.test.lua
similarity index 70%
rename from test/app-tap/gh-5632-gc-buf-reuse.test.lua
rename to test/app-tap/gh-5632-6050-gc-buf-reuse.test.lua
index 6efddb714..bf7590a14 100755
--- a/test/app-tap/gh-5632-gc-buf-reuse.test.lua
+++ b/test/app-tap/gh-5632-6050-gc-buf-reuse.test.lua
@@ -1,16 +1,17 @@
 #!/usr/bin/env tarantool
 
 --
--- gh-5632: Lua code should not use any global buffers or objects without
--- proper ownership protection. Otherwise these items might be suddenly reused
--- during Lua GC which happens almost at any moment. That might lead to data
--- corruption.
+-- gh-5632, gh-6050: Lua code should not use any global buffers or objects
+-- without proper ownership protection. Otherwise these items might be suddenly
+-- reused during Lua GC which happens almost at any moment. That might lead to
+-- data corruption.
 --
 
 local tap = require('tap')
 local ffi = require('ffi')
 local uuid = require('uuid')
 local uri = require('uri')
+local json = require('json')
 local msgpackffi = require('msgpackffi')
 
 local function test_uuid(test)
@@ -142,10 +143,58 @@ local function test_msgpackffi(test)
     test:ok(is_success, 'msgpackffi in gc')
 end
 
-local test = tap.test('gh-5632-gc-buf-reuse')
-test:plan(3)
+local function test_json(test)
+    test:plan(1)
+
+    local encode = json.encode
+    local decode = json.decode
+    local gc_count = 100
+    local iter_count = 1000
+    local is_success = true
+    local data1 = {1, 2, 3, 4, 5}
+    local data2 = {6, 7, 8, 9, 10}
+
+    local function do_encode(data)
+        if not is_success then
+            return
+        end
+        local t = encode(data)
+        t = decode(t)
+        if #t ~= #data then
+            is_success = false
+            return
+        end
+        for i = 1, #t do
+            if t[i] ~= data[i] then
+                is_success = false
+                return
+            end
+        end
+    end
+
+    local function gc_encode()
+        return do_encode(data1)
+    end
+
+    local function create_gc()
+        for _ = 1, gc_count do
+            ffi.gc(ffi.new('char[1]'), gc_encode)
+        end
+    end
+
+    for _ = 1, iter_count do
+        create_gc()
+        do_encode(data2)
+    end
+
+   test:ok(is_success, 'json in gc')
+end
+
+local test = tap.test('gh-5632-6050-gc-buf-reuse')
+test:plan(4)
 test:test('uuid in __gc', test_uuid)
 test:test('uri in __gc', test_uri)
 test:test('msgpackffi in __gc', test_msgpackffi)
+test:test('json in __gc', test_json)
 
 os.exit(test:check() and 0 or 1)
diff --git a/test/app-tap/json.test.lua b/test/app-tap/json.test.lua
index be60e45c9..12a37d3ec 100755
--- a/test/app-tap/json.test.lua
+++ b/test/app-tap/json.test.lua
@@ -21,7 +21,7 @@ end
 
 tap.test("json", function(test)
     local serializer = require('json')
-    test:plan(57)
+    test:plan(58)
 
     test:test("unsigned", common.test_unsigned, serializer)
     test:test("signed", common.test_signed, serializer)
@@ -201,4 +201,16 @@ tap.test("json", function(test)
     _, err_msg = pcall(serializer.decode, '{"a": {a = {}}}')
     test:ok(string.find(err_msg, '{"a":  >> {a = {}}') ~= nil, 'context #6')
 
+    --
+    -- Create a big JSON string to ensure the string builder works fine with
+    -- internal reallocs.
+    --
+    local bigstr = string.rep('a', 16384)
+    local t = {}
+    for i = 1, 10 do
+        table.insert(t, bigstr)
+    end
+    local bigjson = serializer.encode(t)
+    local t_dec = serializer.decode(bigjson)
+    test:is_deeply(t_dec, t, 'encode/decode big strings')
 end)
diff --git a/third_party/lua-cjson/lua_cjson.c b/third_party/lua-cjson/lua_cjson.c
index 38e999870..85186d6d5 100644
--- a/third_party/lua-cjson/lua_cjson.c
+++ b/third_party/lua-cjson/lua_cjson.c
@@ -51,8 +51,7 @@
 #include "mp_extension_types.h" /* MP_DECIMAL, MP_UUID */
 #include "tt_static.h"
 #include "uuid/tt_uuid.h" /* tt_uuid_to_string(), UUID_STR_LEN */
-
-#define DEFAULT_ENCODE_KEEP_BUFFER 1
+#include "cord_buf.h"
 
 typedef enum {
     T_OBJ_BEGIN,
@@ -100,10 +99,6 @@ static struct luaL_serializer *luaL_json_default;
 static json_token_type_t ch2token[256];
 static char escape2char[256];  /* Decoding */
 
-/* encode_buf is only allocated and used when
- * encode_keep_buffer is set */
-static strbuf_t encode_buf;
-
 typedef struct {
     const char *data;
     const char *ptr;
@@ -182,9 +177,6 @@ static int json_destroy_config(lua_State *l)
 static void json_create_tokens()
 {
     int i;
-#if DEFAULT_ENCODE_KEEP_BUFFER > 0
-    strbuf_init(&encode_buf, 0);
-#endif
 
     /* Decoding init */
 
@@ -444,7 +436,9 @@ static int json_encode(lua_State *l) {
                   "expected 1 or 2 arguments");
 
     /* Reuse existing buffer. */
-    strbuf_reset(&encode_buf);
+    strbuf_t encode_buf;
+    struct ibuf *ibuf = cord_ibuf_take();
+    strbuf_create(&encode_buf, -1, ibuf);
     struct luaL_serializer *cfg = luaL_checkserializer(l);
 
     if (lua_gettop(l) == 2) {
@@ -458,6 +452,13 @@ static int json_encode(lua_State *l) {
 
     char *json = strbuf_string(&encode_buf, NULL);
     lua_pushlstring(l, json, strbuf_length(&encode_buf));
+    /*
+     * Even if an exception is raised above, it is fine to skip the buffer
+     * destruction. The strbuf object destructor does not free anything, and
+     * the cord_ibuf object is freed automatically on a next yield.
+     */
+    strbuf_destroy(&encode_buf);
+    cord_ibuf_put(ibuf);
     return 1;
 }
 
@@ -885,8 +886,9 @@ static void json_throw_parse_error(lua_State *l, json_parse_t *json,
                                    const char *exp, json_token_t *token)
 {
     const char *found;
-
-    strbuf_free(json->tmp);
+    struct ibuf *ibuf = json->tmp->ibuf;
+    strbuf_destroy(json->tmp);
+    cord_ibuf_put(ibuf);
 
     if (token->type == T_ERROR)
         found = token->value.string;
@@ -919,7 +921,9 @@ static void json_decode_descend(lua_State *l, json_parse_t *json, int slots)
     char err_context[ERR_CONTEXT_MAX_LENGTH + 1];
     fill_err_context(err_context, json, json->ptr - json->cur_line_ptr - 1);
 
-    strbuf_free(json->tmp);
+    struct ibuf *ibuf = json->tmp->ibuf;
+    strbuf_destroy(json->tmp);
+    cord_ibuf_put(ibuf);
     luaL_error(l, "Found too many nested data structures (%d) on line %d at cha"
                "racter %d here '%s'", json->current_depth, json->line_count,
                json->ptr - json->cur_line_ptr, err_context);
@@ -1101,7 +1105,10 @@ static int json_decode(lua_State *l)
     /* Ensure the temporary buffer can hold the entire string.
      * This means we no longer need to do length checks since the decoded
      * string must be smaller than the entire json string */
-    json.tmp = strbuf_new(json_len);
+    strbuf_t decode_buf;
+    json.tmp = &decode_buf;
+    struct ibuf *ibuf = cord_ibuf_take();
+    strbuf_create(&decode_buf, json_len, ibuf);
 
     json_next_token(&json, &token);
     json_process_value(l, &json, &token);
@@ -1112,7 +1119,8 @@ static int json_decode(lua_State *l)
     if (token.type != T_END)
         json_throw_parse_error(l, &json, "the end", &token);
 
-    strbuf_free(json.tmp);
+    strbuf_destroy(&decode_buf);
+    cord_ibuf_put(ibuf);
 
     return 1;
 }
diff --git a/third_party/lua-cjson/strbuf.c b/third_party/lua-cjson/strbuf.c
index f0f7f4b9a..22c1f2093 100644
--- a/third_party/lua-cjson/strbuf.c
+++ b/third_party/lua-cjson/strbuf.c
@@ -28,6 +28,7 @@
 #include <string.h>
 
 #include "strbuf.h"
+#include "small/ibuf.h"
 
 static void die(const char *fmt, ...)
 {
@@ -41,7 +42,7 @@ static void die(const char *fmt, ...)
     exit(-1);
 }
 
-void strbuf_init(strbuf_t *s, int len)
+void strbuf_create(strbuf_t *s, int len, struct ibuf *ibuf)
 {
     int size;
 
@@ -50,37 +51,19 @@ void strbuf_init(strbuf_t *s, int len)
     else
         size = len + 1;         /* \0 terminator */
 
-    s->buf = NULL;
+    s->buf = ibuf_reserve(ibuf, size);
     s->size = size;
     s->length = 0;
     s->increment = STRBUF_DEFAULT_INCREMENT;
-    s->dynamic = 0;
     s->reallocs = 0;
     s->debug = 0;
-
-    s->buf = malloc(size);
+    s->ibuf = ibuf;
     if (!s->buf)
         die("Out of memory");
 
     strbuf_ensure_null(s);
 }
 
-strbuf_t *strbuf_new(int len)
-{
-    strbuf_t *s;
-
-    s = malloc(sizeof(strbuf_t));
-    if (!s)
-        die("Out of memory");
-
-    strbuf_init(s, len);
-
-    /* Dynamic strbuf allocation / deallocation */
-    s->dynamic = 1;
-
-    return s;
-}
-
 void strbuf_set_increment(strbuf_t *s, int increment)
 {
     /* Increment > 0:  Linear buffer growth rate
@@ -99,36 +82,9 @@ static inline void debug_stats(strbuf_t *s)
     }
 }
 
-/* If strbuf_t has not been dynamically allocated, strbuf_free() can
- * be called any number of times strbuf_init() */
-void strbuf_free(strbuf_t *s)
-{
-    debug_stats(s);
-
-    if (s->buf) {
-        free(s->buf);
-        s->buf = NULL;
-    }
-    if (s->dynamic)
-        free(s);
-}
-
-char *strbuf_free_to_string(strbuf_t *s, int *len)
+void strbuf_destroy(strbuf_t *s)
 {
-    char *buf;
-
     debug_stats(s);
-
-    strbuf_ensure_null(s);
-
-    buf = s->buf;
-    if (len)
-        *len = s->length;
-
-    if (s->dynamic)
-        free(s);
-
-    return buf;
 }
 
 static int calculate_new_size(strbuf_t *s, int len)
@@ -171,9 +127,21 @@ void strbuf_resize(strbuf_t *s, int len)
         fprintf(stderr, "strbuf(%lx) resize: %d => %d\n",
                 (long)s, s->size, newsize);
     }
-
     s->size = newsize;
-    s->buf = realloc(s->buf, s->size);
+
+    struct ibuf *ibuf = s->ibuf;
+    /*
+     * Propagate the write position to enable memcpy() when realloc happens
+     * inside of the ibuf.
+     */
+    ibuf->wpos += s->length;
+    s->buf = ibuf_reserve(ibuf, newsize - s->length) - s->length;
+    /*
+     * But then it is reverted because memcpy() of the old data is done, and the
+     * write position + capacity are managed by strbuf itself.
+     */
+    ibuf->wpos = s->buf;
+
     if (!s->buf)
         die("Out of memory");
     s->reallocs++;
diff --git a/third_party/lua-cjson/strbuf.h b/third_party/lua-cjson/strbuf.h
index d861108c1..b2f8a6efa 100644
--- a/third_party/lua-cjson/strbuf.h
+++ b/third_party/lua-cjson/strbuf.h
@@ -25,6 +25,8 @@
 #include <stdlib.h>
 #include <stdarg.h>
 
+struct ibuf;
+
 /* Size: Total bytes allocated to *buf
  * Length: String length, excluding optional NULL terminator.
  * Increment: Allocation increments when resizing the string buffer.
@@ -36,9 +38,10 @@ typedef struct {
     int size;
     int length;
     int increment;
-    int dynamic;
     int reallocs;
     int debug;
+    /** Backend allocator for the buffer data. */
+    struct ibuf *ibuf;
 } strbuf_t;
 
 #ifndef STRBUF_DEFAULT_SIZE
@@ -49,13 +52,11 @@ typedef struct {
 #endif
 
 /* Initialise */
-extern strbuf_t *strbuf_new(int len);
-extern void strbuf_init(strbuf_t *s, int len);
+extern void strbuf_create(strbuf_t *s, int len, struct ibuf *ibuf);
 extern void strbuf_set_increment(strbuf_t *s, int increment);
 
 /* Release */
-extern void strbuf_free(strbuf_t *s);
-extern char *strbuf_free_to_string(strbuf_t *s, int *len);
+extern void strbuf_destroy(strbuf_t *s);
 
 /* Management */
 extern void strbuf_resize(strbuf_t *s, int len);
@@ -80,11 +81,6 @@ static inline void strbuf_reset(strbuf_t *s)
     s->length = 0;
 }
 
-static inline int strbuf_allocated(strbuf_t *s)
-{
-    return s->buf != NULL;
-}
-
 /* Return bytes remaining in the string buffer
  * Ensure there is space for a NULL terminator. */
 static inline int strbuf_empty_length(strbuf_t *s)
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-05-25 21:20 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-23 14:06 [Tarantool-patches] [PATCH 1/1] json: use cord_ibuf for encoding and decoding Vladislav Shpilevoy via Tarantool-patches
2021-05-24 10:04 ` Oleg Babin via Tarantool-patches
2021-05-24 15:49   ` Vladislav Shpilevoy via Tarantool-patches
2021-05-24 16:00     ` Oleg Babin via Tarantool-patches
2021-05-24 13:01 ` Serge Petrenko via Tarantool-patches
2021-05-24 13:05   ` Serge Petrenko via Tarantool-patches
2021-05-24 15:47     ` Vladislav Shpilevoy via Tarantool-patches
2021-05-24 15:47   ` Vladislav Shpilevoy via Tarantool-patches
2021-05-24 16:17     ` Serge Petrenko via Tarantool-patches
2021-05-25 21:20 ` Vladislav Shpilevoy via Tarantool-patches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox