From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id 13E4B6EC40; Thu, 8 Jul 2021 01:42:56 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 13E4B6EC40 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1625697776; bh=gyljOvzgSVWOvLrfQKSDbgnejaTANFN9Jqzv5XBOkmU=; h=Date:To:References:In-Reply-To:Subject:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=tTBJjUPtP37MSk4rs1NJN1q/O+AznzLvwVj98E6+2N1wMSRyxcg7M+HeZjPodZMKU 9oF9AydwuC/4EvcMxzPVuXilq7MJDV2mDms/EUJKzFytGGT1j44KS6WdU9xODqOcht o8nHkuQlJ1elcsJQUuzrAUyNHCWhYHbQWx6d2lTs= Received: from smtp41.i.mail.ru (smtp41.i.mail.ru [94.100.177.101]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 5A8456EC40 for ; Thu, 8 Jul 2021 01:42:54 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 5A8456EC40 Received: by smtp41.i.mail.ru with esmtpa (envelope-from ) id 1m1GFo-0004H6-RY; Thu, 08 Jul 2021 01:42:53 +0300 Date: Thu, 8 Jul 2021 01:42:26 +0300 To: Roman Khabibov Message-ID: <20210707224226.ovlmanq2u3kv72yq@tkn_work_nb> References: <20210311054949.91263-1-roman.habibov@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20210311054949.91263-1-roman.habibov@tarantool.org> X-4EC0790: 10 X-7564579A: 646B95376F6C166E X-77F55803: 4F1203BC0FB41BD954DFF1DC42D673FBE6C6848F3EB6EB89AE0756428E32ECE9182A05F538085040F60457FFA37DAAB28F248A0CFA77376843D4534E6B1391B650B4FD8E40E9E26D X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE7A1DB0B089319D380EA1F7E6F0F101C67BD4B6F7A4D31EC0BCC500DACC3FED6E28638F802B75D45FF8AA50765F79006374A24022C550661178638F802B75D45FF36EB9D2243A4F8B5A6FCA7DBDB1FC311F39EFFDF887939037866D6147AF826D8EE1713AB1675364180DFD1058EDED3F7117882F4460429724CE54428C33FAD305F5C1EE8F4F765FC2EE5AD8F952D28FBA471835C12D1D9774AD6D5ED66289B52BA9C0B312567BB23117882F446042972877693876707352033AC447995A7AD18E5D25F19253116ADD2E47CDBA5A96583BA9C0B312567BB2376E601842F6C81A19E625A9149C048EE0AC5B80A05675ACD1B780A39BCC1DD35D8FC6C240DEA7642DBF02ECDB25306B2B78CF848AE20165D0A6AB1C7CE11FEE3D23BF7408B3F9022BA3038C0950A5D36B5C8C57E37DE458B0BC6067A898B09E46D1867E19FE14079C09775C1D3CA48CF3D321E7403792E342EB15956EA79C166A417C69337E82CC275ECD9A6C639B01B78DA827A17800CE76D0F27F7E6A6C418731C566533BA786AA5CC5B56E945C8DA X-C1DE0DAB: 0D63561A33F958A560D174B06333892E1AA8BDAA53B6438C10AEC9604014410CD59269BC5F550898D99A6476B3ADF6B47008B74DF8BB9EF7333BD3B22AA88B938A852937E12ACA753753CEE10E4ED4A7410CA545F18667F91A7EA1CDA0B5A7A0 X-C8649E89: 4E36BF7865823D7055A7F0CF078B5EC49A30900B95165D34047E322BFAFD69BD04328A3C1CD15BAB2F0DD4018C4E7704A3E20F3B1CE1BE220ABB2BAA557BF2B81D7E09C32AA3244C8B0FDD02F5E708DC6F141E365D8516789CA7333006C390A08D5DD81C2BAB7D1D X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2biojwWhFrYo6Pn2oyb/oKEcEXw== X-Mailru-Sender: FFAA8E4AEE17E37C3731A083A1A85ADEBD957747244CD1752E8D69FC31556DACB7EA9FE7735C3DBFC664A44C781FCEA7C77752E0C033A69EDF9F2CE1E9CF805D8CD356D4F938FF726C18EFA0BB12DBB0 X-Mras: Ok Subject: Re: [Tarantool-patches] [PATCH] serializer: serialize recursive structures X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Alexander Turenko via Tarantool-patches Reply-To: Alexander Turenko Cc: tarantool-patches@dev.tarantool.org, Vladislav Shpilevoy , Cyrill Gorcunov Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" On Thu, Mar 11, 2021 at 08:49:49AM +0300, Roman Khabibov via Tarantool-patches wrote: > Fix bug with bus error during serializing of recursive > structures. > > Closes #3228 > --- > > Branch: https://github.com/tarantool/tarantool/tree/romanhabibov/gh-3228-visited-set > Issue: https://github.com/tarantool/tarantool/issues/3228 Sorry for the long delay. It is really hard to me to explain complex topics in a more or less structured way. Please, see my comments below. ## Overview I splitted all problems we found into four ones that look atomic. I described three problems (A, B, C) in the issue of the topic (gh-3228) and moved one into its own issue (gh-6201). Here I refer those problems by given names. My comments are for 7206b41ff1456e71474c2ff84b214f770e5897af ('serializer: serialize recursive structures'). I'll give a short checklist first and my notes from meetings with Roman next. ## Patchset checklist - [ ] All refactoring changes should be separated from ones that modify user visible behaviour. Example: extracting of common parts of yaml and lua serializers. - [ ] Behaviour of table/userdata/cdata with __serialize is the same (there are test cases that verify it). - [ ] The patches should follow tarantool's code style. Example: I didn't saw enums introduced specifically to give names for return codes. - [ ] Ensure that usual nesting (non-recursive structures) is not broken. Example: a table w/ __serialize, which returns a table with __serialize (see meeting notes for a test case). - [ ] It is wishful to split solutions for different problems into separate commits (where possible). - [ ] It is wishful to structurize test cases to make testing matrix clear: {problem A, problem B, problem C} x {table, cdata, userdata}. - [ ] It is wishful to describe the overall algorithm somewhere. A comment for the corresponding header file is the good candidate for this. ## Meeting notes Paragraphs in square brackets are added later to clarify something we didn't discuss on our meetings. {{{ Serializer 2021-05-14 The problem (segfault) ---------------------- [I named it 'problem A'.] setmetatable({}, {__serialize = function(self) return self end}) Single node with recursive __serialize. Complex example: local x = {whoami = 'x'} local a = {whoami = 'a'} local b = {whoami = 'b'} setmetatable(x, {__serialize = function(_) return a end}) setmetatable(a, {__serialize = function(_) return b end}) setmetatable(b, {__serialize = function(_) return a end}) (x -> a -> b) ^ | | | +----+ yaml.encode(x) -> ? However __serialize for a node may be nested, but not recursive: local x = {whoami = 'x'} local a = {whoami = 'a'} local b = {whoami = 'b'} setmetatable(x, {__serialize = function(_) return a end}) setmetatable(a, {__serialize = function(_) return b end}) setmetatable(b, {__serialize = function(_) return {whoami = 'result'} end}) -- !! yaml.encode(x) -> {"whoami": "result"} Related (?) problem (wrong result) ---------------------------------- [I named it 'problem C'.] __serialize result do not participate in references search. local x = {whoami = 'x'} yaml.encode({ foo = x, bar = setmetatable({}, {__serialize = function(_) return x end}) }) ** should be ** --- foo: &1 whoami: x bar: *1 ... ** now ** --- foo: whoami: x bar: whoami: x ... Are those problems actually related? Can we fix the segfault separately from the wrong result case? Details ------- luaL_tofield() calls lua_field_try_serialize(), which calls luaL_tofield() for __serialize result. We should detect a loop, so we should maintain a set of visited (processed) objects. How yaml works now: find_references() tracks objects in a table (anchortable): - key: object - value: nil -- we didn't meet the object false -- seen once true -- seen twice or more At dump yaml enumerates anchortable[obj] == true objects and set anchortable[obj] = '0' ('1', '2' and so on). In the resulting patch (re wrong result): replacement objects are tracked in another table TMP: Whether loop detection set may be replaced with ref map. ### cdata / userdata How cdata/userdata with recursive __serialize works now? We should test it and the behaviour should be similar to tables. ### Non-recursive serialize for a single node [It is the test case for the 'Ensure <...>' checklist bullet.] Current master works this way: | tarantool> a = {} | tarantool> b = {} | tarantool> setmetatable(b, {__serialize = function(_) return 'string' end}) | --- | - string | ... | tarantool> setmetatable(a, {__serialize = function(_) return b end}) | --- | - string | ... Current patch has the following behaviour: | <...> | tarantool> setmetatable(a, {__serialize = function(_) return b end}) | --- | - [] | ... We should return 'string' for `a`: call __serialize for `a` and, then, for `b`. It is how the serializer works now, so backward compatibility requires it. }}} {{{ Serializer 2021-05-19 Known problems (about existing patch) ------------------------------------- * Non-recusive serialize for a single node (backward compatibility). * Naming (at least enum constants). * Test cdata / userdata in the similar cases as we test tables. * No overall description how it works: it is hard to dive into the topic. To discuss ---------- * Is it possible to split segfault and wrong result patches? * Elaborate constants naming. * Where to write serializer docs: source comments, doc/, tarantool-internals? * We should discuss a test plan. Discussion ---------- ## The second segfault problem [I named it 'problem B'.] setmetatable({}, {__serialize = function(self) return {self} end}) ## Is it possible to split segfault and wrong result patches? Seems so. Don't move __serialize calls before dumping in the first patch. But introduce visited objects set / map. It seems a set is enough here. But it depends on a behaviour in the x -> a -> b -> a -> b -> ... case. It seems that it'll simplify the patch. NB: Don't forget about cdata / userdata. It also worth to separate refactoring from user visible changes. ## We should discuss a test plan Rough testing matrix for the first patch: Cases: * a -> a -> a -> ... setmetatable({}, {__serialize = function(self) return self end}) * a -> something that contains a setmetatable({}, {__serialize = function(self) return {self} end}) * a -> b -> a -> b -> ... * x -> a -> b -> a -> b -> ... [Those cases are problems A, B, C. First axis.] Three types, where __serialize is applicable: * table * cdata * userdata * plus recursion over table + cdata or cdata + userdata [This is the second axis.] [Fourth case goes to gh-6201.] For the second patch (wrong result): TBD [The same, table / userdata / cdata? Are there other axes?] ## Elaborate constants naming [Or discard it and use usual unnamed return codes.] It is for the second patch (wrong result). Even more: it'll be an intermediate patch with refactoring. +enum try_serialize_ret_code +{ + + TRY_SERIALIZE_ERROR = -1, + TRY_SERIALIZE_RES_TABLE_NOT = 0, + TRY_SERIALIZE_RES_FROM_MAP, + TRY_SERIALIZE_DEFAULT_TABLE, +}; + * TRY_SERIALIZE_ERROR - error occurs, diag is set, the top of + * guest stack is undefined. + * TRY_SERIALIZE_RES_TABLE_NOT - __serialize field is available in + * the metatable, the result value is put in the origin slot, + * encoding is finished. + * TRY_SERIALIZE_RES_FROM_MAP - __serialize field is available in + * the metatable, the result value is table from serialized + * objects map. + * TRY_SERIALIZE_DEFAULT_TABLE - __serialize field is not + * available in the metatable, proceed with default table + * encoding. +enum get_anchor_ret_code +{ + GET_ANCHOR_NO_REFS = 0, + GET_ANCHOR_NAMED_NOT, + GET_ANCHOR_ALIASED, +}; ---- Why we need to distinguish situations, when a table remains the same (no __serialize is not called) and when we call it and replace the Lua object on the stack. It seems TRY_SERIALIZE_RES_FROM_MAP is redundant. It means that the result is from the visited set. TRY_SERIALIZE_RES_TABLE_NOT -- grammar. Is not a table. TRY_SERIALIZE_RES_TABLE_NOT -- we can receive cdata / userdata with __serialize, which'll return a table. (NB: to test cases; return self, return {self}.) What information we should return from lua_field_try_serialize() to skip or proceed with array / map guessing. 1. result not a table -> don't guess whether it is map / array 2. __serialize = 'map/seq/sequence/mapping' -> don't guess whether it is map / array 3. __serialize = function(self) <...> end -> check by 1, 2 lua_field_try_serialize() retval: 1. error 2. we set array / map (ask to don't guess it afterwards) 3. success (no __serialize, succeeded __serialize function call) I don't understand how lua_field_inspect_table() works if __serialize function returns a string. TRY_SERIALIZE_ERROR TRY_SERIALIZE_IS_SEQUENCE_OR_MAPPING_DETERMINED TRY_SERIALIZE_EVERYTHING_ELSE (Looks weird.) -1 0 1 (Maybe keep numbers?) Another variant: try_serialize(<...>, bool *known_as_array_or_map) retval: -1 (error) / 0 (success) ---- get_anchor() -- we should give information, whether we meet the object first time (and dump `&0 obj`) or next time (and dump `*0`). And 'no anchor / no refs' situation. GET_ANCHOR_NAMED_NOT -- grammar. Is not named. | Case | Retval | *anchor | | ---------- | -------------------- | ------- | | No refs | NO_REFS | NULL | | First time | GET_ANCHOR_NAMED_NOT | "0" | | Next time | GET_ANCHOR_ALIASED | "0" | Possible naming: GET_ANCHOR_NO_REFS GET_ANCHOR_FIRST_VISIT GET_ANCHOR_NEXT_VISIT Another variant: GET_ANCHOR_UNNAMED GET_ANCHOR_JUST_NAMED GET_ANCHOR_ALREADY_NAMED Another variant: void get_anchor(<...>, const char **anchor, bool *is_just_named) ## Where to write serializer docs: source comments, doc/, tarantool-internals? TBD [Let's start with a comment in the header file.] AR Alexander: create a checklist. [Done.] }}}