From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 81F7E2F221 for ; Wed, 31 Oct 2018 01:25:21 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Vnxe1cjCXRFt for ; Wed, 31 Oct 2018 01:25:21 -0400 (EDT) Received: from smtp50.i.mail.ru (smtp50.i.mail.ru [94.100.177.110]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id EBF6C2F21F for ; Wed, 31 Oct 2018 01:25:20 -0400 (EDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\)) Subject: [tarantool-patches] Re: [PATCH 1/2] sql: LIKE & GLOB pattern comparison issue From: Nikita Tatunov In-Reply-To: <20181029130123.f254chdxxuwi6c4w@tkn_work_nb> Date: Wed, 31 Oct 2018 08:25:12 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: <3D4337BA-F528-425C-B352-C195C20DA282@tarantool.org> References: <20180817111727.y6nsbblpm5nh4n3g@tkn_work_nb> <436d256a-f9d0-781f-8cad-179d7322c7bd@tarantool.org> <87897608-173E-45EB-80A1-8B249706D8A1@tarantool.org> <6a1352e9-425c-d656-1bec-bb04d9f0fee6@tarantool.org> <58B407E2-AF5D-4531-A9FF-9DC57CE0070B@tarantool.org> <860a125b-19f3-3bf1-8705-25156ff508ab@tarantool.org> <45338A27-C589-4330-B206-A4E379A4DE75@tarantool.org> <20181021035140.avx6d3rokx5ta6hi@tkn_work_nb> <6740948F-6C40-4C0F-B237-7C3573225FBC@tarantool.org> <20181029130123.f254chdxxuwi6c4w@tkn_work_nb> Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: Alexander Turenko Cc: tarantool-patches@freelists.org, korablev@tarantool.org > On Oct 29, 2018, at 16:01, Alexander Turenko = wrote: >=20 > The patch is okay except one note re test case. >=20 > WBR, Alexander Turenko. >=20 >> --- EVIDENCE-OF: R-39414-35489 The infix GLOB operator is implemented = by >> --- calling the function glob(Y,X) and can be modified by overriding = that >> --- function. >=20 > This test case was removed, while we have not a similar one for LIKE. I guess it is concerned more with the second patch. Anyways. If you mean the tests following this comment then actually there are some similar tests for LIKE (15.1.x). >=20 >>>>> if(((size_t)(sourceLimit-s)>(size_t)0x7fffffff && = sourceLimit>s)) { >>>>> *err=3DU_ILLEGAL_ARGUMENT_ERROR; >>>>> return 0xffff; >>>>=20 >>>> 4. I=E2=80=99m not sure if string data can be this long in our = context.=20 >>>> (string length > (size_t) 0x7ffffffff) >>>=20 >>> Note: not 0x7ffffffff, but 0x7fffffff. >>>=20 >>> This limit seems to be some weird internal thing related to using >>> ucnv_getNextUChar inside libicu. >>>=20 >>> I propose to lie libicu about the buffer size in case when it = exceeds >>> this limit. A UTF-8 encoded symbol is 4 bytes long at max, so we can >>> pass the following instead of pattern_end: >>>=20 >>> ((size_t) (pattern_end - pattern) > (size_t) 0x7fffffff ? pattern + = 0x7fffffff : pattern_end >>>=20 >>> I think this trick need to be covered with a unit test (because it = is unclear >>> how to create a string of size >1GiB from lua). Don't sure whether = it is >>> okay to allocate such amount of memory in the test, though... >>>=20 >>> Please, don't do that within this patch, because it is about the = another bug. >>> File an issue with all needed information instead (you can provide a = link to >>> this message for example). >>=20 >> Ok, thank you for advice. I think that=E2=80=99s a good idea, but = there=E2=80=99s one thing >> I=E2=80=99m getting concerned about: it will cause a lot of = operations especially >> in case we=E2=80=99re using LIKE for scanning a lot of data (). I = guess even if it=E2=80=99s >> relevant it=E2=80=99s a discussion inside of an issue that=E2=80=99s = going to be filed. >=20 > Filed https://github.com/tarantool/tarantool/issues/3773 >=20 >>>>> } else if(U_SUCCESS(*err) && c>=3D0) { >>>>> return c; >>>>=20 >>>> 6. Returns symbol (can also be 0xfffd, as it is not treated as an = actual error). >>>>=20 >>>> So if I=E2=80=99m not mistaken we will get results in our function = either from >>>> =E2=80=98return=E2=80=99 number 5 or number 6 and the following = code will not be executed. >>>=20 >>> It is not so. We'll fall through in case of U_ILLEGAL_CHAR_FOUND or >>> U_TRUNCATED_CHAR_FOUND error. >>>=20 >>> To be honest I don't want to continue. It seems we should not lean = on >>> the fact that 0xffff always means end of the buffer, because it does = not >>> guaranteed by the API and is not clear from the code. >>>=20 >>> AFAIR, the problem was to choose appropriate symbol to mark end of = the >>> buffer situation and distinguish it from a real error. It seems we = have >>> not one. So we should fairly (and always) check for the buffer = before a >>> call to ucnv_getNextUChar() or check the status it provide after the >>> call. I would prefer to check it in our code. It seems that it is = how >>> the API works. >>>=20 >>> I propose to use the same code pattern for all Utf8Read calls, e.g.: >>>=20 >>> if (pattern < pattern_end) >>> c =3D Utf8Read(pattern, pattern_end) >>> else >>> return SQL_...; >>> if (c =3D=3D SQL_INVALID_UTF8_SYMBOL) >>> return SQL_...; >>> assert(U_SUCCESS(status)); >>>=20 >>> Note: I have added the assert, because it is not clear what we can = do >>> with, say, U_INVALID_TABLE_FORMAT (improper libicu build / >>> installation). Hope Nikita P. suggests right way, but now I think we >>> should at least assert on that. >>>=20 >>> It seems the code above can be even wrapped into a macro that will = get >>> two pointers (pattern and pattern_end / string and string_end) and = two >>> SQL_... error code to handle two possible errors. Yep, it is = generally >>> discouraged to return from a macro, but if it'll greatly improves = the >>> code readability, so it is appropriate, I think. Just define the = macro >>> right before the function and undefne it after to show a reader it = is >>> some pure internal thing. >>>=20 >>> Note: If you will going that way, don't wrap Utf8Read macro into = another >>> macro. Use one with ucnv_getNextUChar call. >>>=20 >>> It is refactoring of the code and our of the scope of your issue. >>> Please, file an issue and link this message into it (but please ask >>> Nikita P. opinion before). >>>=20 >>> It is not good IMHO, but it seems now it worth to leave the code = with >>> assumption 0xffff is the end of buffer. This is kind of splitting = the >>> problem into parts and allow us to proceed with this patch re = parsing >>> bug. >=20 > Filed https://github.com/tarantool/tarantool/issues/3774