From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id C8DE62EC92 for ; Fri, 26 Oct 2018 11:19:54 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id S948VkkXW39a for ; Fri, 26 Oct 2018 11:19:54 -0400 (EDT) Received: from smtp53.i.mail.ru (smtp53.i.mail.ru [94.100.177.113]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 611DA2EC52 for ; Fri, 26 Oct 2018 11:19:54 -0400 (EDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\)) Subject: [tarantool-patches] Re: [PATCH 1/2] sql: LIKE & GLOB pattern comparison issue From: Nikita Tatunov In-Reply-To: <20181021035140.avx6d3rokx5ta6hi@tkn_work_nb> Date: Fri, 26 Oct 2018 18:19:46 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: <6740948F-6C40-4C0F-B237-7C3573225FBC@tarantool.org> References: <43febf82af3702fadfea135db978ffb6426eb00d.1534436836.git.n.tatunov@tarantool.org> <20180817111727.y6nsbblpm5nh4n3g@tkn_work_nb> <436d256a-f9d0-781f-8cad-179d7322c7bd@tarantool.org> <87897608-173E-45EB-80A1-8B249706D8A1@tarantool.org> <6a1352e9-425c-d656-1bec-bb04d9f0fee6@tarantool.org> <58B407E2-AF5D-4531-A9FF-9DC57CE0070B@tarantool.org> <860a125b-19f3-3bf1-8705-25156ff508ab@tarantool.org> <45338A27-C589-4330-B206-A4E379A4DE75@tarantool.org> <20181021035140.avx6d3rokx5ta6hi@tkn_work_nb> Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: tarantool-patches@freelists.org Cc: korablev@tarantool.org, Alexander Turenko Hello, Alexander! please consider this answer to the review! Issues: https://github.com/tarantool/tarantool/issues/3251 https://github.com/tarantool/tarantool/issues/3334 Branch: = https://github.com/tarantool/tarantool/tree/N_Tatunov/gh-3251-where-like-h= angs > On Oct 21, 2018, at 06:51, Alexander Turenko = wrote: >=20 > Hi! >=20 > Thanks for your work. >=20 > The email is big, but don't afraid. I don't push you to rewrite the > whole things again :) >=20 > The patch is generally okay for me. Minor comments are added on this. >=20 > Below I answered to your investigation about the libicu code. I found > that you was not right in some assumptions, but propose to postpone > refactoring out of this task. >=20 > Then I found some possible corner cases. They are out of the scope of > your task too, so I proposed to check and file issues. >=20 > WBR, Alexander Turenko. >=20 > 1, 2 - ok. >=20 >>> s=3D*source; >>> if(sourceLimit>> *err=3DU_ILLEGAL_ARGUMENT_ERROR; >>> return 0xffff; >>=20 >> 3. This one cannot trigger in sql_utf8_pattern_compare(): >> 1) ucnv_getNextUChar is only called when !(sourceLimit=20 > The discussion is about that the patch lean on the check, but here you > say it cannot be triggered. Mistake? It seems it can be triggered and = it > the case we check in our code. So, ok. Yes I=E2=80=99ve mistaken. I forgot that I changed the macro, thank you. >=20 >>> /* >>> * Make sure that the buffer sizes do not exceed the number range = for >>> * int32_t because some functions use the size (in units or = bytes) >>> * rather than comparing pointers, and because offsets are = int32_t values. >>> * >>> * size_t is guaranteed to be unsigned and large enough for the = job. >>> * >>> * Return with an error instead of adjusting the limits because = we would >>> * not be able to maintain the semantics that either the source = must be >>> * consumed or the target filled (unless an error occurs). >>> * An adjustment would be sourceLimit=3Dt+0x7fffffff; for = example. >>> */ >>> if(((size_t)(sourceLimit-s)>(size_t)0x7fffffff && sourceLimit>s)) = { >>> *err=3DU_ILLEGAL_ARGUMENT_ERROR; >>> return 0xffff; >>=20 >> 4. I=E2=80=99m not sure if string data can be this long in our = context.=20 >> (string length > (size_t) 0x7ffffffff) >=20 > Note: not 0x7ffffffff, but 0x7fffffff. >=20 > This limit seems to be some weird internal thing related to using > ucnv_getNextUChar inside libicu. >=20 > I propose to lie libicu about the buffer size in case when it exceeds > this limit. A UTF-8 encoded symbol is 4 bytes long at max, so we can > pass the following instead of pattern_end: >=20 > ((size_t) (pattern_end - pattern) > (size_t) 0x7fffffff ? pattern + = 0x7fffffff : pattern_end >=20 > I think this trick need to be covered with a unit test (because it is = unclear > how to create a string of size >1GiB from lua). Don't sure whether it = is > okay to allocate such amount of memory in the test, though... >=20 > Please, don't do that within this patch, because it is about the = another bug. > File an issue with all needed information instead (you can provide a = link to > this message for example). Ok, thank you for advice. I think that=E2=80=99s a good idea, but = there=E2=80=99s one thing I=E2=80=99m getting concerned about: it will cause a lot of operations = especially in case we=E2=80=99re using LIKE for scanning a lot of data (). I guess = even if it=E2=80=99s relevant it=E2=80=99s a discussion inside of an issue that=E2=80=99s = going to be filed. >=20 >>> if(c<0) { >>> /* >>> * call the native getNextUChar() implementation if we are >>> * at a character boundary (toULength=3D=3D0) >>> * >>> * unlike with _toUnicode(), getNextUChar() implementations = must set >>> * U_TRUNCATED_CHAR_FOUND for truncated input, >>> * in addition to setting toULength/toUBytes[] >>> */ >>> if(cnv->toULength=3D=3D0 && = cnv->sharedData->impl->getNextUChar!=3DNULL) { >>> c=3Dcnv->sharedData->impl->getNextUChar(&args, err); >>> *source=3Ds=3Dargs.source; >>> if(*err=3D=3DU_INDEX_OUTOFBOUNDS_ERROR) { >>> /* reset the converter without calling the callback = function */ >>> _reset(cnv, UCNV_RESET_TO_UNICODE, FALSE); >>> return 0xffff; /* no output */ >>=20 >> 5. Occurs when trying to access unindexed data. >=20 > Don't got your note here. It seems we call ucnv_getNextUChar_UTF8 from > ucnv_u8.c here (because of the "utf8" type of the converter in = pUtf8conv > in func.c). U_INDEX_OUTOFBOUNDS_ERROR is returned when (s > > sourceLimit), so it cannot occur here. In case of an other error > (U_ILLEGAL_CHAR_FOUND or U_TRUNCATED_CHAR_FOUND) we fall through to > _toUnicode() as the comment (below the code pasted above) suggests. > Don't investigated further. >=20 >>> } else if(U_SUCCESS(*err) && c>=3D0) { >>> return c; >>=20 >> 6. Returns symbol (can also be 0xfffd, as it is not treated as an = actual error). >>=20 >> So if I=E2=80=99m not mistaken we will get results in our function = either from >> =E2=80=98return=E2=80=99 number 5 or number 6 and the following code = will not be executed. >=20 > It is not so. We'll fall through in case of U_ILLEGAL_CHAR_FOUND or > U_TRUNCATED_CHAR_FOUND error. >=20 > To be honest I don't want to continue. It seems we should not lean on > the fact that 0xffff always means end of the buffer, because it does = not > guaranteed by the API and is not clear from the code. >=20 > AFAIR, the problem was to choose appropriate symbol to mark end of the > buffer situation and distinguish it from a real error. It seems we = have > not one. So we should fairly (and always) check for the buffer before = a > call to ucnv_getNextUChar() or check the status it provide after the > call. I would prefer to check it in our code. It seems that it is how > the API works. >=20 > I propose to use the same code pattern for all Utf8Read calls, e.g.: >=20 > if (pattern < pattern_end) > c =3D Utf8Read(pattern, pattern_end) > else > return SQL_...; > if (c =3D=3D SQL_INVALID_UTF8_SYMBOL) > return SQL_...; > assert(U_SUCCESS(status)); >=20 > Note: I have added the assert, because it is not clear what we can do > with, say, U_INVALID_TABLE_FORMAT (improper libicu build / > installation). Hope Nikita P. suggests right way, but now I think we > should at least assert on that. >=20 > It seems the code above can be even wrapped into a macro that will get > two pointers (pattern and pattern_end / string and string_end) and two > SQL_... error code to handle two possible errors. Yep, it is = generally > discouraged to return from a macro, but if it'll greatly improves the > code readability, so it is appropriate, I think. Just define the macro > right before the function and undefne it after to show a reader it is > some pure internal thing. >=20 > Note: If you will going that way, don't wrap Utf8Read macro into = another > macro. Use one with ucnv_getNextUChar call. >=20 > It is refactoring of the code and our of the scope of your issue. > Please, file an issue and link this message into it (but please ask > Nikita P. opinion before). >=20 > It is not good IMHO, but it seems now it worth to leave the code with > assumption 0xffff is the end of buffer. This is kind of splitting the > problem into parts and allow us to proceed with this patch re parsing > bug. >=20 > About the patch > --------------- >=20 > Please, post an issue and a branch links if you don't cite them. > Sometimes it is hard to find them in a mail client history, esp. after > some significant delay in a discussion. >=20 > I'll consider the patch as bugfix + code style fix and will not push = you > to rewrite things in significant way. But I'll ask you to formalize > found problems as issues. >=20 > It rebased on 2.1 with conflicts. Need to be fixed. >=20 >> -#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &s, e, = &status) >> +#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &(s), (e), = &(status)) >=20 > 'status' is not a parameter of the macro, no need to enclose it into > parentheses. >=20 > I would prefer to have it as the parameter, but it seems that the code = has many > indent levels and will look even more ugly then now if we'll increase = lines > lengths. So, just remove the parentheses. Removed. >=20 >> + * @param matchOther The escape char (LIKE) or '[' (GLOB). >=20 > It is a symbol from ESCAPE parameter or a garbage from the likeFunc > stack frame. It seems it worth to initialize 'u32 escape' in likeFunc = to > some symbol you cannot hit within 'c' variable in > sql_utf8_pattern_compare. I think it is SQL_END_OF_STRING. Please, fix > it here if it is related to your changes or file an issue if it was > already here. Changed to SQL_END_OF_STRING. >> + /* Next pattern and input string chars */ >> + UChar32 c, c2; >> + /* "?" or "_" */ >> + UChar32 matchOne =3D pInfo->matchOne; >> + /* "*" or "%" */ >> + UChar32 matchAll =3D pInfo->matchAll; >> + /* True if uppercase=3D=3Dlowercase */ >> + UChar32 noCase =3D pInfo->noCase; >> + /* One past the last escaped input char */ >=20 > Our code style suggests to have a period at end of a comment. Fixed. >=20 >> - assert(matchOther < 0x80); = /* '[' is a single-byte character */ >> + assert(matchOther < 0x80); >=20 > The comment was helpful, IMHO. >=20 > What if we'll use LIKE with ESCAPE with symbol, say, '=D1=91'? We have = not > such tests cases as I see. >=20 > Again, it does not seems to be the problem you solve here. Please, = write > a test case and file an issue if this does not work correctly (it = seems > we can hit the assert above). >=20 > Ouch, now I see, it will be removed in the 2nd commit of the patchset. > So, please, comment it in the commit message of the 1st commit to = don't > confuse a reviewer. However the case with non-ASCII ESCAPE character = is > worth to be checked anyway. I think I will just return the fixed comment to the first commit as = anyways it=E2=80=99s going to be deleted in next commit. Wrote tests for non-ASCII chars in e_expr.test.lua. >=20 > The code of the sql_utf8_pattern_compare function looks okay for me > (except things I suggested to handle separately). >=20 >> test/sql-tap/gh-3251-string-pattern-comparison.test.lua >>=20 >> - test_name =3D prefix .. "8." .. tostring(i) >> + local test_name =3D prefix .. "8." .. tostring(i) >=20 > It is from the 2nd patch, but I think should be here. >=20 Moved to the first patch.