From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 89C492CD1C for ; Sat, 20 Oct 2018 23:51:43 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 90Edpar6NeAO for ; Sat, 20 Oct 2018 23:51:43 -0400 (EDT) Received: from smtp29.i.mail.ru (smtp29.i.mail.ru [94.100.177.89]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 13DC82CC82 for ; Sat, 20 Oct 2018 23:51:42 -0400 (EDT) Date: Sun, 21 Oct 2018 06:51:40 +0300 From: Alexander Turenko Subject: [tarantool-patches] Re: [PATCH 1/2] sql: LIKE & GLOB pattern comparison issue Message-ID: <20181021035140.avx6d3rokx5ta6hi@tkn_work_nb> References: <43febf82af3702fadfea135db978ffb6426eb00d.1534436836.git.n.tatunov@tarantool.org> <20180817111727.y6nsbblpm5nh4n3g@tkn_work_nb> <436d256a-f9d0-781f-8cad-179d7322c7bd@tarantool.org> <87897608-173E-45EB-80A1-8B249706D8A1@tarantool.org> <6a1352e9-425c-d656-1bec-bb04d9f0fee6@tarantool.org> <58B407E2-AF5D-4531-A9FF-9DC57CE0070B@tarantool.org> <860a125b-19f3-3bf1-8705-25156ff508ab@tarantool.org> <45338A27-C589-4330-B206-A4E379A4DE75@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: Nikita Tatunov Cc: tarantool-patches@freelists.org, korablev@tarantool.org Hi! Thanks for your work. The email is big, but don't afraid. I don't push you to rewrite the whole things again :) The patch is generally okay for me. Minor comments are added on this. Below I answered to your investigation about the libicu code. I found that you was not right in some assumptions, but propose to postpone refactoring out of this task. Then I found some possible corner cases. They are out of the scope of your task too, so I proposed to check and file issues. WBR, Alexander Turenko. 1, 2 - ok. > > s=*source; > > if(sourceLimit > *err=U_ILLEGAL_ARGUMENT_ERROR; > > return 0xffff; > > 3. This one cannot trigger in sql_utf8_pattern_compare(): > 1) ucnv_getNextUChar is only called when !(sourceLimit > /* > > * Make sure that the buffer sizes do not exceed the number range for > > * int32_t because some functions use the size (in units or bytes) > > * rather than comparing pointers, and because offsets are int32_t values. > > * > > * size_t is guaranteed to be unsigned and large enough for the job. > > * > > * Return with an error instead of adjusting the limits because we would > > * not be able to maintain the semantics that either the source must be > > * consumed or the target filled (unless an error occurs). > > * An adjustment would be sourceLimit=t+0x7fffffff; for example. > > */ > > if(((size_t)(sourceLimit-s)>(size_t)0x7fffffff && sourceLimit>s)) { > > *err=U_ILLEGAL_ARGUMENT_ERROR; > > return 0xffff; > > 4. I’m not sure if string data can be this long in our context. > (string length > (size_t) 0x7ffffffff) Note: not 0x7ffffffff, but 0x7fffffff. This limit seems to be some weird internal thing related to using ucnv_getNextUChar inside libicu. I propose to lie libicu about the buffer size in case when it exceeds this limit. A UTF-8 encoded symbol is 4 bytes long at max, so we can pass the following instead of pattern_end: ((size_t) (pattern_end - pattern) > (size_t) 0x7fffffff ? pattern + 0x7fffffff : pattern_end I think this trick need to be covered with a unit test (because it is unclear how to create a string of size >1GiB from lua). Don't sure whether it is okay to allocate such amount of memory in the test, though... Please, don't do that within this patch, because it is about the another bug. File an issue with all needed information instead (you can provide a link to this message for example). > > if(c<0) { > > /* > > * call the native getNextUChar() implementation if we are > > * at a character boundary (toULength==0) > > * > > * unlike with _toUnicode(), getNextUChar() implementations must set > > * U_TRUNCATED_CHAR_FOUND for truncated input, > > * in addition to setting toULength/toUBytes[] > > */ > > if(cnv->toULength==0 && cnv->sharedData->impl->getNextUChar!=NULL) { > > c=cnv->sharedData->impl->getNextUChar(&args, err); > > *source=s=args.source; > > if(*err==U_INDEX_OUTOFBOUNDS_ERROR) { > > /* reset the converter without calling the callback function */ > > _reset(cnv, UCNV_RESET_TO_UNICODE, FALSE); > > return 0xffff; /* no output */ > > 5. Occurs when trying to access unindexed data. Don't got your note here. It seems we call ucnv_getNextUChar_UTF8 from ucnv_u8.c here (because of the "utf8" type of the converter in pUtf8conv in func.c). U_INDEX_OUTOFBOUNDS_ERROR is returned when (s > sourceLimit), so it cannot occur here. In case of an other error (U_ILLEGAL_CHAR_FOUND or U_TRUNCATED_CHAR_FOUND) we fall through to _toUnicode() as the comment (below the code pasted above) suggests. Don't investigated further. > > } else if(U_SUCCESS(*err) && c>=0) { > > return c; > > 6. Returns symbol (can also be 0xfffd, as it is not treated as an actual error). > > So if I’m not mistaken we will get results in our function either from > ‘return’ number 5 or number 6 and the following code will not be executed. It is not so. We'll fall through in case of U_ILLEGAL_CHAR_FOUND or U_TRUNCATED_CHAR_FOUND error. To be honest I don't want to continue. It seems we should not lean on the fact that 0xffff always means end of the buffer, because it does not guaranteed by the API and is not clear from the code. AFAIR, the problem was to choose appropriate symbol to mark end of the buffer situation and distinguish it from a real error. It seems we have not one. So we should fairly (and always) check for the buffer before a call to ucnv_getNextUChar() or check the status it provide after the call. I would prefer to check it in our code. It seems that it is how the API works. I propose to use the same code pattern for all Utf8Read calls, e.g.: if (pattern < pattern_end) c = Utf8Read(pattern, pattern_end) else return SQL_...; if (c == SQL_INVALID_UTF8_SYMBOL) return SQL_...; assert(U_SUCCESS(status)); Note: I have added the assert, because it is not clear what we can do with, say, U_INVALID_TABLE_FORMAT (improper libicu build / installation). Hope Nikita P. suggests right way, but now I think we should at least assert on that. It seems the code above can be even wrapped into a macro that will get two pointers (pattern and pattern_end / string and string_end) and two SQL_... error code to handle two possible errors. Yep, it is generally discouraged to return from a macro, but if it'll greatly improves the code readability, so it is appropriate, I think. Just define the macro right before the function and undefne it after to show a reader it is some pure internal thing. Note: If you will going that way, don't wrap Utf8Read macro into another macro. Use one with ucnv_getNextUChar call. It is refactoring of the code and our of the scope of your issue. Please, file an issue and link this message into it (but please ask Nikita P. opinion before). It is not good IMHO, but it seems now it worth to leave the code with assumption 0xffff is the end of buffer. This is kind of splitting the problem into parts and allow us to proceed with this patch re parsing bug. About the patch --------------- Please, post an issue and a branch links if you don't cite them. Sometimes it is hard to find them in a mail client history, esp. after some significant delay in a discussion. I'll consider the patch as bugfix + code style fix and will not push you to rewrite things in significant way. But I'll ask you to formalize found problems as issues. It rebased on 2.1 with conflicts. Need to be fixed. > -#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &s, e, &status) > +#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &(s), (e), &(status)) 'status' is not a parameter of the macro, no need to enclose it into parentheses. I would prefer to have it as the parameter, but it seems that the code has many indent levels and will look even more ugly then now if we'll increase lines lengths. So, just remove the parentheses. > + * @param matchOther The escape char (LIKE) or '[' (GLOB). It is a symbol from ESCAPE parameter or a garbage from the likeFunc stack frame. It seems it worth to initialize 'u32 escape' in likeFunc to some symbol you cannot hit within 'c' variable in sql_utf8_pattern_compare. I think it is SQL_END_OF_STRING. Please, fix it here if it is related to your changes or file an issue if it was already here. > + /* Next pattern and input string chars */ > + UChar32 c, c2; > + /* "?" or "_" */ > + UChar32 matchOne = pInfo->matchOne; > + /* "*" or "%" */ > + UChar32 matchAll = pInfo->matchAll; > + /* True if uppercase==lowercase */ > + UChar32 noCase = pInfo->noCase; > + /* One past the last escaped input char */ Our code style suggests to have a period at end of a comment. > - assert(matchOther < 0x80); /* '[' is a single-byte character */ > + assert(matchOther < 0x80); The comment was helpful, IMHO. What if we'll use LIKE with ESCAPE with symbol, say, 'ё'? We have not such tests cases as I see. Again, it does not seems to be the problem you solve here. Please, write a test case and file an issue if this does not work correctly (it seems we can hit the assert above). Ouch, now I see, it will be removed in the 2nd commit of the patchset. So, please, comment it in the commit message of the 1st commit to don't confuse a reviewer. However the case with non-ASCII ESCAPE character is worth to be checked anyway. The code of the sql_utf8_pattern_compare function looks okay for me (except things I suggested to handle separately). > test/sql-tap/gh-3251-string-pattern-comparison.test.lua > > - test_name = prefix .. "8." .. tostring(i) > + local test_name = prefix .. "8." .. tostring(i) It is from the 2nd patch, but I think should be here.