From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id E7D7921663 for ; Fri, 27 Jul 2018 09:06:03 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id xibRSN-l7yr4 for ; Fri, 27 Jul 2018 09:06:03 -0400 (EDT) Received: from smtp61.i.mail.ru (smtp61.i.mail.ru [217.69.128.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 671592157F for ; Fri, 27 Jul 2018 09:06:03 -0400 (EDT) Date: Fri, 27 Jul 2018 16:06:01 +0300 From: Alexander Turenko Subject: [tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue Message-ID: <20180727130601.b2oby7dleapd5upg@tkn_work_nb> References: <1530190036-10105-1-git-send-email-hollow653@gmail.com> <20180718024314.be245cmsgklxuvnk@tkn_work_nb> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: Nikita Tatunov Cc: tarantool-patches@freelists.org Hi, See comments below. Don't sure we should return 'no match' situation when a left hand expression in not correct unicode string. How other DBs handle that? WBR, Alexander Turenko. > sql: LIKE & GLOB pattern comparison issue > > Currently function that compares pattern and string for GLOB & LIKE > operators doesn't work properly. It uses ICU reading function which > perhaps was working differently before and the implementation for the > comparison ending isn't paying attention to some special cases, hence > in those cases it works improperly. > It used non-ICU function before, consider commit 198d59ce. > With the patch applied an error will be returned in case there's > invalid UTF-8 symbol in pattern & pattern will not be matched > with the string that contains it. Some minor corrections to function > were made as well. > Nitpicking: 'error will be returned' is third situatuon, other then 'matched' and 'not matched'. > Сloses #3251 > Сloses #3334 > Part of #3572 > > 2. The thing you test is not related to a table and other columns. > > Please, convert the tests to the next format: {[[select '' like > > '_B';]], {1}]]}. > > To make it more readable, you can do it like `like_testcases` in > > `sql-tap/collation.test.lua`. > > > > Done. You are still perform comparisons with table columns. > -/* > - * For LIKE and GLOB matching on EBCDIC machines, assume that every > - * character is exactly one byte in size. Also, provde the Utf8Read() > - * macro for fast reading of the next character in the common case where > - * the next character is ASCII. > +/** > + * Providing there are symbols in string s this > + * macro returns UTF-8 code of character and > + * promotes pointer to the next symbol in the string. > + * Otherwise return code is SQL_END_OF_STRING. > */ > -#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &s, e, &status) > +#define Utf8Read(s, e) (((s) < (e)) ?\ > + ucnv_getNextUChar(pUtf8conv, &(s), (e), &(status)) : 0) I suggest to add a whitespace before backslash to make it looks better. > @@ -643,51 +647,61 @@ static const struct compareInfo likeInfoAlt = { '%', > '_', 0, 0 }; > #define SQLITE_MATCH 0 > #define SQLITE_NOMATCH 1 > #define SQLITE_NOWILDCARDMATCH 2 > +#define SQL_PROHIBITED_PATTERN 3 Update comment before that has a reference to 'patternMatch' function. > * > * Globbing rules: > * > - * '*' Matches any sequence of zero or more characters. > + * '*' Matches any sequence of zero or more characters. Why? > * > - * '?' Matches exactly one character. > + * '?' Matches exactly one character. > * > - * [...] Matches one character from the enclosed list of > - * characters. > + * [...] Matches one character from the enclosed list of > + * characters. > * > - * [^...] Matches one character not in the enclosed list. > + * [^...] Matches one character not in the enclosed list. > * The same question. > + * @retval SQLITE_MATCH: Match. > + * SQLITE_NOMATCH: No match. > + * SQLITE_NOWILDCARDMATCH: No match in spite of having * > + * or % wildcards. > + * SQL_PROHIBITED_PATTERN Pattern contains invalid > + * symbol. Nitpicking: no colon after SQL_PROHIBITED_PATTERN. > */ > static int > -patternCompare(const char * pattern, /* The glob pattern */ > - const char * string, /* The string to compare against the glob */ > - const struct compareInfo *pInfo, /* Information about how to do > the compare */ > - UChar32 matchOther /* The escape char (LIKE) or '[' (GLOB) */ > - ) > +sql_utf8_pattern_compare(const char * pattern, > + const char * string, > + const struct compareInfo *pInfo, > + UChar32 matchOther) Don't get why indent is tabulation + 7 spaces. The function itself looks good except lines longer then 80 columns. > @@ -853,8 +903,7 @@ patternCompare(const char * pattern, /* The glob > pattern */ > int > sqlite3_strglob(const char *zGlobPattern, const char *zString) > { > - return patternCompare(zGlobPattern, zString, &globInfo, > - '['); > + return sql_utf8_pattern_compare(zGlobPattern, zString, &globInfo, '['); > } > > /* > @@ -864,7 +913,7 @@ sqlite3_strglob(const char *zGlobPattern, const char > *zString) > int > sqlite3_strlike(const char *zPattern, const char *zStr, unsigned int esc) > { > - return patternCompare(zPattern, zStr, &likeInfoNorm, esc); > + return sql_utf8_pattern_compare(zPattern, zStr, &likeInfoNorm, esc); > } > Should we add sqlite3_result_error is case of SQL_PROHIBITED_PATTERN for these two functions? > - {"LIKE", "like"}, > - {"GLOB", "glob"}, > +-- NOTE: This test needs refactoring after deletion of GLOB & > +-- type restrictions for LIKE. > +-- {"LIKE", "like"}, > +-- {"GLOB", "glob"}, I would add related issue number. > {"AND", "and"}, > {"OR", "or"}, > {"MATCH", "match"}, > @@ -96,7 +98,7 @@ operations = { > {"+", "-"}, > {"<<", ">>", "&", "|"}, > {"<", "<=", ">", ">="}, > - {"=", "==", "!=", "<>", "LIKE", "GLOB"}, --"MATCH", "REGEXP"}, > + {"=", "==", "!=", "<>"}, --"LIKE", "GLOB"}, "MATCH", "REGEXP"}, Leave comment before MATCH and REGEXP to avoid surprises after removing your comment. > --- /dev/null > +++ b/test/sql-tap/gh-3251-string-pattern-comparison.lua > @@ -0,0 +1,238 @@ > +#!/usr/bin/env tarantool > +test = require("sqltester") > +test:plan(106) > + > +local prefix = "like-test-" > + > +-- Unciode byte sequences. > + 1. Typo: unicode. 2. Extra empty line. > +-- Invalid testcases. > + > +for i, tested_string in ipairs(invalid_testcases) do > + local test_name = prefix .. "2." .. tostring(i) > + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" > + > +-- We should raise an error if pattern contains invalid characters. > + 1. Trailing whitespaces. 2. Why comments are not indented as the other code? 3. test_name and test_itself before the section comment looks strange for me here.