From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id C749F26E84 for ; Wed, 18 Jul 2018 11:24:57 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id DVNt7YWEg03v for ; Wed, 18 Jul 2018 11:24:57 -0400 (EDT) Received: from mail-lj1-f194.google.com (mail-lj1-f194.google.com [209.85.208.194]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 3C288266F5 for ; Wed, 18 Jul 2018 11:24:57 -0400 (EDT) Received: by mail-lj1-f194.google.com with SMTP id q127-v6so4447411ljq.11 for ; Wed, 18 Jul 2018 08:24:57 -0700 (PDT) MIME-Version: 1.0 References: <1530190036-10105-1-git-send-email-hollow653@gmail.com> <20180718024314.be245cmsgklxuvnk@tkn_work_nb> In-Reply-To: <20180718024314.be245cmsgklxuvnk@tkn_work_nb> From: Nikita Tatunov Date: Wed, 18 Jul 2018 18:24:44 +0300 Message-ID: Subject: [tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue Content-Type: multipart/alternative; boundary="000000000000364e55057147aa5d" Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: alexander.turenko@tarantool.org Cc: avkhatskevich@tarantool.org, tarantool-patches@freelists.org --000000000000364e55057147aa5d Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable =D1=81=D1=80, 18 =D0=B8=D1=8E=D0=BB. 2018 =D0=B3. =D0=B2 5:43, Alexander Tu= renko < alexander.turenko@tarantool.org>: > i Nikita! > > I don't have objections against this patch in general, just several > minor comments and side notes. > > Alexey, as I see you are familiar with the code. Can you give a review > for the patch? > > WBR, Alexander Turenko. > > On Thu, Jun 28, 2018 at 03:47:16PM +0300, N.Tatunov wrote: > > Currently function that compares pattern and > > string for GLOB & LIKE operators doesn't work properly. > > It uses ICU reading function which perhaps was working > > differently before and the implementation for the > > comparison ending isn't paying attention to some > > special cases, hence in those cases it works improperly. > > Now the checks for comparison should work fine. > > > > =D0=A1loses: #3251 > > =D0=A1loses: #3334 > > Use 'Closes #xxxx' form. Don't sure the form with a colon will work. > > Fixed it. > Utf8Read macro comment now completely irrelevant to its definition. > Compare to sqlite3: > > 584 /* > 585 ** For LIKE and GLOB matching on EBCDIC machines, assume that every > 586 ** character is exactly one byte in size. Also, provde the Utf8Read= () > 587 ** macro for fast reading of the next character in the common case > where > 588 ** the next character is ASCII. > 589 */ > 590 #if defined(SQLITE_EBCDIC) > 591 # define sqlite3Utf8Read(A) (*((*A)++)) > 592 # define Utf8Read(A) (*(A++)) > 593 #else > 594 # define Utf8Read(A) > (A[0]<0x80?*(A++):sqlite3Utf8Read(&A)) > 595 #endif > > Yeah, you're right. Fixed it. > > --- > > src/box/sql/func.c | 25 ++++---- > > test/sql-tap/like1.test.lua | 152 > ++++++++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 165 insertions(+), 12 deletions(-) > > create mode 100755 test/sql-tap/like1.test.lua > > > > diff --git a/src/box/sql/func.c b/src/box/sql/func.c > > index c06e3bd..dcbd7e0 100644 > > --- a/src/box/sql/func.c > > +++ b/src/box/sql/func.c > > @@ -643,6 +643,7 @@ static const struct compareInfo likeInfoAlt =3D { '= %', > '_', 0, 0 }; > > #define SQLITE_MATCH 0 > > #define SQLITE_NOMATCH 1 > > #define SQLITE_NOWILDCARDMATCH 2 > > +#define SQL_NO_SYMBOLS_LEFT 65535 > > > > 0xffff looks better. > Isn't actual with the fix. > > I would use name like ICU_ERROR. By the way, it can mean not only end of > a string, but also other errors. It is okay to give 'not matched' in the > case, I think. > > Hmm, we can misinterpret some error and give 'matched' result, I > guess... maybe we really should check start < end in the macro to > distinguish this cases. Don't sure about the test case. As I see from > ICU source code it can be some internal buffer overflow error. We can do > the check like so: > > -#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &s, e, &status) > +#define Utf8Read(s, e) (((s) < (e)) ? ucnv_getNextUChar(pUtf8conv, &(s), > (e), \ > + &(status)) : 0) > > Yes, nice idea, I wouldn't guess it, thank you for advice. Therefore changed SQL_NO_SYMBOLS_LEFT to SQL_END_OF_STRING, I think it is more understandable. -#define SQL_NO_SYMBOLS_LEFT 65535 > +#define SQL_NO_SYMBOLS_LEFT 0 > /* > > * Compare two UTF-8 strings for equality where the first string is > > @@ -698,29 +699,28 @@ patternCompare(const char * pattern, /* The > glob pattern */ > > const char * string_end =3D string + strlen(string); > > UErrorCode status =3D U_ZERO_ERROR; > > > > - while (pattern < pattern_end){ > > - c =3D Utf8Read(pattern, pattern_end); > > + while ((c =3D Utf8Read(pattern, pattern_end)) !=3D > SQL_NO_SYMBOLS_LEFT) { > > Here and everywhere below we lean on the fact that ucnv_getNextUChar > compares pointers and that is so: > > 1860 s=3D*source; > 1861 if(sourceLimit 1862 *err=3DU_ILLEGAL_ARGUMENT_ERROR; > 1863 return 0xffff; > 1864 } > > By the way, the patch reverts partially implemented approach to > preliminary check start < end introduced in 198d59ce93. I think it is > okay too, because some checks were missed and now they exist again. > > > diff --git a/test/sql-tap/like1.test.lua b/test/sql-tap/like1.test.lua > > new file mode 100755 > > index 0000000..42b4d43 > > --- /dev/null > > +++ b/test/sql-tap/like1.test.lua > > @@ -0,0 +1,152 @@ > > +#!/usr/bin/env tarantool > > +test =3D require("sqltester") > > +test:plan(13) > > + > > Maybe case with % at the end? I tried, that works, but it would be good > to have a regression test. > Yeah, sure. --000000000000364e55057147aa5d Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=D1=81= =D1=80, 18 =D0=B8=D1=8E=D0=BB. 2018 =D0=B3. =D0=B2 5:43, Alexander Turenko = <alexander.turenko@ta= rantool.org>:
i Nikita!

I don't have objections against this patch in general, just several
minor comments and side notes.

Alexey, as I see you are familiar with the code. Can you give a review
for the patch?

WBR, Alexander Turenko.

On Thu, Jun 28, 2018 at 03:47:16PM +0300, N.Tatunov wrote:
> Currently function that compares pattern and
> string for GLOB & LIKE operators doesn't work properly.
> It uses ICU reading function which perhaps was working
> differently before and the implementation for the
> comparison ending isn't paying attention to some
> special cases, hence in those cases it works improperly.
> Now the checks for comparison should work fine.
>
> =D0=A1loses: #3251
> =D0=A1loses: #3334

Use 'Closes #xxxx' form. Don't sure the form with a colon will = work.


Fixed it.
=C2=A0
Utf8Read macro comment now completely irrelevant to its definition.
Compare to sqlite3:

=C2=A0584 /*
=C2=A0585 ** For LIKE and GLOB matching on EBCDIC machines, assume that eve= ry
=C2=A0586 ** character is exactly one byte in size.=C2=A0 Also, provde the = Utf8Read()
=C2=A0587 ** macro for fast reading of the next character in the common cas= e where
=C2=A0588 ** the next character is ASCII.
=C2=A0589 */
=C2=A0590 #if defined(SQLITE_EBCDIC)
=C2=A0591 # define sqlite3Utf8Read(A)=C2=A0 =C2=A0 =C2=A0 =C2=A0 (*((*A)++)= )
=C2=A0592 # define Utf8Read(A)=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0(*(A++))
=C2=A0593 #else
=C2=A0594 # define Utf8Read(A)=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0(A[0]<0x80?*(A++):sqlite3Utf8Read(&A))
=C2=A0595 #endif


Yeah, you're right. Fixed it.
=C2=A0
> ---
>=C2=A0 src/box/sql/func.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 25 = ++++----
>=C2=A0 test/sql-tap/like1.test.lua | 152 ++++++++++++++++++++++++++++++= ++++++++++++++
>=C2=A0 2 files changed, 165 insertions(+), 12 deletions(-)
>=C2=A0 create mode 100755 test/sql-tap/like1.test.lua
>
> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
> index c06e3bd..dcbd7e0 100644
> --- a/src/box/sql/func.c
> +++ b/src/box/sql/func.c
> @@ -643,6 +643,7 @@ static const struct compareInfo likeInfoAlt =3D { = '%', '_', 0, 0 };
>=C2=A0 #define SQLITE_MATCH=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A00
>=C2=A0 #define SQLITE_NOMATCH=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A01=
>=C2=A0 #define SQLITE_NOWILDCARDMATCH=C2=A0 =C2=A02
> +#define SQL_NO_SYMBOLS_LEFT=C2=A0 =C2=A0 =C2=A0 65535
>=C2=A0

0xffff looks better.

Isn't actual w= ith the fix.
=C2=A0

I would use name like ICU_ERROR. By the way, it can mean not only end of a string, but also other errors. It is okay to give 'not matched' i= n the
case, I think.

Hmm, we can misinterpret some error and give 'matched' result, I guess... maybe we really should check start < end in the macro to
distinguish this cases. Don't sure about the test case. As I see from ICU source code it can be some internal buffer overflow error. We can do the check like so:

-#define Utf8Read(s, e)=C2=A0 =C2=A0 ucnv_getNextUChar(pUtf8conv, &s, e= , &status)
+#define Utf8Read(s, e) (((s) < (e)) ? ucnv_getNextUChar(pUtf8conv, &= ;(s), (e), \
+=C2=A0 =C2=A0 =C2=A0 =C2=A0&(status)) : 0)


Yes, nice idea, I wouldn't guess= it, thank you for advice.
Therefore changed SQL_NO_SYMBOLS_LEFT to SQL_END_OF_STRI= NG,
I think it is mo= re understandable.
=C2=A0

-#define SQL_NO_SYMBOLS_LEFT=C2=A0 =C2=A0 =C2=A0 65535
+#define SQL_NO_SYMBOLS_LEFT=C2=A0 =C2=A0 =C2=A0 0
>=C2=A0 /*
>=C2=A0 =C2=A0* Compare two UTF-8 strings for equality where the first s= tring is
> @@ -698,29 +699,28 @@ patternCompare(const char * pattern,=C2=A0 =C2= =A0 /* The glob pattern */
>=C2=A0 =C2=A0 =C2=A0 =C2=A0const char * string_end =3D string + strlen(= string);
>=C2=A0 =C2=A0 =C2=A0 =C2=A0UErrorCode status =3D U_ZERO_ERROR;
>=C2=A0
> -=C2=A0 =C2=A0 =C2=A0while (pattern < pattern_end){
> -=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0c =3D Utf8Read(patter= n, pattern_end);
> +=C2=A0 =C2=A0 =C2=A0while ((c =3D Utf8Read(pattern, pattern_end)) != =3D SQL_NO_SYMBOLS_LEFT) {

Here and everywhere below we lean on the fact that ucnv_getNextUChar
compares pointers and that is so:

1860=C2=A0 =C2=A0 =C2=A0s=3D*source;
1861=C2=A0 =C2=A0 =C2=A0if(sourceLimit<s) {
1862=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*err=3DU_ILLEGAL_ARGUMENT_ERROR;
1863=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return 0xffff;
1864=C2=A0 =C2=A0 =C2=A0}

By the way, the patch reverts partially implemented approach to
preliminary check start < end introduced in 198d59ce93. I think it is okay too, because some checks were missed and now they exist again.

> diff --git a/test/sql-tap/like1.test.lua b/test/sql-tap/like1.test.lua=
> new file mode 100755
> index 0000000..42b4d43
> --- /dev/null
> +++ b/test/sql-tap/like1.test.lua
> @@ -0,0 +1,152 @@
> +#!/usr/bin/env tarantool
> +test =3D require("sqltester")
> +test:plan(13)
> +

Maybe case with % at the end? I tried, that works, but it would be good
to have a regression test.

Yeah, sure.= =C2=A0
--000000000000364e55057147aa5d--