From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tarantool-patches-bounce@freelists.org>
Received: from localhost (localhost [127.0.0.1])
	by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id C749F26E84
	for <tarantool-patches@freelists.org>; Wed, 18 Jul 2018 11:24:57 -0400 (EDT)
Received: from turing.freelists.org ([127.0.0.1])
	by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id DVNt7YWEg03v for <tarantool-patches@freelists.org>;
	Wed, 18 Jul 2018 11:24:57 -0400 (EDT)
Received: from mail-lj1-f194.google.com (mail-lj1-f194.google.com [209.85.208.194])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 3C288266F5
	for <tarantool-patches@freelists.org>; Wed, 18 Jul 2018 11:24:57 -0400 (EDT)
Received: by mail-lj1-f194.google.com with SMTP id q127-v6so4447411ljq.11
        for <tarantool-patches@freelists.org>; Wed, 18 Jul 2018 08:24:57 -0700 (PDT)
MIME-Version: 1.0
References: <1530190036-10105-1-git-send-email-hollow653@gmail.com> <20180718024314.be245cmsgklxuvnk@tkn_work_nb>
In-Reply-To: <20180718024314.be245cmsgklxuvnk@tkn_work_nb>
From: Nikita Tatunov <hollow653@gmail.com>
Date: Wed, 18 Jul 2018 18:24:44 +0300
Message-ID: <CAEi+_apXUVK8QBrBDm05JYcNxYfB43R0vvsQ_2LHK5j+tv25tg@mail.gmail.com>
Subject: [tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue
Content-Type: multipart/alternative; boundary="000000000000364e55057147aa5d"
Sender: tarantool-patches-bounce@freelists.org
Errors-to: tarantool-patches-bounce@freelists.org
Reply-To: tarantool-patches@freelists.org
List-help: <mailto:ecartis@freelists.org?Subject=help>
List-unsubscribe: <tarantool-patches-request@freelists.org?Subject=unsubscribe>
List-software: Ecartis version 1.0.0
List-Id: tarantool-patches <tarantool-patches.freelists.org>
List-subscribe: <tarantool-patches-request@freelists.org?Subject=subscribe>
List-owner: <mailto:>
List-post: <mailto:tarantool-patches@freelists.org>
List-archive: <http://www.freelists.org/archives/tarantool-patches>
To: alexander.turenko@tarantool.org
Cc: avkhatskevich@tarantool.org, tarantool-patches@freelists.org

--000000000000364e55057147aa5d
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

=D1=81=D1=80, 18 =D0=B8=D1=8E=D0=BB. 2018 =D0=B3. =D0=B2 5:43, Alexander Tu=
renko <
alexander.turenko@tarantool.org>:

> i Nikita!
>
> I don't have objections against this patch in general, just several
> minor comments and side notes.
>
> Alexey, as I see you are familiar with the code. Can you give a review
> for the patch?
>
> WBR, Alexander Turenko.
>
> On Thu, Jun 28, 2018 at 03:47:16PM +0300, N.Tatunov wrote:
> > Currently function that compares pattern and
> > string for GLOB & LIKE operators doesn't work properly.
> > It uses ICU reading function which perhaps was working
> > differently before and the implementation for the
> > comparison ending isn't paying attention to some
> > special cases, hence in those cases it works improperly.
> > Now the checks for comparison should work fine.
> >
> > =D0=A1loses: #3251
> > =D0=A1loses: #3334
>
> Use 'Closes #xxxx' form. Don't sure the form with a colon will work.
>
>
Fixed it.


> Utf8Read macro comment now completely irrelevant to its definition.
> Compare to sqlite3:
>
>  584 /*
>  585 ** For LIKE and GLOB matching on EBCDIC machines, assume that every
>  586 ** character is exactly one byte in size.  Also, provde the Utf8Read=
()
>  587 ** macro for fast reading of the next character in the common case
> where
>  588 ** the next character is ASCII.
>  589 */
>  590 #if defined(SQLITE_EBCDIC)
>  591 # define sqlite3Utf8Read(A)        (*((*A)++))
>  592 # define Utf8Read(A)               (*(A++))
>  593 #else
>  594 # define Utf8Read(A)
>  (A[0]<0x80?*(A++):sqlite3Utf8Read(&A))
>  595 #endif
>
>
Yeah, you're right. Fixed it.


> > ---
> >  src/box/sql/func.c          |  25 ++++----
> >  test/sql-tap/like1.test.lua | 152
> ++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 165 insertions(+), 12 deletions(-)
> >  create mode 100755 test/sql-tap/like1.test.lua
> >
> > diff --git a/src/box/sql/func.c b/src/box/sql/func.c
> > index c06e3bd..dcbd7e0 100644
> > --- a/src/box/sql/func.c
> > +++ b/src/box/sql/func.c
> > @@ -643,6 +643,7 @@ static const struct compareInfo likeInfoAlt =3D { '=
%',
> '_', 0, 0 };
> >  #define SQLITE_MATCH             0
> >  #define SQLITE_NOMATCH           1
> >  #define SQLITE_NOWILDCARDMATCH   2
> > +#define SQL_NO_SYMBOLS_LEFT      65535
> >
>
> 0xffff looks better.
>

Isn't actual with the fix.


>
> I would use name like ICU_ERROR. By the way, it can mean not only end of
> a string, but also other errors. It is okay to give 'not matched' in the
> case, I think.
>
> Hmm, we can misinterpret some error and give 'matched' result, I
> guess... maybe we really should check start < end in the macro to
> distinguish this cases. Don't sure about the test case. As I see from
> ICU source code it can be some internal buffer overflow error. We can do
> the check like so:
>
> -#define Utf8Read(s, e)    ucnv_getNextUChar(pUtf8conv, &s, e, &status)
> +#define Utf8Read(s, e) (((s) < (e)) ? ucnv_getNextUChar(pUtf8conv, &(s),
> (e), \
> +       &(status)) : 0)
>
>
Yes, nice idea, I wouldn't guess it, thank you for advice.
Therefore changed SQL_NO_SYMBOLS_LEFT to SQL_END_OF_STRING,
I think it is more understandable.


-#define SQL_NO_SYMBOLS_LEFT      65535
> +#define SQL_NO_SYMBOLS_LEFT      0

>  /*
> >   * Compare two UTF-8 strings for equality where the first string is
> > @@ -698,29 +699,28 @@ patternCompare(const char * pattern,    /* The
> glob pattern */
> >       const char * string_end =3D string + strlen(string);
> >       UErrorCode status =3D U_ZERO_ERROR;
> >
> > -     while (pattern < pattern_end){
> > -             c =3D Utf8Read(pattern, pattern_end);
> > +     while ((c =3D Utf8Read(pattern, pattern_end)) !=3D
> SQL_NO_SYMBOLS_LEFT) {
>
> Here and everywhere below we lean on the fact that ucnv_getNextUChar
> compares pointers and that is so:
>
> 1860     s=3D*source;
> 1861     if(sourceLimit<s) {
> 1862         *err=3DU_ILLEGAL_ARGUMENT_ERROR;
> 1863         return 0xffff;
> 1864     }
>
> By the way, the patch reverts partially implemented approach to
> preliminary check start < end introduced in 198d59ce93. I think it is
> okay too, because some checks were missed and now they exist again.
>
> > diff --git a/test/sql-tap/like1.test.lua b/test/sql-tap/like1.test.lua
> > new file mode 100755
> > index 0000000..42b4d43
> > --- /dev/null
> > +++ b/test/sql-tap/like1.test.lua
> > @@ -0,0 +1,152 @@
> > +#!/usr/bin/env tarantool
> > +test =3D require("sqltester")
> > +test:plan(13)
> > +
>
> Maybe case with % at the end? I tried, that works, but it would be good
> to have a regression test.
>

Yeah, sure.

--000000000000364e55057147aa5d
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><br><div class=3D"gmail_quote"><div dir=3D"ltr">=D1=81=
=D1=80, 18 =D0=B8=D1=8E=D0=BB. 2018 =D0=B3. =D0=B2 5:43, Alexander Turenko =
&lt;<a href=3D"mailto:alexander.turenko@tarantool.org">alexander.turenko@ta=
rantool.org</a>&gt;:<br></div><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">i Nikita!<br>
<br>
I don&#39;t have objections against this patch in general, just several<br>
minor comments and side notes.<br>
<br>
Alexey, as I see you are familiar with the code. Can you give a review<br>
for the patch?<br>
<br>
WBR, Alexander Turenko.<br>
<br>
On Thu, Jun 28, 2018 at 03:47:16PM +0300, N.Tatunov wrote:<br>
&gt; Currently function that compares pattern and<br>
&gt; string for GLOB &amp; LIKE operators doesn&#39;t work properly.<br>
&gt; It uses ICU reading function which perhaps was working<br>
&gt; differently before and the implementation for the<br>
&gt; comparison ending isn&#39;t paying attention to some<br>
&gt; special cases, hence in those cases it works improperly.<br>
&gt; Now the checks for comparison should work fine.<br>
&gt; <br>
&gt; =D0=A1loses: #3251<br>
&gt; =D0=A1loses: #3334<br>
<br>
Use &#39;Closes #xxxx&#39; form. Don&#39;t sure the form with a colon will =
work.<br>
<br></blockquote><div><br></div><div>Fixed it.</div><div>=C2=A0</div><block=
quote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc=
 solid;padding-left:1ex">
Utf8Read macro comment now completely irrelevant to its definition.<br>
Compare to sqlite3:<br>
<br>
=C2=A0584 /*<br>
=C2=A0585 ** For LIKE and GLOB matching on EBCDIC machines, assume that eve=
ry<br>
=C2=A0586 ** character is exactly one byte in size.=C2=A0 Also, provde the =
Utf8Read()<br>
=C2=A0587 ** macro for fast reading of the next character in the common cas=
e where<br>
=C2=A0588 ** the next character is ASCII.<br>
=C2=A0589 */<br>
=C2=A0590 #if defined(SQLITE_EBCDIC)<br>
=C2=A0591 # define sqlite3Utf8Read(A)=C2=A0 =C2=A0 =C2=A0 =C2=A0 (*((*A)++)=
)<br>
=C2=A0592 # define Utf8Read(A)=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0(*(A++))<br>
=C2=A0593 #else<br>
=C2=A0594 # define Utf8Read(A)=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0(A[0]&lt;0x80?*(A++):sqlite3Utf8Read(&amp;A))<br>
=C2=A0595 #endif<br>
<br></blockquote><div><br></div><div>Yeah, you&#39;re right. Fixed it.</div=
><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .=
8ex;border-left:1px #ccc solid;padding-left:1ex">
&gt; ---<br>
&gt;=C2=A0 src/box/sql/func.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 25 =
++++----<br>
&gt;=C2=A0 test/sql-tap/like1.test.lua | 152 ++++++++++++++++++++++++++++++=
++++++++++++++<br>
&gt;=C2=A0 2 files changed, 165 insertions(+), 12 deletions(-)<br>
&gt;=C2=A0 create mode 100755 test/sql-tap/like1.test.lua<br>
&gt; <br>
&gt; diff --git a/src/box/sql/func.c b/src/box/sql/func.c<br>
&gt; index c06e3bd..dcbd7e0 100644<br>
&gt; --- a/src/box/sql/func.c<br>
&gt; +++ b/src/box/sql/func.c<br>
&gt; @@ -643,6 +643,7 @@ static const struct compareInfo likeInfoAlt =3D { =
&#39;%&#39;, &#39;_&#39;, 0, 0 };<br>
&gt;=C2=A0 #define SQLITE_MATCH=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A00<br>
&gt;=C2=A0 #define SQLITE_NOMATCH=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A01=
<br>
&gt;=C2=A0 #define SQLITE_NOWILDCARDMATCH=C2=A0 =C2=A02<br>
&gt; +#define SQL_NO_SYMBOLS_LEFT=C2=A0 =C2=A0 =C2=A0 65535<br>
&gt;=C2=A0 <br>
<br>
0xffff looks better.<br></blockquote><div><br></div><div>Isn&#39;t actual w=
ith the fix.</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
I would use name like ICU_ERROR. By the way, it can mean not only end of<br=
>
a string, but also other errors. It is okay to give &#39;not matched&#39; i=
n the<br>
case, I think.<br>
<br>
Hmm, we can misinterpret some error and give &#39;matched&#39; result, I<br=
>
guess... maybe we really should check start &lt; end in the macro to<br>
distinguish this cases. Don&#39;t sure about the test case. As I see from<b=
r>
ICU source code it can be some internal buffer overflow error. We can do<br=
>
the check like so:<br>
<br>
-#define Utf8Read(s, e)=C2=A0 =C2=A0 ucnv_getNextUChar(pUtf8conv, &amp;s, e=
, &amp;status)<br>
+#define Utf8Read(s, e) (((s) &lt; (e)) ? ucnv_getNextUChar(pUtf8conv, &amp=
;(s), (e), \<br>
+=C2=A0 =C2=A0 =C2=A0 =C2=A0&amp;(status)) : 0)<br>
<br></blockquote><div><br></div><div><span style=3D"font-size:small;backgro=
und-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-co=
lor:initial;float:none;display:inline">Yes, nice idea, I wouldn&#39;t guess=
 it, thank you for advice.</span></div><div><div style=3D"font-size:small;b=
ackground-color:rgb(255,255,255);text-decoration-style:initial;text-decorat=
ion-color:initial">Therefore changed SQL_NO_SYMBOLS_LEFT to SQL_END_OF_STRI=
NG,</div><div style=3D"font-size:small;background-color:rgb(255,255,255);te=
xt-decoration-style:initial;text-decoration-color:initial">I think it is mo=
re understandable.</div>=C2=A0</div><div><br></div><blockquote class=3D"gma=
il_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-lef=
t:1ex">
-#define SQL_NO_SYMBOLS_LEFT=C2=A0 =C2=A0 =C2=A0 65535<br>
+#define SQL_NO_SYMBOLS_LEFT=C2=A0 =C2=A0 =C2=A0 0</blockquote><blockquote =
class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid=
;padding-left:1ex">
&gt;=C2=A0 /*<br>
&gt;=C2=A0 =C2=A0* Compare two UTF-8 strings for equality where the first s=
tring is<br>
&gt; @@ -698,29 +699,28 @@ patternCompare(const char * pattern,=C2=A0 =C2=
=A0 /* The glob pattern */<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0const char * string_end =3D string + strlen(=
string);<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0UErrorCode status =3D U_ZERO_ERROR;<br>
&gt;=C2=A0 <br>
&gt; -=C2=A0 =C2=A0 =C2=A0while (pattern &lt; pattern_end){<br>
&gt; -=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0c =3D Utf8Read(patter=
n, pattern_end);<br>
&gt; +=C2=A0 =C2=A0 =C2=A0while ((c =3D Utf8Read(pattern, pattern_end)) !=
=3D SQL_NO_SYMBOLS_LEFT) {<br>
<br>
Here and everywhere below we lean on the fact that ucnv_getNextUChar<br>
compares pointers and that is so:<br>
<br>
1860=C2=A0 =C2=A0 =C2=A0s=3D*source;<br>
1861=C2=A0 =C2=A0 =C2=A0if(sourceLimit&lt;s) {<br>
1862=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*err=3DU_ILLEGAL_ARGUMENT_ERROR;<br>
1863=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return 0xffff;<br>
1864=C2=A0 =C2=A0 =C2=A0}<br>
<br>
By the way, the patch reverts partially implemented approach to<br>
preliminary check start &lt; end introduced in 198d59ce93. I think it is<br=
>
okay too, because some checks were missed and now they exist again.<br>
<br>
&gt; diff --git a/test/sql-tap/like1.test.lua b/test/sql-tap/like1.test.lua=
<br>
&gt; new file mode 100755<br>
&gt; index 0000000..42b4d43<br>
&gt; --- /dev/null<br>
&gt; +++ b/test/sql-tap/like1.test.lua<br>
&gt; @@ -0,0 +1,152 @@<br>
&gt; +#!/usr/bin/env tarantool<br>
&gt; +test =3D require(&quot;sqltester&quot;)<br>
&gt; +test:plan(13)<br>
&gt; +<br>
<br>
Maybe case with % at the end? I tried, that works, but it would be good<br>
to have a regression test.<br></blockquote><div><br></div><div>Yeah, sure.=
=C2=A0</div></div></div>

--000000000000364e55057147aa5d--