From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tarantool-patches-bounce@freelists.org>
Received: from localhost (localhost [127.0.0.1])
	by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 00CD12C5E5
	for <tarantool-patches@freelists.org>; Mon, 29 Oct 2018 09:01:25 -0400 (EDT)
Received: from turing.freelists.org ([127.0.0.1])
	by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id rfseDN5LFxgW for <tarantool-patches@freelists.org>;
	Mon, 29 Oct 2018 09:01:24 -0400 (EDT)
Received: from smtp55.i.mail.ru (smtp55.i.mail.ru [217.69.128.35])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id B28672BA55
	for <tarantool-patches@freelists.org>; Mon, 29 Oct 2018 09:01:24 -0400 (EDT)
Date: Mon, 29 Oct 2018 16:01:23 +0300
From: Alexander Turenko <alexander.turenko@tarantool.org>
Subject: [tarantool-patches] Re: [PATCH 1/2] sql: LIKE & GLOB pattern
 comparison issue
Message-ID: <20181029130123.f254chdxxuwi6c4w@tkn_work_nb>
References: <20180817111727.y6nsbblpm5nh4n3g@tkn_work_nb>
 <436d256a-f9d0-781f-8cad-179d7322c7bd@tarantool.org>
 <87897608-173E-45EB-80A1-8B249706D8A1@tarantool.org>
 <6a1352e9-425c-d656-1bec-bb04d9f0fee6@tarantool.org>
 <58B407E2-AF5D-4531-A9FF-9DC57CE0070B@tarantool.org>
 <860a125b-19f3-3bf1-8705-25156ff508ab@tarantool.org>
 <45338A27-C589-4330-B206-A4E379A4DE75@tarantool.org>
 <BE06C955-2ED3-42D7-87B8-93F138448E3D@tarantool.org>
 <20181021035140.avx6d3rokx5ta6hi@tkn_work_nb>
 <6740948F-6C40-4C0F-B237-7C3573225FBC@tarantool.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <6740948F-6C40-4C0F-B237-7C3573225FBC@tarantool.org>
Sender: tarantool-patches-bounce@freelists.org
Errors-to: tarantool-patches-bounce@freelists.org
Reply-To: tarantool-patches@freelists.org
List-help: <mailto:ecartis@freelists.org?Subject=help>
List-unsubscribe: <tarantool-patches-request@freelists.org?Subject=unsubscribe>
List-software: Ecartis version 1.0.0
List-Id: tarantool-patches <tarantool-patches.freelists.org>
List-subscribe: <tarantool-patches-request@freelists.org?Subject=subscribe>
List-owner: <mailto:>
List-post: <mailto:tarantool-patches@freelists.org>
List-archive: <http://www.freelists.org/archives/tarantool-patches>
To: Nikita Tatunov <n.tatunov@tarantool.org>
Cc: tarantool-patches@freelists.org, korablev@tarantool.org

The patch is okay except one note re test case.

WBR, Alexander Turenko.

> --- EVIDENCE-OF: R-39414-35489 The infix GLOB operator is implemented by
> --- calling the function glob(Y,X) and can be modified by overriding that
> --- function.

This test case was removed, while we have not a similar one for LIKE.

> >>>    if(((size_t)(sourceLimit-s)>(size_t)0x7fffffff && sourceLimit>s)) {
> >>>        *err=U_ILLEGAL_ARGUMENT_ERROR;
> >>>        return 0xffff;
> >> 
> >> 4. I’m not sure if string data can be this long in our context. 
> >>   (string length > (size_t) 0x7ffffffff)
> > 
> > Note: not 0x7ffffffff, but 0x7fffffff.
> > 
> > This limit seems to be some weird internal thing related to using
> > ucnv_getNextUChar inside libicu.
> > 
> > I propose to lie libicu about the buffer size in case when it exceeds
> > this limit. A UTF-8 encoded symbol is 4 bytes long at max, so we can
> > pass the following instead of pattern_end:
> > 
> > ((size_t) (pattern_end - pattern) > (size_t) 0x7fffffff ? pattern + 0x7fffffff : pattern_end
> > 
> > I think this trick need to be covered with a unit test (because it is unclear
> > how to create a string of size >1GiB from lua). Don't sure whether it is
> > okay to allocate such amount of memory in the test, though...
> > 
> > Please, don't do that within this patch, because it is about the another bug.
> > File an issue with all needed information instead (you can provide a link to
> > this message for example).
> 
> Ok, thank you for advice. I think that’s a good idea, but there’s one thing
> I’m getting concerned about: it will cause a lot of operations especially
> in case we’re using LIKE for scanning a lot of data (). I guess even if it’s
> relevant it’s a discussion inside of an issue that’s going to be filed.

Filed https://github.com/tarantool/tarantool/issues/3773

> >>>            } else if(U_SUCCESS(*err) && c>=0) {
> >>>                return c;
> >> 
> >> 6. Returns symbol (can also be 0xfffd, as it is not treated as an actual error).
> >> 
> >> So if I’m not mistaken we will get results in our function either from
> >> ‘return’ number 5 or number 6 and the following code will not be executed.
> > 
> > It is not so. We'll fall through in case of U_ILLEGAL_CHAR_FOUND or
> > U_TRUNCATED_CHAR_FOUND error.
> > 
> > To be honest I don't want to continue. It seems we should not lean on
> > the fact that 0xffff always means end of the buffer, because it does not
> > guaranteed by the API and is not clear from the code.
> > 
> > AFAIR, the problem was to choose appropriate symbol to mark end of the
> > buffer situation and distinguish it from a real error. It seems we have
> > not one. So we should fairly (and always) check for the buffer before a
> > call to ucnv_getNextUChar() or check the status it provide after the
> > call. I would prefer to check it in our code. It seems that it is how
> > the API works.
> > 
> > I propose to use the same code pattern for all Utf8Read calls, e.g.:
> > 
> > if (pattern < pattern_end)
> > 	c = Utf8Read(pattern, pattern_end)
> > else
> > 	return SQL_...;
> > if (c == SQL_INVALID_UTF8_SYMBOL)
> > 	return SQL_...;
> > assert(U_SUCCESS(status));
> > 
> > Note: I have added the assert, because it is not clear what we can do
> > with, say, U_INVALID_TABLE_FORMAT (improper libicu build /
> > installation). Hope Nikita P. suggests right way, but now I think we
> > should at least assert on that.
> > 
> > It seems the code above can be even wrapped into a macro that will get
> > two pointers (pattern and pattern_end / string and string_end) and two
> > SQL_...  error code to handle two possible errors. Yep, it is generally
> > discouraged to return from a macro, but if it'll greatly improves the
> > code readability, so it is appropriate, I think. Just define the macro
> > right before the function and undefne it after to show a reader it is
> > some pure internal thing.
> > 
> > Note: If you will going that way, don't wrap Utf8Read macro into another
> > macro. Use one with ucnv_getNextUChar call.
> > 
> > It is refactoring of the code and our of the scope of your issue.
> > Please, file an issue and link this message into it (but please ask
> > Nikita P. opinion before).
> > 
> > It is not good IMHO, but it seems now it worth to leave the code with
> > assumption 0xffff is the end of buffer. This is kind of splitting the
> > problem into parts and allow us to proceed with this patch re parsing
> > bug.

Filed https://github.com/tarantool/tarantool/issues/3774