From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id BAA4021B81 for ; Fri, 27 Apr 2018 20:56:45 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id IvBw2fbj7Pow for ; Fri, 27 Apr 2018 20:56:45 -0400 (EDT) Received: from smtp37.i.mail.ru (smtp37.i.mail.ru [94.100.177.97]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 2886921B4B for ; Fri, 27 Apr 2018 20:56:44 -0400 (EDT) Date: Sat, 28 Apr 2018 03:56:53 +0300 From: Alexander Turenko Subject: [tarantool-patches] Re: [PATCH 1/7] lua: expose ICU upper/lower functions to Lua Message-ID: <20180428005653.5ssaax3llh6xzlw4@tkn_work_nb> References: <4964845f82fc37f46f28b1713adf4527c219cb0d.1524698920.git.v.shpilevoy@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <4964845f82fc37f46f28b1713adf4527c219cb0d.1524698920.git.v.shpilevoy@tarantool.org> Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: Vladislav Shpilevoy Cc: kostja@tarantool.org, tarantool-patches@freelists.org Hi Vlad! Please, consider comments below. WBR, Alexander Turenko. On Thu, Apr 26, 2018 at 02:29:01AM +0300, Vladislav Shpilevoy wrote: > Lua can not work with unicode - in Lua it is enterpreted as a > binary. On such string built-in upper/lower functions do not > work. But Tarantool links with ICU that can solve the problem. > Lets expose ICU upper/lower function into Lua to enable correct > case transformations. > > Closes #3290 > --- > src/lua/init.c | 2 +- > src/lua/string.lua | 140 +++++++++++++++++++++++++++++++++++++++++++ > test/app-tap/string.test.lua | 34 ++++++++++- > 3 files changed, 174 insertions(+), 2 deletions(-) > > <...> > diff --git a/src/lua/string.lua b/src/lua/string.lua > index 5ff64c9f6..1c7226143 100644 > --- a/src/lua/string.lua > +++ b/src/lua/string.lua > <...> > +-- > +-- Find cached UCaseMap for @a locale, or create a new one and > +-- cache it. > +-- @param locale String locale or box.NULL for default. > +-- @retval nil Can neither get or create a UCaseMap. > +-- @retval not nil Needed UCaseMap. > +-- There are two special values for a locale: NULL is a 'default locale' means system-wide one (LANG=...) and "" (empty string) means 'root' locale that is some kind of compromise for language/country neural applications. Two things should be considered here: 1. Want we use system-default locale in tarantool by default? A system locale which is dependent of environment and can be different on a developer system and on a staging/production system. Tarantool is not end-user application, but some kind of system software. 2. We should desribe both special values in doxygen-style comment (and, then, in documentation) and explicitly state which one is the default (and why). Side note: it seems that box.NULL will be returned in case or an error, but not nil. > +local function ucasemap_retrieve(locale) > + local ret = ucasemap_cache[locale] > + if not ret then > + ret = ffi.C.ucasemap_open(c_char_ptr(locale), 0, errcode) > + if ret ~= nil then > + ffi.gc(ret, ffi.C.ucasemap_close) > + ucasemap_cache[locale] = ret > + end > + end > + return ret > +end Root locale will be returned in case of non-existent locale (please note that it can differs from 'default', i.e. system locale). Proposed to at least document that behaviour. But maybe it would be better to give explicit error in the case. > +-- > +-- Check ICU options for string.u_upper/u_lower. > +-- @param opts Options. Can contain only one option - locale. > +-- @param usage_err What to throw if opts types are violated. > +-- @retval String locale if found. > +-- @retval box.NULL if locale is not found. > +-- Proposed wording: > box.NULL if 'locale' field is not found. To clarify that no lookup for available locales is performed. > +local function icu_check_case_opts(opts, usage_err) > <...> > diff --git a/test/app-tap/string.test.lua b/test/app-tap/string.test.lua > index 852a7923c..004e149e9 100755 > --- a/test/app-tap/string.test.lua > +++ b/test/app-tap/string.test.lua > <...> > +test:test("unicode", function(test) > + test:plan(12) > + local str = 'хеЛлоу вОрЛд ё Ё я Я э Э ъ Ъ hElLo WorLd 1234 i I İ 勺#☢༺' Your implementation correctly handle the empty string. Proposed to add corresponsing test case. > + local upper_res = 'ХЕЛЛОУ ВОРЛД Ё Ё Я Я Э Э Ъ Ъ HELLO WORLD 1234 I I İ 勺#☢༺' > + local upper_turkish = 'ХЕЛЛОУ ВОРЛД Ё Ё Я Я Э Э Ъ Ъ HELLO WORLD 1234 İ I İ 勺#☢༺' > + local lower_res = 'хеллоу ворлд ё ё я я э э ъ ъ hello world 1234 i i i̇ 勺#☢༺' > + local lower_turkish = 'хеллоу ворлд ё ё я я э э ъ ъ hello world 1234 i ı i 勺#☢༺' > + local s = string.u_upper(str) > + test:is(s, upper_res, 'default locale upper') The test is system-locale dependent (you can try it like `LANG=tr_TR ./src/tarantool ./test/app-tap/string.test.lua`). Proposed to set LANG explicitly in the test.