From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id 8886270152; Thu, 2 Dec 2021 15:12:54 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 8886270152 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1638447174; bh=s2vh2vBjJtV3Xhb8LcHGo2jfVbVua4C4mMQhyosyVxw=; h=Date:To:References:In-Reply-To:Subject:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=n82ZrKN72TPdVK22ipzx0OwzqAfcDUMQ0dSJqPSJYy+OLwlZztA3u0Vyf7Rm0S6Zb 91+MVEkM+GX4wagDmhfKUCdWq1rx8YE+cJKqvHFHKXtNTPtQIaaRjh3dCE3XHxFgHp H4O9IGURJHF/Na+V9XmPJIUPLSD2jVhi7ue+2HIo= Received: from smtpng1.i.mail.ru (smtpng1.i.mail.ru [94.100.181.251]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id F342C70152 for ; Thu, 2 Dec 2021 15:12:52 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org F342C70152 Received: by smtpng1.m.smailru.net with esmtpa (envelope-from ) id 1mskxI-0005qh-Bj; Thu, 02 Dec 2021 15:12:52 +0300 Date: Thu, 2 Dec 2021 15:12:50 +0300 To: Lord , tarantool-patches@dev.tarantool.org Message-ID: <20211202121250.GA34471@tarantool.org> References: <20211114091310.4147-1-lord.nemo@protonmail.com> <20211202120657.GB21659@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20211202120657.GB21659@tarantool.org> X-7564579A: EEAE043A70213CC8 X-77F55803: 4F1203BC0FB41BD93822B471089FF64DA800EFA0B45871060D552154ECD16BE5182A05F538085040EE5BCA64F8C25DC995C6652731F76C3B1D7FBC57567566B3A11924105D8999B1 X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE7BC08626EA5717D14EA1F7E6F0F101C67BD4B6F7A4D31EC0BCC500DACC3FED6E28638F802B75D45FF8AA50765F790063788758EA7442DD2858638F802B75D45FF36EB9D2243A4F8B5A6FCA7DBDB1FC311F39EFFDF887939037866D6147AF826D8A00F479C94413A6227C4E01561652B7B117882F4460429724CE54428C33FAD305F5C1EE8F4F765FCF1175FABE1C0F9B6A471835C12D1D9774AD6D5ED66289B52BA9C0B312567BB23117882F446042972877693876707352033AC447995A7AD18CB629EEF1311BF91D2E47CDBA5A96583BA9C0B312567BB231DD303D21008E29813377AFFFEAFD269A417C69337E82CC2E827F84554CEF50127C277FBC8AE2E8BA83251EDC214901ED5E8D9A59859A8B6B1CFA6D474D4A6A4089D37D7C0E48F6C5571747095F342E88FB05168BE4CE3AF X-C1DE0DAB: 0D63561A33F958A5E7F031CDE8FE72F4BF53EAD00A91BDF7A10A51C40D6CB210D59269BC5F550898D99A6476B3ADF6B47008B74DF8BB9EF7333BD3B22AA88B938A852937E12ACA759F66ED85EB5F25FD410CA545F18667F91A7EA1CDA0B5A7A0 X-C8649E89: 4E36BF7865823D7055A7F0CF078B5EC49A30900B95165D343D1F112031EF3D62B93BC9E7D57B7288FBD9BE3D22069925623CA1AA31A3113EB5E7DD2B978204CC1D7E09C32AA3244C2EB699D4CEB368B14B114A630A4E13111E098CBE561D6343729B2BEF169E0186 X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2biojMMFVqzO9sR3LWDLpH4qSeg== X-Mailru-Sender: 689FA8AB762F7393C37E3C1AEC41BA5D7A6E29D9519DC823D26E4D7AF47EB56283D72C36FC87018B9F80AB2734326CD2FB559BB5D741EB96352A0ABBE4FDA4210A04DAD6CC59E3365FEEDEB644C299C0ED14614B50AE0675 X-Mras: Ok Subject: Re: [Tarantool-patches] [PATCH] sql: Replace naive position alg with KMP X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Mergen Imeev via Tarantool-patches Reply-To: Mergen Imeev Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" Also, I forgot to add: please add me(imeevma@tarantool.org) to the copy when you submit the SQL patch. On Thu, Dec 02, 2021 at 03:06:57PM +0300, Mergen Imeev via Tarantool-patches wrote: > Thank you for the patch! See 13 comments below. > > 1. Please rebase this patch to current tarantool/master. I think it will be > easier to rebase if you fix 3rd and 4th comments. > > 2. The branch name must start with your ID, for example all my branches are > named as 'imeevma/gh-...' > > 3. The main idea behind this issue was to add a function that takes two strings > and a collation and searches one string within the other using the collation. > This function should be used in POSITION() and REPLACE(). Both functions should > see this new function as a black box. Could you do this? > > 4. Overall, I find your implementation to be too complex. Could you make it > simpler? I think it will become simpler if you use coll->cmp() and U8_NEXT(). > Also, how is string comparison different from varbinary string comparison if > coll == NULL? Do you need to make two functions instead of one? > > On Sun, Nov 14, 2021 at 12:13:10PM +0300, Lord via Tarantool-patches wrote: > > From: Gleb > > > > Replaces positioning logic in POSITION() and REPLACE() SQL functions > > with Knuth-Morris-Pratt algorithm by adding two different versions > > of the latter for STRING and VARBINARY arguments and corresponding > > prefix table generators. > > > > Improved asymptotics of positioning logic to: > > time O(haystack_len + needle_len), mem O(needle_len) > > As one of collating types is a Unicode string, > > ICU UCharIterators had to be used and thus Unicode > > letters are compared as UChar32 values and normalization > > is not done ('a' and 'รค' are treated as different) > > > 5. Why length of lines here is less then above? > > > > > Closes tarantool/tarantool#6492 > 6. Please use just `Closes #6492`. > > > --- > > src/box/sql/func.c | 402 ++++++++++++++++++++++++++------- > > 1 file changed, 321 insertions(+), 90 deletions(-) > > > > diff --git a/src/box/sql/func.c b/src/box/sql/func.c > > index 29d713fd0..badfbc176 100644 > > --- a/src/box/sql/func.c > > +++ b/src/box/sql/func.c > > @@ -291,6 +291,144 @@ absFunc(sql_context * context, int argc, sql_value ** argv) > > } > > } > > > > +/** > > + * Implementation of prefix table generator > 7. Trailing whitespaces here and below. > > > + * prefix_table_gen() is a part of Knuth-Morris-Pratt algorithm > > + * that is used in position() and replaceFunc() to find > > + * instances of `needle` in `haystack` bin data > > + */ > 8. In my opinion, you shouldn't make too short lines in comments. Please make it > longer, but no more than 80 characters. > > > +inline static void > > +prefix_table_gen_bin(const unsigned char *needle, > > + const size_t n_needle_bytes, > > + size_t *table) > 9. Please use our code style here and below. See find_compatible() for example. > > > +{ > > + size_t i = 1, j = 0; > > + table[0] = 0; > > + while (i < n_needle_bytes) { > > + if (needle[i] != needle[j]) { > > + if (j == 0) { > > + table[i] = 0; > > + ++i; > > + } else { > > + j = table[j - 1]; > > + } > > + } else { > > + table[i] = j + 1; > > + ++i; > > + ++j; > > + } > > + } > > +} > > + > > +/** > > + * Implementation of prefix table generator using UCharIterators > > + * prefix_table_gen() is a part of Knuth-Morris-Pratt algorithm > > + * that is used in position() to find instances of `needle` > > + * in `haystack` unicode string > > + */ > > +inline static void > > +prefix_table_gen_ustr(const UCharIterator *needle_begin, > > + const unsigned char *needle_str, > > + const size_t n_needle_bytes, > > + const size_t n_needle_chars, > > + size_t table[]) > > +{ > > + table[0] = 0; > > + UCharIterator needle_iter; > > + UCharIterator second_iter; > > + uiter_setUTF8(&needle_iter, needle_str, n_needle_bytes); > > + uiter_setUTF8(&second_iter, needle_str, n_needle_bytes); > > + needle_iter.next(&needle_iter); > > + while (needle_iter.getIndex(&needle_iter, UITER_CURRENT) < n_needle_chars) { > > + if (needle_iter.current(&needle_iter) != second_iter.current(&second_iter)) { > > + if (second_iter.getIndex(&second_iter, UITER_CURRENT) == 0) { > > + table[needle_iter.getIndex(&needle_iter, UITER_CURRENT)] = 0; > > + needle_iter.next(&needle_iter); > > + } else { > > + second_iter.move(&second_iter, table[second_iter.getIndex(&second_iter, UITER_CURRENT) - 1], UITER_START); > 10. Length of lines should be no more than 80. Here and below. > > > @@ -1280,24 +1447,31 @@ zeroblobFunc(sql_context * context, int argc, sql_value ** argv) > > } > > > > /* > > - * The replace() function. Three arguments are all strings: call > > + * The replace() function. Three arguments are all strings: call > > * them A, B, and C. The result is also a string which is derived > > * from A by replacing every occurrence of B with C. The match > > * must be exact. Collating sequences are not used. > > */ > > static void > > -replaceFunc(sql_context * context, int argc, sql_value ** argv) > > +replaceFunc(sql_context *context, int argc, sql_value **argv) > > { > > - const unsigned char *zStr; /* The input string A */ > > - const unsigned char *zPattern; /* The pattern string B */ > > - const unsigned char *zRep; /* The replacement string C */ > > - unsigned char *zOut; /* The output */ > > - int nStr; /* Size of zStr */ > > - int nPattern; /* Size of zPattern */ > > - int nRep; /* Size of zRep */ > > - i64 nOut; /* Maximum size of zOut */ > > - int loopLimit; /* Last zStr[] that might match zPattern[] */ > > - int i, j; /* Loop counters */ > > + const unsigned char *zStr; /* The input string A */ > 11. Do you really need to change this line? Please discard the refactoring > changes from this patch. You can make another patch that only contains the > refactoring changes. Also, if you plan to do such a patch, use our style - we do > not write comments on the same line as the variable definition. Moreover, we > usually define a variable only immediately before using it, and not at the > beginning of the function. > > > + const unsigned char *zPattern; /* The pattern string B */ > > + const unsigned char *zRep; /* The replacement string C */ > > + unsigned char *zOut; /* The output */ > > + int nStr; /* Size of zStr */ > > + int nPattern; /* Size of zPattern */ > > + int nRep; /* Size of zRep */ > > + i64 nOut; /* Maximum size of zOut */ > > + int loopLimit; /* Last zStr[] that might match zPattern[] */ > > + int i, j; /* Loop counters */ > > + int nCnt = 0; /* counter for number of Pattern in Str */ > > + int iStr = 0; /* zStr iterator for finding Patter with KMP */ > > + int nCopiedOut = 0; /* zOut iterator for already copied mem */ > > + int nCopiedStr = 0; /* zStr iterator for already copied mem */ > > + int nPatternPos = 0; /* zStr iterator for found Pattern position */ > > + size_t *prefix_table = NULL; /* prefix table for kmp alg */ > > + > > > > assert(argc == 3); > > UNUSED_PARAMETER(argc); > > @@ -1309,7 +1483,7 @@ replaceFunc(sql_context * context, int argc, sql_value ** argv) > > zPattern = mem_as_ustr(argv[1]); > > if (zPattern == 0) { > > assert(mem_is_null(argv[1]) > > - || sql_context_db_handle(context)->mallocFailed); > > + || sql_context_db_handle(context)->mallocFailed); > 12. Why did you change this? > > > return; > > } > > nPattern = mem_len_unsafe(argv[1]); > > @@ -1331,40 +1505,89 @@ replaceFunc(sql_context * context, int argc, sql_value ** argv) > > return; > > } > > loopLimit = nStr - nPattern; > > - for (i = j = 0; i <= loopLimit; i++) { > > - if (zStr[i] != zPattern[0] > > - || memcmp(&zStr[i], zPattern, nPattern)) { > > - zOut[j++] = zStr[i]; > > - } else { > > - u8 *zOld; > > - sql *db = sql_context_db_handle(context); > > - nOut += nRep - nPattern; > > - testcase(nOut - 1 == db->aLimit[SQL_LIMIT_LENGTH]); > > - testcase(nOut - 2 == db->aLimit[SQL_LIMIT_LENGTH]); > > - if (nOut - 1 > db->aLimit[SQL_LIMIT_LENGTH]) { > > - diag_set(ClientError, ER_SQL_EXECUTE, "string "\ > > - "or binary string is too big"); > > - context->is_aborted = true; > > - sql_free(zOut); > > - return; > > - } > > - zOld = zOut; > > - zOut = sql_realloc64(zOut, (int)nOut); > > - if (zOut == 0) { > > - context->is_aborted = true; > > - sql_free(zOld); > > - return; > > + > > + if (prefix_table != NULL) { > > + /* > > + * optimized replacement logic using the KMP algorithm > > + * requires additional O(nPattern) memory for prefix table > > + * replacing: time O(n + m) > > + */ > > + > > + /* precalculating prefix table */ > > + prefix_table = calloc(nPattern, sizeof(size_t)); > > + prefix_table_gen_bin(zPattern, nPattern, prefix_table); > > + /* counting the number of Pattern occurrences */ > > + for (nCnt = iStr = nPatternPos = 0;;) { > > + nPatternPos = kmp_bin(zStr, zPattern, nStr, nPattern, iStr, prefix_table); > > + iStr = nPatternPos + nPattern; > > + if (nPatternPos == 0) > > + break; > > + ++nCnt; > > + } > > + /* reallocationg mem for zOut */ > > + nOut += (nRep - nPattern) * nCnt; > > + assert(nOut < SQL_MAX_LENGTH); > > + zOut = contextMalloc(context, (i64) nOut); > > + if (zOut == 0) { > > + return; > > + } > > + /* coping and replacing in-place */ > > + for (iStr = nCopiedStr = nCopiedOut = nPatternPos = 0;;) { > > + nPatternPos = kmp_bin(zStr, zPattern, nStr, nPattern, iStr, prefix_table); > > + iStr = nPatternPos + nPattern; > > + if (nPatternPos == 0) > > + break; > > + assert(nPatternPos >= nCopiedStr); > > + memcpy(zOut + nCopiedOut, zStr + nCopiedStr, nPatternPos - nCopiedStr); > > + nCopiedOut += nPatternPos - nCopiedStr; > > + memcpy(zOut + nCopiedOut, zRep, nRep); > > + nCopiedOut += nRep; > > + nCopiedStr = nPatternPos + nPattern; > > + } > > + memcpy(zOut + nCopiedOut, zStr + nCopiedStr, nStr - nCopiedStr); > > + zOut[nOut - 1] = 0; > > + free(prefix_table); > > + } else { > > + /* > > + * Naive replacing substring algorithm > > + * replacing: time O(n * m), mem O(1) > > + * used if cannot allocate mem for prefix table > > + */ > > + for (i = j = 0; i <= loopLimit; i++) { > > + if (zStr[i] != zPattern[0] > > + || memcmp(&zStr[i], zPattern, nPattern)) { > > + zOut[j++] = zStr[i]; > > + } else { > > + u8 *zOld; > > + sql *db = sql_context_db_handle(context); > > + nOut += nRep - nPattern; > > + testcase(nOut - 1 == db->aLimit[SQL_LIMIT_LENGTH]); > > + testcase(nOut - 2 == db->aLimit[SQL_LIMIT_LENGTH]); > > + if (nOut - 1 > db->aLimit[SQL_LIMIT_LENGTH]) { > 13. This high level of nesting doesn't sound like good code. Please try to avoid > it. > > > + diag_set(ClientError, ER_SQL_EXECUTE, "string "\ > > + "or binary string is too big"); > > + context->is_aborted = true; > > + sql_free(zOut); > > + return; > > + } > > + zOld = zOut; > > + zOut = sql_realloc64(zOut, (int)nOut); > > + if (zOut == 0) { > > + context->is_aborted = true; > > + sql_free(zOld); > > + return; > > + } > > + memcpy(&zOut[j], zRep, nRep); > > + j += nRep; > > + i += nPattern - 1; > > } > > - memcpy(&zOut[j], zRep, nRep); > > - j += nRep; > > - i += nPattern - 1; > > } > > + assert(j + nStr - i + 1 == nOut); > > + memcpy(&zOut[j], &zStr[i], nStr - i); > > + j += nStr - i; > > + assert(j <= nOut); > > + zOut[j] = 0; > > } > > - assert(j + nStr - i + 1 == nOut); > > - memcpy(&zOut[j], &zStr[i], nStr - i); > > - j += nStr - i; > > - assert(j <= nOut); > > - zOut[j] = 0; > > if (context->func->def->returns == FIELD_TYPE_STRING) > > mem_set_str_dynamic(context->pOut, (char *)zOut, j); > > else