From: Mergen Imeev via Tarantool-patches <tarantool-patches@dev.tarantool.org>
To: Lord <kashkin.nsaa@gmail.com>, tarantool-patches@dev.tarantool.org
Subject: Re: [Tarantool-patches] [PATCH] sql: Replace naive position alg with KMP
Date: Thu, 2 Dec 2021 15:12:50 +0300 [thread overview]
Message-ID: <20211202121250.GA34471@tarantool.org> (raw)
In-Reply-To: <20211202120657.GB21659@tarantool.org>
Also, I forgot to add: please add me(imeevma@tarantool.org) to the copy when you
submit the SQL patch.
On Thu, Dec 02, 2021 at 03:06:57PM +0300, Mergen Imeev via Tarantool-patches wrote:
> Thank you for the patch! See 13 comments below.
>
> 1. Please rebase this patch to current tarantool/master. I think it will be
> easier to rebase if you fix 3rd and 4th comments.
>
> 2. The branch name must start with your ID, for example all my branches are
> named as 'imeevma/gh-...'
>
> 3. The main idea behind this issue was to add a function that takes two strings
> and a collation and searches one string within the other using the collation.
> This function should be used in POSITION() and REPLACE(). Both functions should
> see this new function as a black box. Could you do this?
>
> 4. Overall, I find your implementation to be too complex. Could you make it
> simpler? I think it will become simpler if you use coll->cmp() and U8_NEXT().
> Also, how is string comparison different from varbinary string comparison if
> coll == NULL? Do you need to make two functions instead of one?
>
> On Sun, Nov 14, 2021 at 12:13:10PM +0300, Lord via Tarantool-patches wrote:
> > From: Gleb <kashkin.nsaa@gmail.com>
> >
> > Replaces positioning logic in POSITION() and REPLACE() SQL functions
> > with Knuth-Morris-Pratt algorithm by adding two different versions
> > of the latter for STRING and VARBINARY arguments and corresponding
> > prefix table generators.
> >
> > Improved asymptotics of positioning logic to:
> > time O(haystack_len + needle_len), mem O(needle_len)
> > As one of collating types is a Unicode string,
> > ICU UCharIterators had to be used and thus Unicode
> > letters are compared as UChar32 values and normalization
> > is not done ('a' and 'ä' are treated as different)
> >
> 5. Why length of lines here is less then above?
>
> >
> > Closes tarantool/tarantool#6492
> 6. Please use just `Closes #6492`.
>
> > ---
> > src/box/sql/func.c | 402 ++++++++++++++++++++++++++-------
> > 1 file changed, 321 insertions(+), 90 deletions(-)
> >
> > diff --git a/src/box/sql/func.c b/src/box/sql/func.c
> > index 29d713fd0..badfbc176 100644
> > --- a/src/box/sql/func.c
> > +++ b/src/box/sql/func.c
> > @@ -291,6 +291,144 @@ absFunc(sql_context * context, int argc, sql_value ** argv)
> > }
> > }
> >
> > +/**
> > + * Implementation of prefix table generator
> 7. Trailing whitespaces here and below.
>
> > + * prefix_table_gen() is a part of Knuth-Morris-Pratt algorithm
> > + * that is used in position() and replaceFunc() to find
> > + * instances of `needle` in `haystack` bin data
> > + */
> 8. In my opinion, you shouldn't make too short lines in comments. Please make it
> longer, but no more than 80 characters.
>
> > +inline static void
> > +prefix_table_gen_bin(const unsigned char *needle,
> > + const size_t n_needle_bytes,
> > + size_t *table)
> 9. Please use our code style here and below. See find_compatible() for example.
>
> > +{
> > + size_t i = 1, j = 0;
> > + table[0] = 0;
> > + while (i < n_needle_bytes) {
> > + if (needle[i] != needle[j]) {
> > + if (j == 0) {
> > + table[i] = 0;
> > + ++i;
> > + } else {
> > + j = table[j - 1];
> > + }
> > + } else {
> > + table[i] = j + 1;
> > + ++i;
> > + ++j;
> > + }
> > + }
> > +}
> > +
> > +/**
> > + * Implementation of prefix table generator using UCharIterators
> > + * prefix_table_gen() is a part of Knuth-Morris-Pratt algorithm
> > + * that is used in position() to find instances of `needle`
> > + * in `haystack` unicode string
> > + */
> > +inline static void
> > +prefix_table_gen_ustr(const UCharIterator *needle_begin,
> > + const unsigned char *needle_str,
> > + const size_t n_needle_bytes,
> > + const size_t n_needle_chars,
> > + size_t table[])
> > +{
> > + table[0] = 0;
> > + UCharIterator needle_iter;
> > + UCharIterator second_iter;
> > + uiter_setUTF8(&needle_iter, needle_str, n_needle_bytes);
> > + uiter_setUTF8(&second_iter, needle_str, n_needle_bytes);
> > + needle_iter.next(&needle_iter);
> > + while (needle_iter.getIndex(&needle_iter, UITER_CURRENT) < n_needle_chars) {
> > + if (needle_iter.current(&needle_iter) != second_iter.current(&second_iter)) {
> > + if (second_iter.getIndex(&second_iter, UITER_CURRENT) == 0) {
> > + table[needle_iter.getIndex(&needle_iter, UITER_CURRENT)] = 0;
> > + needle_iter.next(&needle_iter);
> > + } else {
> > + second_iter.move(&second_iter, table[second_iter.getIndex(&second_iter, UITER_CURRENT) - 1], UITER_START);
> 10. Length of lines should be no more than 80. Here and below.
>
> > @@ -1280,24 +1447,31 @@ zeroblobFunc(sql_context * context, int argc, sql_value ** argv)
> > }
> >
> > /*
> > - * The replace() function. Three arguments are all strings: call
> > + * The replace() function. Three arguments are all strings: call
> > * them A, B, and C. The result is also a string which is derived
> > * from A by replacing every occurrence of B with C. The match
> > * must be exact. Collating sequences are not used.
> > */
> > static void
> > -replaceFunc(sql_context * context, int argc, sql_value ** argv)
> > +replaceFunc(sql_context *context, int argc, sql_value **argv)
> > {
> > - const unsigned char *zStr; /* The input string A */
> > - const unsigned char *zPattern; /* The pattern string B */
> > - const unsigned char *zRep; /* The replacement string C */
> > - unsigned char *zOut; /* The output */
> > - int nStr; /* Size of zStr */
> > - int nPattern; /* Size of zPattern */
> > - int nRep; /* Size of zRep */
> > - i64 nOut; /* Maximum size of zOut */
> > - int loopLimit; /* Last zStr[] that might match zPattern[] */
> > - int i, j; /* Loop counters */
> > + const unsigned char *zStr; /* The input string A */
> 11. Do you really need to change this line? Please discard the refactoring
> changes from this patch. You can make another patch that only contains the
> refactoring changes. Also, if you plan to do such a patch, use our style - we do
> not write comments on the same line as the variable definition. Moreover, we
> usually define a variable only immediately before using it, and not at the
> beginning of the function.
>
> > + const unsigned char *zPattern; /* The pattern string B */
> > + const unsigned char *zRep; /* The replacement string C */
> > + unsigned char *zOut; /* The output */
> > + int nStr; /* Size of zStr */
> > + int nPattern; /* Size of zPattern */
> > + int nRep; /* Size of zRep */
> > + i64 nOut; /* Maximum size of zOut */
> > + int loopLimit; /* Last zStr[] that might match zPattern[] */
> > + int i, j; /* Loop counters */
> > + int nCnt = 0; /* counter for number of Pattern in Str */
> > + int iStr = 0; /* zStr iterator for finding Patter with KMP */
> > + int nCopiedOut = 0; /* zOut iterator for already copied mem */
> > + int nCopiedStr = 0; /* zStr iterator for already copied mem */
> > + int nPatternPos = 0; /* zStr iterator for found Pattern position */
> > + size_t *prefix_table = NULL; /* prefix table for kmp alg */
> > +
> >
> > assert(argc == 3);
> > UNUSED_PARAMETER(argc);
> > @@ -1309,7 +1483,7 @@ replaceFunc(sql_context * context, int argc, sql_value ** argv)
> > zPattern = mem_as_ustr(argv[1]);
> > if (zPattern == 0) {
> > assert(mem_is_null(argv[1])
> > - || sql_context_db_handle(context)->mallocFailed);
> > + || sql_context_db_handle(context)->mallocFailed);
> 12. Why did you change this?
>
> > return;
> > }
> > nPattern = mem_len_unsafe(argv[1]);
> > @@ -1331,40 +1505,89 @@ replaceFunc(sql_context * context, int argc, sql_value ** argv)
> > return;
> > }
> > loopLimit = nStr - nPattern;
> > - for (i = j = 0; i <= loopLimit; i++) {
> > - if (zStr[i] != zPattern[0]
> > - || memcmp(&zStr[i], zPattern, nPattern)) {
> > - zOut[j++] = zStr[i];
> > - } else {
> > - u8 *zOld;
> > - sql *db = sql_context_db_handle(context);
> > - nOut += nRep - nPattern;
> > - testcase(nOut - 1 == db->aLimit[SQL_LIMIT_LENGTH]);
> > - testcase(nOut - 2 == db->aLimit[SQL_LIMIT_LENGTH]);
> > - if (nOut - 1 > db->aLimit[SQL_LIMIT_LENGTH]) {
> > - diag_set(ClientError, ER_SQL_EXECUTE, "string "\
> > - "or binary string is too big");
> > - context->is_aborted = true;
> > - sql_free(zOut);
> > - return;
> > - }
> > - zOld = zOut;
> > - zOut = sql_realloc64(zOut, (int)nOut);
> > - if (zOut == 0) {
> > - context->is_aborted = true;
> > - sql_free(zOld);
> > - return;
> > +
> > + if (prefix_table != NULL) {
> > + /*
> > + * optimized replacement logic using the KMP algorithm
> > + * requires additional O(nPattern) memory for prefix table
> > + * replacing: time O(n + m)
> > + */
> > +
> > + /* precalculating prefix table */
> > + prefix_table = calloc(nPattern, sizeof(size_t));
> > + prefix_table_gen_bin(zPattern, nPattern, prefix_table);
> > + /* counting the number of Pattern occurrences */
> > + for (nCnt = iStr = nPatternPos = 0;;) {
> > + nPatternPos = kmp_bin(zStr, zPattern, nStr, nPattern, iStr, prefix_table);
> > + iStr = nPatternPos + nPattern;
> > + if (nPatternPos == 0)
> > + break;
> > + ++nCnt;
> > + }
> > + /* reallocationg mem for zOut */
> > + nOut += (nRep - nPattern) * nCnt;
> > + assert(nOut < SQL_MAX_LENGTH);
> > + zOut = contextMalloc(context, (i64) nOut);
> > + if (zOut == 0) {
> > + return;
> > + }
> > + /* coping and replacing in-place */
> > + for (iStr = nCopiedStr = nCopiedOut = nPatternPos = 0;;) {
> > + nPatternPos = kmp_bin(zStr, zPattern, nStr, nPattern, iStr, prefix_table);
> > + iStr = nPatternPos + nPattern;
> > + if (nPatternPos == 0)
> > + break;
> > + assert(nPatternPos >= nCopiedStr);
> > + memcpy(zOut + nCopiedOut, zStr + nCopiedStr, nPatternPos - nCopiedStr);
> > + nCopiedOut += nPatternPos - nCopiedStr;
> > + memcpy(zOut + nCopiedOut, zRep, nRep);
> > + nCopiedOut += nRep;
> > + nCopiedStr = nPatternPos + nPattern;
> > + }
> > + memcpy(zOut + nCopiedOut, zStr + nCopiedStr, nStr - nCopiedStr);
> > + zOut[nOut - 1] = 0;
> > + free(prefix_table);
> > + } else {
> > + /*
> > + * Naive replacing substring algorithm
> > + * replacing: time O(n * m), mem O(1)
> > + * used if cannot allocate mem for prefix table
> > + */
> > + for (i = j = 0; i <= loopLimit; i++) {
> > + if (zStr[i] != zPattern[0]
> > + || memcmp(&zStr[i], zPattern, nPattern)) {
> > + zOut[j++] = zStr[i];
> > + } else {
> > + u8 *zOld;
> > + sql *db = sql_context_db_handle(context);
> > + nOut += nRep - nPattern;
> > + testcase(nOut - 1 == db->aLimit[SQL_LIMIT_LENGTH]);
> > + testcase(nOut - 2 == db->aLimit[SQL_LIMIT_LENGTH]);
> > + if (nOut - 1 > db->aLimit[SQL_LIMIT_LENGTH]) {
> 13. This high level of nesting doesn't sound like good code. Please try to avoid
> it.
>
> > + diag_set(ClientError, ER_SQL_EXECUTE, "string "\
> > + "or binary string is too big");
> > + context->is_aborted = true;
> > + sql_free(zOut);
> > + return;
> > + }
> > + zOld = zOut;
> > + zOut = sql_realloc64(zOut, (int)nOut);
> > + if (zOut == 0) {
> > + context->is_aborted = true;
> > + sql_free(zOld);
> > + return;
> > + }
> > + memcpy(&zOut[j], zRep, nRep);
> > + j += nRep;
> > + i += nPattern - 1;
> > }
> > - memcpy(&zOut[j], zRep, nRep);
> > - j += nRep;
> > - i += nPattern - 1;
> > }
> > + assert(j + nStr - i + 1 == nOut);
> > + memcpy(&zOut[j], &zStr[i], nStr - i);
> > + j += nStr - i;
> > + assert(j <= nOut);
> > + zOut[j] = 0;
> > }
> > - assert(j + nStr - i + 1 == nOut);
> > - memcpy(&zOut[j], &zStr[i], nStr - i);
> > - j += nStr - i;
> > - assert(j <= nOut);
> > - zOut[j] = 0;
> > if (context->func->def->returns == FIELD_TYPE_STRING)
> > mem_set_str_dynamic(context->pOut, (char *)zOut, j);
> > else
prev parent reply other threads:[~2021-12-02 12:12 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-11-14 9:13 Lord via Tarantool-patches
2021-12-02 12:06 ` Mergen Imeev via Tarantool-patches
2021-12-02 12:12 ` Mergen Imeev via Tarantool-patches [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20211202121250.GA34471@tarantool.org \
--to=tarantool-patches@dev.tarantool.org \
--cc=imeevma@tarantool.org \
--cc=kashkin.nsaa@gmail.com \
--subject='Re: [Tarantool-patches] [PATCH] sql: Replace naive position alg with KMP' \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox