From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 6946E22BA3 for ; Tue, 24 Apr 2018 08:51:58 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 6dSVxh-r1YIn for ; Tue, 24 Apr 2018 08:51:58 -0400 (EDT) Received: from smtp3.mail.ru (smtp3.mail.ru [94.100.179.58]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id F309A22BA0 for ; Tue, 24 Apr 2018 08:51:57 -0400 (EDT) Subject: [tarantool-patches] Re: [PATCH 4/4] sql: move SQL statistics to server References: <291aa5961a366c81ffa7ea465636125617a78dad.1524515002.git.korablev@tarantool.org> From: Vladislav Shpilevoy Message-ID: <1dd6cfc3-25f1-8184-df64-2e1638ad55a0@tarantool.org> Date: Tue, 24 Apr 2018 15:51:55 +0300 MIME-Version: 1.0 In-Reply-To: <291aa5961a366c81ffa7ea465636125617a78dad.1524515002.git.korablev@tarantool.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: Nikita Pettik , tarantool-patches@freelists.org Hello. The patch is great, but see 17 comments below %) Most of them are minor except one. The key idea is to use region allocator to build statistics on analyze. See details below. On 23/04/2018 23:29, Nikita Pettik wrote: > SQLite provides kind of statistics concerning data holding in each index > and space. It is used to help query planer build optimized plan. Such > statistics don't depend on internal index implementation (e.g. B-tree or > LSM tree) and contain information concerning tuple distribution. > This patch moves members resbonsible statistics from original SQLite > structs to Tarantool ones. To be more precise, now index's opts contains > statistics from _stat1 space (arrays of integers and their loarithm 1. resbonsible -> responsible, loarithm -> logarithm. > approximation) and more sophisticated one from _stat4 (array of samples). > It also contains set of flags to turn on/off different optimizations > (such as skip-scan). > > After ANALYZE command is executed, statistics are saved into _sql_stat1 > and _sql_stat4 spaces (in order to persist them). However, it should also > be loaded into in-memory structs representing indexes. This patch > reworks original SQLite routine to load it directly to newly introduced > stat struct of index. > > Closes #3253 > --- > src/box/index_def.c | 92 ++++++ > src/box/index_def.h | 90 ++++++ > src/box/sql/analyze.c | 742 +++++++++++++++++++++++------------------------- > src/box/sql/build.c | 61 ---- > src/box/sql/parse.y | 2 - > src/box/sql/pragma.c | 3 +- > src/box/sql/prepare.c | 7 +- > src/box/sql/sqliteInt.h | 72 +++-- > src/box/sql/vdbe.c | 5 +- > src/box/sql/where.c | 190 ++++++++----- > src/box/sql/whereexpr.c | 4 +- > 11 files changed, 716 insertions(+), 552 deletions(-) > > diff --git a/src/box/index_def.c b/src/box/index_def.c > index eda39a622..14ff94192 100644 > --- a/src/box/index_def.c > +++ b/src/box/index_def.c > @@ -154,9 +157,98 @@ index_def_dup(const struct index_def *def) > return NULL; > } > } > + if (def->opts.stat != NULL) { > + dup->opts.stat = malloc(sizeof(*dup->opts.stat)); 2. Lets reduce count of mallocs and allocate all on the single huge malloc. It eliminates index_stat_destroy_samples, and memory fragmentation. See how it is done for space_def. You must define an index_stat_sizeof function, that takes what is needed to calculate summary size, calculates offsets on different parts of the result memory etc. I know, that it is built in load_stat_from_space, but seems like I found a solution using region. See the comments below. > diff --git a/src/box/index_def.h b/src/box/index_def.h > index 6167570b7..52e9a0548 100644 > --- a/src/box/index_def.h > +++ b/src/box/index_def.h > @@ -113,6 +181,27 @@ index_opts_create(struct index_opts *opts) > *opts = index_opts_default; > } > > +/** > + * Release memory allocated for samples, statistics they include, > + * sample keys and avg_eq array. > + */ > +void > +index_stat_destroy_samples(struct index_stat *stat); > + > +/** > + * Destroy index stat. > + */ > +static inline void > +destroy_stat(struct index_stat *stat) { 3. It will be removed after review fixes, but why you did not just inlined it in a single usage place? > + if (stat == NULL) > + return; > + free(stat->tuple_stat1); > + free(stat->tuple_log_est); > + index_stat_destroy_samples(stat); > + TRASH(stat); > + free(stat); > +} > + > /** > * Destroy index options > */ > @@ -120,6 +209,7 @@ static inline void > index_opts_destroy(struct index_opts *opts) > { > free(opts->sql); > + destroy_stat(opts->stat); 4. Will be turned into just free(). > diff --git a/src/box/sql/analyze.c b/src/box/sql/analyze.c > index 4965e899a..14dfbd551 100644 > --- a/src/box/sql/analyze.c > +++ b/src/box/sql/analyze.c > @@ -104,8 +104,8 @@ > * large nEq values. > * > */ > -#ifndef SQLITE_OMIT_ANALYZE 5. The definition still exists in mkkeywordhash.c > @@ -1217,49 +1217,46 @@ struct analysisInfo { > sqlite3 *db; > }; > > -/* > - * The first argument points to a nul-terminated string containing a > - * list of space separated integers. Read the first nOut of these into > - * the array aOut[]. > +/** > + * The first argument points to a nul-terminated string > + * containing a list of space separated integers. Load > + * the first stat_size of these into the output arrays. > + * > + * @param stat_string String containing array of integers. > + * @param stat_size Size of output arrays. > + * @param[out] stat_exact Decoded array of statistics. > + * @param[out] stat_log Decoded array of stat logariphms. > + * @param[out] index Handle extra flags for this index, > + * if not NULL. > */ > static void > -decodeIntArray(char *zIntArray, /* String containing int array to decode */ > - int nOut, /* Number of slots in aOut[] */ > - tRowcnt * aOut, /* Store integers here */ > - LogEst * aLog, /* Or, if aOut==0, here */ > - Index * pIndex /* Handle extra flags for this index, if not NULL */ > - ) > -{ > - char *z = zIntArray; > - int c; > - int i; > - tRowcnt v; > - > - if (z == 0) > +decode_stat_string(const char *stat_string, int stat_size, tRowcnt *stat_exact, > + LogEst *stat_log, struct index *index) { 6. Please, create a ticket to store stat_string in a stat table as map or array of stat values instead of string. It allows to extremely simplify all of this pure absolute shit using raw MessagePack array of integers iteration. > + char *z = (char *) stat_string; > + if (z == NULL) > z = ""; > - > - for (i = 0; *z && i < nOut; i++) { > - v = 0; > + for (int i = 0; *z && i < stat_size; i++) { > + tRowcnt v = 0; > + int c; > while ((c = z[0]) >= '0' && c <= '9') { > v = v * 10 + c - '0'; > z++; > } 7. sscanf? > - if (aOut) > - aOut[i] = v; > - if (aLog) > - aLog[i] = sqlite3LogEst(v); > + if (stat_exact != NULL) > + stat_exact[i] = v; > + if (stat_log != NULL) > + stat_log[i] = sqlite3LogEst(v); > if (*z == ' ') > z++; > } > - > - if (pIndex) { > - pIndex->bUnordered = 0; > - pIndex->noSkipScan = 0; > + if (index != NULL) { 8. As I can see, you use this function 3 times with index == NULL and 1 time with not NULL. Lets just inline these things where the index is not NULL, and remove the index argument. > @@ -1269,369 +1266,374 @@ decodeIntArray(char *zIntArray, /* String containing int array to decode */ > } > } > > -/* > +/** > * This callback is invoked once for each index when reading the > * _sql_stat1 table. > * > * argv[0] = name of the table > * argv[1] = name of the index (might be NULL) > - * argv[2] = results of analysis - on integer for each column > + * argv[2] = results of analysis - array of integers > + * > + * Entries for which argv[1]==NULL simply record the number > + * of rows in the table. This routine also allocates memory > + * for stat struct itself and statistics which are not related > + * to stat4 samples. > * > - * Entries for which argv[1]==NULL simply record the number of rows in > - * the table. > + * @param data Additional analysis info. > + * @param argc Number of arguments. > + * @param argv Callback arguments. > + * @retval 0 on success, -1 otherwise. > */ > static int > -analysisLoader(void *pData, int argc, char **argv, char **NotUsed) > +analysis_loader(void *data, int argc, char **argv, char **unused) > { > - analysisInfo *pInfo = (analysisInfo *) pData; > - Index *pIndex; > - Table *pTable; > - const char *z; > - > assert(argc == 3); > - UNUSED_PARAMETER2(NotUsed, argc); > - > - if (argv == 0 || argv[0] == 0 || argv[2] == 0) { > + UNUSED_PARAMETER(data); > + UNUSED_PARAMETER2(unused, argc); > + if (argv == 0 || argv[0] == 0 || argv[2] == 0) > return 0; > - } > - pTable = sqlite3HashFind(&pInfo->db->pSchema->tblHash, argv[0]); > - if (pTable == NULL) > + uint32_t space_id = box_space_id_by_name(argv[0], strlen(argv[0])); > + assert(space_id != BOX_ID_NIL); > + struct space *space = space_by_id(space_id); > + assert(space != NULL); > + struct index *index; > + if (space == NULL) > return 0; > - if (argv[1] == 0) { > - pIndex = 0; > + if (argv[1] == NULL) { > + index = NULL; 9. Lets return 0 here and remove 'if index != NULL' below. The code will look more compact. > } else if (sqlite3_stricmp(argv[0], argv[1]) == 0) { > - pIndex = sqlite3PrimaryKeyIndex(pTable); > + index = space_index(space, 0); > + assert(index != NULL); > } else { > - pIndex = sqlite3HashFind(&pTable->idxHash, argv[1]); > + uint32_t iid = box_index_id_by_name(space_id, argv[1], > + strlen(argv[1])); > + assert(iid != BOX_ID_NIL); > + index = space_index(space, iid); > } > - z = argv[2]; > - > - if (pIndex) { > - tRowcnt *aiRowEst = 0; > - int nCol = index_column_count(pIndex) + 1; > - /* Index.aiRowEst may already be set here if there are duplicate > - * _sql_stat1 entries for this index. In that case just clobber > - * the old data with the new instead of allocating a new array. > + if (index != NULL) { > + /* > + * Additional field is used to describe total > + * count of tuples in index. Although now all > + * indexes feature the same number of tuples, > + * partial indexes are going to be implemented > + * someday. > */ > - if (pIndex->aiRowEst == 0) { > - pIndex->aiRowEst = > - (tRowcnt *) sqlite3MallocZero(sizeof(tRowcnt) * > - nCol); > - if (pIndex->aiRowEst == 0) > - sqlite3OomFault(pInfo->db); > + int column_count = index->def->key_def->part_count + 1; > + /* > + * Stat arrays may already be set here if there > + * are duplicate _sql_stat1 entries for this > + * index. In that case just clobber the old data > + * with the new instead of allocating a new array. > + */ > + struct index_stat *stat = index->def->opts.stat; > + if (stat == NULL) { > + stat = calloc(1, sizeof(*stat)); > + if (stat == NULL) { > + diag_set(OutOfMemory, sizeof(*stat), "calloc", > + "stat"); > + return -1; > + } > + index->def->opts.stat = stat; > } > - aiRowEst = pIndex->aiRowEst; > - pIndex->bUnordered = 0; > - decodeIntArray((char *)z, nCol, aiRowEst, pIndex->aiRowLogEst, > - pIndex); > - } > - > - return 0; > -} > - > -/* > - * If the Index.aSample variable is not NULL, delete the aSample[] array > - * and its contents. > - */ > -void > -sqlite3DeleteIndexSamples(sqlite3 * db, Index * pIdx) > -{ > - if (pIdx->aSample) { > - int j; > - for (j = 0; j < pIdx->nSample; j++) { > - IndexSample *p = &pIdx->aSample[j]; > - sqlite3DbFree(db, p->p); > + if (stat->tuple_stat1 == NULL) { 10. Can it be NULL, if index->def->opts.stat was not NULL above? And as I can see, if analysis loader fails not on the first call, then the statistics is updated partially. I propose to make it atomic. For instance, you can create for each index a new index_stat object, fill it with new stat, and on success replace the old one. See below how to do it simple. > -/* > +/** > * Load the content from the _sql_stat4 table > - * into the relevant Index.aSample[] arrays. > + * into the relevant index->stat->samples[] arrays. > + * > + * Arguments must point to SQL statements that return > + * data equivalent to the following: > * > - * Arguments zSql1 and zSql2 must point to SQL statements that return > - * data equivalent to the following (statements are different for stat4, > - * see the caller of this function for details): > + * prepare: SELECT tbl,idx,count(*) FROM _sql_stat4 GROUP BY tbl,idx; > + * load: SELECT tbl,idx,neq,nlt,ndlt,sample FROM _sql_stat4; > * > - * zSql1: SELECT tbl,idx,count(*) FROM _sql_stat4 GROUP BY tbl,idx > - * zSql2: SELECT tbl,idx,neq,nlt,ndlt,sample FROM _sql_stat4 > + * 'prepare' statement is used to allocate enough memory for > + * statistics (i.e. arrays lt, dt, dlt and avg_eq). 'load' query > + * is needed for > * > + * @param db Database handler. > + * @param sql_select_prepare SELECT statement, see above. > + * @param sql_select_load SELECT statement, see above. > + * @retval SQLITE_OK on success, smth else otherwise. > */ > static int > -loadStatTbl(sqlite3 * db, /* Database handle */ > - Table * pStatTab, /* Stat table to load to */ > - const char *zSql1, /* SQL statement 1 (see above) */ > - const char *zSql2 /* SQL statement 2 (see above) */ > - ) > +load_stat_from_space(struct sqlite3 *db, const char *sql_select_prepare, > + const char *sql_select_load) 11. As I can see, it is the core function of the patchset. And now it has two major problems: * partial statistics update on OOM or a non-first prepare() fail - new samples are allocated, but not filled, for example, due to malloc() error on the next index processing; * leaks on opts.stat->samples = malloc(alloc_size); - you do not delete the old memory. These and analysis_loader problems can be solved using region. I propose to use in this function and analysis_loader region allocations only. The array of new index_stat structures is created on a region in sql_analysis_load. Analysis loader takes this array via void *data parameter, and fills index_stats with more region allocations for tuple_stat1, tuple_log_est etc. Then load_stat_from_space takes the same array and process its index_stat instead of original stats in structs index. At the end of load_stat_from_space you are sure, that all statistics for all indexes are completely full, and replace original index stats in structs index using index_stat_dup and its super malloc. Region can be truncated right before return from sql_analysis_load. > @@ -1646,84 +1648,52 @@ sql_space_tuple_log_count(struct Table *tab) > return sqlite3LogEst(pk->vtab->size(pk)); > } > > -/* > - * Load the content of the _sql_stat1 and sql_stat4 tables. The > - * contents of _sql_stat1 are used to populate the Index.aiRowEst[] > -*\* arrays. The contents of sql_stat4 are used to populate the > - * Index.aSample[] arrays. > - * > - * If the _sql_stat1 table is not present in the database, SQLITE_ERROR > - * is returned. In this case, even if the sql_stat4 table is present, > - * no data is read from it. > - * > - * If the _sql_stat4 table is not present in the database, SQLITE_ERROR > - * is returned. However, in this case, data is read from the _sql_stat1 > - * table (if it is present) before returning. > - * > - * If an OOM error occurs, this function always sets db->mallocFailed. > - * This means if the caller does not care about other errors, the return > - * code may be ignored. > - */ > -int > -sqlite3AnalysisLoad(sqlite3 * db) > +log_est 12. Lets name this type log_est_t. > +index_field_tuple_est(struct Index *idx, uint32_t field) > { > - analysisInfo sInfo; > - HashElem *i, *j; > - char *zSql; > - int rc = SQLITE_OK; > - > - /* Clear any prior statistics */ > - for (j = sqliteHashFirst(&db->pSchema->tblHash); j; > - j = sqliteHashNext(j)) { > - Table *pTab = sqliteHashData(j); > - for (i = sqliteHashFirst(&pTab->idxHash); i; > - i = sqliteHashNext(i)) { > - Index *pIdx = sqliteHashData(i); > - pIdx->aiRowLogEst[0] = 0; > - sqlite3DeleteIndexSamples(db, pIdx); > - pIdx->aSample = 0; > - } > + struct space *space = space_by_id(SQLITE_PAGENO_TO_SPACEID(idx->tnum)); > + if (space == NULL) > + return idx->aiRowLogEst[field]; > + struct index *tnt_idx = > + space_index(space, SQLITE_PAGENO_TO_INDEXID(idx->tnum)); > + assert(tnt_idx != NULL); > + assert(field <= tnt_idx->def->key_def->part_count); > + if (tnt_idx->def->opts.stat == NULL) { > + if (field == 0) 13. Why? Field 0 is an ordinary field number. And why do you return 0, if a field is equal to part_count and the index is unique? > diff --git a/src/box/sql/sqliteInt.h b/src/box/sql/sqliteInt.h > index d026a213d..c2ea066d6 100644 > --- a/src/box/sql/sqliteInt.h > +++ b/src/box/sql/sqliteInt.h > @@ -1473,7 +1473,6 @@ typedef struct FuncDef FuncDef; > typedef struct FuncDefHash FuncDefHash; > typedef struct IdList IdList; > typedef struct Index Index; > -typedef struct IndexSample IndexSample; 14. I still see IndexSample mention in analyze.c on 553 line. > @@ -2173,14 +2171,41 @@ struct Index { > * or _NONE > */ > unsigned idxType:2; /* 1==UNIQUE, 2==PRIMARY KEY, 0==CREATE INDEX */ > - unsigned bUnordered:1; /* Use this index for == or IN queries only */ > - unsigned noSkipScan:1; /* Do not try to use skip-scan if true */ > - int nSample; /* Number of elements in aSample[] */ > - int nSampleCol; /* Size of IndexSample.anEq[] and so on */ > - tRowcnt *aAvgEq; /* Average nEq values for keys not in aSample */ > - IndexSample *aSample; /* Samples of the left-most key */ > - tRowcnt *aiRowEst; /* Non-logarithmic stat1 data for this index */ > }; > +/** > + * default_tuple_est[] array contains default information > + * which is used when we don't have real space, e.g. temporary > + * objects representing result set of nested SELECT or VIEW. > + * > + * First number is supposed to contain the number of elements > + * in the index. Since we do not know, guess 1 million. > + * Second one is an estimate of the number of rows in the > + * table that match any particular value of the first column of > + * the index. Third one is an estimate of the number of > + * rows that match any particular combination of the first 2 > + * columns of the index. And so on. It must always be true: > + * > + * default_tuple_est[N] <= default_tuple_est[N-1] > + * default_tuple_est[N] >= 1 > + * > + * Apart from that, we have little to go on besides intuition > + * as to how default values should be initialized. The numbers > + * generated here are based on typical values found in actual > + * indices. > + */ > +extern const log_est default_tuple_est[]; 15. Why is it extern? Looks like it is used in analyze.c only, and can be static within the file. > + > +/** > + * Fetch statistics concerning tuples to be selected. > + * If there is no appropriate Tarantool's index, > + * return one of default values. > + * > + * @param idx Index. > + * @param field Number of field to be examined. > + * @retval Estimate logarithm of tuples selected by given field. 16. All tuples? Or unique by this field? > diff --git a/src/box/sql/where.c b/src/box/sql/where.c > index e6f9ef431..975e7059f 100644 > --- a/src/box/sql/where.c > +++ b/src/box/sql/where.c > @@ -977,15 +986,15 @@ whereKeyStats(Parse * pParse, /* Database connection */ > > pRec->nField = n; > res = > - sqlite3VdbeRecordCompareMsgpack(aSample[iSamp].n, > - aSample[iSamp].p, pRec); > + sqlite3VdbeRecordCompareMsgpack(samples[iSamp].key_size, > + samples[iSamp].sample_key, 17. Lets make sample_key has char * type instead of void *. It is common practice for raw MessagePack in Tarantool.