[tarantool-patches] Re: [PATCH 4/4] sql: move SQL statistics to server
Vladislav Shpilevoy
v.shpilevoy at tarantool.org
Tue Apr 24 15:51:55 MSK 2018
Hello. The patch is great, but see 17 comments below %)
Most of them are minor except one. The key idea is to
use region allocator to build statistics on analyze. See
details below.
On 23/04/2018 23:29, Nikita Pettik wrote:
> SQLite provides kind of statistics concerning data holding in each index
> and space. It is used to help query planer build optimized plan. Such
> statistics don't depend on internal index implementation (e.g. B-tree or
> LSM tree) and contain information concerning tuple distribution.
> This patch moves members resbonsible statistics from original SQLite
> structs to Tarantool ones. To be more precise, now index's opts contains
> statistics from _stat1 space (arrays of integers and their loarithm
1. resbonsible -> responsible, loarithm -> logarithm.
> approximation) and more sophisticated one from _stat4 (array of samples).
> It also contains set of flags to turn on/off different optimizations
> (such as skip-scan).
>
> After ANALYZE command is executed, statistics are saved into _sql_stat1
> and _sql_stat4 spaces (in order to persist them). However, it should also
> be loaded into in-memory structs representing indexes. This patch
> reworks original SQLite routine to load it directly to newly introduced
> stat struct of index.
>
> Closes #3253
> ---
> src/box/index_def.c | 92 ++++++
> src/box/index_def.h | 90 ++++++
> src/box/sql/analyze.c | 742 +++++++++++++++++++++++-------------------------
> src/box/sql/build.c | 61 ----
> src/box/sql/parse.y | 2 -
> src/box/sql/pragma.c | 3 +-
> src/box/sql/prepare.c | 7 +-
> src/box/sql/sqliteInt.h | 72 +++--
> src/box/sql/vdbe.c | 5 +-
> src/box/sql/where.c | 190 ++++++++-----
> src/box/sql/whereexpr.c | 4 +-
> 11 files changed, 716 insertions(+), 552 deletions(-)
>
> diff --git a/src/box/index_def.c b/src/box/index_def.c
> index eda39a622..14ff94192 100644
> --- a/src/box/index_def.c
> +++ b/src/box/index_def.c
> @@ -154,9 +157,98 @@ index_def_dup(const struct index_def *def)
> return NULL;
> }
> }
> + if (def->opts.stat != NULL) {
> + dup->opts.stat = malloc(sizeof(*dup->opts.stat));
2. Lets reduce count of mallocs and allocate all on the single
huge malloc. It eliminates index_stat_destroy_samples, and memory
fragmentation.
See how it is done for space_def. You must define an index_stat_sizeof
function, that takes what is needed to calculate summary size,
calculates offsets on different parts of the result memory etc.
I know, that it is built in load_stat_from_space, but seems like I
found a solution using region. See the comments below.
> diff --git a/src/box/index_def.h b/src/box/index_def.h
> index 6167570b7..52e9a0548 100644
> --- a/src/box/index_def.h
> +++ b/src/box/index_def.h
> @@ -113,6 +181,27 @@ index_opts_create(struct index_opts *opts)
> *opts = index_opts_default;
> }
>
> +/**
> + * Release memory allocated for samples, statistics they include,
> + * sample keys and avg_eq array.
> + */
> +void
> +index_stat_destroy_samples(struct index_stat *stat);
> +
> +/**
> + * Destroy index stat.
> + */
> +static inline void
> +destroy_stat(struct index_stat *stat) {
3. It will be removed after review fixes, but why you did not
just inlined it in a single usage place?
> + if (stat == NULL)
> + return;
> + free(stat->tuple_stat1);
> + free(stat->tuple_log_est);
> + index_stat_destroy_samples(stat);
> + TRASH(stat);
> + free(stat);
> +}
> +
> /**
> * Destroy index options
> */
> @@ -120,6 +209,7 @@ static inline void
> index_opts_destroy(struct index_opts *opts)
> {
> free(opts->sql);
> + destroy_stat(opts->stat);
4. Will be turned into just free().
> diff --git a/src/box/sql/analyze.c b/src/box/sql/analyze.c
> index 4965e899a..14dfbd551 100644
> --- a/src/box/sql/analyze.c
> +++ b/src/box/sql/analyze.c
> @@ -104,8 +104,8 @@
> * large nEq values.
> *
> */
> -#ifndef SQLITE_OMIT_ANALYZE
5. The definition still exists in mkkeywordhash.c
> @@ -1217,49 +1217,46 @@ struct analysisInfo {
> sqlite3 *db;
> };
>
> -/*
> - * The first argument points to a nul-terminated string containing a
> - * list of space separated integers. Read the first nOut of these into
> - * the array aOut[].
> +/**
> + * The first argument points to a nul-terminated string
> + * containing a list of space separated integers. Load
> + * the first stat_size of these into the output arrays.
> + *
> + * @param stat_string String containing array of integers.
> + * @param stat_size Size of output arrays.
> + * @param[out] stat_exact Decoded array of statistics.
> + * @param[out] stat_log Decoded array of stat logariphms.
> + * @param[out] index Handle extra flags for this index,
> + * if not NULL.
> */
> static void
> -decodeIntArray(char *zIntArray, /* String containing int array to decode */
> - int nOut, /* Number of slots in aOut[] */
> - tRowcnt * aOut, /* Store integers here */
> - LogEst * aLog, /* Or, if aOut==0, here */
> - Index * pIndex /* Handle extra flags for this index, if not NULL */
> - )
> -{
> - char *z = zIntArray;
> - int c;
> - int i;
> - tRowcnt v;
> -
> - if (z == 0)
> +decode_stat_string(const char *stat_string, int stat_size, tRowcnt *stat_exact,
> + LogEst *stat_log, struct index *index) {
6. Please, create a ticket to store stat_string in a stat table as map or array of stat
values instead of string. It allows to extremely simplify all of this pure absolute shit
using raw MessagePack array of integers iteration.
> + char *z = (char *) stat_string;
> + if (z == NULL)
> z = "";
> -
> - for (i = 0; *z && i < nOut; i++) {
> - v = 0;
> + for (int i = 0; *z && i < stat_size; i++) {
> + tRowcnt v = 0;
> + int c;
> while ((c = z[0]) >= '0' && c <= '9') {
> v = v * 10 + c - '0';
> z++;
> }
7. sscanf?
> - if (aOut)
> - aOut[i] = v;
> - if (aLog)
> - aLog[i] = sqlite3LogEst(v);
> + if (stat_exact != NULL)
> + stat_exact[i] = v;
> + if (stat_log != NULL)
> + stat_log[i] = sqlite3LogEst(v);
> if (*z == ' ')
> z++;
> }
> -
> - if (pIndex) {
> - pIndex->bUnordered = 0;
> - pIndex->noSkipScan = 0;
> + if (index != NULL) {
8. As I can see, you use this function 3 times with index == NULL and 1
time with not NULL. Lets just inline these things where the index is not NULL,
and remove the index argument.
> @@ -1269,369 +1266,374 @@ decodeIntArray(char *zIntArray, /* String containing int array to decode */
> }
> }
>
> -/*
> +/**
> * This callback is invoked once for each index when reading the
> * _sql_stat1 table.
> *
> * argv[0] = name of the table
> * argv[1] = name of the index (might be NULL)
> - * argv[2] = results of analysis - on integer for each column
> + * argv[2] = results of analysis - array of integers
> + *
> + * Entries for which argv[1]==NULL simply record the number
> + * of rows in the table. This routine also allocates memory
> + * for stat struct itself and statistics which are not related
> + * to stat4 samples.
> *
> - * Entries for which argv[1]==NULL simply record the number of rows in
> - * the table.
> + * @param data Additional analysis info.
> + * @param argc Number of arguments.
> + * @param argv Callback arguments.
> + * @retval 0 on success, -1 otherwise.
> */
> static int
> -analysisLoader(void *pData, int argc, char **argv, char **NotUsed)
> +analysis_loader(void *data, int argc, char **argv, char **unused)
> {
> - analysisInfo *pInfo = (analysisInfo *) pData;
> - Index *pIndex;
> - Table *pTable;
> - const char *z;
> -
> assert(argc == 3);
> - UNUSED_PARAMETER2(NotUsed, argc);
> -
> - if (argv == 0 || argv[0] == 0 || argv[2] == 0) {
> + UNUSED_PARAMETER(data);
> + UNUSED_PARAMETER2(unused, argc);
> + if (argv == 0 || argv[0] == 0 || argv[2] == 0)
> return 0;
> - }
> - pTable = sqlite3HashFind(&pInfo->db->pSchema->tblHash, argv[0]);
> - if (pTable == NULL)
> + uint32_t space_id = box_space_id_by_name(argv[0], strlen(argv[0]));
> + assert(space_id != BOX_ID_NIL);
> + struct space *space = space_by_id(space_id);
> + assert(space != NULL);
> + struct index *index;
> + if (space == NULL)
> return 0;
> - if (argv[1] == 0) {
> - pIndex = 0;
> + if (argv[1] == NULL) {
> + index = NULL;
9. Lets return 0 here and remove 'if index != NULL' below. The code will
look more compact.
> } else if (sqlite3_stricmp(argv[0], argv[1]) == 0) {
> - pIndex = sqlite3PrimaryKeyIndex(pTable);
> + index = space_index(space, 0);
> + assert(index != NULL);
> } else {
> - pIndex = sqlite3HashFind(&pTable->idxHash, argv[1]);
> + uint32_t iid = box_index_id_by_name(space_id, argv[1],
> + strlen(argv[1]));
> + assert(iid != BOX_ID_NIL);
> + index = space_index(space, iid);
> }
> - z = argv[2];
> -
> - if (pIndex) {
> - tRowcnt *aiRowEst = 0;
> - int nCol = index_column_count(pIndex) + 1;
> - /* Index.aiRowEst may already be set here if there are duplicate
> - * _sql_stat1 entries for this index. In that case just clobber
> - * the old data with the new instead of allocating a new array.
> + if (index != NULL) {
> + /*
> + * Additional field is used to describe total
> + * count of tuples in index. Although now all
> + * indexes feature the same number of tuples,
> + * partial indexes are going to be implemented
> + * someday.
> */
> - if (pIndex->aiRowEst == 0) {
> - pIndex->aiRowEst =
> - (tRowcnt *) sqlite3MallocZero(sizeof(tRowcnt) *
> - nCol);
> - if (pIndex->aiRowEst == 0)
> - sqlite3OomFault(pInfo->db);
> + int column_count = index->def->key_def->part_count + 1;
> + /*
> + * Stat arrays may already be set here if there
> + * are duplicate _sql_stat1 entries for this
> + * index. In that case just clobber the old data
> + * with the new instead of allocating a new array.
> + */
> + struct index_stat *stat = index->def->opts.stat;
> + if (stat == NULL) {
> + stat = calloc(1, sizeof(*stat));
> + if (stat == NULL) {
> + diag_set(OutOfMemory, sizeof(*stat), "calloc",
> + "stat");
> + return -1;
> + }
> + index->def->opts.stat = stat;
> }
> - aiRowEst = pIndex->aiRowEst;
> - pIndex->bUnordered = 0;
> - decodeIntArray((char *)z, nCol, aiRowEst, pIndex->aiRowLogEst,
> - pIndex);
> - }
> -
> - return 0;
> -}
> -
> -/*
> - * If the Index.aSample variable is not NULL, delete the aSample[] array
> - * and its contents.
> - */
> -void
> -sqlite3DeleteIndexSamples(sqlite3 * db, Index * pIdx)
> -{
> - if (pIdx->aSample) {
> - int j;
> - for (j = 0; j < pIdx->nSample; j++) {
> - IndexSample *p = &pIdx->aSample[j];
> - sqlite3DbFree(db, p->p);
> + if (stat->tuple_stat1 == NULL) {
10. Can it be NULL, if index->def->opts.stat was not NULL above? And as I can see,
if analysis loader fails not on the first call, then the statistics is updated
partially. I propose to make it atomic. For instance, you can create for each index
a new index_stat object, fill it with new stat, and on success replace the old one.
See below how to do it simple.
> -/*
> +/**
> * Load the content from the _sql_stat4 table
> - * into the relevant Index.aSample[] arrays.
> + * into the relevant index->stat->samples[] arrays.
> + *
> + * Arguments must point to SQL statements that return
> + * data equivalent to the following:
> *
> - * Arguments zSql1 and zSql2 must point to SQL statements that return
> - * data equivalent to the following (statements are different for stat4,
> - * see the caller of this function for details):
> + * prepare: SELECT tbl,idx,count(*) FROM _sql_stat4 GROUP BY tbl,idx;
> + * load: SELECT tbl,idx,neq,nlt,ndlt,sample FROM _sql_stat4;
> *
> - * zSql1: SELECT tbl,idx,count(*) FROM _sql_stat4 GROUP BY tbl,idx
> - * zSql2: SELECT tbl,idx,neq,nlt,ndlt,sample FROM _sql_stat4
> + * 'prepare' statement is used to allocate enough memory for
> + * statistics (i.e. arrays lt, dt, dlt and avg_eq). 'load' query
> + * is needed for
> *
> + * @param db Database handler.
> + * @param sql_select_prepare SELECT statement, see above.
> + * @param sql_select_load SELECT statement, see above.
> + * @retval SQLITE_OK on success, smth else otherwise.
> */
> static int
> -loadStatTbl(sqlite3 * db, /* Database handle */
> - Table * pStatTab, /* Stat table to load to */
> - const char *zSql1, /* SQL statement 1 (see above) */
> - const char *zSql2 /* SQL statement 2 (see above) */
> - )
> +load_stat_from_space(struct sqlite3 *db, const char *sql_select_prepare,
> + const char *sql_select_load)
11. As I can see, it is the core function of the patchset. And now it has
two major problems:
* partial statistics update on OOM or a non-first prepare() fail - new samples
are allocated, but not filled, for example, due to malloc() error on the next
index processing;
* leaks on opts.stat->samples = malloc(alloc_size); - you do not delete
the old memory.
These and analysis_loader problems can be solved using region. I propose
to use in this function and analysis_loader region allocations only.
The array of new index_stat structures is created on a region in
sql_analysis_load. Analysis loader takes this array via
void *data parameter, and fills index_stats with more region allocations
for tuple_stat1, tuple_log_est etc.
Then load_stat_from_space takes the same array and process its index_stat
instead of original stats in structs index.
At the end of load_stat_from_space you are sure, that all statistics for
all indexes are completely full, and replace original index stats in
structs index using index_stat_dup and its super malloc.
Region can be truncated right before return from sql_analysis_load.
> @@ -1646,84 +1648,52 @@ sql_space_tuple_log_count(struct Table *tab)
> return sqlite3LogEst(pk->vtab->size(pk));
> }
>
> -/*
> - * Load the content of the _sql_stat1 and sql_stat4 tables. The
> - * contents of _sql_stat1 are used to populate the Index.aiRowEst[]
> -*\* arrays. The contents of sql_stat4 are used to populate the
> - * Index.aSample[] arrays.
> - *
> - * If the _sql_stat1 table is not present in the database, SQLITE_ERROR
> - * is returned. In this case, even if the sql_stat4 table is present,
> - * no data is read from it.
> - *
> - * If the _sql_stat4 table is not present in the database, SQLITE_ERROR
> - * is returned. However, in this case, data is read from the _sql_stat1
> - * table (if it is present) before returning.
> - *
> - * If an OOM error occurs, this function always sets db->mallocFailed.
> - * This means if the caller does not care about other errors, the return
> - * code may be ignored.
> - */
> -int
> -sqlite3AnalysisLoad(sqlite3 * db)
> +log_est
12. Lets name this type log_est_t.
> +index_field_tuple_est(struct Index *idx, uint32_t field)
> {
> - analysisInfo sInfo;
> - HashElem *i, *j;
> - char *zSql;
> - int rc = SQLITE_OK;
> -
> - /* Clear any prior statistics */
> - for (j = sqliteHashFirst(&db->pSchema->tblHash); j;
> - j = sqliteHashNext(j)) {
> - Table *pTab = sqliteHashData(j);
> - for (i = sqliteHashFirst(&pTab->idxHash); i;
> - i = sqliteHashNext(i)) {
> - Index *pIdx = sqliteHashData(i);
> - pIdx->aiRowLogEst[0] = 0;
> - sqlite3DeleteIndexSamples(db, pIdx);
> - pIdx->aSample = 0;
> - }
> + struct space *space = space_by_id(SQLITE_PAGENO_TO_SPACEID(idx->tnum));
> + if (space == NULL)
> + return idx->aiRowLogEst[field];
> + struct index *tnt_idx =
> + space_index(space, SQLITE_PAGENO_TO_INDEXID(idx->tnum));
> + assert(tnt_idx != NULL);
> + assert(field <= tnt_idx->def->key_def->part_count);
> + if (tnt_idx->def->opts.stat == NULL) {
> + if (field == 0)
13. Why? Field 0 is an ordinary field number. And why do you return 0, if
a field is equal to part_count and the index is unique?
> diff --git a/src/box/sql/sqliteInt.h b/src/box/sql/sqliteInt.h
> index d026a213d..c2ea066d6 100644
> --- a/src/box/sql/sqliteInt.h
> +++ b/src/box/sql/sqliteInt.h
> @@ -1473,7 +1473,6 @@ typedef struct FuncDef FuncDef;
> typedef struct FuncDefHash FuncDefHash;
> typedef struct IdList IdList;
> typedef struct Index Index;
> -typedef struct IndexSample IndexSample;
14. I still see IndexSample mention in analyze.c on 553 line.
> @@ -2173,14 +2171,41 @@ struct Index {
> * or _NONE
> */
> unsigned idxType:2; /* 1==UNIQUE, 2==PRIMARY KEY, 0==CREATE INDEX */
> - unsigned bUnordered:1; /* Use this index for == or IN queries only */
> - unsigned noSkipScan:1; /* Do not try to use skip-scan if true */
> - int nSample; /* Number of elements in aSample[] */
> - int nSampleCol; /* Size of IndexSample.anEq[] and so on */
> - tRowcnt *aAvgEq; /* Average nEq values for keys not in aSample */
> - IndexSample *aSample; /* Samples of the left-most key */
> - tRowcnt *aiRowEst; /* Non-logarithmic stat1 data for this index */
> };
> +/**
> + * default_tuple_est[] array contains default information
> + * which is used when we don't have real space, e.g. temporary
> + * objects representing result set of nested SELECT or VIEW.
> + *
> + * First number is supposed to contain the number of elements
> + * in the index. Since we do not know, guess 1 million.
> + * Second one is an estimate of the number of rows in the
> + * table that match any particular value of the first column of
> + * the index. Third one is an estimate of the number of
> + * rows that match any particular combination of the first 2
> + * columns of the index. And so on. It must always be true:
> + *
> + * default_tuple_est[N] <= default_tuple_est[N-1]
> + * default_tuple_est[N] >= 1
> + *
> + * Apart from that, we have little to go on besides intuition
> + * as to how default values should be initialized. The numbers
> + * generated here are based on typical values found in actual
> + * indices.
> + */
> +extern const log_est default_tuple_est[];
15. Why is it extern? Looks like it is used in analyze.c only, and
can be static within the file.
> +
> +/**
> + * Fetch statistics concerning tuples to be selected.
> + * If there is no appropriate Tarantool's index,
> + * return one of default values.
> + *
> + * @param idx Index.
> + * @param field Number of field to be examined.
> + * @retval Estimate logarithm of tuples selected by given field.
16. All tuples? Or unique by this field?
> diff --git a/src/box/sql/where.c b/src/box/sql/where.c
> index e6f9ef431..975e7059f 100644
> --- a/src/box/sql/where.c
> +++ b/src/box/sql/where.c
> @@ -977,15 +986,15 @@ whereKeyStats(Parse * pParse, /* Database connection */
>
> pRec->nField = n;
> res =
> - sqlite3VdbeRecordCompareMsgpack(aSample[iSamp].n,
> - aSample[iSamp].p, pRec);
> + sqlite3VdbeRecordCompareMsgpack(samples[iSamp].key_size,
> + samples[iSamp].sample_key,
17. Lets make sample_key has char * type instead of void *. It is common
practice for raw MessagePack in Tarantool.
More information about the Tarantool-patches
mailing list