[tarantool-patches] Re: [PATCH 4/4] sql: move SQL statistics to server

Tue Apr 24 15:51:55 MSK 2018

Hello. The patch is great, but see 17 comments below %)

Most of them are minor except one. The key idea is to
use region allocator to build statistics on analyze. See
details below.

On 23/04/2018 23:29, Nikita Pettik wrote:
> SQLite provides kind of statistics concerning data holding in each index
> and space. It is used to help query planer build optimized plan. Such
> statistics don't depend on internal index implementation (e.g. B-tree or
> LSM tree) and contain information concerning tuple distribution.
> This patch moves members resbonsible statistics from original SQLite
> structs to Tarantool ones. To be more precise, now index's opts contains
> statistics from _stat1 space (arrays of integers and their loarithm

1. resbonsible -> responsible, loarithm -> logarithm.

> approximation) and more sophisticated one from _stat4 (array of samples).
> It also contains set of flags to turn on/off different optimizations
> (such as skip-scan).
> 
> After ANALYZE command is executed, statistics are saved into _sql_stat1
> and _sql_stat4 spaces (in order to persist them). However, it should also
> be loaded into in-memory structs representing indexes. This patch
> reworks original SQLite routine to load it directly to newly introduced
> stat struct of index.
> 
> Closes #3253
> ---
>   src/box/index_def.c     |  92 ++++++
>   src/box/index_def.h     |  90 ++++++
>   src/box/sql/analyze.c   | 742 +++++++++++++++++++++++-------------------------
>   src/box/sql/build.c     |  61 ----
>   src/box/sql/parse.y     |   2 -
>   src/box/sql/pragma.c    |   3 +-
>   src/box/sql/prepare.c   |   7 +-
>   src/box/sql/sqliteInt.h |  72 +++--
>   src/box/sql/vdbe.c      |   5 +-
>   src/box/sql/where.c     | 190 ++++++++-----
>   src/box/sql/whereexpr.c |   4 +-
>   11 files changed, 716 insertions(+), 552 deletions(-)
> 
> diff --git a/src/box/index_def.c b/src/box/index_def.c
> index eda39a622..14ff94192 100644
> --- a/src/box/index_def.c
> +++ b/src/box/index_def.c
> @@ -154,9 +157,98 @@ index_def_dup(const struct index_def *def)
>   			return NULL;
>   		}
>   	}
> +	if (def->opts.stat != NULL) {
> +		dup->opts.stat = malloc(sizeof(*dup->opts.stat));

2. Lets reduce count of mallocs and allocate all on the single
huge malloc. It eliminates index_stat_destroy_samples, and memory
fragmentation.

See how it is done for space_def. You must define an index_stat_sizeof
function, that takes what is needed to calculate summary size,
calculates offsets on different parts of the result memory etc.

I know, that it is built in load_stat_from_space, but seems like I
found a solution using region. See the comments below.

> diff --git a/src/box/index_def.h b/src/box/index_def.h
> index 6167570b7..52e9a0548 100644
> --- a/src/box/index_def.h
> +++ b/src/box/index_def.h
> @@ -113,6 +181,27 @@ index_opts_create(struct index_opts *opts)
>   	*opts = index_opts_default;
>   }
>   
> +/**
> + * Release memory allocated for samples, statistics they include,
> + * sample keys and avg_eq array.
> + */
> +void
> +index_stat_destroy_samples(struct index_stat *stat);
> +
> +/**
> + * Destroy index stat.
> + */
> +static inline void
> +destroy_stat(struct index_stat *stat) {

3. It will be removed after review fixes, but why you did not
just inlined it in a single usage place?

> +	if (stat == NULL)
> +		return;
> +	free(stat->tuple_stat1);
> +	free(stat->tuple_log_est);
> +	index_stat_destroy_samples(stat);
> +	TRASH(stat);
> +	free(stat);
> +}
> +
>   /**
>    * Destroy index options
>    */
> @@ -120,6 +209,7 @@ static inline void
>   index_opts_destroy(struct index_opts *opts)
>   {
>   	free(opts->sql);
> +	destroy_stat(opts->stat);

4. Will be turned into just free().

> diff --git a/src/box/sql/analyze.c b/src/box/sql/analyze.c
> index 4965e899a..14dfbd551 100644
> --- a/src/box/sql/analyze.c
> +++ b/src/box/sql/analyze.c
> @@ -104,8 +104,8 @@
>    * large nEq values.
>    *
>    */
> -#ifndef SQLITE_OMIT_ANALYZE

5. The definition still exists in mkkeywordhash.c

> @@ -1217,49 +1217,46 @@ struct analysisInfo {
>   	sqlite3 *db;
>   };
>   
> -/*
> - * The first argument points to a nul-terminated string containing a
> - * list of space separated integers. Read the first nOut of these into
> - * the array aOut[].
> +/**
> + * The first argument points to a nul-terminated string
> + * containing a list of space separated integers. Load
> + * the first stat_size of these into the output arrays.
> + *
> + * @param stat_string String containing array of integers.
> + * @param stat_size Size of output arrays.
> + * @param[out] stat_exact Decoded array of statistics.
> + * @param[out] stat_log Decoded array of stat logariphms.
> + * @param[out] index Handle extra flags for this index,
> + *                   if not NULL.
>    */
>   static void
> -decodeIntArray(char *zIntArray,	/* String containing int array to decode */
> -	       int nOut,	/* Number of slots in aOut[] */
> -	       tRowcnt * aOut,	/* Store integers here */
> -	       LogEst * aLog,	/* Or, if aOut==0, here */
> -	       Index * pIndex	/* Handle extra flags for this index, if not NULL */
> -    )
> -{
> -	char *z = zIntArray;
> -	int c;
> -	int i;
> -	tRowcnt v;
> -
> -	if (z == 0)
> +decode_stat_string(const char *stat_string, int stat_size, tRowcnt *stat_exact,
> +		   LogEst *stat_log, struct index *index) {

6. Please, create a ticket to store stat_string in a stat table as map or array of stat
values instead of string. It allows to extremely simplify all of this pure absolute shit
using raw MessagePack array of integers iteration.

> +	char *z = (char *) stat_string;
> +	if (z == NULL)
>   		z = "";
> -
> -	for (i = 0; *z && i < nOut; i++) {
> -		v = 0;
> +	for (int i = 0; *z && i < stat_size; i++) {
> +		tRowcnt v = 0;
> +		int c;
>   		while ((c = z[0]) >= '0' && c <= '9') {
>   			v = v * 10 + c - '0';
>   			z++;
>   		}

7. sscanf?

> -		if (aOut)
> -			aOut[i] = v;
> -		if (aLog)
> -			aLog[i] = sqlite3LogEst(v);
> +		if (stat_exact != NULL)
> +			stat_exact[i] = v;
> +		if (stat_log != NULL)
> +			stat_log[i] = sqlite3LogEst(v);
>   		if (*z == ' ')
>   			z++;
>   	}
> -
> -	if (pIndex) {
> -		pIndex->bUnordered = 0;
> -		pIndex->noSkipScan = 0;
> +	if (index != NULL) {

8. As I can see, you use this function 3 times with index == NULL and 1
time with not NULL. Lets just inline these things where the index is not NULL,
and remove the index argument.

> @@ -1269,369 +1266,374 @@ decodeIntArray(char *zIntArray,	/* String containing int array to decode */
>   	}
>   }
>   
> -/*
> +/**
>    * This callback is invoked once for each index when reading the
>    * _sql_stat1 table.
>    *
>    *     argv[0] = name of the table
>    *     argv[1] = name of the index (might be NULL)
> - *     argv[2] = results of analysis - on integer for each column
> + *     argv[2] = results of analysis - array of integers
> + *
> + * Entries for which argv[1]==NULL simply record the number
> + * of rows in the table. This routine also allocates memory
> + * for stat struct itself and statistics which are not related
> + * to stat4 samples.
>    *
> - * Entries for which argv[1]==NULL simply record the number of rows in
> - * the table.
> + * @param data Additional analysis info.
> + * @param argc Number of arguments.
> + * @param argv Callback arguments.
> + * @retval 0 on success, -1 otherwise.
>    */
>   static int
> -analysisLoader(void *pData, int argc, char **argv, char **NotUsed)
> +analysis_loader(void *data, int argc, char **argv, char **unused)
>   {
> -	analysisInfo *pInfo = (analysisInfo *) pData;
> -	Index *pIndex;
> -	Table *pTable;
> -	const char *z;
> -
>   	assert(argc == 3);
> -	UNUSED_PARAMETER2(NotUsed, argc);
> -
> -	if (argv == 0 || argv[0] == 0 || argv[2] == 0) {
> +	UNUSED_PARAMETER(data);
> +	UNUSED_PARAMETER2(unused, argc);
> +	if (argv == 0 || argv[0] == 0 || argv[2] == 0)
>   		return 0;
> -	}
> -	pTable = sqlite3HashFind(&pInfo->db->pSchema->tblHash, argv[0]);
> -	if (pTable == NULL)
> +	uint32_t space_id = box_space_id_by_name(argv[0], strlen(argv[0]));
> +	assert(space_id != BOX_ID_NIL);
> +	struct space *space = space_by_id(space_id);
> +	assert(space != NULL);
> +	struct index *index;
> +	if (space == NULL)
>   		return 0;
> -	if (argv[1] == 0) {
> -		pIndex = 0;
> +	if (argv[1] == NULL) {
> +		index = NULL;

9. Lets return 0 here and remove 'if index != NULL' below. The code will
look more compact.

>   	} else if (sqlite3_stricmp(argv[0], argv[1]) == 0) {
> -		pIndex = sqlite3PrimaryKeyIndex(pTable);
> +		index = space_index(space, 0);
> +		assert(index != NULL);
>   	} else {
> -		pIndex = sqlite3HashFind(&pTable->idxHash, argv[1]);
> +		uint32_t iid = box_index_id_by_name(space_id, argv[1],
> +						    strlen(argv[1]));
> +		assert(iid != BOX_ID_NIL);
> +		index = space_index(space, iid);
>   	}
> -	z = argv[2];
> -
> -	if (pIndex) {
> -		tRowcnt *aiRowEst = 0;
> -		int nCol = index_column_count(pIndex) + 1;
> -		/* Index.aiRowEst may already be set here if there are duplicate
> -		 * _sql_stat1 entries for this index. In that case just clobber
> -		 * the old data with the new instead of allocating a new array.
> +	if (index != NULL) {
> +		/*
> +		 * Additional field is used to describe total
> +		 * count of tuples in index. Although now all
> +		 * indexes feature the same number of tuples,
> +		 * partial indexes are going to be implemented
> +		 * someday.
>   		 */
> -		if (pIndex->aiRowEst == 0) {
> -			pIndex->aiRowEst =
> -			    (tRowcnt *) sqlite3MallocZero(sizeof(tRowcnt) *
> -							  nCol);
> -			if (pIndex->aiRowEst == 0)
> -				sqlite3OomFault(pInfo->db);
> +		int column_count = index->def->key_def->part_count + 1;
> +		/*
> +		 * Stat arrays may already be set here if there
> +		 * are duplicate _sql_stat1 entries for this
> +		 * index. In that case just clobber the old data
> +		 * with the new instead of allocating a new array.
> +		 */
> +		struct index_stat *stat = index->def->opts.stat;
> +		if (stat == NULL) {
> +			stat = calloc(1, sizeof(*stat));
> +			if (stat == NULL) {
> +				diag_set(OutOfMemory, sizeof(*stat), "calloc",
> +					 "stat");
> +				return -1;
> +			}
> +			index->def->opts.stat = stat;
>   		}
> -		aiRowEst = pIndex->aiRowEst;
> -		pIndex->bUnordered = 0;
> -		decodeIntArray((char *)z, nCol, aiRowEst, pIndex->aiRowLogEst,
> -			       pIndex);
> -	}
> -
> -	return 0;
> -}
> -
> -/*
> - * If the Index.aSample variable is not NULL, delete the aSample[] array
> - * and its contents.
> - */
> -void
> -sqlite3DeleteIndexSamples(sqlite3 * db, Index * pIdx)
> -{
> -	if (pIdx->aSample) {
> -		int j;
> -		for (j = 0; j < pIdx->nSample; j++) {
> -			IndexSample *p = &pIdx->aSample[j];
> -			sqlite3DbFree(db, p->p);
> +		if (stat->tuple_stat1 == NULL) {

10. Can it be NULL, if index->def->opts.stat was not NULL above? And as I can see,
if analysis loader fails not on the first call, then the statistics is updated
partially. I propose to make it atomic. For instance, you can create for each index
a new index_stat object, fill it with new stat, and on success replace the old one.
See below how to do it simple.

> -/*
> +/**
>    * Load the content from the _sql_stat4 table
> - * into the relevant Index.aSample[] arrays.
> + * into the relevant index->stat->samples[] arrays.
> + *
> + * Arguments must point to SQL statements that return
> + * data equivalent to the following:
>    *
> - * Arguments zSql1 and zSql2 must point to SQL statements that return
> - * data equivalent to the following (statements are different for stat4,
> - * see the caller of this function for details):
> + *    prepare: SELECT tbl,idx,count(*) FROM _sql_stat4 GROUP BY tbl,idx;
> + *    load: SELECT tbl,idx,neq,nlt,ndlt,sample FROM _sql_stat4;
>    *
> - *    zSql1: SELECT tbl,idx,count(*) FROM _sql_stat4 GROUP BY tbl,idx
> - *    zSql2: SELECT tbl,idx,neq,nlt,ndlt,sample FROM _sql_stat4
> + * 'prepare' statement is used to allocate enough memory for
> + * statistics (i.e. arrays lt, dt, dlt and avg_eq). 'load' query
> + * is needed for
>    *
> + * @param db Database handler.
> + * @param sql_select_prepare SELECT statement, see above.
> + * @param sql_select_load SELECT statement, see above.
> + * @retval SQLITE_OK on success, smth else otherwise.
>    */
>   static int
> -loadStatTbl(sqlite3 * db,	/* Database handle */
> -	    Table * pStatTab,	/* Stat table to load to */
> -	    const char *zSql1,	/* SQL statement 1 (see above) */
> -	    const char *zSql2	/* SQL statement 2 (see above) */
> -    )
> +load_stat_from_space(struct sqlite3 *db, const char *sql_select_prepare,
> +		     const char *sql_select_load)

11. As I can see, it is the core function of the patchset. And now it has
two major problems:
* partial statistics update on OOM or a non-first prepare() fail - new samples
   are allocated, but not filled, for example, due to malloc() error on the next
   index processing;
* leaks on opts.stat->samples = malloc(alloc_size); - you do not delete
   the old memory.

These and analysis_loader problems can be solved using region. I propose
to use in this function and analysis_loader region allocations only.

The array of new index_stat structures is created on a region in
sql_analysis_load. Analysis loader takes this array via
void *data parameter, and fills index_stats with more region allocations
for tuple_stat1, tuple_log_est etc.

Then load_stat_from_space takes the same array and process its index_stat
instead of original stats in structs index.

At the end of load_stat_from_space you are sure, that all statistics for
all indexes are completely full, and replace original index stats in
structs index using index_stat_dup and its super malloc.

Region can be truncated right before return from sql_analysis_load.

> @@ -1646,84 +1648,52 @@ sql_space_tuple_log_count(struct Table *tab)
>   	return sqlite3LogEst(pk->vtab->size(pk));
>   }
>   
> -/*
> - * Load the content of the _sql_stat1 and sql_stat4 tables. The
> - * contents of _sql_stat1 are used to populate the Index.aiRowEst[]
> -*\* arrays. The contents of sql_stat4 are used to populate the
> - * Index.aSample[] arrays.
> - *
> - * If the _sql_stat1 table is not present in the database, SQLITE_ERROR
> - * is returned. In this case, even if the sql_stat4 table is present,
> - * no data is read from it.
> - *
> - * If the _sql_stat4 table is not present in the database, SQLITE_ERROR
> - * is returned. However, in this case, data is read from the _sql_stat1
> - * table (if it is present) before returning.
> - *
> - * If an OOM error occurs, this function always sets db->mallocFailed.
> - * This means if the caller does not care about other errors, the return
> - * code may be ignored.
> - */
> -int
> -sqlite3AnalysisLoad(sqlite3 * db)
> +log_est

12. Lets name this type log_est_t.

> +index_field_tuple_est(struct Index *idx, uint32_t field)
>   {
> -	analysisInfo sInfo;
> -	HashElem *i, *j;
> -	char *zSql;
> -	int rc = SQLITE_OK;
> -
> -	/* Clear any prior statistics */
> -	for (j = sqliteHashFirst(&db->pSchema->tblHash); j;
> -	     j = sqliteHashNext(j)) {
> -		Table *pTab = sqliteHashData(j);
> -		for (i = sqliteHashFirst(&pTab->idxHash); i;
> -		     i = sqliteHashNext(i)) {
> -			Index *pIdx = sqliteHashData(i);
> -			pIdx->aiRowLogEst[0] = 0;
> -			sqlite3DeleteIndexSamples(db, pIdx);
> -			pIdx->aSample = 0;
> -		}
> +	struct space *space = space_by_id(SQLITE_PAGENO_TO_SPACEID(idx->tnum));
> +	if (space == NULL)
> +		return idx->aiRowLogEst[field];
> +	struct index *tnt_idx =
> +		space_index(space, SQLITE_PAGENO_TO_INDEXID(idx->tnum));
> +	assert(tnt_idx != NULL);
> +	assert(field <= tnt_idx->def->key_def->part_count);
> +	if (tnt_idx->def->opts.stat == NULL) {
> +		if (field == 0)

13. Why? Field 0 is an ordinary field number. And why do you return 0, if
a field is equal to part_count and the index is unique?

> diff --git a/src/box/sql/sqliteInt.h b/src/box/sql/sqliteInt.h
> index d026a213d..c2ea066d6 100644
> --- a/src/box/sql/sqliteInt.h
> +++ b/src/box/sql/sqliteInt.h
> @@ -1473,7 +1473,6 @@ typedef struct FuncDef FuncDef;
>   typedef struct FuncDefHash FuncDefHash;
>   typedef struct IdList IdList;
>   typedef struct Index Index;
> -typedef struct IndexSample IndexSample;

14. I still see IndexSample mention in analyze.c on 553 line.
> @@ -2173,14 +2171,41 @@ struct Index {
>   				 * or _NONE
>   				 */
>   	unsigned idxType:2;	/* 1==UNIQUE, 2==PRIMARY KEY, 0==CREATE INDEX */
> -	unsigned bUnordered:1;	/* Use this index for == or IN queries only */
> -	unsigned noSkipScan:1;	/* Do not try to use skip-scan if true */
> -	int nSample;		/* Number of elements in aSample[] */
> -	int nSampleCol;		/* Size of IndexSample.anEq[] and so on */
> -	tRowcnt *aAvgEq;	/* Average nEq values for keys not in aSample */
> -	IndexSample *aSample;	/* Samples of the left-most key */
> -	tRowcnt *aiRowEst;	/* Non-logarithmic stat1 data for this index */
>   };
> +/**
> + * default_tuple_est[] array contains default information
> + * which is used when we don't have real space, e.g. temporary
> + * objects representing result set of nested SELECT or VIEW.
> + *
> + * First number is supposed to contain the number of elements
> + * in the index. Since we do not know, guess 1 million.
> + * Second one is an estimate of the number of rows in the
> + * table that match any particular value of the first column of
> + * the index. Third one is an estimate of the number of
> + * rows that match any particular combination of the first 2
> + * columns of the index. And so on. It must always be true:
> + *
> + *           default_tuple_est[N] <= default_tuple_est[N-1]
> + *           default_tuple_est[N] >= 1
> + *
> + * Apart from that, we have little to go on besides intuition
> + * as to how default values should be initialized. The numbers
> + * generated here are based on typical values found in actual
> + * indices.
> + */
> +extern const log_est default_tuple_est[];

15. Why is it extern? Looks like it is used in analyze.c only, and
can be static within the file.

> +
> +/**
> + * Fetch statistics concerning tuples to be selected.
> + * If there is no appropriate Tarantool's index,
> + * return one of default values.
> + *
> + * @param idx Index.
> + * @param field Number of field to be examined.
> + * @retval Estimate logarithm of tuples selected by given field.

16. All tuples? Or unique by this field?
> diff --git a/src/box/sql/where.c b/src/box/sql/where.c
> index e6f9ef431..975e7059f 100644
> --- a/src/box/sql/where.c
> +++ b/src/box/sql/where.c
> @@ -977,15 +986,15 @@ whereKeyStats(Parse * pParse,	/* Database connection */
>   
>   		pRec->nField = n;
>   		res =
> -		    sqlite3VdbeRecordCompareMsgpack(aSample[iSamp].n,
> -						    aSample[iSamp].p, pRec);
> +		    sqlite3VdbeRecordCompareMsgpack(samples[iSamp].key_size,
> +						    samples[iSamp].sample_key,

17. Lets make sample_key has char * type instead of void *. It is common
practice for raw MessagePack in Tarantool.