[tarantool-patches] Re: [PATCH 2/2] sql: make aggregate functions types more strict

Wed Apr 17 16:19:16 MSK 2019

> On 17 Apr 2019, at 15:50, i.koptelov <ivan.koptelov at tarantool.org> wrote:
> 
> Thank you for the comments! I agree with all your suggestions and would
> send fixes a little bit later. Now I have one thing to discuss. 
> 
>> On 5 Apr 2019, at 22:48, Konstantin Osipov <kostja at tarantool.org> wrote:
>> 
>> * Ivan Koptelov <ivan.koptelov at tarantool.org> [19/04/05 18:02]:
>> 
>> Besides, I guess you can get rid of this check for most common
>> cases - averaging a column of the same type - so this is perhaps
>> better to make a separate opcode, not part of the main opcode, and
>> emit only when we're not sure the type is going to be the same
>> across all values. I don't know how hard this is to do, however -
>> perhaps should be moved into a separate patch, but I'd guess
>> detecting that the aggregate function argument has a non-mutable
>> type is not hard. 
>> 
>> -- 
>> Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
>> http://tarantool.io - www.twitter.com/kostja_osipov
> 
> I am not quite understand why do you use word ‘opcode’.
> Functions are implemented as C code.

I guess Konstantin meant adding opcode that implements
logic of primitive aggregate function. We are operating in
terms of opcodes due to our abstraction model: it is divided
into compilation stage and runtime. To invoke ‘c function’, we
should have corresponding opcode implementing initial preparation,
passing arguments to function, saving results to given registers etc.

Original SQLite routines to implement aggregates take into account
most general case (any function, any types, any number of arguements etc).
We can significantly simplify this procedures for several quite common cases.
For instance, SELECT sum(a) FROM t;

Within one opcode, we can open iterator, fetch values and quickly evaluate
sum (without involving OP_Next, OP_Column etc).

> Considering your suggestion (one ‘opcode’ for simple cases, another one
> - for complex) I want to do the following:
> 1) Add a bunch of INTERNAL functions, for example max_number, max_text and max_scalar.
> max_number and max_text would not have excess type checks, while max_scalar would
> have all necessary type checks. So a bunch of INTERNAL functions would implement one EXTERNAL function
> (just max() in this example).

Anyway, at some moment you have to dispatch call of function depending
on runtime value:

switch (mem->type) {
	case MEM_Str:
		return max_str(mem);
etc.

Otherwise, we have to add separate opcode for each pair
(aggregate, type), which in turn would decrease performance.

Nevertheless, I like this approach. For instance, it is used in PosgreSQL:
for each operation (like ‘+’) they have function with corresponding arguments.
When you try to add non-compatible values (like string and double), there’s
no function implementing addition with char * and double arguments, so
error is raised:  'operator does not exist: text * integer’. What is more, such
approach is quite suitable for JITing. On the other hand, we should carefully
benchmark changes like this. I’m not sure whether priority of this task is high
or low.

> 2) In runtime determine proper INTERNAL function (max_number, max_text or max_scalar) to implement
> given function. It would be done only once (not on the every step of aggregate function) using
> information about column type.
> 
> For example:
> SELECT MAX(b) FROM test_table;
> 
> If test_table.b has TEXT type we would use max_text. If test_table.b has SCALAR type
> we would max_scalar.  
> 
> If this question seem for you to be too ‘low-level’ I can just send the code
> for the next review round.
> 
>