[Tarantool-discussions] SQL built-in functions position

Thu Oct 1 17:46:31 MSK 2020

Hello,

My comments/answers inlined.

On 28 сен 22:07, Vladislav Shpilevoy wrote:
> Hi!
> 
> On 27.09.2020 17:18, Mergen Imeev wrote:
> > Hi all. I have a question that I would like to discuss.
> > 
> > The question is about SQL built-in functions. At the moment these functions are
> > partially described in _func and partially in src/box/sql/func.c. I received two
> > completely different suggestions from my reviewers on what to do with these
> > functions:
> > 1) Move definitions completely to _func. Remove definitions from func.c.
> > 2) Move definitions completely to func.c. Remove definitions from _func.
> > 
> > In the first case, users will be able to see the function definitions. Also, in
> > the future, we may allow these functions to be called from Lua (although not
> > sure if this is necessary). The main idea is 'all functions have the same
> > interface'.
> > 
> > In the second case, the implementation is simpler, and we can more easily
> > implement some features, such as "virtual" functions. For users, the definition
> > can only be seen in the documentation. The main idea is 'SQL built-in functions
> > are part of SQL'.
> > 
> > Which of these approaches do you think is more beneficial to us?
> 
> I am the one who proposed the second option. I proposed it because of
> several reasons.
> 
> ==================================================
> ## Reason 1
> 
> The implementation is ugly by design. Since SQL functions are strictly typed,
> and some of them may take more than one type, we were forced to implement some
> kind of function overload by mangling function names. We are not trying to
> implement C++ here, it goes against the _func basic schema.

We are implementing SQL built-ins here and that is what syntax dictates. Originally,
wheren we were removing hard code it was intentded to make both languages
interoperable. I see no reason, why _func should exclusively belong to Lua.

We are free to extend _func schema.

> To workaround that there were added a new ugly option: is_overloaded. Overloaded
> means that there is still one function but declared twice, just to check types.

This option is useless since we might implement stable mangler, which won't change
a name if there're no types of args and returns are specified.

> For example, LENGTH and LENGTH_VARBINARY, TRIM, TRIM_VARBINARY, and so on. That
> leads to an issue, that we have not existing functions in _func. For example:
> 
> 	tarantool> box.space._func.index.name:select({'LENGTH_VARBINARY'})
> 	---
> 	- - [68, 1, 'LENGTH_VARBINARY', 1, 'SQL_BUILTIN', '', 'function', ['varbinary'], 'integer',
> 	    'none', 'none', true, false, true, ['SQL'], {'is_overloaded': true}, '', '2020-08-14
> 	      16:27:52', '2020-08-14 16:27:52']
> 	...
> 
> 	tarantool> box.execute("SELECT LENGTH_VARBINARY('abc')")
> 	---
> 	- null
> 	- Function 'LENGTH_VARBINARY' does not exist
> 	...
> 
> Doesn't this look bad? That is -1 point to the argument about 'it is good to
> have function visible in _func'. They are visible, but it does not mean anything.

_func is a service space and it is unusual to use it that way. Just imagine, you're
doing `nm` on some C++ object and trying to invoke mangled names from there. We should
deal with it anyway, since user might create she's own function and there'll be no
way to hardcode it.

> ====================
> ## Reason 2
> 
> SQL has vararg functions - the ones which take unknown number of arguments. That
> led to addition of one another ugly option: has_vararg. When it is true, all
> arguments after last declared argument are forced to have the same type as the
> last argument. This looks wrong from all possible sides.

That is the way all compilers work. They add hidden attribute to function declaration
which manifests it has va_arg in it.

> Firstly, if a function has variable argument count, why the hell should they all
> be of the same type? It is a silly restriction motivated entirely by SQL
> specifics like SUM() taking all arguments of the same type, or LEAST(), or GREATEST()
> and so on. That is very-SQL thing. I can't imagine when such a restriction may
> be needed in other supported languages.

They won't. va_arg must be typeless. Type of actual values must be checked inside
given routine. Just look inside printf().

> Secondly, like with the option 'is_overloaded' it complicates output of _func and
> makes it harder for a user to create a function, when he sees so many irrelevant
> options, needed for SQL only. Just take a look at the _func output in the end of
> this email to see how bad is it [1].

Complication of the output of select() from service space is not an argument at all.

> Thirdly, it simply does not work for functions which are really vararg, and not
> some kind of aggregates, ironically. For example, PRINTF(). To workaround the
> workaround-option has_vararg such functions simply turn off their type checks by
> making the last argument SCALAR. Well, this looks really bad. It would become a
> bit simpler if we would split that option in two - has_vararg to make an
> equivalent of ... in C with values of any type, and is_aggregate to check all
> values have the same type. But still ugly.

See first comment in this chapter.

> ====================
> ## Reason 3
> 
> SQL built-in functions are a part of the language. When we store them separately,
> we risk to get SQL implementation different from the built-in SQL functions schema.
> For example, we will add a new built-in function, and a user will upgrade, but
> won't be able to use it until he upgrades the schema, and we will need to support
> that in SQL implementation - that some built-in functions actually may not exist.

I think this is a good idea to upgrade before use _new_ functionality.

> ====================
> ## Reason 4
> 
> Some of the functions are supposed to be used for aggregated values, and this makes
> their implementation not reusable in other languages. That in turn makes their presence
> in the common _func space, used for all functions, irrelevant. I am talking about
> SUM(), COUNT(), AVG(), probably there are more.

For now, error should be emitted, but in future nothing blocks us from enabling
such aggregates, e.g. for arrays.

> ================================================================================
> Now talking of the points I received about how good would it be to have these
> functions in _func.
> 
> ====================
> ## Built-in functions require privileges? - No.
> 
> Cite from Peter's email:
> 
> 	The current built-in functions are harmless, except perhaps for RANDOMBLOB with a huge value.
> 	However, in future maybe there will be built-in functions that should require privileges.
> 	In that case, I believe, they will have to be in _func (and someday in _vfunc)
> 	so that grant() will work for them. 
> 
> Cite from Nikita's email:
> 
> 	allows to grant and verify priveleges in the same way and so forth
> 
> What are examples of such built-in functions? Built-in functions are a part
> of the language. Restricting access to SQL built-ins is like restricting access
> to 'next()', 'string.upper()', 'os.time()' and other functions in Lua. Or
> restricting usage of 'operator+' in C++. They are a part of the language. If
> something needs restrictions, it does not seem to be a part of the language.
> If a function is sandboxed, so it does not affect other users, I don't see why
> would it need restrictions. All built-ins are sandboxed.
> 
> ====================
> ## Built-in functions prevent duplicates in _func? - No.
> 
> Cite from Nikita's email:
> 
> 	It makes name collisions check simple
> 
> This is a matter of a few code lines to add that check to _func triggers
> to check new function name in all languages, if necessary. Or we could just make
> the functions hash do the checking. So we won't even need to involve SQL-specific
> code.
> 
> That is by the way questionable. I would think more if we want to forbid
> definition of functions clashing with built-ins. That would be a 'legal' overload,
> if we would look into _func first, and then into built-ins, but I don't really
> know about that. Didn't think of it properly yet.
> 
> ====================
> ## Storage in _func does not change _func schema and documentation? - No.
> 
> Cite from Peter's email:
> 
> 	I did not document in the manual's SQL section that built-in functions will
> 	be in _func, so removing them is not a regression from documented behaviour. 
> 
> It is good that their presence in _func is not documented. Because even if
> it would, and we would go for _func extension, we would change _func schema
> and options anyway, because of the new options 'is_overloaded' and 'has_vararg',
> and new fake functions with _VARBINARY suffix.
> 
> Cite from Nikita's email:
> 
> 	Built-ins are already declaraed in _func, so reverting this thing would
> 	result in another one unnecessary schema change and upgrade (so I doubt that
> 	implementation would be somehow 'simpler')
> 
> This is also wrong. Both versions change _func. But the version with extending
> _func makes it bigger, adds new ugly SQL-specific options, and adds not existing
> functions with fake names. In the version about functions-as-part-of-language _func
> is cleared from these functions, and nothing more happens with the schema. So the
> patch will be simpler for sure, even though may get bigger due to more removals.
> 
> Besides, their removal from _func would also allow to get rid of the crutch with
> language 'SQL_BUILTIN' in _func. It will be present in C, but won't exist in _func.
> This is not a language. Languages are SQL and Lua. SQL_BUILTIN was a crutch from
> the beginning, to make the functions work anyhow.

I'd like to make call mechanism for SQL built-ins and use-defined routines the same.

> ====================
> ## Users benefit from seeing SQL-specific functions in _func? - No.
> 
> Look at [1]. The output format is hardly readable. Not only because of the
> new options (partially because of them), but also because 1) some functions
> here don't really exist - LENGTH_VARBINARY, POSITION_VARBINARY, TRIM_VARBINARY,
> SUBSTR_VARBINARY, 2) because of lots of other fields of _func, which make
> the output super hard to read and understand what is what, even now.
> 
> _func is a dead end if someone wants to see function definitions. For that it
> would be better to introduce a new pretty printer, somewhere in box.func maybe.

That is strange argument (and I guess it is duplicate of one ofprevious). _func
is a service space. It should be read with care.

> ====================
> ## Storage in _func unifies functions? - _func output is unified, yes. Other things - no.
> 
> That is the only unification. Code still is going to separate them. By
> introduction of the new _func options you just move the complexity from a
> private place related to SQL only to the public space, related to the whole
> _func and other languages.

We should end up with the same mechanisms.

> ====================
> ## Reuse SQL functions in Lua and other languages? - No.
> 
> Cite from Nikita's email:
> 
> 	Finally part of functions can turn out to be really
> 	usefull in Lua someday such as date()/time()
> 
> It is not a secret, that all the languages we support already have
> date/time functions. In Lua these are 'os.date', 'os.time', 'fiber.clock',
> 'fiber.time' and more. In C these are all the standard C functions such
> as 'time()', 'gettimeofdat()', 'clock_gettime()', 'clock()', and more. Also
> lots of functions for time formatting, but I don't remember their names.
> 
> So what exactly are the SQL built-in functions so much needed in Lua
> and C? Looking at the unreadable _func output below, I can't imagine
> why somebody need any of these functions out of SQL.

Third time about non-readable output? Well, ok. It is unreadable. But that
is not an argument.

That said. I see we have 3 voices against 1.

--
Regards, Kirill Yukhin