[Tarantool-patches] [PATCH 13/15] sql: introduce cache for prepared statemets

Tue Nov 12 11:50:52 MSK 2019

* Georgy Kirichenko <georgy at tarantool.org> [19/11/12 11:03]:
> > > value. So the new proposal is:
> > A back of the envelope calculation shows that if you have 40
> > nodes, 32 cores, 100 connections on each instance, it gives you
> > 
> > 40*32*100*5mb = 625 GB of prepared statement cache!
> Why did you make this fake up?
> 1. why did you pack instances so dense?

What you should have done is looked at the worst case. 

The case above is not the worst. The desnity doesn't matter much 
- halve it and you get 300GB, even 50G for a cache on 40 nodes 
is too much.

> 2. why did you use 5mb instead of 1-2-5kb, for instance.

That's the default cache size. And it's a small size, too, a
single compiled AST for a SELECT can easily take 16-64kb (measure
it!), so it's a cache for ~100-200 statements.

> 4. why did you count connections because the only thing is matter is count of 
> unique sql statements.

With a per-session cache, you need to count connections. With a
global cache the numbers are not bad.

> 4. why did you use the word `cache`.
> Prepared statements and statement cache are the completely different 
> substances. Obviously, if somebody prepared a statement then they wish the 
> statement to do not be evicted from the prepared statements list.

This is some kind of 1980's myth. The reason everybody does it
this way is that virtually all databases that support prepared
statements are vertically scalable process-per-connection.

Otherwise, the approach has no benefits. In particular, there is
no value in making the user explicitly deallocate a prepared
statement. It's a pain to not be able to re-use a prepared
statement handle across connections. 

There are some risks in having a global transparent cache, too:
i.e. one has to watch out for security issues, if the same
statement can be reused across connections with different
credentials. But these risks do not materialize in case of
tarantool.

> > 
> > If you make the cache opaque to the client you don't have
> > this mess with explicit referencing and dereferencing from
> > sessions.
> User should explicitly unprepare any statement they prepared before. Or it 
> would be done by on_session_stop trigger. As we do not want to evict a 
> prepared statement from prepared statement list, really.

There is no reason for this, except copying someone else's design.
You're just making users do more work, and doing more work on the
server side on disconnect.

> Yes, and it is the feature requirement - if a statement was prepared then it 
> should be prepared until user did not request the statement unprepare.

There is no such requirement. 

> So I would like to explain my own vision of the feature:
> 1. Each instance has its own prepared statements table from which a statement 
> vm could be fetched using the statement key (sql source string hash, for 
> instance). Also the table may contain some meta: list of accessed objects (to 
> prevent us from access checking on each invocation), count of invocation, sql 
> source, reference counter.
> 2. Each session has a limit which limits the amount of resources, count of 
> prepared statement for instance.

So users have double trouble: not only they need to explicitly
deallocate prepare, they also need to be able to handle an
out-of-resources error, such as
out-of-prepared-statements-cache-memory. 

This "vision" happens to copy all the bugs of Oracle. 

- connect/disconnect is slow. 
- users have to do more work
- there is a new kind of error - out of statement cache - which
  one has to be prepared for. There is no way to work this error
  around on application side
- the server side still has to handle concurrent execution 
  of the same prepared statement, the protocol allows it today,
  so the implementaiton either has to ban it or create a copy on
  the fly (i.e. implicitly prepare another vdbe anyway).

The only benefit of explicit allocate/deallocate is that there is
some kind of guarantee that the statement is not going to get
parsed again when it is executed. This is very little value: I'm
not aware of cases when this is needed, and one can create a view instead. 

> 3. Each session contain the list of prepared statement identifiers used by this 
> session.
> 4. Statement cache could be used by the feature but it is not essential. At 
> least it relaxes the feature implementation complexity - we should not 
> implement a good statement cache, eviction politics and other stuff like this 
> right now.
> 5. If a client loses their session (crash, network failure) then it should 
> reestablish a session and reprepare all statement it wish to use. It is more 
> simply to implement in comparison with any caching, eviction, tracking and 
> other things you suggested.

No, it's not simple to implement, certainly not simpler than a
global cache with its own opaque eviction strategy. The simplest
implementation of the global cache is not caching anything at all
and preparing before every execute - which is good enough for JDBC
and ODBC.

> 6. Right now a connection could contain only one session and it is not a stop-
> factor - the only thing we should keep in mind - prepared statement relates to 
> a session not to a connection. Then, when we will implement streams and 
> multiple sessions per connection, we should not do any more changes to support 
> prepared statements in this circumstances.

If you suggest that a single session can execute only one
statement concurrently, you should be explicit about it - and it's
a huge limitation.

-- 
Konstantin Osipov, Moscow, Russia