[Tarantool-patches] [PATCH 13/15] sql: introduce cache for prepared statemets

Tue Nov 12 12:30:38 MSK 2019

On Tuesday, November 12, 2019 11:50:52 AM MSK Konstantin Osipov wrote:
> * Georgy Kirichenko <georgy at tarantool.org> [19/11/12 11:03]:
> > > > value. So the new proposal is:
> > > A back of the envelope calculation shows that if you have 40
> > > nodes, 32 cores, 100 connections on each instance, it gives you
> > > 
> > > 40*32*100*5mb = 625 GB of prepared statement cache!
> > 
> > Why did you make this fake up?
> > 1. why did you pack instances so dense?
> 
> What you should have done is looked at the worst case.
> 
> The case above is not the worst. The desnity doesn't matter much
> - halve it and you get 300GB, even 50G for a cache on 40 nodes
> is too much.
> 
> > 2. why did you use 5mb instead of 1-2-5kb, for instance.
> 
> That's the default cache size. And it's a small size, too, a
> single compiled AST for a SELECT can easily take 16-64kb (measure
> it!), so it's a cache for ~100-200 statements.
I do not talk about cache size. Which cache, why cache? I consider only a 
prepared statement resource footprint and count of prepared statements on a 
single instance. There is no cache yet.
> 
> > 4. why did you count connections because the only thing is matter is count
> > of unique sql statements.
> 
> With a per-session cache, you need to count connections. With a
> global cache the numbers are not bad.
We will have more than one session per connection, please keep it in mind. All 
statements in a session would execute sequentially otherwise there is no 
transaction consistency in a session boundaries. 
> 
> > 4. why did you use the word `cache`.
> > Prepared statements and statement cache are the completely different
> > substances. Obviously, if somebody prepared a statement then they wish the
> > statement to do not be evicted from the prepared statements list.
> 
> This is some kind of 1980's myth. 
And it does not mean that this approach is bad.
> The reason everybody does it
> this way is that virtually all databases that support prepared
> statements are vertically scalable process-per-connection.
> 
> Otherwise, the approach has no benefits. In particular, there is
> no value in making the user explicitly deallocate a prepared
> statement. It's a pain to not be able to re-use a prepared
> statement handle across connections.
Your approach suggested there is no more prepared statements because there is 
no difference between prepared and unprepared except its argument marshaling 
which could be done using client library. Because simple statement also should 
be put into a statement cache. And if I prepared a statement I definitely want 
my prepared statement still persist without any dependencies how many other 
statement were executed.  
Anyway, in case on reconnect a session should be reestablished so there is no 
big deal to reprepare all statement. And it is no a big deal to iterate over 
session prepared statement maps to decrement all corresponding statement 
reference counts.
Also my approach allows to reuse a statement prepared by other session because 
they both would share the prepared statement reference (sql source hash).
Additionally, it is a very strange opinion that a client should be able to 
reuse prepared statement event in cases when a server crashes or a client 
initiates a connection to another server or event cluster. Otherwise there is 
no difference between prepared statements and simple unprepared statements.
> 
> There are some risks in having a global transparent cache, too:
> i.e. one has to watch out for security issues, if the same
> statement can be reused across connections with different
> credentials. But these risks do not materialize in case of
> tarantool.
Wrong, please keep in mind multisession connection, vshard and proxy.
Also a prepared statement sql could contain session-local variables. 
> 
> > > If you make the cache opaque to the client you don't have
> > > this mess with explicit referencing and dereferencing from
> > > sessions.
> > 
> > User should explicitly unprepare any statement they prepared before. Or it
> > would be done by on_session_stop trigger. As we do not want to evict a
> > prepared statement from prepared statement list, really.
> 
> There is no reason for this, except copying someone else's design.
> You're just making users do more work, and doing more work on the
> server side on disconnect.
Amount of work I suggest to do is much less in comparison with a good 
statement cache (simple LRU is completely not enough), statement repreparation 
if it is unknown or evicted, cache statistic, global cache sharing and access 
checking. My approach is straightforward and predictable.
> 
> > Yes, and it is the feature requirement - if a statement was prepared then
> > it should be prepared until user did not request the statement unprepare.
> There is no such requirement.
> 
> > So I would like to explain my own vision of the feature:
> > 1. Each instance has its own prepared statements table from which a
> > statement vm could be fetched using the statement key (sql source string
> > hash, for instance). Also the table may contain some meta: list of
> > accessed objects (to prevent us from access checking on each invocation),
> > count of invocation, sql source, reference counter.
> > 2. Each session has a limit which limits the amount of resources, count of
> > prepared statement for instance.
> 
> So users have double trouble: not only they need to explicitly
> deallocate prepare, they also need to be able to handle an
> out-of-resources error, such as
> out-of-prepared-statements-cache-memory.
Yes, and it is completely Ok.
> 
> This "vision" happens to copy all the bugs of Oracle.
You compare my approach with the most popular business database, thanks a lot.
> 
> - connect/disconnect is slow.
Please estimate costs of disconnect in terms of network pings, sql planning 
and wal latency (the only thing the server should do is to decrement around 
100-200 integers)
> - users have to do more work
Please estimate amount of work in comparison with users business logic 
implementation
> - there is a new kind of error - out of statement cache - which
I would like repeat: we do not talk about statement cache.
>   one has to be prepared for. There is no way to work this error
>   around on application side
> - the server side still has to handle concurrent execution
>   of the same prepared statement, the protocol allows it today,
>   so the implementaiton either has to ban it or create a copy on
>   the fly (i.e. implicitly prepare another vdbe anyway).
I do not see how did you make it up. Any prepared statement implementation 
should be ably to handle many prepared statement executions in parallel.
The other thing my approach suggests: vdbe should be able executed in 
parallel, but this limitation is applicable anyway.
> 
> The only benefit of explicit allocate/deallocate is that there is
> some kind of guarantee that the statement is not going to get
> parsed again when it is executed. This is very little value: I'm
> not aware of cases when this is needed, and one can create a view instead.
Please, make a proof.
> 
> > 3. Each session contain the list of prepared statement identifiers used by
> > this session.
> > 4. Statement cache could be used by the feature but it is not essential.
> > At
> > least it relaxes the feature implementation complexity - we should not
> > implement a good statement cache, eviction politics and other stuff like
> > this right now.
> > 5. If a client loses their session (crash, network failure) then it should
> > reestablish a session and reprepare all statement it wish to use. It is
> > more simply to implement in comparison with any caching, eviction,
> > tracking and other things you suggested.
> 
> No, it's not simple to implement, certainly not simpler than a
> global cache with its own opaque eviction strategy. The simplest
> implementation of the global cache is not caching anything at all
> and preparing before every execute - which is good enough for JDBC
> and ODBC.
Well, so you suggested not to implement prepared statements.
> 
> > 6. Right now a connection could contain only one session and it is not a
> > stop- factor - the only thing we should keep in mind - prepared statement
> > relates to a session not to a connection. Then, when we will implement
> > streams and multiple sessions per connection, we should not do any more
> > changes to support prepared statements in this circumstances.
> 
> If you suggest that a single session can execute only one
> statement concurrently, you should be explicit about it - and it's
> a huge limitation.
Please make a distinction between connection and session, we already discussed 
it some month before.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: This is a digitally signed message part.
URL: <https://lists.tarantool.org/pipermail/tarantool-patches/attachments/20191112/5dc0e027/attachment.sig>