Re: Cached/global query plans, autopreparation

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Cached/global query plans, autopreparation
Date: 2018-02-15 08:20:00
Message-ID: 18f5ba5b-5d76-7e48-886c-1206a1135c00@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 13.02.2018 20:13, Shay Rojansky wrote:
> Hi all,
>
> Was wondering if anyone has a reaction to my email below about
> statement preparation, was it too long? :)
>
> (and sorry for top-posting)
>
> On Tue, Feb 6, 2018 at 9:27 PM, Shay Rojansky <roji(at)roji(dot)org
> <mailto:roji(at)roji(dot)org>> wrote:
>
> Hi all.
>
> Various versions of having PostgreSQL caching and/or autopreparing
> statement plans have been discussed
> (https://www.postgresql.org/message-id/op.t9ggb3wacigqcu%40apollo13.peufeu.com
> <https://www.postgresql.org/message-id/op.t9ggb3wacigqcu%40apollo13.peufeu.com>,
> https://www.postgresql.org/message-id/8e76d8fc-8b8c-14bd-d4d1-e9cf193a74f5%40postgrespro.ru
> <https://www.postgresql.org/message-id/8e76d8fc-8b8c-14bd-d4d1-e9cf193a74f5%40postgrespro.ru>),
> without clear conclusions or even an agreement on what might be
> worthwhile to implement. I wanted to bring this up again from a
> PostgreSQL driver maintainer's perspective (I'm the owner of
> Npgsql, the open source .NET driver), apologies in advance if I'm
> repeating things or I've missed crucial information. Below I'll
> describe three relevant issues and what I've done to deal with them.
>
> When the same statement is rerun, preparing it has a very
> significant performance boost. However, in short-lived connection
> scenarios it's frequently not possible to benefit from this -
> think of a typical webapp which allocates a connection from a
> pool, run a query and then return the connection. To make sure
> prepared statements are used, Npgsql's connection pool doesn't
> send DISCARD ALL when a connection is returned (to avoid wiping
> out the connections), and maintains an internal table mapping SQL
> (and parameter types) to a PostgreSQL statement name. The next
> time the application attempts to prepare the same SQL, the
> prepared statement is found in the table and no preparation needs
> to occur. This means that prepared statements persist across
> pooled connection open/close, and are never discarded unless the
> user uses a specific API. While this works, the disadvantages are
> that:
> 1. This kind of mechanism needs to be implemented again and again,
> in each driver:
> 2. It relies on Npgsql's internal pooling, which can track
> persistent prepared statements on physical connections. If an
> external pool is used (e.g. pgpool), this isn't really possible.
> 1. It complicates resetting the session state (instead of DISCARD
> ALL, a combination of all other reset commands except DEALLOCATE
> ALL needs be sent). This is minor.
>
> The second issue is that many applications don't work directly
> against the database API (ADO.NET <http://ADO.NET> in .NET, JDBC
> in Java). If any sort of O/RM or additional layer is used, there's
> a good chance that that layer doesn't prepare in any way, and
> indeed hide your access to the database API's preparation method.
> Two examples from the .NET world is dapper (a very popular
> micro-O/RM) and Entity Framework. In order to provide the best
> possible performance in these scenarios, Npgsql has an opt-in
> feature whereby it tracks how many times a given statement was
> executed, and once it passes a certain threshold automatically
> prepares it. An LRU cache is then used to determine which prepared
> statements to discard, to avoid explosion. In effect, statement
> auto-preparation is implemented in the driver. I know that the
> JDBC driver also implements such a mechanism (it was actually the
> inspiration for the Npgsql feature). The issues with this are:
>
> 1. As above, this has to be implemented by every driver (and is
> quite complex to do well)
> 2. There's a possible missed opportunity in having a single plan
> on the server, as each connection has its own (the "global plan"
> option). Many apps out there send the same statements across many
> connections so this seems relevant - but I don't know if the gains
> outweigh the contention impact in PostgreSQL.
>
> Finally, since quite a few (most?) other databases include
> autopreparation (SQL Server, Oracle...), users porting their
> applications - which don't explicitly prepare - experience a big
> performance drop. It can rightly be said that porting an
> application across databases isn't a trivial task and that
> adjustments need to be made, but from experience I can say that
> PostgreSQL is losing quite a few users to this.
>
> The above issues could be helped by having PostgreSQL cache on its
> side (especially the second issue, which is the most important).
> Ideally, any adopted solution would be transparent and not require
> any modification to applications. It would also not impact
> explicitly-prepared statements in any way.
>
> Note that I'm not arguing for any specific implementation on the
> PostgreSQL side (e.g. global or not), but just describing a need
> and hoping to restart a conversation that will lead somewhere.
>
> (and thanks for reading this overly long message!)
>
> Shay
>
>

I am an author of one of the proposal (autoprepare which is in commit
fest now), but I think that sooner or later Postgres has to come to
solution with shared DB caches/prepared plans.
Please correct me if I am wrong, but it seems to me that most of all
other top DBMSes having something like this.
Such decision can provide a lot of different advantages:
1. Better memory utilization: no need to store the same data N times
where N is number of backends and spend time for warming cache.
2. Optimizer can spend more time choosing better plan which then can be
used by all clients. Even now time of compilation of some queries
several times exceeds time of their execution.
3. It is simpler to add facilities for query plan tuning and maintaining
(storing, comparing,...)
4. It make is possible to control size of memory used by caches. Right
now catalog cache for DB with hundred thousands and tables and indexes
multiplied by hundreds of backends can consume terabytes of memory.
5. Shared caches can simplify invalidation mechanism.
6. Almost all enterprise systems working with Postgres has to use some
kind of connection pooling (pgbouncer, pgpool,...). It almost exclude
possibility to use prepared statements. Which can slow down performance
up to two times.

There is just one (but very important) problem which needs to be solved:
access to shared cache should be synchronized.
But there are a lot of other shared resources in Postgres (procarray,
shared buffers,...). So  I do not think that it is unsolvable problem
and that it can cause degrade of performance.

So it seems to be obvious that shared caches/plans can provide a lot of
advantages. But it is still not clear to me the value of this advantages
for real customers.
Using -M prepared  protocol in pgbench workload can improve speed up to
two times. But I have asked real Postgres users in Avito, Yandex,
MyOffice and them told me
that on their workloads advantage of prepared statements is about 10%.
10% performance improvement is definitely not a good compensation for
rewriting substantial part of Postgres core...

Another aspect is that Java, .Net and other languages has their own
mechanism for preparing statements. I still do not think that the
question whether to prepare statement or not can be solved just at API
level, without interaction of database engine. Language binding can
compare costs of generic and specialized plans, shared prepared
statements between all database connections,... But I think that it is
more natural and efficient to implement this logic in one place, rather
than try to reimplement it several times for different APIs.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Konstantin Knizhnik 2018-02-15 08:59:46 Re: JIT compiling with LLVM v10.1
Previous Message Amit Langote 2018-02-15 07:25:54 Re: FOR EACH ROW triggers on partitioned tables