Re: Poor plan choice in prepared statement

From: Scott Carey <scott(at)richrelevance(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "david(at)lang(dot)hm" <david(at)lang(dot)hm>
Cc: Guillaume Smet <guillaume(dot)smet(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, bricklen <bricklen(at)gmail(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Poor plan choice in prepared statement
Date: 2009-01-30 19:41:24
Message-ID: C5A897E4.2182%scott@richrelevance.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

>david(at)lang(dot)hm writes:
>> the poster who started this thread had a query where the parsing phase
>> took significantly longer than the planning stage.

> That was an anecdote utterly unsupported by evidence.
regards, tom lane

The issue of prepared statements having atrocious query plans has hit me again. I feel very strongly about this topic and the need for Postgres to have an option that allows for prepared statements to re-plan based on the inputs that works and is user friendly. Pardon my bluntness, but in the current situation the system is brain dead in many important use cases and data sets.

I believe my statement referenced above was about parsing time to the remaining time, not parsing compared to planning. But either way, its a minor detail, and its not important to justify the need for the enhancement here.
Yeah, its anecdotal from your perspective. Go ahead and ignore all that if you wish.
**** I am making several points in this message that are independent of such evidence, IMHO.
I have tried to rearrange this message so that the anecdotal narrative is at the end, after the first dashed line.

Unnamed prepared statements do solve much of the problem in theory, since the most common issue is typically poor execution plans or a lack of ability to cleanly deal with SQL injection and write less bug prone client code. Parsing being very expensive is more rare. But there IS a performance savings that is not insignificant for many workloads to be had by avoiding the utterly avoidable parsing.

HOWEVER:
What is overlooked WRT unnamed prepared statements, is that they are hard to use and changing client code or behavior is difficult, error prone, and sometimes impossible. Not all client APIs play nice with them at the moment (see Postgres' JDBC). The behavior has some global tweaks, but these are useless in many situations where you need behavior that varies.

Every time the answer to a problem is to change the client behavior, I ask myself if the DB could have a better default or configuration parameter so that clients don't have to change. Some database instances have dozens of applications and hundreds of client types including ad-hoc usage. Some client code is legacy code that simply can't be changed.
Changing the clients is a lot harder than changing a db parameter, or configuring a new default for a particular db user. If the client must change, adding something like SET prepare_level = 'parse_only' is the least intrusive and easiest to test - but I stress again that in many real-world cases the client is not flexible.

A session-level parameter that controls prepared statement behavior defaults (never cache by default? parse cache only? parse cache and plan cache?) would be a blessing. A DBA could administer a fix to certain problems without having to force clients to change behavior or wait for new client API versions with fixes.

That reminds me, isn't there a global parameter that can force no prepared statements to be cached, does that make them all behave as if they are unnamed? Or are they simply re-created each time? I believe I tried this in the past and the prepared statements were unable to use the parameter values for partition table selection, suggesting the latter.

Typically, I run into the issue with queries in a back-end process that operates on large data sets or partitioned tables. Prepared statements essentially kill performance by several orders of magnitude (think, scan 1 versus scan 5000 partition tables). However, my recent issue is brutally simple.
I need to have an auto-complete web form, and thus need a query with a portion like
WHERE name LIKE 'userenteredtext%'
Thus, I make an index with varchar_pattern_ops and off we go! ... Or not. Works fine with explicit queries, but not a prepared query. Unfortunately, this is highly prone to SQL injection, and the industry standard way to deal with this is by parameterization.
http://www.owasp.org/index.php/Guide_to_SQL_Injection
(that site is a weath of information, tools, and tests on the topic, for example: http://www.owasp.org/index.php/Testing_for_SQL_Injection).

Prepared statements are a blessing from a client code perspective, preventing all sorts of bugs and catching many others early, on the client side. Not being able to use them because it causes the database to execute very bad query plans is a flaw in the database, not the client.

------------------------------------------------------------------------------------
Unnamed prepared statements did not work for me when I tried them as a solution (8.3.2, supposedly after the fix). I was in a hurry to fix my issue, and just moved on when they were plainly not working. It is possible I did something wrong back then. They are also poorly supported by many client APIs -- when they did not work for me, I supposed it was a JDBC issue or perhaps user error, but maybe it was server side? The documentation on both sides is not entirely clear on what should really be happening. And I could not figure out how to debug where the problem was. Can you even test an unnamed prepared statement in psql?
I had no time to waste and changed a lot of the client code instead, and have since been less interested until the topic came up in this thread, and then after letting this message sit as a draft for a month, ran into the "varchar_pattern_ops + prepared statement = index, what index?" Issue.

--------
On queries that are overly parse-heavy:

It is very difficult to tease apart parse time from plan time in Postgres (genrally speaking, getting performance data from Postgres is much harder than the commercial DB's I've used). However, my experience with commercial DBs that have running counters of time spent in various operations is that parsing can be upwards of 50% of the total CPU usage with some workloads. Decision Support type stuff where many light-weight queries are executed, is where I have seen that with Oracle. In Oracle you can find out exactly how much of CPU time was spent parsing versus planning versus other stuff in aggregate over a time interval. I don't suspect that Postgres has a parser that is orders of magnitude faster. But its true I don't have direct evidence to share right now teasing apart plan from parse time.

So yes, unnamed prepared statements are potentially part (but not all) of a solution, provided there was good control over their use in client API's (which there isn't and won't be, since they are non-standard), or there is a useful non-global way to configure them.

Planning is long on most queries that take long to parse, but this will vary. (We have several queries that access tables with either only primary key indexes or no indexes, lots of embedded 'stuff' that is naturally more parser than planner heavy like CASE statements and value constraints/checks/modifiers on columns without indexes and not in the where clause, and some are approximately 2k character behemoths).

-----
On Performance Improvements gained by avoiding prepared statements:

I gained about a factor of 5x to 50x in performance by changing a lot of code to avoid prepared statements. Some of the worst cases involved partitioned tables, where planning issues are particularly troublesome. (lets scan 1000 tables for no good reason! Hooray!)

I never posted here about my struggles with prepared statements and execution performance, I knew it wasn't going to change and I had to fix the problem in a few days time. One of the queries changed had about 7 sub-selects in it, but the eventual plan can be very fast for some cases, returning almost no data, and we run this query repeatedly with slightly different parameters. So with some parameters the execution time dominates by orders of magnitude, and for most parameter combinations the execution time is almost none of it. Of course, now we just write a new query for all the variants, else the performance is unacceptable. This is not too difficult of a change because these are not a SQL injection worry, although it has complicated the client code and test cases.

---
Being able to avoid these problems and let programmers use prepared statements would be a good thing, and so I think a solution more useable and flexible than the current unnamed prepared statements would be great. And if the avoidable and redundant re-parsing can be avoided too, its win-win.

Being able to cut parsing out of the loop in many cases to improve performance, should be able to stand up on its own as a legitimate improvement. If it is permanently bound to planning, it is permanently bound to significant caveats. Decoupling the two and providing a means of control over these WRT prepared statements and related features has much merit IMO.

Browse pgsql-performance by date

  From Date Subject
Next Message Scott Carey 2009-01-30 20:04:05 Re: Using multiple cores for index creation?
Previous Message Scott Marlowe 2009-01-30 07:53:37 Re: PG performance in high volume environment (many INSERTs and lots of aggregation reporting)