Re: Avoiding bad prepared-statement plans.

From: Mark Mielke <mark(at)mark(dot)mielke(dot)cc>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alex Hunsaker <badalex(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jeroen Vermeulen <jtv(at)xs4all(dot)nl>, Greg Stark <gsstark(at)mit(dot)edu>, Bart Samwel <bart(at)samwel(dot)tk>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Avoiding bad prepared-statement plans.
Date: 2010-02-26 08:13:31
Message-ID: 4B8782AB.4050504@mark.mielke.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

My preference is to deal with the specific value vs generic value issue.

For this issue, it can affect performance even if PREPARE/EXECUTE is
execute exactly once.

In the last case I saw, a certain query was executing once every second,
and with a specific value it would take < 1 ms, and with a generic value
it would take > 50 ms. That's 5% system load for one CPU core to do
nothing. After analysis, it was clearly a "common value" vs "not common
value" problem. For this particular table, it stored an integer, but
only used two values across something like 100k rows. The query was for
a third value that did not exist. The difference was a sequential scan
vs an index lookup.

I do not know whether the application was doing PREPARE/EXECUTE each
time, or whether it was doing PREPARE once in advance and then EXECUTE
each time after that, but I don't think it matters, either, as I think
both cases deserve attention, and the problem is the same in both cases.
Even one generic plan run costs 50+ the cost of both planning and execution.

Re-planning a generic plan with another generic plan may generate zero
benefit, with a measurable cost. More on this after...

All the points about ms seem invalid to me. There are many reason why ms
could increase, and many of them have nothing to do with plan
efficiency. Again, re-planning due to a high ms, or a high ratio of ms,
does not indicate that re-planning will improve the success of the plan.
The planning process does not measure ms or predict ms.

My idea of an optimal system is as follows:

1) Prepare gathers and caches data about the tables involved in the
query, including column statistics that are likely to be required during
the planning process, but prepare does not running the planning process.

2) Execute runs the planning process re-using data cached by prepare,
and then executes the plan.

3) Advanced: Execute may cache the selected plan for re-use only if it
can identify a set of criteria that would allow the selected plan to be
tested and invalidated if the parameter nature has changed such that a
re-planning would likely choose another plan. Execute may cache multiple
plans against a prepared statement, provided that each cached plan
identify invalidation criteria.

4) Even more Advanced: Prepare may identify that elements of the plan
that will always be the same, no matter what parameter is specified, and
cache these results for substitution into the planning phase when
execute is run. (Effectively lifting the planning from execute to
prepare, but only where it makes obvious [= cheap to detect] sense)

This treats the whole statement planning and execution as a pipeline,
lengthening the pipeline, and adjusting some of the pipeline elements
from prepare to execute. It has the benefit of having fast
prepare/execute whether execute is invoked only once or many times. The
effect is that all statements are specifically planned, but specific
plans are re-used wherever possible.

To support the case of changing data, I think the analyze process should
be able to force invalidation of cached plans, and force the cached
column statistics for prepared statements to be invalidated and
re-queried on demand, or push new statistics directly into the prepared
statements. It makes no sense (to me) to re-plan for the same parameters
until an analyze is done, so this tells me that analyze is the event
that should cause the re-plan to occur.

I think anything less than the above will increasing the performance of
some queries while describing the performance of other queries. It might
be possible to guess which queries are more valuable to people than
others, and hard code solutions for these specific queries, but hard
coding solutions will probably always be a "lowest hanging fruit" solution.

After writing this, I'm pretty sure that implementation of the above
into PostgreSQL would be difficult, and it could be a valid concern that
the investment is not worth the benefit at this time. It's a tough problem.

My $0.01 CDN. :-)

Cheers,
mark

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Piyush Newe 2010-02-26 08:30:05 Correcting Error message
Previous Message Alex Hunsaker 2010-02-26 06:29:41 Re: Avoiding bad prepared-statement plans.