A costing analysis tool

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: A costing analysis tool
Date: 2005-10-12 23:28:47
Message-ID: s34d55ef.014@gwmta.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I'm looking at trying to fix some clear flaws in costing which cause
of our real-world queries to choose sub-optimal plans under PostgreSQL.
It's clear that there needs to be a tool to analyze the accuracy of
costing for a variety of queries, both to direct any efforts to fix
problems and to test for possible costing regressions. As far as I can
tell, no such tool currently exists. If I've missed something, please
let me know, even if it's ad hoc or incomplete.

Note that I'm talking about a tool strictly to check the accuracy of
the estimated costs of plans chosen by the planner, nothing else.

I'm at the rough planning stage, and would appreciate any feedback on
my thoughts before I start actual development. Considerations, in no
particular order (but numbered for easy reference):

(1) Most of the time, on a well-configured database, PostgreSQL
chooses a plan which performs very well. Many test cases need to cover
these normal cases, both to set a baseline and to catch any regression.

(2) A large database must be created for these tests, since many
issues don't show up in small tables. The same data must be generated
in every database, so results are comparable and reproducable.

(3) Developers should be able to easily add test cases, either for
their own use or contributed to the community.

(4) The same test query should be able to easily run in different
permutations of the following:

(a) With no data cached before the run, or as fully cached as possible.
(b) With various enable_xxx settings on or off.
(c) With or without significant dead space in the tables/indexes.

(5) The tool needs to be able to run in a variety of OS environments.
At a minimum, this means some way to pick up configuration information
to specify how to start and stop the back end, and how to flush the
system cache.

(6) The relative costs of various plans shifts dramatically when C
asserts are enabled. To avoid misleading results, the tool should warn
the user when run on a build configured with --enable-cassert, and all
results from such an environment should be conspicuously identified as
such.

(7) I envision a process to create a test database, populate it, run a
series of test cases with EXPLAIN ANALYZE, capture the results, parse
the results and store them in a database, analyze the results to find
means and standard deviations both overall and for each type of plan,
and report summaries and outliers -- with references to the test cases.
The primary statistic of interest is actual time divided by cost. This
seems like it would be of interest overall, and within the permutations
mentioned above for a single query.

A reasonable set of test cases would illuminate where costing
adjustments would provide the most benefit for the least risk. By
reining in the most extreme outliers, we could allow the planner to
recognize the best plan among the options it is considering with
greater accuracy. This tool could be used as a regression test to
ensure that a costing adjustment didn't distort the cost of something
which had been accurately costed.

So, what do you think?

-Kevin

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Neil Conway 2005-10-12 23:36:30 Re: pgsql: Do all accesses to shared buffer headers
Previous Message Kevin Grittner 2005-10-12 23:23:46 Re: Are cost estimates based on asserts?