Re: Correlation in cost_index()

From: Sean Chittenden <sean(at)chittenden(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Manfred Koizar <mkoi-pg(at)aon(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Correlation in cost_index()
Date: 2003-08-08 18:06:56
Message-ID: 20030808180656.GL94710@perrin.int.nxad.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> > Hrm, after an hour of searching and reading, I think one of the
> > better papers on the subject can be found here:
> > http://www.cs.ust.hk/faculty/dimitris/PAPERS/TKDE-NNmodels.pdf
>
> Interesting paper, but I don't see the connection to index order
> correlation?

Nothing that I found was nearly that specific, as close as I could
find was the paper above on calculating the cost of fetching data from
a disk, which I thought was the bigger problem at hand, but I
digress...

In one paper about large dimension index searches, they did suggest
that cost was cumulative for the number of disk reads or nodes in the
tree that weren't held in cache, which was the biggest hint that I had
found on this specific topic. With that as a guiding light (or
something faintly resembling it), it'd seem as though an avg depth of
nodes in index * tuples_fetched * (random_io_cost * indexCorrelation)
would be closer than where we are now... but now also think I/we're
barking up the right tree with this thread.

It's very possible that cost_index() is wrong, but it seems as though
after some testing as if PostgreSQL _overly_ _favors_ the use of
indexes:

# SET enable_seqscan = true; SET enable_indexscan = true;
SET
SET
# EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE utc_date > '2002-10-01'::TIMESTAMP WITH TIME ZONE;
INFO: cost_seqscan: run_cost: 21472.687500
startup_cost: 0.000000

INFO: cost_index: run_cost: 21154.308116
startup_cost: 0.000000
indexCorrelation: 0.999729
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using report_user_cat_count_utc_date_id_idx on report_user_cat_count rucc (cost=0.00..21154.31 rows=705954 width=64) (actual time=91.36..6625.79 rows=704840 loops=1)
Index Cond: (utc_date > '2002-10-01 00:00:00-07'::timestamp with time zone)
Total runtime: 11292.68 msec
(3 rows)

# SET enable_seqscan = true; SET enable_indexscan = false;
SET
SET
# EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE utc_date > '2002-10-01'::TIMESTAMP WITH TIME ZONE;
INFO: cost_seqscan: run_cost: 21472.687500
startup_cost: 0.000000

INFO: cost_index: run_cost: 21154.308116
startup_cost: 100000000.000000
indexCorrelation: 0.999729
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on report_user_cat_count rucc (cost=0.00..21472.69 rows=705954 width=64) (actual time=1091.45..7441.19 rows=704840 loops=1)
Filter: (utc_date > '2002-10-01 00:00:00-07'::timestamp with time zone)
Total runtime: 10506.44 msec
(3 rows)

Which I find surprising and humorous given the popular belief is, mine
included, contrary to those results. I can say with pretty high
confidence that the patch to use a geometric mean isn't correct after
having done real world testing as its break even point is vastly
incorrect and only uses an index when there are less than 9,000 rows
to fetch, a far cry from the 490K break even I found while testing.
What I did find interesting, however, was that it does work better at
determining the use of multi-column indexes, but I think that's
because the geometric mean pessimizes the value of indexCorrelation,
which gets pretty skewed when using a multi-column index.

# CREATE INDEX report_user_cat_count_utc_date_user_id_idx ON report_user_cat_count (user_id,utc_date);
# CLUSTER report_user_cat_count_utc_date_user_id_idx ON report_user_cat_count;
# ANALYZE report_user_cat_count;
# SET enable_seqscan = true; SET enable_indexscan = true;
SET
SET
# EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE user_id < 1000 AND utc_date > '2002-01-01'::TIMESTAMP WITH TIME ZONE;
INFO: cost_seqscan: run_cost: 23685.025000
startup_cost: 0.000000

INFO: cost_index: run_cost: 366295.018684
startup_cost: 0.000000
indexCorrelation: 0.500000
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on report_user_cat_count rucc (cost=0.00..23685.03 rows=133918 width=64) (actual time=0.28..6100.85 rows=129941 loops=1)
Filter: ((user_id < 1000) AND (utc_date > '2002-01-01 00:00:00-08'::timestamp with time zone))
Total runtime: 6649.21 msec
(3 rows)

# SET enable_seqscan = false; SET enable_indexscan = true;
SET
SET
# EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE user_id < 1000 AND utc_date > '2002-01-01'::TIMESTAMP WITH TIME ZONE;
INFO: cost_seqscan: run_cost: 23685.025000
startup_cost: 100000000.000000

INFO: cost_index: run_cost: 366295.018684
startup_cost: 0.000000
indexCorrelation: 0.500000
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using report_user_cat_count_utc_date_user_id_idx on report_user_cat_count rucc (cost=0.00..366295.02 rows=133918 width=64) (actual time=53.91..3110.42 rows=129941 loops=1)
Index Cond: ((user_id < 1000) AND (utc_date > '2002-01-01 00:00:00-08'::timestamp with time zone))
Total runtime: 3667.47 msec
(3 rows)

If I manually set the indexCorrelation to 1.0, however, the planner
chooses the right plan on its own, which is in effect what setting a
higher random_page_cost had been compensating for, a poorly determined
indexCorrelation.

# SET enable_seqscan = true; SET enable_indexscan = true;
SET
SET
# EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE user_id < 1000 AND utc_date > '2002-01-01'::TIMESTAMP WITH TIME ZONE;
INFO: cost_seqscan: run_cost: 23685.025000
startup_cost: 0.000000

INFO: cost_index: run_cost: 4161.684248
startup_cost: 0.000000
indexCorrelation: 1.000000
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using report_user_cat_count_utc_date_user_id_idx on report_user_cat_count rucc (cost=0.00..4161.68 rows=133918 width=64) (actual time=0.67..1176.63 rows=129941 loops=1)
Index Cond: ((user_id < 1000) AND (utc_date > '2002-01-01 00:00:00-08'::timestamp with time zone))
Total runtime: 1705.40 msec
(3 rows)

Which suggests to me that line 3964 in
./src/backend/utils/adt/selfuncs.c isn't right for multi-column
indexes, esp for indexes that are clustered. I don't know how to
address this though... Tom, any hints?

FWIW, this is an old data/schema from a defunct project that I can
give people access to if they'd like. -sc

--
Sean Chittenden

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephan Szabo 2003-08-08 18:24:30 Re: consistency check on SPI tuple count failed
Previous Message Alvaro Herrera 2003-08-08 18:01:04 Re: new psql \d command