Re: Use of index in 7.0 vs 6.5

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Ryan Bradetich <ryan_bradetich(at)hp(dot)com>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Use of index in 7.0 vs 6.5
Date: 2000-05-25 06:14:38
Message-ID: 21530.959235278@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Ryan Bradetich <ryan_bradetich(at)hp(dot)com> writes:
> procman=# explain select count(catagory) from medusa where host_id = 404
> and catagory like 'A%';

> Here is my analysis of the stastics (based on the examples in the
> archive).

> The most common value host_id in the table is 446 with row fraction of
> ~ 2.8%. The estimated number of rows in the index is 6704. This
> table has 4,630,229 entries in the table.
> I do not understand why the planner would choose a seqscan over the
> index scan because 6704/4,630,229 is ~ 0.15%.

I see at least part of the answer. The 'rows=' number in the EXPLAIN
output is the planner's estimate of the net number of rows out after
applying all available WHERE restrictions. In this case we've got a
clause "host_id = 404" which can be used with the index on host_id,
and then we have another clause "catagory like 'A%'" which will be
applied on-the-fly to the tuples returned by the indexscan. The
rows number tells you about the estimated number of rows out after
that second filter step. However, the cost of the indexscan depends
mainly on the number of tuples that have to be fetched, and that is
determined by the selectivity of just the "host_id = 404" clause.

I made a dummy table with the schema you showed and then inserted
the statistics you reported into the system tables (who's afraid
of "update pg_class ..." ;-)). If I didn't mess up, you should
be able to reproduce these EXPLAIN results:

set enable_seqscan = off;
explain select count(catagory) from medusa where host_id = 404;
NOTICE: QUERY PLAN:

Aggregate (cost=206943.69..206943.69 rows=1 width=12)
-> Index Scan using medusa_host_id_key on medusa (cost=0.00..206781.97 rows=64690 width=12)

set enable_seqscan = on;
explain select count(catagory) from medusa where host_id = 404;
NOTICE: QUERY PLAN:

Aggregate (cost=178115.59..178115.59 rows=1 width=12)
-> Seq Scan on medusa (cost=0.00..177953.86 rows=64690 width=12)

This shows that the planner is actually estimating that the indexscan
will fetch about 64690 rows (of which it guesses only 6704 will remain
after applying the catagory clause, but that's not really relevant to
the cost estimate). Since there are 120076 pages in the table, that
would mean pretty nearly one separate page fetch for each retrieved
tuple, if the matching tuples are randomly distributed --- and that
very probably *would* take more time than reading the whole table
sequentially. So the planner's chain of logic holds up if all these
assumptions are correct.

Since you find that in reality the indexscan method is very quick,
I'm guessing that there are actually fairly few tuples matching
host_id = 404. Could you run a quick "select count(*)" to check?

This seems to point up (once again) the deficiency of assuming that
the most-common value in the table is a good guide to the frequency
of typical values. You showed that host_id = 446 occurs in 2.8% of
the rows in this table; a search for 446 very probably would be faster
as a seqscan than as an indexscan (you might care to try it and see).
But that's probably a statistical outlier that's not got much to do
with the frequency of typical values in the table.

The only really good answer to this problem is to collect more-detailed
statistics in VACUUM ANALYZE, which I hope to see us doing in a release
or so. In the meantime I am wondering about deliberately skewing the
cost model in favor of indexscans, because I sure haven't heard many
complaints about erroneous selection of indexscans...

One way to put a thumb on the scales is to reduce the value of the SET
variable random_page_cost. The default value is 4.0, which seems to
correspond more or less to reality, but reducing it to 3 or so would
shift the planner pretty nicely in the direction of indexscans.

regards, tom lane

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Hiroshi Inoue 2000-05-25 07:19:54 RE: Use of index in 7.0 vs 6.5
Previous Message Michael A. Mayo 2000-05-25 04:54:38 Re: [SQL]