Re: SQL command speed

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Kate Collins <klcollins(at)wsicorp(dot)com>
Cc: postgresql news group <pgsql-sql(at)postgresql(dot)org>
Subject: Re: SQL command speed
Date: 2000-05-19 16:06:16
Message-ID: 8221.958752376@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Kate Collins <klcollins(at)wsicorp(dot)com> writes:
> The table I am using has 114600 total rows. The full query returns 1129
> rows. Right now the table is static, i.e. I am not modifying it while I am
> running these tests.

> Here are the results of the EXPLAIN with the different numbers of OR's.

> ---QUERY 1, returns 1129 rows---
> pbi=> [ 32-or variant ]
> Seq Scan on notam_details (cost=0.00..13420.40 rows=26230 width=12)

> --- QUERY 2, returns 11 rows ---
> pbi=> [ just one WHERE clause ]
> Index Scan using notam_details_item_a on notam_details (cost=0.00..2739.57
> rows=927 width=12)

> --- QUERY 3, returns 11 rows ---
> pbi=> [ 2 ORed clauses ]
> Seq Scan on notam_details (cost=0.00..4820.90 rows=1847 width=12)

OK, the problem here is the disconnect between the planner's estimate
of the number of returned rows (the "rows" number in EXPLAIN) and the
actual result row count. If the query actually did need to pull nearly
a thousand rows for each OR'd key, then using a seqscan for more than a
couple of ORs would make sense. But in reality you don't have nearly
that many rows per OR key, so an indexscan is needed.

That estimate is largely driven by a single statistic collected by
VACUUM ANALYZE, which is the frequency of the most common value in
the item_a column. If the MCV is pretty common then the planner assumes
that the column doesn't have very many distinct values, so you get a
high estimate of the number of returned rows. Evidently you have
a very common MCV for item_a, but the key values you are looking for
aren't nearly that common.

I have seen a number of cases where someone was using a dummy value
(like 'N/A', or an empty string, etc) to indicate unknown data, and
there were so many of these entries as to not merely be the MCV,
but drive the MCV's frequency statistic far above the frequency of
occurrence of any "real" value. This fools the planner into thinking
that *all* the values are like that, and so it generates plans
accordingly. The long-run solution is to gather more-detailed
statistics, and that's in the TODO list for a future version. In the
meantime, there is a workaround that you may be able to use: instead
of a dummy value, store NULL for unknown entries. The statistics
gatherer already accounts for NULLs separately, so a large fraction
of NULLs won't fool the planner about the frequency of non-null values.

Note: if you try this, be sure to re-run VACUUM ANALYZE after you
replace the dummies with NULLs. The plans won't change until there's
a more accurate statistical entry for your table...

regards, tom lane

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Richard J. Kuhns 2000-05-19 16:25:35 [HACKERS] Re: Question about databases in alternate locations...
Previous Message Tom Lane 2000-05-19 15:36:01 Re: bogus number