Quick Links

Re: SQL command speed

From:	Kate Collins <klcollins(at)wsicorp(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	postgresql news group <pgsql-sql(at)postgresql(dot)org>
Subject:	Re: SQL command speed
Date:	2000-05-19 19:34:50
Message-ID:	3925975A.B8880CCD@wsicorp.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-sql

Tom,

Thank you for the explanation.

You are correct in your assessment of the nature of the distribution of the
elements in the item_a column. Some values return none or a few rows. While
other values return many rows (100 or more). It is not an even distribution. I
had never really considered what effect this would have on searching before.

Kate

Tom Lane wrote:

> Kate Collins <klcollins(at)wsicorp(dot)com> writes:
> > The table I am using has 114600 total rows. The full query returns 1129
> > rows. Right now the table is static, i.e. I am not modifying it while I am
> > running these tests.
>
> > Here are the results of the EXPLAIN with the different numbers of OR's.
>
> > ---QUERY 1, returns 1129 rows---
> > pbi=> [ 32-or variant ]
> > Seq Scan on notam_details (cost=0.00..13420.40 rows=26230 width=12)
>
> > --- QUERY 2, returns 11 rows ---
> > pbi=> [ just one WHERE clause ]
> > Index Scan using notam_details_item_a on notam_details (cost=0.00..2739.57
> > rows=927 width=12)
>
> > --- QUERY 3, returns 11 rows ---
> > pbi=> [ 2 ORed clauses ]
> > Seq Scan on notam_details (cost=0.00..4820.90 rows=1847 width=12)
>
> OK, the problem here is the disconnect between the planner's estimate
> of the number of returned rows (the "rows" number in EXPLAIN) and the
> actual result row count. If the query actually did need to pull nearly
> a thousand rows for each OR'd key, then using a seqscan for more than a
> couple of ORs would make sense. But in reality you don't have nearly
> that many rows per OR key, so an indexscan is needed.
>
> That estimate is largely driven by a single statistic collected by
> VACUUM ANALYZE, which is the frequency of the most common value in
> the item_a column. If the MCV is pretty common then the planner assumes
> that the column doesn't have very many distinct values, so you get a
> high estimate of the number of returned rows. Evidently you have
> a very common MCV for item_a, but the key values you are looking for
> aren't nearly that common.
>
> I have seen a number of cases where someone was using a dummy value
> (like 'N/A', or an empty string, etc) to indicate unknown data, and
> there were so many of these entries as to not merely be the MCV,
> but drive the MCV's frequency statistic far above the frequency of
> occurrence of any "real" value. This fools the planner into thinking
> that *all* the values are like that, and so it generates plans
> accordingly. The long-run solution is to gather more-detailed
> statistics, and that's in the TODO list for a future version. In the
> meantime, there is a workaround that you may be able to use: instead
> of a dummy value, store NULL for unknown entries. The statistics
> gatherer already accounts for NULLs separately, so a large fraction
> of NULLs won't fool the planner about the frequency of non-null values.
>
> Note: if you try this, be sure to re-run VACUUM ANALYZE after you
> replace the dummies with NULLs. The plans won't change until there's
> a more accurate statistical entry for your table...
>
> regards, tom lane

--
=================================================
Katherine (Kate) L. Collins
Senior Software Engineer/Meteorologist
Weather Services International (WSI Corporation)
4 Federal Street
Billerica, MA 01821
EMAIL: klcollins(at)wsicorp(dot)com
PHONE: (978) 670-5110
FAX: (978) 670-5100
http://www.intellicast.com

In response to

Re: SQL command speed at 2000-05-19 16:06:16 from Tom Lane

Responses

Re: SQL command speed at 2000-05-19 21:39:01 from Tom Lane

Browse pgsql-sql by date

	From	Date	Subject
Next Message	Tom Lane	2000-05-19 19:48:46	Re: Creating Tables in Functions
Previous Message	Tom Lane	2000-05-19 17:48:15	Re: Re[2]: lower() for varchar data by creating an index