Re: Query performance PLEASE HELP

From: Dmitry Tkach <dmitry(at)openratings(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Query performance PLEASE HELP
Date: 2003-01-31 21:08:47
Message-ID: 3E3AE5DF.4070408@openratings.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Tom Lane wrote:

>Dmitry Tkach <dmitry(at)openratings(dot)com> writes:
>
>>Well... Yes. I am doing that daily. Actually, I was wonderring about
>>those estimates too, but that's not my primary concern right now -
>>perhaps, it should be, but, as far as I understand, the estimate only
>>matter for the query plan selection, and I don't have a problem with the
>>query plan
>>
>
>You should. If the query can be done any faster, it will be by picking
>a different query plan. I'm not sure what would be a better plan, but
>certainly a large part of the problem is that the planner is so far off
>about the rowcount estimates.
>
So, what would be the alternative query plan be to make it quicker?
I tried forcing it to do the join the other way around (by doing set
enable_sort=off; - this forces it to use tradestyle for the outer loop),
but that doesn't make it any better...
What other possibilities are there that would be likely to improve things?

>
>
>One thing I'm wondering is if the index on tradestyle.name could be helpful.
>How selective is "ts.name like 'POST%'", exactly --- does that eliminate
>a lot of rows, or not? Is the thing able to use that as an indexqual
>(ie, are you in C locale)?
>
Yes. There is an index on ts.name. I mentioned that in the 'PS' of the
original message - if I use another parameter for the name criteria, it
decides to use that index, but, as I said, it does not help very much.
In that particular case I mentioned, the execution took about 5 minutes
- still a lot, although a little better then the 7 I had for the first
query, but I believe, this "improvement" is because of caching, not
really due to a different query plan.

>
>To tell you the truth, I do not believe your assertion that these tables
>have been analyzed. I don't see how the rowcount estimates could be so
>small if the planner were aware of the true table statistics.
>
Well... It doesn't actually COUNT all the rows when you do the analyze,
right? Those are just estimates, based on subsampling...
Business names isn't something perfectly ditributed statistically, so,
those estimates being off doesn't surprise me much really...

> What does
>pg_stats show for the columns used in the query?
>
select * from pg_stats where tablename in ('tradestyle',
'managed_supplier') and attname in ('name', 'duns', 'subscriber');
-[ RECORD 1
]-----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
tablename | tradestyle
attname | duns
null_frac | 0
avg_width | 4
n_distinct | -1
most_common_vals |
most_common_freqs |
histogram_bounds |
{1145200,22833532,49988608,79446436,102189545,135526259,196661250,612015735,806455895,876539727,969917566}
correlation | -0.18264
-[ RECORD 2
]-----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
tablename | tradestyle
attname | name
null_frac | 0
avg_width | 24
n_distinct | 385825
most_common_vals | {"UNITED STATES POSTAL SERVICE","G N C","H & R
BLOCK","CARL'S JR","EDWARD D JONES & CO L P","FARMERS INSURANCE","UNITED
STATES DEPARTMENT OF THE AIR FORCE","WOMENS INTERNATIONAL BOWL"}
most_common_freqs |
{0.00233333,0.001,0.001,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667}
histogram_bounds | {".COM LLC","BEAU OLSON JOHN","COLONIAL LIFE &
ACCIDENT INSURANCE COMPANY INC","EISENHOWER PARTNERSHIP","GROWTH &
OPPORTUNITY INC","JUNCTION AUTOLAND","MC EXPORT","PANTS
HANGER","SALISBURY NEWS","THE BACK STORE II LLC","ZR ENTERPRISES, INC"}
correlation | 0.010456
-[ RECORD 3
]-----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
tablename | managed_supplier
attname | duns
null_frac | 0
avg_width | 4
n_distinct | -0.688082
most_common_vals |
{1021435,1213214,1307974,2190528,2593051,3292620,17543877,27399237,41869355,49591852}
most_common_freqs |
{0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667}
histogram_bounds |
{1002229,6557045,22423490,50341502,73508350,112086095,168224033,252940564,627776784,847204914,2100000158}
correlation | -0.151041
-[ RECORD
4]-----+-----------------------------------------------------------------------------------------------
tablename | managed_supplier
attname | subscriber
null_frac | 0
avg_width | 4
n_distinct | 44
most_common_vals | {74,81,20,111,67,45,66,108,75,68}
most_common_freqs |
{0.181333,0.169667,0.138333,0.124,0.0396667,0.0386667,0.0376667,0.0336667,0.0333333,0.0303333}
histogram_bounds | {3,63,72,78,98,107,110,112,118,124,1001}
correlation | -0.912398

What does NEGATIVE n_distinct mean (for managed_supplier.duns)? :-)
It lies about many things... For example - n_distinct for
tradestyle.name = 385825 is about 100 times small than the actual number
(which is a little over 30 million)

Thanks!

Dima

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message pilsl 2003-01-31 21:20:52 SQL-question (JOIN)
Previous Message Tom Lane 2003-01-31 21:07:43 Re: Query performance PLEASE HELP