From: | Rajarshi Guha <rguha(at)indiana(dot)edu> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Adam Rich <adam(dot)r(at)sbcglobal(dot)net>, pgsql-general(at)postgresql(dot)org |
Subject: | Re: suggestions on improving a query |
Date: | 2007-02-14 13:22:42 |
Message-ID: | 1171459362.7305.27.camel@localhost |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Tue, 2007-02-13 at 22:04 -0500, Tom Lane wrote:
> "Adam Rich" <adam(dot)r(at)sbcglobal(dot)net> writes:
> > This line:
> > Index Scan using plp_total_idx on dockscore_plp
> > (cost=0.00..16733229.92 rows=4669988 width=80)
> > (actual time=98.323..322537.605 rows=25197 loops=1)
> > Means the planner did what it did, because it estimated there would be
> > nearly 5 million rows. However, there were only 25,000.
Sorry for not doing the obvious beforehand! I increased the statistics
target for some of the columns in some of the tables and then did a
vacuum analyze. Rerunning the query gives:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..397.24 rows=10 width=268) (actual time=98322.597..171721.583 rows=10 loops=1)
-> Nested Loop (cost=0.00..37182572.57 rows=936023 width=268) (actual time=98322.590..171721.543 rows=10 loops=1)
-> Nested Loop (cost=0.00..31580822.05 rows=936023 width=90) (actual time=98236.963..171379.151 rows=10 loops=1)
-> Index Scan using plp_total_idx on dockscore_plp (cost=0.00..16858401.83 rows=4669988 width=80) (actual time=54.989..102775.761 rows=25197 loops=1)
-> Index Scan using dock_pkey on dock (cost=0.00..3.14 rows=1 width=18) (actual time=2.718..2.718 rows=0 loops=25197)
Index Cond: (dock.dockid = "outer".id)
Filter: (target = '1YC1'::text)
-> Index Scan using pubchem_compound_pkey on pubchem_compound (cost=0.00..5.97 rows=1 width=187) (actual time=34.221..34.223 rows=1 loops=10)
Index Cond: (("outer".cid)::text = (pubchem_compound.cid)::text)
Total runtime: 171722.964 ms
(10 rows)
Clearly a big improvement in performance.
(One question not directly related to the problem: when looking at the
output of explain analyze, I know that one is supposed to start at the
bottom and move up. Does that that the index scan on pubchem_compound is
being performed first? Or should I start from the innermost line?)
However it seems that it could still be improved:
-> Index Scan using plp_total_idx on dockscore_plp (cost=0.00..16858401.83 rows=4669988 width=80) (actual time=54.989..102775.761 rows=25197 loops=1)
It looks like theres a big mismatch on the expected and observed costs and times.
> The real problem here is that the planner is guessing that it won't take
> very long to find 10 rows satisfying the target = '1YC1' condition while
> scanning in dockscore_plp.total order. So it chooses a plan that would
> have a long total runtime (notice the large cost estimates below the
> Limit) expecting that only a small fraction of that total will actually
> be expended. The expectation seems a bit off unfortunately :-(.
> I can't tell from the given data whether the problem is just an
> overestimate of the frequency of target = '1YC1', or if there's an
> additional effect.
I think that increasing the statistics has improved that.
> For example, if that target value tended to only be
> associated with larger values of dockscore_plp.total, then a plan like
> this could lose big-time because it will have to scan a long way to find
> those rows.
This is not the case. The value '1YC1' will be associated with both high
and low values of dockscore_plp.total
What I would like my query to do is this:
1. From dock.target find all rows = '1YC1'
2. Using dock.dockid of these rows, get the corresponding rows in
dockscore_plp
3. Using dock.cid from the rows in 2., get the corresponding rows in
pubchem_compound
4. Sort and take the top 10 from step 2 (and associated rows in step 3)
However now that I have written this it seems that what I really want to
do is:
1. From dock.target find all rows = '1YC1'
2. Using dock.dockid of these rows, get the corresponding rows in
dockscore_plp
3. Sort and take the top 10
4. Get the corresponding rows from pubchem_compound.cid
The problem with this is that step is represented by the
dock.cid = pubchem_compound.cid
clause. It seems that if I had the cid column in dockscore_plp, then I
could do a sort+limit in dockscore_plp and then simply lookup the
corresponding (10) rows in pubchem_compound (rather than looking up 960K
rows). The downside to this is that there are 4 more tables like
dockscore_plp, and I would have to add a cid column to each of them -
which seems redundant.
Is it useful to increase redundancy to improve performance?
Thanks for the pointers,
-------------------------------------------------------------------
Rajarshi Guha <rguha(at)indiana(dot)edu>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
There's no problem so bad that you can't add some guilt to it to make
it worse.
-Calvin
From | Date | Subject | |
---|---|---|---|
Next Message | rloefgren | 2007-02-14 14:08:00 | Re: Proper escaping for char(3) string, or PHP at fault, or me at fault? |
Previous Message | Magnus Hagander | 2007-02-14 13:22:00 | Re: PGSQL 8.2.3 Installation problem |