Re: suggestions on improving a query

From: Rajarshi Guha <rguha(at)indiana(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Adam Rich <adam(dot)r(at)sbcglobal(dot)net>, pgsql-general(at)postgresql(dot)org
Subject: Re: suggestions on improving a query
Date: 2007-02-14 13:22:42
Message-ID: 1171459362.7305.27.camel@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, 2007-02-13 at 22:04 -0500, Tom Lane wrote:
> "Adam Rich" <adam(dot)r(at)sbcglobal(dot)net> writes:
> > This line:
> > Index Scan using plp_total_idx on dockscore_plp
> > (cost=0.00..16733229.92 rows=4669988 width=80)
> > (actual time=98.323..322537.605 rows=25197 loops=1)
> > Means the planner did what it did, because it estimated there would be
> > nearly 5 million rows. However, there were only 25,000.

Sorry for not doing the obvious beforehand! I increased the statistics
target for some of the columns in some of the tables and then did a
vacuum analyze. Rerunning the query gives:

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..397.24 rows=10 width=268) (actual time=98322.597..171721.583 rows=10 loops=1)
-> Nested Loop (cost=0.00..37182572.57 rows=936023 width=268) (actual time=98322.590..171721.543 rows=10 loops=1)
-> Nested Loop (cost=0.00..31580822.05 rows=936023 width=90) (actual time=98236.963..171379.151 rows=10 loops=1)
-> Index Scan using plp_total_idx on dockscore_plp (cost=0.00..16858401.83 rows=4669988 width=80) (actual time=54.989..102775.761 rows=25197 loops=1)
-> Index Scan using dock_pkey on dock (cost=0.00..3.14 rows=1 width=18) (actual time=2.718..2.718 rows=0 loops=25197)
Index Cond: (dock.dockid = "outer".id)
Filter: (target = '1YC1'::text)
-> Index Scan using pubchem_compound_pkey on pubchem_compound (cost=0.00..5.97 rows=1 width=187) (actual time=34.221..34.223 rows=1 loops=10)
Index Cond: (("outer".cid)::text = (pubchem_compound.cid)::text)
Total runtime: 171722.964 ms
(10 rows)

Clearly a big improvement in performance.

(One question not directly related to the problem: when looking at the
output of explain analyze, I know that one is supposed to start at the
bottom and move up. Does that that the index scan on pubchem_compound is
being performed first? Or should I start from the innermost line?)

However it seems that it could still be improved:

-> Index Scan using plp_total_idx on dockscore_plp (cost=0.00..16858401.83 rows=4669988 width=80) (actual time=54.989..102775.761 rows=25197 loops=1)

It looks like theres a big mismatch on the expected and observed costs and times.

> The real problem here is that the planner is guessing that it won't take
> very long to find 10 rows satisfying the target = '1YC1' condition while
> scanning in dockscore_plp.total order. So it chooses a plan that would
> have a long total runtime (notice the large cost estimates below the
> Limit) expecting that only a small fraction of that total will actually
> be expended. The expectation seems a bit off unfortunately :-(.
> I can't tell from the given data whether the problem is just an
> overestimate of the frequency of target = '1YC1', or if there's an
> additional effect.

I think that increasing the statistics has improved that.

> For example, if that target value tended to only be
> associated with larger values of dockscore_plp.total, then a plan like
> this could lose big-time because it will have to scan a long way to find
> those rows.

This is not the case. The value '1YC1' will be associated with both high
and low values of dockscore_plp.total

What I would like my query to do is this:

1. From dock.target find all rows = '1YC1'

2. Using dock.dockid of these rows, get the corresponding rows in
dockscore_plp

3. Using dock.cid from the rows in 2., get the corresponding rows in
pubchem_compound

4. Sort and take the top 10 from step 2 (and associated rows in step 3)

However now that I have written this it seems that what I really want to
do is:

1. From dock.target find all rows = '1YC1'

2. Using dock.dockid of these rows, get the corresponding rows in
dockscore_plp

3. Sort and take the top 10

4. Get the corresponding rows from pubchem_compound.cid

The problem with this is that step is represented by the

dock.cid = pubchem_compound.cid

clause. It seems that if I had the cid column in dockscore_plp, then I
could do a sort+limit in dockscore_plp and then simply lookup the
corresponding (10) rows in pubchem_compound (rather than looking up 960K
rows). The downside to this is that there are 4 more tables like
dockscore_plp, and I would have to add a cid column to each of them -
which seems redundant.

Is it useful to increase redundancy to improve performance?

Thanks for the pointers,

-------------------------------------------------------------------
Rajarshi Guha <rguha(at)indiana(dot)edu>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
There's no problem so bad that you can't add some guilt to it to make
it worse.
-Calvin

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message rloefgren 2007-02-14 14:08:00 Re: Proper escaping for char(3) string, or PHP at fault, or me at fault?
Previous Message Magnus Hagander 2007-02-14 13:22:00 Re: PGSQL 8.2.3 Installation problem