Re: query optimization

From: Faheem Mitha <faheem(at)email(dot)unc(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: query optimization
Date: 2009-11-25 22:54:46
Message-ID: alpine.DEB.2.00.0911251446530.14636@orwell.homelinux.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


Hi Robert,

Thanks very much for your suggestions.

On Wed, 25 Nov 2009, Robert Haas wrote:

> On Mon, Nov 23, 2009 at 5:47 PM, Faheem Mitha <faheem(at)email(dot)unc(dot)edu> wrote:
>>
>> Hi everybody,
>>
>> I've got two queries that needs optimizing. Actually, there are others,
>> but these are pretty representative.
>>
>> You can see the queries and the corresponding plans at
>>
>> http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.pdf
>>
>> or
>>
>> http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.tex
>>
>> if you prefer text (latex file, effectively text in this case)
>>
>> The background to this is at
>> http://bulldog.duhs.duke.edu/~faheem/snpdb/diag.pdf
>>
>> If more details are required, let me know and I can add them. I'd appreciate
>> suggestions about how to make these queries go faster.
>>
>> Please CC this email address on any replies.
>
> I've found that a good way to approach optimizing queries of this type
> is to look at the EXPLAIN ANALYZE results and figure out which parts
> of the query are slow. Then simplify the rest of the query as much as
> possible without eliminating the slowness. Then try to figure out how
> to optimize the simplified query: rewrite the logic, add indices,
> change the schema, etc. Lastly start adding the other bits back in.

Good strategy. Now I just have to understand EXPLAIN ANALYZE well enough
to figure out which bits are slow. :-)

> It looks like the dedup_patient_anno CTE is part of your problem. Try
> pulling that piece out and optimizing it separately. I wonder if that
> could be rewritten to use SELECT DISTINCT ON (...) and whether that
> would be any faster.

Isn't SELECT DISTINCT supposed to be evil, since in general the result is
not deterministic? I think I had SELECT DISTINCT earlier, and removed it
because of that, with the help of Andrew (RhodiumToad on #postgresql) I
didn't compare the corresponding subqueries separately, so don't know what
speed difference this made.

> If not, you might want to look at some way of pre-marking the
> non-duplicate rows so that you don't have to recompute that each time.

What are the options re pre-marking?

> Then you might be able to use the underlying table directly in the next
> CTE, which will usually permit better optimization, more use of indices,
> etc. It seems pretty unfortunate that dedup_patient_anno joins against
> geno and then patient_geno does what appears to be the same join again.
> Is there some way to eliminate that? If so it will probably help.

You don't say whether you are looking at the PED or TPED query, so I'll
assume PED. They are similar anyway.

I see your point re the joins. You mean

anno INNER JOIN geno

followed by

geno INNER JOIN dedup_patient_anno

? I think the point of the first join is to reduce the anno table based on
information from the geno table. The result is basically a subset of the
anno table with some potential duplication removed, which is then
re-joined to the geno table. I agree this seems a bit suboptimal, and
there might be a better way to do this.

> Once you've got those parts of the query as well-optimized as you can,
> add the next pieces in and start hacking on those.

Regards, Faheem.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Robert Haas 2009-11-25 23:26:50 Re: query optimization
Previous Message Robert Haas 2009-11-25 19:01:38 Re: Query times change by orders of magnitude as DB ages