Re: query optimization

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Faheem Mitha <faheem(at)email(dot)unc(dot)edu>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: query optimization
Date: 2009-11-30 15:40:48
Message-ID: 603c8f070911300740h6cead2aao55793a449a72a233@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Fri, Nov 27, 2009 at 4:47 PM, Faheem Mitha <faheem(at)email(dot)unc(dot)edu> wrote:
>>>> If not, you might want to look at some way of pre-marking the
>>>> non-duplicate rows so that you don't have to recompute that each time.
>>>
>>> What are the options re pre-marking?
>>
>> Well, what I usually do is - if I'm going to do the same
>> distinct-ification frequently, I add an extra column (say, a boolean)
>> and set it to true for all and only those rows which will pass the
>> distinct-ification filter.  Then I can just say WHERE <that column
>> name>.
>
> Yes, I see. The problem with is premarking is that the selection is somewhat
> dynamic, in the sense that this depends on the idlink table, which depends
> on patient data, which can change.

Yeah. For things like this I find you have to think hard about how to
organize your schema so that you can optimize the queries you care
about. There are no "just do this and it works" solutions to
performance problems of this type. Still, many of them are solvable
by making the right decisions elsewhere. Sometimes you can use
triggers to recompute your premarks when the data in the other table
changes. Another strategy is to keep a cache of precomputed results
somewhere. When the underlying data changes, you use triggers to
invalidate anything in the cache that might now be wrong, and set
things up so that it will be recomputed when next it is used. But in
either case you have to figure out the right place to do the
computation so that it gains you more than it saves you, and adjusting
your schema is often necessary.

>>>> Then you might be able to use the underlying table directly in the next
>>>> CTE, which will usually permit better optimization, more use of indices,
>>>> etc.  It seems pretty unfortunate that dedup_patient_anno joins against
>>>> geno
>>>> and then patient_geno does what appears to be the same join again. Is
>>>> there
>>>> some way to eliminate that?  If so it will probably help.
>>>
>>> You don't say whether you are looking at the PED or TPED query, so I'll
>>> assume PED. They are similar anyway.
>>>
>>> I see your point re the joins. You mean
>>>
>>> anno INNER JOIN geno
>>>
>>> followed by
>>>
>>> geno INNER JOIN dedup_patient_anno
>>>
>>> ? I think the point of the first join is to reduce the anno table based
>>> on
>>> information from the geno table. The result is basically a subset of the
>>> anno table with some potential duplication removed, which is then
>>> re-joined
>>> to the geno table. I agree this seems a bit suboptimal, and there might
>>> be a
>>> better way to do this.
>>
>> Yeah, I didn't think about it in detail, but it looks like it should
>> be possible.  Eliminating joins can sometimes have *dramatic* effects
>> on query performance, and it never hurts.
>
> Failing all else, couldn't I smoosh together the two queries and do a triple
> join? For reference, the two CTEs in question, from the PED query, are as
> follows.
>
>    dedup_patient_anno AS
>     ( SELECT *
>     FROM
>             (SELECT  *,
>                      row_number() OVER(PARTITION BY anno.rsid ORDER BY
> anno.id)
>             FROM     anno
>                      INNER JOIN geno
>                      ON       anno.id = geno.anno_id
>             WHERE    idlink_id        =
>                      (SELECT MIN(id)
>                      FROM    idlink
>                      )
>             ) AS s
>     WHERE   row_number = '1'
>     ),
>     patient_geno AS
>     ( SELECT geno.idlink_id AS idlink_id,
>       geno.anno_id AS anno_id,
>       geno.snpval_id AS snpval_id,
>       allelea_id, alleleb_id
>       FROM    geno
>             INNER JOIN dedup_patient_anno
>             ON      geno.anno_id = dedup_patient_anno.id
>     ),

If that will give the same results, which I'm not immediately certain
about, then I highly recommend it. In general I would recommend only
using CTEs to express concepts that can't sensibly be expressed in
other ways, not to beautify your queries. Keep in mind that joins can
be reordered and/or executed using different methods but most other
operations can't be, so trying to get your joins together in one place
is usually a good strategy, in my experience. And of course if that
lets you reduce the total number of joins, that's even better.

...Robert

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Ron Mayer 2009-11-30 15:48:32 Re: SSD + RAID
Previous Message Ivan Voras 2009-11-30 12:33:46 Re: Any have tested ZFS like PostgreSQL installation filesystem?