Re: huge disparities in =/IN/BETWEEN performance

From: "George Pavlov" <gpavlov(at)mynewplace(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-sql(at)postgresql(dot)org>
Subject: Re: huge disparities in =/IN/BETWEEN performance
Date: 2007-02-09 19:21:41
Message-ID: 8C5B026B51B6854CBE88121DBF097A868160D3@ehost010-33.exch010.intermedia.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Thanks all for the various useful thoughts. Let me backtrack a bit and
state my real underlying issue a bit with actual examples. Hope not to
bore you with the length of this. Looks to me like an optimizer issue
unless I am missing something.

So, suppose I have a query:

select *
from stuff
inner join
( -- just getting the distinct
-- user-stuff associations
-- since there may be multiple;
-- ultimately I need to query by user
select stuff_id, user_id
from stuff_user
group by 1,2 -- GROUP BY outperforms DISTINCT
) su
using (stuff_id)
left join
( -- this obtains summary statistics
-- about each stuff item
select stuff_id, count(*)
from stuff_events
group by 1
) se
using (stuff_id)
where user_id = 41

This is a very pared down version of what I have. And yes this specific
query can be rewritten as a single GROUP BY, but in the real world I am
gathering the aggregate statistics from several tables, so I actually
have several sub-recordsets similar to the one called "se" above.
Rewriting ALL those as a single GROUP BY is not feasible. I know, all
this cries for a single summarized rollup table, but let's not go there
(for now).

So running the above is inefficient. This particular user_id has only
one associated stuff_id and does not even have much data for that in
stuff_events. The query runs in ~4600ms. Were I to query by stuff_id
instead, things look great (if I change the where clause to the stuff_id
it runs in 25ms). When I query based on stuff_id the optimizer uses an
index on stuff_events.stuff_id. However, when I query by user_id it does
a Seq Scan on stuff_events. I somehow wish I could tell the optimizer to
first figure out which stuff_ids are related to the user_id that is
being asked for and then look ONLY those up in the stuff_events table
using the index on stuff_id.

It would seem (and this is where we get back to my original question)
that one should be able to just say:

select *
from stuff
left join
(select stuff_id, count(*)
from stuff_events
group by 1
) se
using (stuff_id)
where stuff_id in
(select distinct stuff_id
from stuff_user
where user_id = 41
)

You'd think that the subquery in the IN would be (very quickly) resolved
to a list of stuff_ids and then stuff_events would be accessed via its
stuff_id index. Instead, the Seq Scan on stuff_events still happens and
the query actually is even slower than the original, running in ~5500ms.

So one (very ugly) way to optimize the first query is to add an extra
join to stuff_user INSIDE the "se" subquery:

select *
from stuff
inner join
(select stuff_id, user_id
from stuff_user
group by 1,2
) su
using (stuff_id)
left join
(select stuff_id, user_id, count(*)
from stuff_events
inner join
( -- same subquery as above
select stuff_id, user_id
from stuff_user
group by 1,2
) su2
using (stuff_id)
group by 1,2
) se
using (stuff_id)
where user_id = 41;

This does improve things a lot, bringing the execution time for this
particular user to 3ms (!), but it is quite ugly and not fast enough for
me for a user_id with lots of associated stuff_ids.

George

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Tom Lane 2007-02-09 22:38:39 Re: huge disparities in =/IN/BETWEEN performance
Previous Message Ezequias Rodrigues da Rocha 2007-02-09 18:50:15 Have anyone this man e-mail ?