Re: bad plan and LIMIT

From: James Nelson <james(at)photoshelter(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance(at)postgresql(dot)org, "Thomas S(dot) Chin" <thom(at)genx(dot)net>
Subject: Re: bad plan and LIMIT
Date: 2009-05-01 15:42:46
Message-ID: 1A9C9705-6F4D-4BA2-B4CE-17339E040F03@photoshelter.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


I looked into the distribution of the filenames, in particular I ran a
query to see how for into the table the 1st filename would be found.

photoshelter=# select count(*) from ps_image where lower(file_name) <
'a-400-001.jpg';
count
---------
8915832

As you can see the first row is almost 9 million rows into the table.
(a-400-001.jpg is the first filename returned by the query) which
implies the distribution is heavily non-uniform. (For uniform
distribution the first row should have been within the first 500 rows,
give or take)

I tried the query you suggest below but it did not work well, but
using it as inspiration the following query does work:

photoshelter=# explain analyze select * from (
SELECT ID, lower(file_name) as lfn FROM ps_image WHERE id IN
(SELECT image_id FROM ps_gallery_image WHERE
gallery_id='G00007ejKGoWS_cY')
offset 0
) ss
ORDER BY lfn ASC
limit 1;
QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=158946.43..158946.43 rows=1 width=52) (actual
time=1539.615..1539.615 rows=1 loops=1)
-> Sort (cost=158946.43..159044.80 rows=39350 width=52) (actual
time=1539.613..1539.613 rows=1 loops=1)
Sort Key: (lower((ps_image.file_name)::text))
Sort Method: top-N heapsort Memory: 17kB
-> Limit (cost=43197.34..158356.18 rows=39350 width=36)
(actual time=74.530..1499.328 rows=50237 loops=1)
-> Nested Loop (cost=43197.34..158356.18 rows=39350
width=36) (actual time=74.529..1475.378 rows=50237 loops=1)
-> HashAggregate (cost=43197.34..43590.84
rows=39350 width=17) (actual time=74.468..110.638 rows=50237 loops=1)
-> Index Scan using gi_gallery_id on
ps_gallery_image (cost=0.00..43072.80 rows=49816 width=17) (actual
time=0.049..46.926 rows=50237 loops=1)
Index Cond: ((gallery_id)::text =
'G00007ejKGoWS_cY'::text)
-> Index Scan using ps_image_pkey on ps_image
(cost=0.00..2.90 rows=1 width=36) (actual time=0.025..0.025 rows=1
loops=50237)
Index Cond: ((ps_image.id)::text =
(ps_gallery_image.image_id)::text)
Total runtime: 1540.032 ms
(12 rows)

Interestingly to me, while the 'offest 0' did not work as an
optimization fence in the query you provided, it works as one in the
query above. I had tried removing it from the above query, and the
plan reverted back to the bad form.

The non-uniform distribution leads me to another question, would it be
possible to use partial indexes or some other technique to help the
planner. Or would the fact that the relevant information, gallery ids
and filenames, are split across two tables foil any attempt?

In any case, I'd like to thank everyone for their input. The query
above will be a big help.

be well,

James

On May 1, 2009, at 10:57 AM, Tom Lane wrote:

> James Nelson <james(at)photoshelter(dot)com> writes:
>> Hi, I'm hoping you guys can help with improving this query I'm having
>> a problem with. The main problem is that the query plan changes
>> depending on the value of the LIMIT clause, with small values using a
>> poor plan and running very slowly. The two times are roughly 5
>> minutes
>> for the bad plan and 1.5 secs for the good plan.
>
>> photoshelter=# explain analyze SELECT ID FROM ps_image WHERE id IN
>> (SELECT image_id FROM ps_gallery_image WHERE
>> gallery_id='G00007ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC limit 1;
>
> The problem here is an overoptimistic assessment of how long it will
> take to find a match to gallery_id='G00007ejKGoWS_cY' while searching
> in file_name order. You might be able to fix that by increasing the
> statistics target for gallery_id. However, if the issue is not so
> much how many occurrences of 'G00007ejKGoWS_cY' there are as that
> they're all associated with high values of file_name, that won't
> help. In that case I think it would work to restructure the query
> along the lines of
>
> select * from (
> SELECT ID FROM ps_image WHERE id IN
> (SELECT image_id FROM ps_gallery_image WHERE
> gallery_id='G00007ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC
> offset 0
> ) ss
> limit 1;
>
> The OFFSET should act as an optimization fence to prevent the LIMIT
> from being used in the planning of the subquery.
>
> regards, tom lane

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2009-05-01 16:14:10 Re: Many left outer joins with limit performance
Previous Message Robert Haas 2009-05-01 15:27:04 Re: Transparent table partitioning in future version of PG?