Re: vacuum analyze slows sql query

From: patrick ~ <sidsrr(at)yahoo(dot)com>
To: John Meinel <john(at)johnmeinel(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: vacuum analyze slows sql query
Date: 2004-11-08 23:19:51
Message-ID: 20041108231951.23927.qmail@web52105.mail.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi John,

Thanks for your reply and analysis.

--- John Meinel <john(at)johnmeinel(dot)com> wrote:

> patrick ~ wrote:
> [...]
> > pkk=# explain analyze execute pkk_01( 241 );
> > QUERY PLAN
> >
>
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > Result (cost=2.66..2.67 rows=1 width=0) (actual time=2872.211..2872.213
> > rows=1 loops=1)
> > InitPlan
> > -> Limit (cost=0.00..2.66 rows=1 width=4) (actual
> > time=2872.189..2872.189 rows=0 loops=1)
> > -> Seq Scan on pkk_purchase p0 (cost=0.00..37225.83 rows=13983
> > width=4) (actual time=2872.180..2872.180 rows=0 loops=1)
> > Filter: ((offer_id = $1) AND (((expire_time)::timestamp
> with
> > time zone > now()) OR (expire_time IS NULL) OR (pending = true)) AND
> > ((cancel_date IS NULL) OR (pending = true)))
> > Total runtime: 2872.339 ms
> > (6 rows)
> >
> > Time: 2873.479 ms
> >
>
> [...]
>
> > So, is this the ultimate solution to this issue?
> >
> > --patrick
>
> It's not so much that correlation is < 0.5. It sounds like you're
> running into the same issue that I ran into in the past. You have a
> column with lots of repeated values, and a few exceptional ones. Notice
> this part of the query:
> -> Seq Scan on pkk_purchase p0 (cost rows=13983) (actual rows=0)
>
> For a general number, it thinks it might return 14,000 rows, hence the
> sequential scan. Before you do ANALYZE, it uses whatever defaults exist,
> which are probably closer to reality.
>
> The problem is that you probably have some values for pkk_purchase where
> it could return 14,000 rows (possibly much much more). And for those,
> seq scan is the best plan. However, for the particular value that you
> are testing, there are very few (no) entries in the table.

You are absoultely correct:

pkk=# select offer_id,count(*) from pkk_purchase group by offer_id order by
count ;
offer_id | count
----------+--------
1019 | 1
1018 | 1
1016 | 1 (many of these)
... | ...
2131 | 6
844 | 6
1098 | 6 (a dozen or so of these)
... | ...
2263 | 682
2145 | 723
2258 | 797
2091 | 863
... | ...
1153 | 96330 (the few heavy weights)
244 | 122163
242 | 255719
243 | 273427
184 | 348476

> With a prepared statement (or a function) it has to determine ahead of
> time what the best query is without knowing what value you are going to
> ask for.
>
> Lets say for a second that you manage to trick it into using index scan,
> and then you actually call the function with one of the values that
> returns 1,000s of rows. Probably it will take 10-100 times longer than
> if it used a seq scan.

Hmm... The fact is I am selecting (in this example anyway) over all
values in pkk_offer table and calling the stored function with each
pkk_offer.offer_id which in turn does a select on pkk_purchase table.
Note that offer_id is a foreign key in pkk_purchase referencing
pkk_offer table.

I don't know if it matters (I suspect that it does) but I am using
LIMIT 1 in the sub-query/stored function. All I need is one single
row meeting any of the criteria laid out in the stored procedure to
establish an offer_id is "pending".

> So what is the solution? The only one I'm aware of is to turn your
> static function into a dynamic one.
>
> So somewhere within the function you build up a SQL query string and
> call EXECUTE str. This forces the query planner to be run every time you
> call the function. This means that if you call it will a "nice" value,
> you will get the fast index scan, and if you call it with a "bad" value,
> it will switch back to seq scan.
>
> The downside is you don't get much of a benefit from using as stored
> procedure, as it has to run the query planner all the time (as though
> you issue the query manually each time.) But it still might be better
> for you in the long run.

Well, running the query without the stored function, basically typing
out the stored function as a sub-query shows me:

pkk=# explain analyze select o0.offer_id, ( select case when ( select
p0.purchase_id from pkk_purchase p0 where p0.offer_id = o0.offer_id and (
p0.pending = true or ( ( p0.expire_time > now() or p0.expire_time isnull ) and
p0.cancel_date isnull ) ) limit 1 ) isnull then false else true end ) from
pkk_offer o0 ;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on pkk_offer o0 (cost=0.00..1834.11 rows=618 width=4) (actual
time=2413.398..1341885.084 rows=618 loops=1)
SubPlan
-> Result (cost=2.94..2.95 rows=1 width=0) (actual
time=2171.287..2171.289 rows=1 loops=618)
InitPlan
-> Limit (cost=0.00..2.94 rows=1 width=4) (actual
time=2171.264..2171.266 rows=1 loops=618)
-> Seq Scan on pkk_purchase p0 (cost=0.00..37225.83
rows=12670 width=4) (actual time=2171.245..2171.245 rows=1 loops=618)
Filter: ((offer_id = $0) AND
(((expire_time)::timestamp with time zone > now()) OR (expire_time IS NULL) OR
(pending = true)) AND ((cancel_date IS NULL) OR (pending = true)))
Total runtime: 1341887.523 ms
(8 rows)

while deleting all statistics on the pkk_% tables I get:

pkk=# delete from pg_statistic where pg_statistic.starelid = pg_class.oid and
pg_class.relname like 'pkk_%';
DELETE 11

pkk=# explain analyze select o0.offer_id, ( select case when ( select
p0.purchase_id from pkk_purchase p0 where p0.offer_id = o0.offer_id and (
p0.pending = true or ( ( p0.expire_time > now() or p0.expire_time isnull ) and
p0.cancel_date isnull ) ) limit 1 ) isnull then false else true end ) from
pkk_offer o0 ;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on pkk_offer o0 (cost=0.00..6646.94 rows=618 width=4) (actual
time=0.190..799.930 rows=618 loops=1)
SubPlan
-> Result (cost=10.73..10.74 rows=1 width=0) (actual time=1.277..1.278
rows=1 loops=618)
InitPlan
-> Limit (cost=0.00..10.73 rows=1 width=4) (actual
time=1.266..1.267 rows=1 loops=618)
-> Index Scan using pur_offer_id_idx on pkk_purchase p0
(cost=0.00..20690.18 rows=1929 width=4) (actual time=1.258..1.258 rows=1
loops=618)
Index Cond: (offer_id = $0)
Filter: ((((expire_time)::timestamp with time zone >
now()) OR (expire_time IS NULL) OR (pending = true)) AND ((cancel_date IS NULL)
OR (pending = true)))
Total runtime: 801.234 ms
(9 rows)

As you can see this query (over all values of pkk_offer) with out
any pg_statistics on the pkk_purchase table is extremely fast.

Is this a bug in the PostgreSQL planner that misjudges the best
choice with pg_statistics at hand?

--patrick


__________________________________
Do you Yahoo!?
Check out the new Yahoo! Front Page.
www.yahoo.com

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2004-11-08 23:57:58 Re: [PERFORM] poor performance in migrated database
Previous Message Carlos Lopez 2004-11-08 21:28:41 Re: [PERFORM] poor performance in migrated database