Re: vacuum analyze slows sql query

From: John Meinel <john(at)johnmeinel(dot)com>
To: patrick ~ <sidsrr(at)yahoo(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: vacuum analyze slows sql query
Date: 2004-11-09 00:01:30
Message-ID: 419008DA.3000007@johnmeinel.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

patrick ~ wrote:
> Hi John,
>
> Thanks for your reply and analysis.
>

No problem. It just happens that this is a problem we ran into recently.

>
> --- John Meinel <john(at)johnmeinel(dot)com> wrote:
>
>
>>patrick ~ wrote:
[...]

>
> Hmm... The fact is I am selecting (in this example anyway) over all
> values in pkk_offer table and calling the stored function with each
> pkk_offer.offer_id which in turn does a select on pkk_purchase table.
> Note that offer_id is a foreign key in pkk_purchase referencing
> pkk_offer table.
>
> I don't know if it matters (I suspect that it does) but I am using
> LIMIT 1 in the sub-query/stored function. All I need is one single
> row meeting any of the criteria laid out in the stored procedure to
> establish an offer_id is "pending".
>

If you are trying to establish existence, we also had a whole thread on
this. Basically what we found was that adding an ORDER BY clause, helped
tremendously in getting the planner to switch to an Index scan. You
might try something like:

SELECT column FROM mytable WHERE column='myval' ORDER BY column LIMIT 1;

There seems to be a big difference between the above statement and:

SELECT column FROM mytable WHERE column='myval' LIMIT 1;

>
>
>>So what is the solution? The only one I'm aware of is to turn your
>>static function into a dynamic one.
>>
>>So somewhere within the function you build up a SQL query string and
>>call EXECUTE str. This forces the query planner to be run every time you
>>call the function. This means that if you call it will a "nice" value,
>>you will get the fast index scan, and if you call it with a "bad" value,
>>it will switch back to seq scan.
>>
>>The downside is you don't get much of a benefit from using as stored
>>procedure, as it has to run the query planner all the time (as though
>>you issue the query manually each time.) But it still might be better
>>for you in the long run.
>
>
>
> Well, running the query without the stored function, basically typing
> out the stored function as a sub-query shows me:
>
>
> pkk=# explain analyze select o0.offer_id, ( select case when ( select
> p0.purchase_id from pkk_purchase p0 where p0.offer_id = o0.offer_id and (
> p0.pending = true or ( ( p0.expire_time > now() or p0.expire_time isnull ) and
> p0.cancel_date isnull ) ) limit 1 ) isnull then false else true end ) from
> pkk_offer o0 ;
> QUERY PLAN
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Seq Scan on pkk_offer o0 (cost=0.00..1834.11 rows=618 width=4) (actual
> time=2413.398..1341885.084 rows=618 loops=1)
> SubPlan
> -> Result (cost=2.94..2.95 rows=1 width=0) (actual
> time=2171.287..2171.289 rows=1 loops=618)
> InitPlan
> -> Limit (cost=0.00..2.94 rows=1 width=4) (actual
> time=2171.264..2171.266 rows=1 loops=618)
> -> Seq Scan on pkk_purchase p0 (cost=0.00..37225.83
> rows=12670 width=4) (actual time=2171.245..2171.245 rows=1 loops=618)
> Filter: ((offer_id = $0) AND
> (((expire_time)::timestamp with time zone > now()) OR (expire_time IS NULL) OR
> (pending = true)) AND ((cancel_date IS NULL) OR (pending = true)))
> Total runtime: 1341887.523 ms
> (8 rows)
>
>
> while deleting all statistics on the pkk_% tables I get:
>
> pkk=# delete from pg_statistic where pg_statistic.starelid = pg_class.oid and
> pg_class.relname like 'pkk_%';
> DELETE 11
>
> pkk=# explain analyze select o0.offer_id, ( select case when ( select
> p0.purchase_id from pkk_purchase p0 where p0.offer_id = o0.offer_id and (
> p0.pending = true or ( ( p0.expire_time > now() or p0.expire_time isnull ) and
> p0.cancel_date isnull ) ) limit 1 ) isnull then false else true end ) from
> pkk_offer o0 ;
> QUERY PLAN
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Seq Scan on pkk_offer o0 (cost=0.00..6646.94 rows=618 width=4) (actual
> time=0.190..799.930 rows=618 loops=1)
> SubPlan
> -> Result (cost=10.73..10.74 rows=1 width=0) (actual time=1.277..1.278
> rows=1 loops=618)
> InitPlan
> -> Limit (cost=0.00..10.73 rows=1 width=4) (actual
> time=1.266..1.267 rows=1 loops=618)
> -> Index Scan using pur_offer_id_idx on pkk_purchase p0
> (cost=0.00..20690.18 rows=1929 width=4) (actual time=1.258..1.258 rows=1
> loops=618)
> Index Cond: (offer_id = $0)
> Filter: ((((expire_time)::timestamp with time zone >
> now()) OR (expire_time IS NULL) OR (pending = true)) AND ((cancel_date IS NULL)
> OR (pending = true)))
> Total runtime: 801.234 ms
> (9 rows)
>
>
> As you can see this query (over all values of pkk_offer) with out
> any pg_statistics on the pkk_purchase table is extremely fast.
>
> Is this a bug in the PostgreSQL planner that misjudges the best
> choice with pg_statistics at hand?
>
> --patrick
>

In order to understand your query I broke it up and restructured it as
follows.
You might try to add the ORDER BY line, and see what you get.

EXPLAIN ANALYZE
SELECT o0.offer_id,
( SELECT CASE WHEN
( SELECT p0.purchase_id FROM pkk_purchase p0
WHERE p0.offer_id = o0.offer_id
AND ( p0.pending = true
OR ( p0.cancel_date ISNULL
AND ( p0.expire_time > NOW() or p0.expire_time
ISNULL )
)
)
ORDER BY p0.purchase_id --Insert this line
LIMIT 1
) ISNULL THEN false
ELSE true
END
) FROM pkk_offer o0 ;

I also wonder about some parts of your query. I don't know your business
logic but you are tacking a lot of the query into the WHERE, and I
wonder if postgres just thinks it's going to need to analyze all the
data before it gets a match.

I also don't remember what columns you have indices on. Or whether it is
common to have cancel_date null, or expire_time > NOW() or expire_time
null, etc.

So is your function just everything within the CASE statement?

You might try rewriting it as a loop using a cursor, as I believe using
a cursor again lends itself to index scans (as it is even more likely
that you will not get all the data.)

Something like (this is untested)

create function is_pending(int) returns bool as '
declare
p_id alias for $1;
begin

DECLARE is_pending_cursor CURSOR FOR
SELECT p0.purchase_id FROM pkk_purchase p0
WHERE p0.offer_id = p_id;
FOR READ ONLY;

FOR FETCH NEXT is_pending_cursor
IF row.pending = true or ...
RETURN true;

RETURN false;
END;
';

I don't know cursors terribly well, but this might get you going.
Probably in your case you also have a large portion of the records with
pending = true, which means that with an index scan it doesn't have to
hit very many records. Either you have a low record count for a
particular purchase_id, or you have a lot of pendings. seq scan just
hurts because it has to sift out all the other id's that you don't care
about.

But remember, I'm not a guru, just someone who has been hit by the
inequal distribution problem.

John
=:->

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Pierre-Frédéric Caillaud 2004-11-09 00:04:25 Re: vacuum analyze slows sql query
Previous Message Tom Lane 2004-11-08 23:57:58 Re: [PERFORM] poor performance in migrated database