Re: speed up query with max() and odd estimates

From: John A Meinel <john(at)arbash-meinel(dot)com>
To: newz(at)bearfruit(dot)org, pgsql-performance(at)postgresql(dot)org
Subject: Re: speed up query with max() and odd estimates
Date: 2005-04-26 21:51:09
Message-ID: 426EB7CD.2020406@arbash-meinel.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Matthew Nuzum wrote:
> I have this query that takes a little over 8 min to run:
> select client,max(atime) as atime from usage_access where atime >=
> (select atime - '1 hour'::interval from usage_access order by atime
> desc limit 1) group by client;
>
> I think it can go a lot faster. Any suggestions on improving this? DB
> is 7.3.4 I think. (There is no index on client because it is very big
> and this data is used infrequently.)
Switch to Postgres 8.0.2 :)

Actually, I think one problem that you are running into is that postgres
(at least used to) has problems with selectivity of date fields when
using a non-constant parameter.

So it isn't switching over to using an index, even though you are
restricting the access time.

I would guess that creating a multi-column index on (client, atime)
*might* get you the best performance.

Try adding the index, and then doing this query:

select atime from usage_access where client = <client_id>
order by atime desc limit 1;

If you can get that query to use an index, then you can put it in a
loop. Something like:

CREATE FUNCTION last_client_access() RETURNS SETOF time AS '
DECLARE
client_id INT;
client_time TIME;
BEGIN
FOR client_id IN SELECT id FROM <client_list> LOOP
SELECT INTO client_time atime FROM usage_access
WHERE client = client_id
ORDER BY atime DESC LIMIT 1;
RETURN NEXT client_time;
END LOOP;
END;
' LANGUAGE plpgsql;

If you really need high speed, you could create a partial index for each
client id, something like:
CREATE INDEX usage_access_atime_client1_idx ON usage_access(atime)
WHERE client = client1;

But that is a lot of indexes to maintain.

I'm hoping that the multi-column index would be enough.

You might also try something like:

SELECT client, max(atime) FROM usage_access
WHERE atime > now - '1 hour'::interval
GROUP BY client;

now is more of a constant, so postgres might have a better time figuring
out the selectivity. I don't know your table, but I assume you are
constantly inserting new rows, and the largest atime value will be close
to now(). Remember, in this query (and in your original query) clients
with their last access time > then 1 hour since the max time (of all
clients) will not be shown. (Example, client 1 accessed yesterday,
client 2 accessed right now your original last atime would be today,
which would hide client 1).

Also, if it is simply a problem of the planner mis-estimating the
selectivity of the row, you can alter the statistics for atime.

ALTER TABLE usage_access ALTER COLUMN atime SET STATISTICS 1000;

I'm not really sure what else to try, but you might start there.

Also, I still recommend upgrading to postgres 8, as I think it handles a
lot of these things better. (7.3 is pretty old).

John
=:->

>
> explain ANALYZE select client,max(atime) as atime from usage_access
> where atime >= (select atime - '1 hour'::interval from usage_access
> order by atime desc limit 1) group by client;
>
> QUERY PLAN
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=3525096.28..3620450.16 rows=1271385 width=20)
> (actual time=482676.95..482693.69 rows=126 loops=1)
> InitPlan
> -> Limit (cost=0.00..0.59 rows=1 width=8) (actual
> time=0.40..0.41 rows=1 loops=1)
> -> Index Scan Backward using usage_access_atime on
> usage_access (cost=0.00..22657796.18 rows=38141552 width=8) (actual
> time=0.39..0.40 rows=2 loops=1)
> -> Group (cost=3525096.28..3588665.53 rows=12713851 width=20)
> (actual time=482676.81..482689.29 rows=3343 loops=1)
> -> Sort (cost=3525096.28..3556880.90 rows=12713851
> width=20) (actual time=482676.79..482679.16 rows=3343 loops=1)
> Sort Key: client
> -> Seq Scan on usage_access (cost=0.00..1183396.40
> rows=12713851 width=20) (actual time=482641.57..482659.18 rows=3343
> loops=1)
> Filter: (atime >= $0)
> Total runtime: 482694.65 msec
>
>
> I'm starting to understand this, which is quite frightening to me. I
> thought that maybe if I shrink the number of rows down I could improve
> things a bit, but my first attempt didn't work. I thought I'd replace
> the "from usage_access" with this query instead:
> select * from usage_access where atime >= (select atime - '1
> hour'::interval from usage_access order by atime desc limit 1);
>
> QUERY PLAN
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Seq Scan on usage_access (cost=0.00..1183396.40 rows=12713851
> width=116) (actual time=481796.22..481839.43 rows=3343 loops=1)
> Filter: (atime >= $0)
> InitPlan
> -> Limit (cost=0.00..0.59 rows=1 width=8) (actual
> time=0.41..0.42 rows=1 loops=1)
> -> Index Scan Backward using usage_access_atime on
> usage_access (cost=0.00..22657796.18 rows=38141552 width=8) (actual
> time=0.40..0.41 rows=2 loops=1)
> Total runtime: 481842.47 msec
>
> It doesn't look like this will help at all.
>
> This table is primarily append, however I just recently deleted a few
> million rows from the table, if that helps anyone.
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Gurmeet Manku 2005-04-26 22:00:48 Re: [HACKERS] Bad n_distinct estimation; hacks suggested?
Previous Message Andrew Dunstan 2005-04-26 21:41:20 Re: [HACKERS] Bad n_distinct estimation; hacks suggested?