query performance, though it was timestamps,maybe just table size?

From: Henry Drexler <alonup8tb(at)gmail(dot)com>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: query performance, though it was timestamps,maybe just table size?
Date: 2012-11-30 13:22:07
Message-ID: CAAtgU9TfeMJ==48ZjKAGJ8tptPEjyEE_bOxCWeh=wzbSD_W1TA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello, and thank you in advance.

Beyond the date vs timestamp troubleshooting I did, I am not sure what else
to look for, I know the increase of rows will have some affect but I just
don't think the query should go from 4 minutes to over 50.

system:

laptop - ubuntu 12.04 lts

"PostgreSQL 9.2.1 on x86_64-unknown-linux-gnu, compiled by gcc
(Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit"

Summary:
I have two tables. I run a function called massive_expansion that will
look at customer_id and the_range - it will use these to query the table
massive to find all those dates for the customer_id that are in the_range
and insert these into another table.

Problem:
The table massive is about 65 million rows (about double what it was)
The query now takes 50 minutes (it used to take 4 minutes with data 1/2 the
size).

History:
When (I first started out) I had everything as date the query went
quite quickly - 4 minutes when the table massive is about 30 million rows.
I then re-did everything but this time instead of date I used the
actual timestamps (as a result there were a few more rows but only 2
million more) - the query went quickly - 5 minutes when the table massive
is about 32 million rows.
So I did not see an appreciable difference between the data having date
vs timestamp (all other things kept the same.) So I supposed I can rule
out date vs timestamp being an issue.

tsrange is always an interval of 7 days.

I have two tables:

First table:
CREATE TABLE massive
(
source character varying,
dateof timestamp without time zone,
customer_id bigint,
count_raw bigint
);

CREATE INDEX customer_id_dateof
ON massive
USING btree
(customer_id, dateof);

Second table:
CREATE TABLE critical_visitors
(
customer_id bigint,
dateof timestamp without time zone,
the_range tsrange
);

CREATE INDEX customer_id_range
ON critical_visitors
USING btree
(customer_id, the_range);

The function:
CREATE or replace FUNCTION massive_expansion(customer_id
bigint,the_range tsrange) RETURNS void AS $$
BEGIN

INSERT INTO massive_expansion_from_critical_visitors
(
select
massive.source,
massive.dateof,
massive.customer_id,
massive.count_raw
from
massive
where
massive.customer_id = $1
and
massive.dateof <@ the_range) ;

END;
$$ LANGUAGE plpgsql;

The query:
select
massive_expansion(customer_id,the_range)
from
critical_visitors;

Additional Detail:
I did a query against the 30 million volume with this query:
select
massive.source,
massive.dateof,
massive.customer_id,
massive.count_raw
from
massive
where
massive.customer_id = '<a customer_id goes here>'::bigint--$1
and
massive.dateof <@ '(2012-07-22 17:00:00,2012-07-29 17:00:00]'::tsrange;

With a query plan of:
"Index Scan using customer_id_sourcee on massive_m (cost=0.00..113.98
rows=1 width=28)"
" Index Cond: (customer_id = 9167174563::bigint)"
" Filter: (dateof <@ '("2012-07-22 17:00:00","2012-07-29
17:00:00"]'::tsrange)"

Then on the 65 million volume table I did the same query and got a plan of:
"Index Scan using customer_id_source on massive (cost=0.00..189.38
rows=1 width=28)"
" Index Cond: (customer_id = 9167174563::bigint)"
" Filter: (dateof <@ '("2012-07-22 17:00:00","2012-07-29
17:00:00"]'::tsrange)"

Responses

Browse pgsql-general by date

  From Date Subject
Next Message ivan marchesini 2012-11-30 13:25:24 Re: difference in query performance due to the inclusion of a polygon geometry field
Previous Message Kevin Grittner 2012-11-30 13:08:06 Re: postgres timestamp data errors