Auto-clustering?

From: Royce Ausburn <royce(at)inomial(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Auto-clustering?
Date: 2010-12-16 23:49:02
Message-ID: E03CB9AF-169C-47CA-B82D-645C8D6E94D9@inomial.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi all,

I have a table that in the typical case holds two minute sample data for a few thousand sources. Often we need to report on these data for a particular source over a particular time period and we're finding this query tends to get a bit slow.

The structure of the table:

Table "public.sample"
Column | Type | Modifiers
-------------------+--------------------------+-------------------------------------------------
client | integer | not null
aggregateid | bigint | not null
sample | bigint | not null default nextval('samplekey'::regclass)
customer | integer |
period | integer | not null
starttime | integer | not null
duration | integer | not null
ip | text |
tariff | integer |
bytessentrate | bigint |
bytessent | bigint |
bytesreceived | bigint |
packets | integer | not null
queuetype | integer | not null default 0
collection | integer |
bytesreceivedrate | bigint |
greatestrate | bigint |
invalidated | timestamp with time zone |
Indexes:
"sample_pkey" PRIMARY KEY, btree (sample)
"sample_collection_starttime_idx" btree (collection, starttime)
"sample_customer_starttime_idx" btree (customer, starttime)
"sample_sample_idx" btree (client, sample)
Foreign-key constraints:
"sample_client_fkey" FOREIGN KEY (client) REFERENCES client(client)

fc=# explain analyse select collection, period, tariff, sum(bytesSent), sum(bytesReceived), sum(packets), max(sample), (starttime / 3600) * 3600 as startchunk from sample_20101001 where starttime between 1287493200 and 1290171599 and collection=128 and ip = '10.9.125.207' group by startchunk, tariff, collection, period; QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=34959.01..34959.03 rows=1 width=44) (actual time=67047.850..67047.850 rows=0 loops=1)
-> Bitmap Heap Scan on sample_20101001 (cost=130.56..34958.91 rows=5 width=44) (actual time=67047.847..67047.847 rows=0 loops=1)
Recheck Cond: ((collection = 128) AND (starttime >= 1287493200) AND (starttime <= 1290171599))
Filter: (ip = '10.9.125.207'::text)
-> Bitmap Index Scan on sample_20101001_collection_starttime_idx (cost=0.00..130.56 rows=9596 width=0) (actual time=9806.115..9806.115 rows=6830 loops=1)
Index Cond: ((collection = 128) AND (starttime >= 1287493200) AND (starttime <= 1290171599))
Total runtime: 67048.201 ms
(7 rows)

I figure at most there should only be ~20,000 rows to be read from disk, and I expect that the index is doing a pretty good job of making sure only the rows that need reading are read. inclusion of the ip in the query is almost redundant as most of the time an ip has its own collection.... My suspicion is that the rows that we're interested in are very sparsely distributed on disk, so we're having to read too many pages for the query...

All of the queries on this table are reporting on a single collection, so ideally a collection's data would all be stored in the same part of the disk... or at least clumped together. This can be achieved using "cluster", however as far as I know there's no automated, non-cronesque means of clustering and having the table become unusable during the cluster is not ideal.

I've considered partitioning, but I don't think that's going to give the effect I need. Apparently clustering is only going to scale to a few dozen child tables, so that's only going to give one order of magnitude performance for significant complexity.

Are there any other options?

Cheers!

--Royce

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Eric Comeau 2010-12-17 06:53:56 Re: How to get FK to use new index without restarting the database
Previous Message Mladen Gogala 2010-12-16 19:27:08 Re: Index Bloat - how to tell?