Re: Auto-clustering?

From: Filip Rembiałkowski <filip(dot)rembialkowski(at)gmail(dot)com>
To: Royce Ausburn <royce(at)inomial(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Auto-clustering?
Date: 2010-12-17 09:27:57
Message-ID: AANLkTinN8bZSK-iXOZRsDd-N2=DGoyQ26Wc_nJRP2UYD@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

2010/12/17 Royce Ausburn <royce(at)inomial(dot)com>

> Hi all,
>
> I have a table that in the typical case holds two minute sample data for a
> few thousand sources. Often we need to report on these data for a
> particular source over a particular time period and we're finding this query
> tends to get a bit slow.
>
> The structure of the table:
>
> Table "public.sample"
> Column | Type |
> Modifiers
>
> -------------------+--------------------------+-------------------------------------------------
> client | integer | not null
> aggregateid | bigint | not null
> sample | bigint | not null default
> nextval('samplekey'::regclass)
> customer | integer |
> period | integer | not null
> starttime | integer | not null
> duration | integer | not null
> ip | text |
> tariff | integer |
> bytessentrate | bigint |
> bytessent | bigint |
> bytesreceived | bigint |
> packets | integer | not null
> queuetype | integer | not null default 0
> collection | integer |
> bytesreceivedrate | bigint |
> greatestrate | bigint |
> invalidated | timestamp with time zone |
> Indexes:
> "sample_pkey" PRIMARY KEY, btree (sample)
> "sample_collection_starttime_idx" btree (collection, starttime)
> "sample_customer_starttime_idx" btree (customer, starttime)
> "sample_sample_idx" btree (client, sample)
> Foreign-key constraints:
> "sample_client_fkey" FOREIGN KEY (client) REFERENCES client(client)
>
>
> fc=# explain analyse select collection, period, tariff, sum(bytesSent),
> sum(bytesReceived), sum(packets), max(sample), (starttime / 3600) * 3600 as
> startchunk from sample_20101001 where starttime between 1287493200 and
> 1290171599 and collection=128 and ip = '10.9.125.207' group by
> startchunk, tariff, collection, period;
> QUERY PLAN
>
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
> HashAggregate (cost=34959.01..34959.03 rows=1 width=44) (actual
> time=67047.850..67047.850 rows=0 loops=1)
> -> Bitmap Heap Scan on sample_20101001 (cost=130.56..34958.91 rows=5
> width=44) (actual time=67047.847..67047.847 rows=0 loops=1)
> Recheck Cond: ((collection = 128) AND (starttime >= 1287493200)
> AND (starttime <= 1290171599))
> Filter: (ip = '10.9.125.207'::text)
> -> Bitmap Index Scan on sample_20101001_collection_starttime_idx
> (cost=0.00..130.56 rows=9596 width=0) (actual time=9806.115..9806.115
> rows=6830 loops=1)
> Index Cond: ((collection = 128) AND (starttime >=
> 1287493200) AND (starttime <= 1290171599))
> Total runtime: 67048.201 ms
> (7 rows)
>
>
how about (auto)vacuuming?

>
> I figure at most there should only be ~20,000 rows to be read from disk,
> and I expect that the index is doing a pretty good job of making sure only
> the rows that need reading are read. inclusion of the ip in the query is
> almost redundant as most of the time an ip has its own collection.... My
> suspicion is that the rows that we're interested in are very sparsely
> distributed on disk, so we're having to read too many pages for the query...
>

you can test this suspicion in very simple way:
- create test table (like yours including indexes including constraints, but
with no data)
- insert into test select * from yours order by
- analyze test tablee available
- test the query on the new table

If new query is much faster, and if you have intensive random UPD/DEL/INS
activity, periodic CLUSTER could be a good idea...
but it depends on actual usage patterns (SELECT/modify ratio, types of
updates, and so on).

>
> All of the queries on this table are reporting on a single collection, so
> ideally a collection's data would all be stored in the same part of the
> disk... or at least clumped together. This can be achieved using "cluster",
> however as far as I know there's no automated, non-cronesque means of
> clustering and having the table become unusable during the cluster is not
> ideal.
>

cron is a way of automation, isn't it :-)

>
>
> I've considered partitioning, but I don't think that's going to give the
> effect I need. Apparently clustering is only going to scale to a few dozen
> child tables, so that's only going to give one order of magnitude
> performance for significant complexity.
>
>

regarding partitioning: I guess it starts to make sense around 10M rows or
10G Bytes in one table.

regarding clustering: it does not help with index bloat.

and finally, you did not specify what PostgreSQL version are you using.

cheers,
Filip

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Marti Raudsepp 2010-12-17 09:41:12 Re: Auto-clustering?
Previous Message Eric Comeau 2010-12-17 06:53:56 Re: How to get FK to use new index without restarting the database