Re: Recommendations for partitioning?

From: Dave Johansen <davejohansen(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Recommendations for partitioning?
Date: 2013-12-19 16:53:35
Message-ID: CAAcYxUcb0NFfMDsMOCL5scNRrUL7=9hKxjz021JMQp0r7f5sCQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Sorry for the delay response. We had some hardware/configuration issues
that appear to be solved now, so now we're starting to actually play with
modifying the database.

On Sat, Dec 7, 2013 at 1:29 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> On Thu, Dec 5, 2013 at 7:36 AM, Dave Johansen <davejohansen(at)gmail(dot)com>wrote:
>
>> I'm managing a database that is adding about 10-20M records per day to a
>> table and time is a core part of most queries,
>>
>
>
> What is the nature of how the time column is used in the queries?
> Depending on how it is used, you might not get much improvement at all, or
> you might get N fold improvement, or you might find that re-designing your
> indexes could get you the same query improvement that partitioning would,
> but with less headache.
>

The time column is usually used to calculate statistics, find/analyze
duplicates, analyze data contents, etc on a specific time window. So there
will be queries with GROUP BY and WINDOWs with a specific time filter in
the where clause.

>
> so I've been looking into seeing if I need to start using partitioning
>> based on the time column and I've found these general guidelines:
>>
>> Don't use more than about 50 paritions (
>> http://www.postgresql.org/message-id/17147.1271696670@sss.pgh.pa.us )
>> Use triggers to make the interface easier (
>> https://wiki.postgresql.org/wiki/Table_partitioning#Trigger-based and
>> http://stackoverflow.com/questions/16049396/postgres-partition-by-week )
>>
>
> Using triggers slows INSERTs down by a lot (unless they were already slow
> due to the need to hit disk to maintain the indexes or something like
> that). Are you sure you can handle that slow down, given your insertion
> rate? You could get the best of both worlds by having your bulk loaders
> target the correct partition directly, but also have the triggers on the
> parent table for any programs that don't get the message.
>

Inserting directly into the correct partition whenever possible and leaving
the trigger on the parent table seems like the best option.

>
>> The only data I found fell inline with what you'd expect (i.e. speeds up
>> selects but slows down inserts/updates
>> http://www.if-not-true-then-false.com/2009/performance-testing-between-partitioned-and-non-partitioned-postgresql-tables-part-3/)
>>
>
>
> One of the big benefits of partitioning can be to speed up insertions
> tremendously, by keeping the hot part of the indices that need to be
> maintained upon insertion together in shared_buffers.
>

We insert lots of new data, but rarely modify existing data once it's in
the database, so it sounds like this would be a big benefit for us.

>
>
>> So I was thinking that partitioning based on month to keep the number of
>> partitions low, so that would mean about 0.5G records in each table. Does
>> that seem like a reasonable number of records in each partition? Is there
>> anything else that I should consider or be aware of?
>>
>
> How will data be expired? Hows does the size of one of your intended
> partitions compare to your RAM and shared_buffers.
>

We add about 10-20 million records per day with each being about 200 bytes
in size (there's a bytea in there with that being the average size) to each
table and there's 64 GB of RAM on the machine.

>
> Cheers,
>
> Jeff
>

On Sat, Dec 7, 2013 at 1:37 PM, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>wrote:
> I'll add that you can use assymetric partitioning if you tend to do a
> lot of more fine grained queries on recent data and more big roll up
> on older ones. I.e. partition by month except for the last 30 days, do
> it by day etc. Then at the end of the month roll all the days into a
> month partition and delete them.

This sounds like a great solution for us. Is there some trick to roll the
records from one partition to another? Or is the only way just a SELECT
INTO followed by a DELETE?

Thanks,
Dave

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Dave Johansen 2013-12-19 17:00:38 Unexpected pgbench result
Previous Message Marc Cousin 2013-12-19 15:33:24 query plan not optimal