Re-partitioning huge schema

From: Erik Jones <erik(at)myemma(dot)com>
To: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re-partitioning huge schema
Date: 2007-12-03 22:32:30
Message-ID: 39686278-5E30-43EE-8E5D-5898918CD22E@myemma.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi, I've inherited <smirk> a database schema wherein the original
developers took the inheritance mechanism to an extreme where new
client accounts get 13 different tables of their own created for
them. We're at the many tens of thousands of tables mark (well over
100K) and I'm going to be re-partitioning most of those 13 table
types. Basically, what I'm looking for here is advice on whether or
not my proposed method of repartitioning the data is valid or if I
should be thinking differently.

We're an email marketing campaign creation, delivery, and response
tracking delivery service and so I'll use the opens tracking tables
for this.

Here's the opens tables' layout:

Column | Type |
Modifiers
---------------------------+-----------------------------
+------------------------
message_user_agent | character varying(255) | not null
message_open_ts | timestamp without time zone | not null
default now()
message_is_text_only |
smallint | not null default 0
message_id |
bigint | not null
mailing_id |
bigint |
member_id |
bigint |
Indexes:
"u_mesopens_pkey" PRIMARY KEY, btree (emma_message_id)
"u_mesopens_emma_message_open_ts" btree (emma_message_open_ts)
"u_mesopens_emma_mailing_id_idx" btree (emma_mailing_id)
"u_mesopens_emma_member_id_idx" btree (emma_member_id)

All of the other types will follow this general style, i.e. id based
primary key and indexes along with a timestamp somewhere in there
recording the original entry time. The majority of the queries we
run against these are solely id based with some being time (or id +
time) based, probably less than 10-20% for the latter.

In order to get a ballpark idea of the current table stats I ran the
following query against pg_class for all of these type of tables
(this one is just for the opens):

select substring(relname from '[0-9]+_(.*)$') as ttype,
count(relname) as total_tables,
sum(reltuples) as total_tuples,
max(reltuples) as max_tuples,
to_char(avg(reltuples), '999999.99') as avg_num_tuples,
to_char(stddev_pop(reltuples), '9999999.99') as std_dev_from_avg
from pg_class
where relname ~ 'opens$'
and substring(relname from '[0-9]+_(.*)$') is not null
group by ttype;

With the following results:

-[ RECORD 1 ]----+-----------------------
ttype | messages_history_opens
total_tables | 14027
total_tuples | 139284528
max_tuples | 2760599
avg_num_tuples | 9929.84
std_dev_from_avg | 59945.51

Now, for this discussion, let me also point out that we've had a
pretty steady growth rate of about 150% per year since the company
opened 5 years ago and, for the sake of argument, we'll use that here
although we really don't have any guarantees past personal faith that
we can maintaing that :)

So, I'm looking at both a bin partiioning system or range based on
the date timestamps and both seem to have their pros and cons. For
the bin example, for this table type if I set it at 300 partitions it
will take approximately 5 years before any of the partitions reaches
the size of our current largest opens table. This is obviously very
attractive from a management perspective and has the added advantage
that I could probably expect the spread of data (volume per table) to
be pretty even over time. However, it's those times somebody wants
to ask, "Show me all my members who have opened in the last year"
that it becomes a problem as that data would most likely be spread
over a radically varying number of partitions. I could probably
solve that by making the partition based on modulo account_id.

The other option, of course, is to go with the "standard" (my quotes
based on what I've seen on this and the performance list) range based
partitioning. However, it really wouldn't make a lot of sense to
make those partitions based on the timestamp fields as the data for
one mailing could then span many of the partitions and the same
question I noted in the last paragraph would again result in hitting
potentially many (although not as many as before) partitions. So,
what I'm left with here is to partition on id ranges but, in order to
maintain a somewhat stable maximum partition size, I'd have to play a
statistical guessing game wrt the size previous month's (or week's)
partitions grew to v. their defined id ranges. That's doable, but
not very attractive. To be honest, I'm not even convinced that
that's something I'd really need to worry about on a month-to-month
basis.

Seen from one perspective, partitioning on id ranges v. using the bin
partitioning method are kind of similar. The biggest differences
being that with the range setup the majority of the working set of
data will be in the last couple of month's partitions whereas with
the bin method it will be spread pretty evenly across all of them,
i.e. lots of pages for a couple pages constantly being worked with v.
a couple pages from a lot of tables constantly being worked with.

Any thoughts or advice?

Erik Jones

Software Developer | Emma®
erik(at)myemma(dot)com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Erik Jones 2007-12-03 22:38:24 Re: Moving pgstat.stat and pgstat.tmp
Previous Message Alvaro Herrera 2007-12-03 22:17:30 Re: postmaster logfile