Re: Questions about Partitioning

From: Steve Crawford <scrawford(at)pinpointresearch(dot)com>
To: Phoenix Kiula <phoenix(dot)kiula(at)gmail(dot)com>
Cc: PG-General Mailing List <pgsql-general(at)postgresql(dot)org>
Subject: Re: Questions about Partitioning
Date: 2011-04-19 20:40:29
Message-ID: 4DADF33D.8060705@pinpointresearch.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 04/19/2011 08:56 AM, Phoenix Kiula wrote:
> While I fix some bigger DB woes, I have learned a lesson. Huge indexes
> and tables are a pain.
>
> Which makes me doubly keen on looking at partitioning.
Before jumping into partitioning it would be useful to know specifically
what pain you are having with your current tables and indexes.
Maintenance? Performance? Other? Question zero is "What issues are
currently causing you pain with large tables?" and after that
determining if the partitioning is an appropriate solution. There is
pain associated with partitioning, as well, so you need to be sure that
you will achieve a net pain reduction.

Carefully read
http://www.postgresql.org/docs/current/static/ddl-partitioning.html, it
has examples that answer several of your questions. Pay extra attention
to 5.9.6 "Caveats".

Some places where partitioning work well:

1. The partition can substitute for an index and the resulting child
tables will have somewhat comparable sizes. If you had contact
information where state was typically required in queries you might
partition the data into tables for each state so a typical query would
only touch a smaller data set and the partitioning/child-table
constraints substitute for an index on state.

2. You frequently drop data in bulk and can group that data in such a
way that you can drop or truncate a child-table. Among the places I've
used partitioning is for validation codes. I partition them by like
expiration and when the date arrives, I just drop the partition with the
expired codes - way faster than delete-from and the necessary follow-up
maintenance when deleting millions of codes.

3. The nature of your data is such that it can be partitioned into a
small part that is accessed frequently and parts that are relatively
rarely accessed.

> Most examples I see online are partitioned by date. As in months, or
> quarter, and so on. This doesn't work for me as I don't have too much
> logic required based on time.

Time-based data often satisfies all of the above (log data you can
partition by month, typically only look at the current month and drop
data that is a year old, for example) so that's what ends up being in
most examples.

> The biggest, highest volume SELECT in my database happens through an
> "alias" column. This is an alphanumeric column. The second-biggest
> SELECT happens through the "userid" column -- because many users check
> their account every day.
>
> A rough table definition can be considered as follows:
>
> CREATE TABLE maintable
> id SERIAL primary key
> alias VARCHAR(42)
> ...
> user_id VARCHAR(30)
> user_registered BOOLEAN
> status VARCHAR(1)
>
>
> My questions:
>
> 1. Which column should I partition by -- the "alias" because it's the
> largest contributor of queries? This should be OK, but my concern is
> that when user_id queries are happening, then the data for the same
> user will come through many subtables that are partitioned by "alias"
> -- will this happen automatically (presuming "constraint exclusion" is
> on)? How does partitioning by one column affect queries on others.
> Will there be subtable-by-subtable indexes on both alias and

Answer question zero, above, first. But beware - the primary key is not
inherited. You run the risk of duplicating the primary key (or other
unique identifier) across child tables unless you implement the
appropriate constraints on the child tables to prevent this. It's also
pointless to have a primary key on the parent table in most situations.

> 2. How does SERIAL type work with partitions? Will INSERT data go into
> the respective partitions and yet maintain an overall sequence -- I
> mean, the *same* overall sequence for the parent table distributed
> automagically across subtables?
This depends on how you set up your triggers, constraints, child tables
etc. but by default a basic "create table thechild () inherits
(theparent);" will result in a child table that shares the same sequence
as the parent.
> 3. If I partition using "a%", "b%" etc up to "z%" as the partition
> condition, is this an issue -- are about 26 subtables too many
> partitions? Mine are static partitions as in they will be the same
> forever, unlike data-based partitions. And each partition will
> continue to grow. If I include that "alias"es can begin with numbers
> and allowed symbols too, then this may be 45 partitions? What's the
> limit of partitions -- not only official limit, but practical limit in
> terms of performance?

As always, the answer is "depends" but I wouldn't typically see 45 as
too many. See primary-key warning above. It's less an absolute number of
tables and more whether the design of your tables and queries results in
execution efficiency gains that outweigh the additional planner costs.

> 4. Given that it's a wildcard LIKE condition (with a "%") will this
> affect the index and subsequent SELECT speed? Are partition conditions
> recommended to be "=" or "<" type operators only or is LIKE ok??
The restriction in your where clause should closely match the constraint
definition. I don't know precisely how clever or forgiving the planner
is in different situations but have the best results when my where
clauses include a condition that exactly matches my constraint
definitions. I don't know the details of how the planner decides when to
use constraint exclusion but I have observed that a constraint of
"the_data ~ '^a'",... does not seem to cause the planner to use
constraints but "substr(the_data,1,1)='a'", ... does.
> 5. Does partitioning need to happen only through one column? Can I
> have a condition containing two columns instead?
>
> CREATE TABLE subtable_a (
> PRIMARY KEY (id)
> CHECK ( user_id LIKE 'a%' and user_registered IS TRUE)
> ) INHERITS (maintable);
>
> CREATE TABLE subtable_b (
> PRIMARY KEY (id),
> CHECK ( user_id LIKE 'b%' and user_registered IS TRUE)
> ) INHERITS (maintable);
>
> ......etc....
You can use multiple columns, expressions, etc. For this and the prior
question, use "explain" to verify that your design and queries are
operating as desired.

> 7. "Constraint exclusion" - is it recommended to have this in the
> pg.conf, or will I need to do this before every SQL? I prefer the
> pg.conf way, but want to confirm that there are no downsides for other
> regular SQL operations with this setting?
The default, constraint_exclusion = partition, should be fine. The
planner will only check constraints in situations where it might be useful.
> 8. How will JOIN work? I have different tables JOINing with the parent
> table now. With partitioned subtables, will constraint exclusion
> automatically do what's needed and my SQL does not need to change? Or
> will there be triggers required for each and every query I currently
> have?

They should "work", just fine. But they will probably not gain in
efficiency unless they are of a type where constraint exclusion will be
beneficial and are rewritten so the planner will use that exclusion.
Note, also, that the planner will not use constraint exclusion on
prepared queries.

>
> Eight questions is enough for my first post in this partitioning thread :)
>
> Thanks much!
>

Partitioning is a fabulous solution to problems for which it is suited.
I'm not convinced that yours is one of them. You may be looking at lots
of pain trying to get it to work correctly and it may not even be of
benefit. Which returns us to question zero.

Cheers,
Steve

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Andrej 2011-04-19 22:13:28 Re: Using column aliasses in the same query
Previous Message Daniele Varrazzo 2011-04-19 19:45:19 Re: setting connection/ query timeout