Storage Model for Partitioning

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Storage Model for Partitioning
Date: 2008-01-11 10:25:40
Message-ID: 1200047140.4266.972.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

In my striving towards more effective partitioning for Postgres, I see
we have one main decision to make and that all other sub-tasks are
driven directly by this one issue. The issue is: At what point we store
the data within the existing storage model? We discussed this in 2005
when I started to discuss what became constraint exclusion and it
remains the core issue from which all other tasks are driven.

If we can establish the basics of how a table can be split into
partitions, then that allows work to progress on the other issues.

I'd like some guidance from the senior crew on this, which is hopefully
possible without getting embroiled in all the details of partitioning,
most of which are more straightforward technical issues.

The current storage model for a table is that above the smgr layer a
table looks like a single continuous range of blocks, while below the
smgr layer it is in fact a set of segment files.

Given where we are now, how should we change the storage model to
support partitioning? If at all. We have two types of requirement:
- manageability - tablespaces for each partition etc..
- performance - excluding partitions etc..

I've argued from multiple sides of the fence, so I'm trying to present a
neutral view to allow us to take the best route forward.

The basic options are these:

0. Do Nothing - we don't want any of the other options.

1. Partitions are Contiguous Ranges of Blocks
As proposed for segment exclusion based partitioning

2. Partitions are Tables
As used by current constraint exclusion based partitioning.

3. Partitions are RelFileNodes, but not Tables

4. Some Other Choice

In more detail...

1. Partitions are Contiguous Ranges of Blocks

Partitions are a simple subset of a table, i.e. a contiguous range of
blocks within the main block range of the table. That allows us to
maintain the current smgr model, which then allows...

- allows RI via SHARE locks
- avoids the need for complex planner changes
- allows unique indexes
- allows global indexes (because we only have one table)
- works naturally with synchronous scans and buffer recycling

Doing partitioning this way means we would (as a trivial example) assign
blocks 1-10 as partition 1, blocks 11-20 as partition 2 etc.. There are
two sub-options of this basic idea:

a) Dynamic Partitioning - we define partitioning based around what is
already in the table, rather than trying to force the data to a
"correct" partition. No changes to the current storage model. Useful,
but it doesn't do all that everybody wants.

- allows automated table extension, so works automatically with Slony
- allows partition wise merge joins
- but not easily usable with declarative partitioning

b) Fixed partitioning - we define partitions as static ranges of blocks,
which may leave us with holes in the range of BlockNumbers, plus each
partition has a maximum size that it cannot expand beyond. Probably
unacceptable.

2. Partitions are Tables

Current situation.

This means we have to change
- Nested loop joins work with partitions, so an IndexScan must be able
to cross partitions within the target table
- indexes, so they can refer to more than one partition
- share locking in the executor
- changes to allow synchronous scans and buffer recycling
- automatic partition creation required
- single DDL declaration from the Parent table

3. Partitions are RelFileNodes, but not Tables

We allow a table to have multiple RelFileNodes, when explicitly declared
that way.

This means we have to change
- DDL changes to allow TABLE level changes to apply to all RelFileNodes,
while PARTITION level changes to apply to only one RelFileNode
- indexes, so they can refer to more than one partition
- share locking in the executor
- changes to allow synchronous scans and buffer recycling

There *are* other changes not mentioned here that are required for
partitioning, which although complex are less doubtful.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2008-01-11 10:59:14 Re: Storage Model for Partitioning
Previous Message Pavel Stehule 2008-01-11 09:54:40 Re: 8.2.4 serious slowdown