Re: Table Partitioning Feature

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Gupta <amit(dot)pc(dot)gupta(at)gmail(dot)com>
Cc: Emmanuel Cecchet <manu(at)frogthinker(dot)org>, pgsql-hackers(at)postgresql(dot)org, swm(at)alcove(dot)com(dot)au
Subject: Re: Table Partitioning Feature
Date: 2009-02-10 05:16:00
Message-ID: 603c8f070902092116j276836f2p2d8e243e98a4a20e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 9, 2009 at 9:16 AM, Amit Gupta <amit(dot)pc(dot)gupta(at)gmail(dot)com> wrote:
> Hi Emmanuel,
>
> We are considering to following approach:
> 1. metadata table pg_partitions is defined as follows:
> CATALOG(pg_partitions,2336) BKI_WITHOUT_OIDS
> {
> Oid partrelid; // partition table Oid
> Oid parentrelid; // Parent table Oid
> int4 parttype; // Type of partition, list, hash, range
> Oid partkey; // partition key Oid
> Oid keytype; /// type of partition key.
> int4 keyorder /// order of the key in multi-key partitions.
> text min;
> text max; // min and max for range parti
> text[] list;
> int hash; // hash value
> } FormData_pg_partitions;
>
>
> 2. C triggers will fire a query on this table to get the relevant
> partition of the inserted/updated data using SPI interface. The query
> will look something like (for range partitioning)
>
> select min(partrelid)
> from pg_partitions
> where parentrelid = 2934 // we know this value
> and (
> ( $1 between to_int(min ) and to_int(max) and
> keyorder = 1) OR
> ($2 between to_date (min) and to_date (max) and
> keyorder =2 )
> ....
> )
> group by
> parentrelid
> having
> count(*) = <number of partition keys>
>
> $1, $2, ... are the placeholders of the actual partition key values of
> trigger tuple.
>
> Since we know the type of partition keys, and the parentrelid, this
> kind of query string can be saved in another table say, pg_part_map.
> And its plan can be parsed once and saved in cache to be reused.
> Do you see any issue with using SPI interface within triggers?
>
> The advantage of this kind of approah is that trigger code can be made
> genric for any kind of partition table.

I am a little fuzzy on what you're proposing here, but I think you're
saying that you're only going to support range partitioning on
integers or dates and that you plan to use the text type to store the
integer or date values. FWIW, those don't seem like very good
decisions to me. I think you should aim to support range partitioning
on any combination of a datatype and a less-than operator, similar to
what pg_statistic does for statistics. pg_statistic uses anyarray to
store the datums.

I am also somewhat skeptical about the idea of using triggers for
this. I haven't scrutinized the issue in detail, so I may be all
wet... but ISTM that the concerns raised elsewhere about the order in
which triggers can be expected to fire may bite you fairly hard. ISTM
the right semantics are something like this:

- fire all of the row-level BEFORE triggers on the parent table
(giving up if any return NULL)
- determine the correct child table based on the resulting tuple
- fire all of the row-level BEFORE triggers on the child table (giving
up if any return NULL)
- insert the tuple into the child table
- fire all of the row-level AFTER triggers on the child table... and
possibly also the parent table... not sure about the order

You will also need to fire statement-level triggers on the appropriate
tables, which is a little tricky. Presumably you want the tables on
which the AFTER triggers fire to be the same ones as those on which
the BEFORE triggers fire, but you don't know which child tables you're
actually going to hit until you actually perform the action. Maybe
the right thing to do is fire both sets of triggers on the parent
table and those child tables not excluded by constraint exclusion...?
But I'm not sure about that.

Anyway, getting these types of behavior via triggers may be tricky.
But then again maybe not: I haven't read the code.

...Robert

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message K, Niranjan (NSN - IN/Bangalore) 2009-02-10 06:25:57 Re: Synch Replication
Previous Message Euler Taveira de Oliveira 2009-02-10 05:02:22 Re: [COMMITTERS] pgsql: Update autovacuum to use reloptions instead of a system catalog,