Re: Partitioning syntax

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Takahiro Itagaki <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: Jaime Casanova <jaime(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Partitioning syntax
Date: 2010-07-06 17:49:42
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On Thu, Jun 17, 2010 at 9:34 PM, Takahiro Itagaki
<itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> wrote:
> Jaime Casanova <jaime(at)2ndquadrant(dot)com> wrote:
>> This one, doesn't apply to head anymore... please update
> Thank you for reviewing my patch!
> I attached an updated patch set for partitioning syntax.

I've taken a little bit more of a look at this patch and I guess I'm
not too happy with the design.

1. It seems to me that the proposed design for pg_partition is poorly
thought out. In particular, I don't see how this would work if we
wanted to partition on multiple keys, which is a feature supported by
both Oracle and MySQL. It would also be nice to give at least some
thought to how we might handle partitioning by list with
subpartitioning by range or hash, or range partitioning with
subpartitioning by hash. We certainly don't need to do
subpartitioning in the first version of the patch, but I think we
should have a plan.

2. I am still of the view that the first version of this patch should
correctly handle routing of INSERT and COPY data to the correct
partition. But at a very minimum we need to have a plan for how we're
going to implement that in a follow-on patch. I think the way to do
this is to binary search a sorted array of partition keys (perhaps
upper bounds for range partitioning, and exact values for list
partitioning). When you find the correct key, then you find the index
of that key and look up that same index in a separate array of table
OIDs and insert there. While it's possible to construct such a
structure from the proposed catalog structure, it requires an index
scan. I'm wondering if it might be better to abandon the idea of
storing the partition values in pg_inherits and instead put
preconstructed arrays directly into pg_partition. That way, with a
single row fetch, you can get all the data you need. I'm not sure
this is better, though - other opinions?

3. For a first version of this patch, I would suggest that we only
allow partitioning by base columns, rather than expressions. When
someone goes to do a bulk load of data into the table, and we want to
do automatic tuple routing, we're going to have to evaluate the
partitioning expression(s) for every row. I'm just guessing here, but
I bet it's a lot cheaper to fetch an attribute by attnum than to
evaluate an arbitrary expression. So even if we add partitioning by
expression later, I don't think that the work to make a special case
for base columns will be wasted.

4. The dependency handling needs more thought. For example, if I do this:

create table names (id integer primary key, name varchar not null)
partition by range (id) (partition names_1 values less than 1000,
partition names_2 values less than 2000, partition names_3 values less
than 3000);

and then I drop names_2, the automatically generated constraint on
names_3 still says CHECK (2000 <= id AND id < 3000). And I can drop
those constraints off the individual tables, too, which doesn't seem
like it ought to be allowed. And if I do something like this:

create or replace function f(int) returns int as $$select $1$$
immutable language sql;
create table names (id integer primary key, name varchar not null)
partition by range (f(id)) (partition names_1 values less than 1000,
partition names_2 values less than 2000, partition names_3 values less
than 3000);
alter table names_1 drop constraint names_1_id_check;
alter table names_2 drop constraint names_2_id_check;
alter table names_3 drop constraint names_3_id_check;
drop function f(int);

...then I get:

ERROR: cannot drop function f(integer) because other objects depend on it
DETAIL: range partition depends on function f(integer)

...but there's no identification of which range partition depends on it.

The locking is also broken here. In session #1, start a transaction
and do CREATE PARTITION on an existing partitioned table. Then in
session #2, do DROP FUNCTION <some function> CASCADE in a manner that
leads to the range partition getting dropped. Then commit session #1.
Now you have a pg_inherits catalog with leftovers in inhvalues.

5. The use of the term "partition" is not very consistent. For
example, we use CREATE PARTITION to create a partition, but we use
DROP TABLE to get rid of it (there is no DROP PARTITION). I think
that the right syntax to use here is ALTER TABLE ... ADD/DROP
PARTITION; both Oracle and MySQL do it that way. And meanwhile
OCLASS_PARTITION means "the partitioning information associated with
the parent table", not "a partition of a parent table".

6. There's some kind of magic in here associated with indexes on the
parent table - it seems that matching indexes or primary keys are
automatically created on each child table. But there's no provision
for keeping them in sync. If I create a partitioned table with a
primary key, the key is inherited by all its current children. If I
then drop the primary key, it disappears from the parent but it still
exists on the children. Any new children created afterwards don't
have it, however. I'm not sure whether indices should propagate from
parent to child or not, but propagating whatever exists at the moment
of creation and then forgetting about it doesn't seem right.

7. I'm not convinced that it's a good idea to treat ALTER TABLE parent
ATTACH/DETACH PARTITION child as basically a synonym for ALTER TABLE
child [NO] INHERIT parent, but even if it is the current
implementation seems way too permissive (it also lacks comments and
adequate documentation). You can, for example, use ATTACH PARTITION
to add a new child and then NO INHERIT to detach it again; or you can
use INHERIT to attach a child even when the parent is partitioned. It
does however catch the case of trying to use ATTACH PARTITION to
attach a child to an unpartitioned parent.

Robert Haas
The Enterprise Postgres Company

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-07-06 18:38:48 Re: Re: [COMMITTERS] pgsql: Fix log_temp_files docs and comments to say bytes not kilobytes.
Previous Message Pavel Golub 2010-07-06 17:08:36 keepalive in libpq using