Re: [POC] hash partitioning

From: Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>
To: amul sul <sulamul(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [POC] hash partitioning
Date: 2017-03-03 11:01:57
Message-ID: 20170303200157.d4e3e37b.nagata@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 2 Mar 2017 18:33:42 +0530
amul sul <sulamul(at)gmail(dot)com> wrote:

Thank you for the patch. This is very interesting. I'm going to look
into your code and write a feedback later.

> On Wed, Mar 1, 2017 at 3:50 PM, Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp> wrote:
>
> > ​[....]​
> >
> > I Agree that it is unavoidable partitions number in modulo hashing,
> > > but we can do in other hashing technique. Have you had thought about
> > > Linear hashing[1] or Consistent hashing​[2]?​ This will allow us to
> > > add/drop
> > > partition with minimal row moment. ​
> >
> > Thank you for your information of hash technique. I'll see them
> > and try to allowing the number of partitions to be changed.
> >
> > ​
> Thanks for showing interest, I was also talking about this with Robert Haas
> and
> hacking on this, here is what we came up with this.
>
> If we want to introduce hash partitioning without syntax contort and minimal
> movement while changing hash partitions (ADD-DROP/ATTACH-DETACH operation),
> at start I thought we could pick up linear hashing, because of in both the
> hashing we might need to move approx tot_num_of_tuple/tot_num_of_partitions
> tuples at adding new partition and no row moment required at dropping
> partitioning.
>
> With further thinking and talking through the idea of using linear hashing
> with my team, we realized that has some problems specially during pg_dump
> and pg_upgrade. Both a regular pg_dump and the binary-upgrade version of
> pg_dump which is used by pg_restore need to maintain the identity of the
> partitions. We can't rely on things like OID order which may be unstable, or
> name order which might not match the order in which partitions were added.
> So
> somehow the partition position would need to be specified explicitly.
>
> So later we came up with some syntax like this (just fyi, this doesn't add
> any new keywords):
>
> create table foo (a integer, b text) partition by hash (a);
> create table foo1 partition of foo with (modulus 4, remainder 0);
> create table foo2 partition of foo with (modulus 8, remainder 1); --
> legal, modulus doesn't need to match
> create table foo3 partition of foo with (modulus 8, remainder 4); --
> illegal, overlaps foo1
>
> Here we​ need to​ enforce a rule that every modulus must be a factor of the
> next
> larger modulus. So, for example, if you have a bunch of partitions that all
> have
> modulus 5, you can add a new​ ​partition with modulus 10 or a new partition
> with
> modulus 15, but you cannot add both a partition with modulus 10 and a
> partition
> with modulus 15, because 10 is not a factor of 15. However, you could
> simultaneously use modulus 4, modulus 8, modulus 16, and modulus 32 if you
> wished, because each modulus is a factor of the next larger one. You could
> also use modulus 10, modulus 20, and modulus 60. But you could not use
> modulus
> 10, modulus 15, and modulus 60, because while both of the smaller module are
> factors of 60, it is not true that each is a factor of the next.
>
> Other advantages with this rule are:
>
> 1. Dropping ​(or detaching) and adding (or attaching) ​a partition can never
> cause the rule to be violated.
>
> 2. We can easily build a tuple-routing data structure based on the largest
> modulus.
>
> For example: If the user has
> partition 1 with (modulus 2, remainder 1),
> partition 2 with (modulus 4, remainder 2),
> partition 3 with (modulus 8, remainder 0) and
> partition 4 with (modulus 8, remainder 4),
>
> then we can build the following tuple routing array in the relcache:
>
> == lookup table for hashvalue % 8 ==
> 0 => p3
> 1 => p1
> 2 => p2
> 3 => p1
> 4 => p4
> 5 => p1
> 6 => p2
> 7 => p1
>
> 3. It's also quite easy to test with a proposed new partition overlaps with
> any
> existing partition. Just build the mapping array and see if you ever end up
> trying to assign a partition to a slot that's already been assigned to some
> other partition.
>
> We can still work on the proposed syntax - and I am open for suggestions.
> One
> more thought is to use FOR VALUES HAVING like:
> CREATE TABLE foo1 PARTITION OF foo FOR VALUES HAVING (modulus 2, remainder
> 1);
>
> But still more thoughts/inputs welcome here.
>
> Attached patch implements former syntax, here is quick demonstration:
>
> 1.CREATE :
> create table foo (a integer, b text) partition by hash (a);
> create table foo1 partition of foo with (modulus 2, remainder 1);
> create table foo2 partition of foo with (modulus 4, remainder 2);
> create table foo3 partition of foo with (modulus 8, remainder 0);
> create table foo4 partition of foo with (modulus 8, remainder 4);
>
> 2. Display parent table info:
> postgres=# \d+ foo
> Table "public.foo"
> Column | Type | Collation | Nullable | Default | Storage | Stats
> target | Description
> --------+---------+-----------+----------+---------+----------+--------------+-------------
> a | integer | | | | plain |
> |
> b | text | | | | extended |
> |
> Partition key: HASH (a)
> Partitions: foo1 WITH (modulus 2, remainder 1),
> foo2 WITH (modulus 4, remainder 2),
> foo3 WITH (modulus 8, remainder 0),
> foo4 WITH (modulus 8, remainder 4)
>
> 3. Display child table info:
> postgres=# \d+ foo1
> Table "public.foo1"
> Column | Type | Collation | Nullable | Default | Storage | Stats
> target | Description
> --------+---------+-----------+----------+---------+----------+--------------+-------------
> a | integer | | | | plain |
> |
> b | text | | | | extended |
> |
> Partition of: foo WITH (modulus 2, remainder 1)
>
> 4. INSERT:
> postgres=# insert into foo select i, 'abc' from generate_series(1,10) i;
> INSERT 0 10
>
> postgres=# select tableoid::regclass as part, * from foo;
> part | a | b
> ------+----+-----
> foo1 | 3 | abc
> foo1 | 4 | abc
> foo1 | 7 | abc
> foo1 | 10 | abc
> foo2 | 1 | abc
> foo2 | 2 | abc
> foo2 | 9 | abc
> foo3 | 6 | abc
> foo4 | 5 | abc
> foo4 | 8 | abc
> (10 rows)
>
> TODOs.
> 1. Maybe need some work in the CREATE TABLE .. PARTITION OF .. syntax.
> 2. Trim regression tests (if require).
> 3. Documentation
>
> Thoughts/Comments?

--
Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Stark 2017-03-03 11:30:17 Re: [POC] hash partitioning
Previous Message Robert Haas 2017-03-03 10:27:05 Re: Proposal : Parallel Merge Join