Re: Declarative partitioning - another take

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Erik Rijkers <er(at)xs4all(dot)nl>, Amit Langote <amitlangote09(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Declarative partitioning - another take
Date: 2016-12-08 15:25:08
Message-ID: CA+TgmobTxn2+0x96h5Le+GOK5kw3J37SRveNfzEdx9s5-Yd8vA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On Wed, Dec 7, 2016 at 11:42 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
>> Congrats to everyone working on this! This is a large step forward.
>
> Congratulations to all! It was a long way to this result.

Yes. The last effort in this area which I can remember was by Itagaki
Takahiro in 2010, so we've been waiting for this for more than 6
years. It's really good that Amit was able to put in the effort to
produce a committable patch, and I think he deserves all of our thanks
for getting that done - and NTT deserves our thanks for paying him to
do it.

Even though I know he put in a lot more work than I did, let me just
say: phew, even reviewing that was a ton of work.

Of course, this is the beginning, not the end. I've been thinking
about next steps -- here's an expanded list:

- more efficient plan-time partition pruning (constraint exclusion is too slow)
- run-time partition pruning
- partition-wise join (Ashutosh Bapat is already working on this)
- try to reduce lock levels
- hash partitioning
- the ability to create an index on the parent and have all of the
children inherit it; this should work something like constraint
inheritance. you could argue that this doesn't add any real new
capability but it's a huge usability feature.
- teaching autovacuum enough about inheritance hierarchies for it to
update the parent statistics when they get stale despite the lack of
any actual inserts/updates/deletes to the parent. this has been
pending for a long time, but it's only going to get more important
- row movement (aka avoiding the need for an ON UPDATE trigger on each
partition)
- insert (and eventually update) tuple routing for foreign partitions
- not scanning the parent
- fixing the insert routing so that we can skip tuple conversion where possible
- fleshing out the documentation

One thing I'm wondering is whether we can optimize away some of the
heavyweight locks. For example, if somebody does SELECT * FROM ptab
WHERE id = 1, they really shouldn't need to lock the entire
partitioning hierarchy, but right now they do. If the root knows
based on its own partitioning key that only one child is relevant, it
would be good to lock *only that child*. For this feature to be
competitive, it needs to scale to at least a few thousand partitions,
and locking thousands of objects instead of one or two is bound to be
slow. Similarly, you can imagine teaching COPY to lock partitions
only on demand; if no tuples are routed to a particular partition, we
don't need to lock it. There's a manageability component here, too:
not locking partitions unnecessarily makes ti easier to get DDL on
other partitions through. Alternatively, maybe we could rewrite the
lock manager to be hierarchical, so that you can take a single lock
that represents an AccessShareLock on all partitions and only need to
make one entry in the lock table to do it. That means that attempts
to lock individual partitions need to check not only for a lock on
that partition but also on anything further up in the hierarchy, but
that might be a good trade if it gives us O(1) locking on the parent.
And maybe we could also have a level of the hierarchy that represents
every-table-in-the-database, for the benefit of pg_dump. Of course,
rewriting the lock manager is a big project not for the faint of
heart, but I think if we don't it's going to be a scaling bottleneck.

We also need to consider other parts of the system that may not scale,
like pg_dump. For a long time, we've been sorta-kinda willing to fix
the worst of the scalability problems with pg_dump, but that's really
no longer an adequate response. People want 1000 partitions. Heck,
people want 1,000,000 partitions, but getting to where 1000 partitions
works well would help PostgreSQL a lot. Our oft-repeated line that
inheritance isn't designed for large numbers of inheritance children
is basically just telling people who have the use case where they need
that to go use some other product. Partitioning, like replication, is
not an optional feature for a world-class database. And, from a
technical point of view, I think we've now got an infrastructure that
really should be able to be scaled up considerably higher than what
we've been able to do in the past. When we were stuck with
inheritance + constraint exclusion, we could say "well, there's not
really any point because you'll hit these other limits anyway". But I
think now that's not really true. This patch eliminates one of the
core scalability problems in this area, and provides infrastructure
for attacking some of the others. I hope that people will step up and
do that. There's a huge opportunity here for PostgreSQL to become
relevant in use cases where it currently falters badly, and we should
try to take advantage of it. This patch is a big step by itself, but
if we ignore the potential to do more with this as the base we will be
leaving a lot of "win" on the table.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christoph Berg 2016-12-08 15:51:34 Re: Use EVP API pgcrypto encryption, dropping support for OpenSSL 0.9.6 and older
Previous Message Stephen Frost 2016-12-08 15:11:57 Re: pg_dump vs. TRANSFORMs