Comments on automatic DML routing and explicit partitioning subcommands

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Kedar Potdar <kedar(dot)potdar(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Comments on automatic DML routing and explicit partitioning subcommands
Date: 2009-07-14 09:39:18
Message-ID: 1247564358.11347.1308.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

(submitting review slightly earlier than start of commitfest)


Thank you very much for the patch. Well done for getting to this stage.
There is definitely much support for your work.

My thoughts after initial review are fairly wide ranging. Overall, the
patch is not ready/close to commit in this commitfest. My estimate is
that it will take next 2 commitfests before we could commit it; that is
still comfortably within this release.


* I have one main comment that effects about 50% of the patch: We should
not be using triggers. Triggers are a useful external solution, but they
are not the best or even a desirable approach internally. Inserting data
using triggers creates a huge overhead in AfterTrigger events held in
memory. In my view this is totally unsuitable for use with VLDBs, which
is exactly the point of partitioning. There is nothing we can do to
reduce or aggregate these trigger events as is possible with RI trigger
checks (not yet done, but possible...).

It should be possible to avoid triggers altogether by having a function
that dynamically calculates the target partition relid from an input
tuple, somewhere around ExecInsert() and during DoCopy(). Thus, current
executor has minor changes and we do all required cleverness in a
partition function evaluation module. You should be able to rework much
of the code into that form. [bsearch()]

Dynamic evaluation would also help SQL such as in-bound FK checks, i.e.
FK checks against the partitioned table. That needs some other work
also, but the guts of it will be similar to both problems.

Avoiding triggers will also avoid the horrible looking stuff with
pg_dump. "Might see a few errors" is a phrase unlikely to result in
patch application, in my experience. :-)

Whether we accept that or not, there should be a clear focus on
measuring and reducing the time taken to route an inserted row through
to its target partition, rather than just automating it. There may be an
alternative that achieves better efficiencies. We should be measuring
timings with 100s-1000s of partitions, not just 2 or 3.

Dynamic partition evaluation will also be important in other places.
Partition evaluation

* I don't see any need or purpose for HASH partitioning. I suggest that
is added as a later patch once everything else is accepted, if there is
clear separate reason for it to exist. (I'm certain it has meaning for
your sponsors; we must differentiate between the needs of add-on
products and the needs of Postgres core). There isn't much point
discussing this issue until other parts are committed anyway and a
smaller patch will be more easily committed. Skip it for now, please.

* It appears that *every* update is handled as a delete plus a
re-insertion into the parent(s). Moving between partitions is usually an
indicator of a poor choice of partitioning scheme, so we should not be
de-optimising normal updates because of corner cases. I think we need to
handle/optimize the stays-in-same-partition and the moves-partition
cases differently. For me it would be an acceptable limitation to
disallow row movement, at least until we get the first part of this

* I think we should be making some clear assumptions that partitions
have identical row types to their parent and that we have only a single
parent for each partition. There seems like extra code to handle
additional cases.


* There is no explanatory or command syntax documentation. That prevents
people from understanding what you are attempting to deliver, apart from
reading the test case. Many people would probably have comments on how
partitioning works and syntax if they could see docs. (Please don't
refer me to earlier design docs - we need to see how this patch works;
external docs go stale very quickly. And no need to rehash the
description about the need for and benefits of partitioning - we already
have chapters on it in the pg docs.)

* The test case is nowhere near long enough for the number of additional
commands you are adding. I would expect to see about 10x as many test
cases to cover all the options, side cases and complexity. Multiple data
types, etc

* There are very few comments, and those that do exist are typically
one-liners. Look through other parts of the existing code and judge how
much commenting is necessary. Basically, lots of long explanatory
comments. Why this way? Why here? Why this API? Why not another way,
What assumptions, approximations have been made etc.. (I've found this
helps the author very much, not just the reviewer. Once extensive
comments have been received you will be on a later version and you start
to forget how parts work between all the different versions)

* The patch is >8000 lines long. I feel like it is trying to do too much
in one go and we should break it down.

* I suggest that you strip out all the stuff about complex partition
functions until we get the main items agreed, e.g. SPLIT PARTITION etc..
This will help you to concentrate on getting main items slick before we
move on.

* Please put all required files into one archive file, so we know which
files are which. Don't let the reviewer guess which files to apply.

* Changing the build system for some custom code is not likely to be
acceptable without good reason. AFAICS there is no good reason for the
way you've done it, and definitely no comments to explain that. There is
already an established way to add functions to the backend - use that.
I suspect we don't need to add any new external functions at all if we
avoid the trigger route as discussed above.

* It appears you don't have a unique index on pg_partition. We need a
constraint on it, plus we need an index to handle cases where we have
100s and 1000s of partitions. I think you should be doing this kind of
thing in your own testing. We don't want this to just pass the test
case, we want it to work with TB of data and multiple tables, each with
zillions of partitions.

* Not clear why we have pg_partition rather than changes to pg_inherits?
Why would we want another catalog table? You have two similar functions:
find_inheritance_parents() and find_inheritance_parent() - why both?

* Function names like parttype() or maxval() are not self explanatory,
please expand function names to help later readers of the code. We don't
use function names anywhere in Postgres that are both all lowercase and
where the words have no underscore to separate them. We use
function_names_like_this() or FunctionNamesLikeThis() but

* You mention that you've modified TRUNCATE etc to work with
partitioning. This was added in 8.4, so your patch no longer needs to do
anything and that part of it should be removed.

* I haven't tried to apply the patches, as yet. I feel there are major
changes required before we get to that level of confirmation. If we can
keep the initial changes fairly modest we can get it applied and still
add many more useful things in this release.

* The patch should be described as "Automatic DML routing and explicit
syntax for partitioning". There are many other tasks required to make
partitioning work. I also don't want anybody to think that we have "done
partitioning" with this patch even when complete; this patch attempts to
address syntax and DML routing. It doesn't address automatic partition
creation, list partitioning, sub-partitioning, various partitioning
optimisations and global indexing. Others may wish to work on those
issues during this dev cycle also and we want to encourage people to
submit patches on those also.

* I would advise against merging global indexes and partitioning
patches. IMHO global indexes are a tick-box feature and so a waste of
dev time. They are almost never used once the reality of their
ridiculous size takes hold; most DBAs laugh when they are suggested. And
the dev footprint on Postgres will be fairly large also. Start a new
thread if you wish to discuss them again - I don't.

* If we can prove that the partition value ranges are distinct then it
is possible to have a PrimaryKey without global indexes.

I'm sure there'll be much discussion around this. However detailed the
discussion gets, we're all very supportive of your efforts to

Best Regards,

Simon Riggs
PostgreSQL Training, Services and Support


Browse pgsql-hackers by date

  From Date Subject
Next Message Itagaki Takahiro 2009-07-14 09:47:01 Sampling profiler updated
Previous Message Heikki Linnakangas 2009-07-14 09:04:56 Index-only quals