Re: Declarative partitioning - another take

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Langote <amitlangote09(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Declarative partitioning - another take
Date: 2016-11-10 12:40:09
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thanks again for the prompt review!

On 2016/11/10 2:00, Robert Haas wrote:
> On Wed, Nov 9, 2016 at 6:14 AM, Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>> But none of that is really a problem. Sure, the 'P' case will never
>>> arise in 9.6, but so what? I'd really like to not keep duplicating
>>> these increasingly-complex hunks of code if we can find some way to
>>> avoid that.
>> We cannot reference pg_catalog.pg_get_partkeydef() in the SQL query that
>> getTables() sends to pre-10 servers, right? But I suppose we need not
>> call it in that particular SQL query in the first place.
> Oh, yeah, that's a problem; the query will error out against older
> servers. You could do something like:
> char *partkeydef;
> if (version <= 90600)
> partkeydef = "NULL";
> else
> partkeydef = "CASE WHEN c.relkind = 'P' THEN
> pg_catalog.pg_get_partkeydef(c.oid) ELSE NULL END";
> ...and the use %s to interpolate that into the query string.

Yeah, that's a way.

>>> I think that in most places were are referring to the "partitioning
>>> method" (with ing) but the "partition key" (without ing). Let's try to
>>> be consistent.
>> I'm inclined to switch to "partitioning method" and "partitioning key",
>> but do you mean just the documentation or throughout? Beside
>> documentation, I mean source code comments, error messages, etc. I have
>> assumed throughout.
> I think "partitioning key" is a bit awkward and actually prefer
> "partiton key". But "partition method" sounds funny so I would go
> with "partitioning method".

OK, "partition key" and "partitioning method" it is then. Source code
comments, error messages, variables call the latter (partitioning)
"strategy" though which hopefully is fine.

>>> I don't think it's OK to allow the partition to be added if it
>>> contains rows that might not be valid. We are generally vary wary
>>> about adding options that can cause data integrity failures and I
>>> think we shouldn't start here, either. On the other hand, it's also
>>> not desirable for adding a partition to take O(n) time in all cases.
>>> So what would be nice is if the new partition could self-certify that
>>> contains no problematic rows by having a constraint that matches the
>>> new partitioning constraint. Then we can skip the scan if that
>>> constraint is already present.
>> I agree that NO VALIDATE is undesirable and it would be better if we could
>> get rid of the same. I want to clarify one thing though: what does it
>> mean for the new partition to have a constraint that *matches* the new
>> partitioning constraint? Does it mean the new partition's constraint
>> *implies* the partitioning constraint? Or as certified by equal()?
> Implies would be better, but equal() might be tolerable.

I think a equal() -based method would fail to help in most cases. Consider
the following example:

create table foo1 (a int);
alter table foo add constraint check_a check (a < 10 and a >= 1);

create table foo (a int) partition by range (a);
alter table foo attach partition foo1 for values from (1) to (10);

The last command will internally generate the partition constraints that
is basically a list of implicitly AND'd expressions viz. a is not null, a
>= 1 and a < 10 which we wrap into a BoolExpr. It would not be equal()
with foo1's constraint even though they are basically the same constraint.
So, simple structural equality may prove to be less productive, IMHO.

It seems a *implies* -based solution would work much better, although a
bit slower for obvious reasons. I reckon slownsess is not a big issue in
this case. So I prototyped the same using predicate_implied_by() which
seems to work reasonably. Looks something like this:

skip_validate = false;
if (predicate_implied_by(partConstraint, existConstraint))
skip_validate = true;

Where partConstraint is 1-member list with the new partition constraint
and existConstraint is a list of the existing constraints of the table
being attached, derived from its TupleConstr.

>>> + /*
>>> + * Never put a
>>> null into the values array, flag
>>> + * instead for
>>> the code further down below where
>>> + * we
>>> construct the actual relcache struct.
>>> + */
>>> +
>>> found_null_partition = true;
>>> +
>>> null_partition_index = i;
>>> How about we elog(ERROR, ...) if found_null_partition is already set?
>> Makes sense. However, let me mention here that duplicates either within
>> one partition's list or across the partitions are not possible. That's
>> because in case of the former, we de-duplicate before storing the list
>> into the catalog and the latter would simply be an overlap error. Could
>> this be made an Assert() then?
> Well, I think that wouldn't be as good, because for example some
> misguided person might do a manual update of the catalog. It seems to
> me that if the system catalog contents are questionable, it's better
> to error out than to have completely undefined behavior. In other
> words, we probably want it checked in non-Assert builds. See similar
> cases where we have things like:
> elog(ERROR, "conpfeqop is not a 1-D Oid array");

Good point, agreed.

>>> + /*
>>> + * Is lower_val = upper_val?
>>> + */
>>> + if (lower_val && upper_val)
>>> So, that comment does not actually match that code. That if-test is
>>> just checking whether both bounds are finite. What I think you should
>>> be explaining here is that if lower_val and upper_val are both
>>> non-infinite, and if the happen to be equal, we want to emit an
>>> equality test rather than a >= test plus a <= test because $REASON.
>> OK, I have added some explanatory comments around this code.
> So, I was kind of assuming that the answer to "why are we doing this"
> was "for efficiency". It's true that val >= const AND const <= val
> can be simplified to val = const, but it wouldn't be necessary to do
> so for correctness. It just looks nicer and runs faster.

I guess I ended up with this code following along a bit different way of
thinking about it, but in the end what you're saying is true.

> By the way, if you want to keep the inclusive stuff, I think you
> probably need to consider this logic in light of the possibility that
> one bound might be exclusive. Maybe you're thinking that can't happen
> anyway because it would have been rejected earlier, but an elog(ERROR,
> ...) might still be appropriate just to guard against messed-up
> catalogs.

Yes, that makes sense. I guess you mean a case like the one shown below:

create table foo5 partition of foo for values start (3, 10) end (3, 10);
ERROR: cannot create range partition with empty range

I think the following would suffice as a guard (checked only if it turns
out that lower_val and upper_val are indeed equal):

if (i == key->partnatts - 1 && spec->lowerinc != spec->upperinc)
elog(ERROR, "invalid range bound specification");

>>> What happens if you try to attach a table as a partition of itself or
>>> one of its ancestors?
>> The latter fails with "already a partition" error.
>> The former case was not being handled at all which has now been fixed.
>> ATExecAddInherit() prevents that case as an instance of preventing the
>> circularity of inheritance. It says: ERROR: circular inheritance not
>> allowed. And then: DETAIL: "rel" is already a child of "rel".
>> Should ATExecAttachPartition() use the same trick and keep using the same
>> message(s)? The above detail message doesn't quite sound appropriate when
>> one tries to attach a table as partition of itself.
> Whichever way is easier to code seems fine. It's a pretty obscure
> case, so I don't think we need to add code just to get a very slightly
> better error message.

OK, so let's keep it same as with regular inheritance.

>>> + | FOR VALUES START '(' range_datum_list ')' lb_inc
>>> + END_P '('
>>> range_datum_list ')' ub_inc
>>> Just a random idea. Would for VALUES FROM ( ... ) TO ( ... ) be more
>>> idiomatic than START and END?
>> It would actually. Should I go ahead and change to FROM (...) TO (...)?
> +1!


>> Related to range partitioning, should we finalize on inclusive START/FROM
>> and exclusive END/TO preventing explicit specification of the inclusivity?
> I would be in favor of committing the initial patch set without that,
> and then considering the possibility of adding it later. If we
> include it in the initial patch set we are stuck with it.

OK, I have removed the syntactic ability to specify INCLUSIVE/EXCLUSIVE
with each of the range bounds.

I haven't changed any code (such as comparison functions) that manipulates
instances of PartitionRangeBound which has a flag called inclusive. I
didn't remove the flag, but is instead just set to (is_lower ? true :
false) when initializing from the parse node. Perhaps, there is some scope
for further simplifying that code, which you probably alluded to when you
proposed that we do this.

>> Attached updated patches take care of the above comments and few other
>> fixes. There are still a few items I have not addressed right away:
>> - Remove NO VALIDATE clause in ATTACH PARTITION and instead rely on the
>> new partition's constraints to skip the validation scan
>> - Remove the syntax to specify inclusivity of each of START and END bounds
>> of range partitions and instead assume inclusive START and exclusive END
> OK - what is the time frame for these changes?

I have implemented both of these in the attached patch. As mentioned
above, the logic to skip the validation scan using the new partition's
constraints is still kind of a prototype solution, but it seems to work so
far. Comments on the same would be very helpful.


Attachment Content-Type Size
0001-Catalog-and-DDL-for-partitioned-tables-12.patch text/x-diff 114.3 KB
0002-psql-and-pg_dump-support-for-partitioned-tables-12.patch text/x-diff 23.9 KB
0003-Catalog-and-DDL-for-partitions-12.patch text/x-diff 205.4 KB
0004-psql-and-pg_dump-support-for-partitions-12.patch text/x-diff 22.1 KB
0005-Teach-a-few-places-to-use-partition-check-quals-12.patch text/x-diff 30.6 KB
0006-Introduce-a-PartitionTreeNode-data-structure-12.patch text/x-diff 8.0 KB
0007-Tuple-routing-for-partitioned-tables-12.patch text/x-diff 43.7 KB
0008-Update-DDL-Partitioning-chapter-to-reflect-new-devel-12.patch text/x-diff 24.7 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Emre Hasegeli 2016-11-10 12:41:19 Re: Floating point comparison inconsistencies of the geometric types
Previous Message Dean Rasheed 2016-11-10 12:17:50 Re: Improving RLS planning