Re: Declarative partitioning - another take

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Amit Langote <amitlangote09(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Declarative partitioning - another take
Date: 2016-11-09 17:00:03
Message-ID: CA+Tgmob1bTucKsHcV_bfd6ym-CCYDJUZDUQmT77bZjiqPL8Saw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

In this latest patch set:

src/backend/parser/parse_utilcmd.c:3194: indent with spaces.
+ *rdatum;

With all patches applied, "make check" fails with a bunch of diffs
that look like this:

Check constraints:
- "pt1chk2" CHECK (c2 <> ''::text)
"pt1chk3" CHECK (c2 <> ''::text)

On Wed, Nov 9, 2016 at 6:14 AM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> Actually the sentence there is: The parenthesized list of columns or
> expressions forms the the <firstterm>partitioning key</firstterm> for the
> table

OK.

>> + /*
>> + * Strip any top-level COLLATE clause. This ensures that we treat
>> + * "x COLLATE y" and "(x COLLATE y)" alike.
>> + */
>>
>> But you don't, right? Unless I am confused, which is possible, the
>> latter COLLATE will be ignored, while the former one will set the
>> collation to be used in the context of partitioning comparisons.
>
> The code immediately following the comment does in fact strip the
> top-level COLLATE clause.
>
> while (IsA(expr, CollateExpr))
> expr = (Node *) ((CollateExpr *) expr)->arg;
>
> So that the following two specifications are equivalent which is the intent:
>
> create table p (a text) partition by range (a collate "en_US");
> vs.
> create table p (a text) partition by range ((a collate "en_US"));

I see. You're right.

>> Re-reviewing 0002:
>>
>> + if (fout->remoteVersion >= 100000)
>> + {
>> + PQExpBuffer acl_subquery = createPQExpBuffer();
>> + PQExpBuffer racl_subquery = createPQExpBuffer();
>> + PQExpBuffer initacl_subquery = createPQExpBuffer();
>> + PQExpBuffer initracl_subquery = createPQExpBuffer();
>> +
>> + PQExpBuffer attacl_subquery = createPQExpBuffer();
>> + PQExpBuffer attracl_subquery = createPQExpBuffer();
>> + PQExpBuffer attinitacl_subquery = createPQExpBuffer();
>> + PQExpBuffer attinitracl_subquery = createPQExpBuffer();
>>
>> It seems unnecessary to repeat all of this. The only differences
>> between the 10000 version and the 9600 version are:
>>
>> 60,61c60
>> < "AS changed_acl, "
>> < "CASE WHEN c.relkind = 'P' THEN
>> pg_catalog.pg_get_partkeydef(c.oid) ELSE NULL END AS partkeydef "
>> ---
>>> "AS changed_acl "
>> 73c72
>> < "WHERE c.relkind in ('%c', '%c', '%c', '%c',
>> '%c', '%c', '%c') "
>> ---
>>> "WHERE c.relkind in ('%c', '%c', '%c', '%c', '%c', '%c') "
>> 87,88c86
>> < RELKIND_MATVIEW, RELKIND_FOREIGN_TABLE,
>> < RELKIND_PARTITIONED_TABLE);
>> ---
>>> RELKIND_MATVIEW, RELKIND_FOREIGN_TABLE);
>>
>> But none of that is really a problem. Sure, the 'P' case will never
>> arise in 9.6, but so what? I'd really like to not keep duplicating
>> these increasingly-complex hunks of code if we can find some way to
>> avoid that.
>
> We cannot reference pg_catalog.pg_get_partkeydef() in the SQL query that
> getTables() sends to pre-10 servers, right? But I suppose we need not
> call it in that particular SQL query in the first place.

Oh, yeah, that's a problem; the query will error out against older
servers. You could do something like:

char *partkeydef;
if (version <= 90600)
partkeydef = "NULL";
else
partkeydef = "CASE WHEN c.relkind = 'P' THEN
pg_catalog.pg_get_partkeydef(c.oid) ELSE NULL END";

...and the use %s to interpolate that into the query string.

> How about we do it in the following manner in getSchemaData():
>
> if (g_verbose)
> write_msg(NULL, "reading partition key information for interesting
> tables\n");
> getTablePartitionKeyInfo(fout, tblinfo, numTables);
>
> I have implemented the same.

That might be OK too; I'll have to read it through carefully.

>> Re-reviewing 0003:
>>
>> + <para>
>> + If this table is a partition, one cannot perform <literal>DROP
>> NOT NULL</>
>> + on a column if it is marked not null in the parent table.
>> + </para>
>>
>> This would, presumably, also be true for inheritance. I think we
>> could just leave this out.
>
> Actually, it isn't true for the regular inheritance situation.
>
> create table parent (a int not null);
> create table child () inherits (parent);
> alter table child alter a drop not null; -- this works (bug?)

Hrm, OK. That doesn't satisfy MY idea of the principle of least
astonishment, but hey...

> But in the meantime, we can proceed with enforcing inheritance on NOT NULL
> constraint for *partitions*, because they only ever have one parent and
> hence do not need elaborate coninhcount based scheme, I think. Thoughts?

I agree.

>> + correspond to the partitioning method and partitioning key of the target
>>
>> I think that in most places were are referring to the "partitioning
>> method" (with ing) but the "partition key" (without ing). Let's try to
>> be consistent.
>
> I'm inclined to switch to "partitioning method" and "partitioning key",
> but do you mean just the documentation or throughout? Beside
> documentation, I mean source code comments, error messages, etc. I have
> assumed throughout.

I think "partitioning key" is a bit awkward and actually prefer
"partiton key". But "partition method" sounds funny so I would go
with "partitioning method".

>> + A full table scan is performed on the table being attached to check that
>> + existing rows in the table are within the specified partition bound.
>> + If it is known in advance that no partition bound violating rows are
>> + present in the table, the above scan can be skipped by specifying the
>> + <literal>NO VALIDATE</> option.
>> + Default behavior is to perform the scan, as if the
>> <literal>VALIDATE</literal>
>> + option were specified.
>>
>> I don't think it's OK to allow the partition to be added if it
>> contains rows that might not be valid. We are generally vary wary
>> about adding options that can cause data integrity failures and I
>> think we shouldn't start here, either. On the other hand, it's also
>> not desirable for adding a partition to take O(n) time in all cases.
>> So what would be nice is if the new partition could self-certify that
>> contains no problematic rows by having a constraint that matches the
>> new partitioning constraint. Then we can skip the scan if that
>> constraint is already present.
>
> I agree that NO VALIDATE is undesirable and it would be better if we could
> get rid of the same. I want to clarify one thing though: what does it
> mean for the new partition to have a constraint that *matches* the new
> partitioning constraint? Does it mean the new partition's constraint
> *implies* the partitioning constraint? Or as certified by equal()?

Implies would be better, but equal() might be tolerable.

>> + /*
>> + * Never put a
>> null into the values array, flag
>> + * instead for
>> the code further down below where
>> + * we
>> construct the actual relcache struct.
>> + */
>> +
>> found_null_partition = true;
>> +
>> null_partition_index = i;
>>
>> How about we elog(ERROR, ...) if found_null_partition is already set?
>
> Makes sense. However, let me mention here that duplicates either within
> one partition's list or across the partitions are not possible. That's
> because in case of the former, we de-duplicate before storing the list
> into the catalog and the latter would simply be an overlap error. Could
> this be made an Assert() then?

Well, I think that wouldn't be as good, because for example some
misguided person might do a manual update of the catalog. It seems to
me that if the system catalog contents are questionable, it's better
to error out than to have completely undefined behavior. In other
words, we probably want it checked in non-Assert builds. See similar
cases where we have things like:

elog(ERROR, "conpfeqop is not a 1-D Oid array");

>> + /*
>> + * Is lower_val = upper_val?
>> + */
>> + if (lower_val && upper_val)
>>
>> So, that comment does not actually match that code. That if-test is
>> just checking whether both bounds are finite. What I think you should
>> be explaining here is that if lower_val and upper_val are both
>> non-infinite, and if the happen to be equal, we want to emit an
>> equality test rather than a >= test plus a <= test because $REASON.
>
> OK, I have added some explanatory comments around this code.

So, I was kind of assuming that the answer to "why are we doing this"
was "for efficiency". It's true that val >= const AND const <= val
can be simplified to val = const, but it wouldn't be necessary to do
so for correctness. It just looks nicer and runs faster.

By the way, if you want to keep the inclusive stuff, I think you
probably need to consider this logic in light of the possibility that
one bound might be exclusive. Maybe you're thinking that can't happen
anyway because it would have been rejected earlier, but an elog(ERROR,
...) might still be appropriate just to guard against messed-up
catalogs.

> I forgot this instance of forced-recursion-if-partitioned, fixed. I was
> not quite sure what the error message would say; how about:
>
> ereport(ERROR,
> (errcode(ERRCODE_WRONG_OBJECT_TYPE),
> errmsg("must truncate child tables too")));

Works fror me.

>> What happens if you try to attach a table as a partition of itself or
>> one of its ancestors?
>
> The latter fails with "already a partition" error.
>
> The former case was not being handled at all which has now been fixed.
> ATExecAddInherit() prevents that case as an instance of preventing the
> circularity of inheritance. It says: ERROR: circular inheritance not
> allowed. And then: DETAIL: "rel" is already a child of "rel".
>
> Should ATExecAttachPartition() use the same trick and keep using the same
> message(s)? The above detail message doesn't quite sound appropriate when
> one tries to attach a table as partition of itself.

Whichever way is easier to code seems fine. It's a pretty obscure
case, so I don't think we need to add code just to get a very slightly
better error message.

>> + | FOR VALUES START '(' range_datum_list ')' lb_inc
>> + END_P '('
>> range_datum_list ')' ub_inc
>>
>> Just a random idea. Would for VALUES FROM ( ... ) TO ( ... ) be more
>> idiomatic than START and END?
>
> It would actually. Should I go ahead and change to FROM (...) TO (...)?

+1!

> Related to range partitioning, should we finalize on inclusive START/FROM
> and exclusive END/TO preventing explicit specification of the inclusivity?

I would be in favor of committing the initial patch set without that,
and then considering the possibility of adding it later. If we
include it in the initial patch set we are stuck with it.

> Attached updated patches take care of the above comments and few other
> fixes. There are still a few items I have not addressed right away:
>
> - Remove NO VALIDATE clause in ATTACH PARTITION and instead rely on the
> new partition's constraints to skip the validation scan
> - Remove the syntax to specify inclusivity of each of START and END bounds
> of range partitions and instead assume inclusive START and exclusive END

OK - what is the time frame for these changes?

> Also, what used to be "[PATCH 5/9] Refactor optimizer's inheritance set
> expansion code" is no longer included in the set. I think we can go back
> to that topic later as I mentioned last week [2].

Great.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-11-09 17:10:34 Re: Declarative partitioning - another take
Previous Message Yury Zhuravlev 2016-11-09 16:58:38 Re: WIP: About CMake v2