Quick Links

Re: Declarative partitioning - another take

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Langote <amitlangote09(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Declarative partitioning - another take
Date:	2016-11-18 01:18:45
Message-ID:	c6da6e1e-fc4b-d447-3334-127316d8f54f@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 2016/11/18 1:43, Robert Haas wrote:
> On Thu, Nov 17, 2016 at 6:27 AM, Amit Langote wrote:
>> OK, I will share the performance results soon.
>
> Thanks.
>
>>> Also, in 0006:
>>>
>>> - I doubt that PartitionTreeNodeData's header comment will survive
>>> contact with pgindent.
>>
>> Fixed by adding "/* ----" at the top of the comment.
>
> OK. Hopefully you also tested that. In general, with a patch set
> this large, the more you can do to make it pgindent-clean, the better.
> Ideally none of your code would get changed by pgindent, or at least
> not the new files. But note that you will probably have have to
> update typedefs.list if you actually want to be able to run it without
> having it mangle things that involve your new structs.

OK, I was afraid that running pgindent might end up introducing unrelated
diffs in the patch, but I'm sure any code committed to HEAD would have
been through pgindent and my worry might be pointless after all. Also,
thanks for the tip about the typedefs.list. I will go try pgindent'ing
the patches.

>>> - The code in make_modifytable() to swap out the rte_array for a fake
>>> one looks like an unacceptable kludge. I don't know offhand what a
>>> better design would look like, but what you've got is really ugly.
>>
>> Agree that it looks horrible. The problem is we don't add partition
>> (child table) RTEs when planning an insert on the parent and FDW
>> partitions can't do without some planner handling - planForeignModify()
>> expects a valid PlannerInfo for deparsing target lists (basically, to be
>> able to use planner_rt_fetch()).
>>
>> Any design beside this kludge would perhaps mean that we will be adding
>> valid partition RTEs at some earlier planning stage much like what
>> expand_inherited_rtentry() does during non-insert queries. Perhaps, we
>> could have expand_inherited_rtentry() make an exception in partitioned
>> tables' case and create root->append_rel_list members even in the insert
>> case. We could then go over the append_rel_list in make_modifytable() to
>> get the valid RT index of foreign child tables to pass to
>> PlanForeignModify(). Using append_rel_list for insert planning is perhaps
>> equally ugly though. Thoughts?
>
> If it's only needed for foreign tables, how about for v1 we just throw
> an error and say errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> errmsg("cannot route inserted tuples to a foreign table") for now. We
> can come back and fix it later. Doing more inheritance expansion
> early is probably not a good idea because it likely sucks for
> performance, and that's especially unfortunate if it's only needed for
> foreign tables. Coming up with some new FDW API or some modification
> to the existing one is probably better, but I don't really want to get
> hung up on that right now.

OK, I agree with the decision to make tuple-routing a unsupported feature
in the first cut as far as foreign partitions are concerned. Once can
still insert data into partitions directly.

I am assuming that the error should be thrown all the way below
ExecInsert(), not in the planner. Which means I should revert any changes
I've made to the planner in this patch.

>>> - I don't understand how it can be right for get_partition_for_tuple()
>>> to be emitting an error that says "range partition key contains null".
>>> A defective partition key should be detected at partition creation
>>> time, not in the middle of tuple routing.
>>
>> That error is emitted when the partition key of the *input tuple* is found
>> to contain NULLs. Maybe we need to consider something like how btree
>> indexes treat NULLs - order NULLs either before or after all the non-NULL
>> values based on NULLS FIRST/LAST config. Thoughts?
>
> Well, I'd say my confusion suggests that the error message needs to be
> clearer about what the problem is. I think this is basically a case
> of there being no partition for the given row, so maybe just arrange
> to use that same message here ("no partition of relation \"%s\" found
> for row").

The reason NULLs in an input row are caught and rejected (with the current
message) before control reaches range_partition_for_tuple() is because
it's not clear to me whether the range bound comparison logic in
partition_rbound_datum_cmp() should be prepared to handle NULLs and what
the results of comparisons should look like. Currently, all it ever
expects to see in the input tuple's partition key is non-NULL datums.
Comparison proceeds as follows: if a range bound datum is a finite value,
we invoke the comparison proc or if it is infinite, we conclude that the
input tuple is > or < the bound in question based on whether the bound is
a lower or upper bound, respectively.

Or are you saying that get_tuple_for_partition() should simply return -1
(partition not found) in case of encountering a NULL in range partition
key to the caller instead of throwing error as is now? If the user sees
the message and decides to create a new range partition that *will* accept
such a row, how do they decide what its boundaries should be?

Thanks,
Amit

In response to

Re: Declarative partitioning - another take at 2016-11-17 16:43:14 from Robert Haas

Responses

Re: Declarative partitioning - another take at 2016-11-18 16:39:55 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Alvaro Herrera	2016-11-18 01:38:42	Re: Mail thread references in commits
Previous Message	Tsunakawa, Takayuki	2016-11-18 01:09:46	Re: Patch: Implement failover on libpq connect level.