Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.
Date: 2018-07-04 12:37:52
Message-ID: CAFjFpRevODMmZ8MF3oLsXr9mSH9-mWC_fxmPWTYCSqHVr_Zwqg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 4, 2018 at 5:36 PM, Etsuro Fujita
<fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> (2018/07/04 19:04), Ashutosh Bapat wrote:
>>
>> On Fri, Jun 29, 2018 at 6:21 PM, Etsuro Fujita
>> <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>>
>>> (2018/06/22 22:54), Ashutosh Bapat wrote:
>>>>
>>>> + if (enable_partitionwise_join&&
>>>> rel->top_parent_is_partitioned)
>>>> + {
>>>> + build_childrel_tlist(root, rel, childrel, 1,&appinfo);
>>>> + }
>>>>
>>>> Why do we need rel->top_parent_is_partitioned? If a relation is
>>>> partitioned (if (rel->part_scheme), it's either the top parent or is
>>>> partition of some other partitioned table. In either case this
>>>> condition will be true.
>>>
>>>
>>>
>>> This would be needed to avoid unnecessarily applying build_childrel_tlist
>>> to
>>> child rels of a partitioned table for which we don't consider
>>> partitionwise
>>> join. Consider:
>>>
>>> postgres=# create table lpt (c1 int, c2 text) partition by list (c1);
>>> CREATE TABLE
>>> postgres=# create table lpt_p1 partition of lpt for values in (1);
>>> CREATE TABLE
>>> postgres=# create table lpt_p2 (c1 int check (c1 in (2)), c2 text);
>>> CREATE TABLE
>>> postgres=# create table test (c1 int, c2 text);
>>> CREATE TABLE
>>> postgres=# explain verbose select * from (select * from lpt union all
>>> select
>>> * from lpt_p2) ss(c1, c2) inner join test on (ss.c1 = test.c1);
>>
>>
>> --- plan clipped
>>
>>>
>>> In this example, the top parent is not a partitioned table and we don't
>>> need
>>> to consider partitionwise join for that, so we don't need to apply that
>>> to
>>> the child rel lpt_p1 of the partitioned table lpt (and the table lpt_p2).
>>
>>
>> FWIW, the test is not sufficient. In the above example, a simple
>> select * from lpt inner join test where lpt.c1 = test.c1 would not use
>> partition-wise join technique. Why not to avoid build_childrel_tlist()
>> in that case as well?
>
>
> I might misunderstand your words, but in the above example the patch doesn't
> apply build_childrel_tlist to lpt_p1 and lpt_p2. The reason for that is
> because we can avoid adjusting the tlists for the corresponding subplans at
> plan creation time so that whole-row Vars in the tlists are transformed into
> CREs. I think the overhead of the adjustment is not that big, but not zero,
> so it would be worth avoiding applying build_childrel_tlist to partitions if
> the top parent won't participate in a partitionwise-join at all.

I don't think that answers my question. When we join lpt with test,
your patch will apply build_childrel_tlist() to lpt_p1 and lpt_p2 even
when join between lpt and test is not going to use partition-wise
join. Why? As per your explanation, the condition "if
(enable_partitionwise_join && rel->top_parent_is_partitioned)" is
used to avoid applying build_childrel_tlist() when partition-wise join
won't be possible. But it's not covering all the cases.

>
>> Worst we can change the criteria to use
>> partition-wise join in future e.g. above case would use partition-wise
>> join by 1. merging lpt_p1 into corresponding partition of lpt so that
>> ss is partitioned and 2. repartitioning test or joining test with each
>> partition of lpt separately. In those cases the changes you are doing
>> here need to be reverted and put somewhere else. There's already a
>> patch to reverse the order of join and grouping out there. How would
>> this work with that.
>
>
> Interesting, but that seems like a matter of PG12+.

Yes, and I don't think it's a good idea to change this fix for PG12+ again.

>
>> In general I think set_append_rel_size() or build_simple_rel() is not
>> the place where we should decide whether the given relation will
>> participate in a PWJ. Also we should not selectively add a
>> ConvertRowtypeExpr on top of a whole-row Var of some a child relation.
>> We should do it for all the child rels or shouldn't do it at all.
>
>
> One thing I thought was to apply build_childrel_tlist for all the child rels
> whether its top parent is a partitioned table or not, but as I mentioned
> above, that would incur needless overhead for adjusting the tlists for the
> child rels that don't involve in a partitionwise join such as lpt_p1 and
> lpt_p2, which I think is not good.
>
>> An
>> in-between state will produce a hell lot of confusion for any further
>> optimization. Whenever we change the code around partition-wise
>> operations in future, we will have to check whether or not a given
>> child rel has its whole-row Var embedded in ConvertRowtypeExpr. As I
>> have mentioned earlier, I am also not comfortable with the targetlist
>> of child relations being type inconsistent with that of the parent,
>> which is a fundamental rule in inheritance. Worst keep them
>> inconsistent during path creation and make them consistent at the time
>> of creating plans. A child's whole-row Var has datatype of the child
>> where as that of parent has datatype of parent.
>
>
> I don't see any critical issue here. Could you elaborate a bit more on that
> point?

I think breaking a fundamental rule like this itself is critical. But
interestingly I am not able to find a case where it becomes a problem.
But may be I haven't tried enough. And the question is if it's not
required to have the targetlists type consistent, why even bother with
ConvertRowtypeExpr addition there? We can use your approach of adding
ConvertRowtypeExpr at the end in all the cases.

>
>> A ConvertRowtypeExpr
>> is used to fix this inconsistency. That's why I chose to use
>> pull_var_clause() as a place to fix the problem and not fix
>> ConvertRowtypeExpr in targetlist itself.
>
>
> I think the biggest issue in the original patch you proposed is that we
> spend extra cycles where partitioning is not involved, which is the biggest
> reason why I think the original patch isn't the right way to go.

When there are no partitions involved, there won't be any
ConvertRowtypeExprs there which means the function
is_converted_whole_row_reference() would just return from the first
line checking IsA() and nullness of node. pull_var_clause() is
structured so that it always incurs IsA() cost whether or not the
corresponding type of node is present in the query or not. E.g. even
if there are no aggregates in the query, we will still incur some
cycles checking IsA(Aggref, node) when that function is called. We
could restructure the condition as IsA(node, ConvertRowtypeExpr) &&
is_converted_whole_row_reference(node) to avoid even the extra cycles
spent in function call.

>
>> I am also not comfortable in just avoiding ConvertRowtypeExprs in
>> targetlist and leave them as is in the conditions and ECs etc. This
>> means searching a ConvertRowtypeExpr e.g. for creating pathkeys in
>> targetlist will fail at the time of path creation but will succeed at
>> the time of plan creation.
>>
>> This is turning more invasive that my approach of fixing it through PVC.
>
>
> Sorry, I don't understand this. Could you show me places where the problem
> happens?

I think breaking a fundamental rule like this itself is invasive.

>
>>>> + /* The child plan should be able to do projection */
>>>> + Assert(is_projection_capable_plan(subplan));
>>>> +
>>>> Again why? A MergeAppend's subplan could be a Sort plan, which will
>>>> not be projection capable.
>>>
>>>
>>>
>>> IIUC, since we are called here right after create_plan_recurse, the child
>>> plan would be a scan or join unless it's neither Append nor MergeAppend.
>>> So
>>> it should be projection-capable. Maybe I'm missing something, though.
>>
>>
>> That's not true. add_paths_to_append_rel() adds sort paths on scan if
>> necessary and creates merge append path.
>
>
> Really? I think create_merge_append_path called from
> generate_mergeappend_paths called from add_paths_to_append_rel uses a dummy
> sort path just to compute the cost, but I don't think
> create_merge_append_path (or generate_mergeappend_paths or
> add_paths_to_append_rel) insert a sort path to a scan (or join) path.

You are right. We add sort after adjust_subplan_tlist() has been
called. So, we are fine there.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Laurenz Albe 2018-07-04 13:14:38 Re: Libpq support to connect to standby server as priority
Previous Message Pavan Deolasee 2018-07-04 12:08:13 Locking considerations of REINDEX