Re: NOT IN subquery optimization

From: "Li, Zheng" <zhelli(at)amazon(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: NOT IN subquery optimization
Date: 2020-03-24 22:32:00
Message-ID: 3158C776-2098-4177-A087-C58FF1AB72EC@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tom,

Thanks for the feedback.

* I find it entirely unacceptable to stick some planner temporary
fields into struct SubLink. If you need that storage you'll have
to find some other place to put it. But in point of fact I don't
think you need it; it doesn't look to me to be critical to generate
the subquery's plan any earlier than make_subplan would have done it.
Moreover, you should really strive to *not* do that, because it's
likely to get in the way of other future optimizations. As the
existing comment in make_subplan already suggests, we might want to
delay subplan planning even further than that in future.

The reason for calling make_subplan this early is that we want to
Call subplan_is_hashable(plan), to decide whether to proceed with the proposed
transformation. We try to stick with the fast hashed subplan when possible to avoid
potential performance degradation from the transformation which may restrict the
planner to choose Nested Loop Anti Join in order to handle Null properly,
the following is an example from subselect.out:
explain (costs false) select * from s where n not in (select u from l);
QUERY PLAN
-----------------------------------------------
Nested Loop Anti Join
InitPlan 1 (returns $0)
-> Seq Scan on l l_1
-> Seq Scan on s
Filter: ((n IS NOT NULL) OR (NOT $0))
-> Index Only Scan using l_u on l
Index Cond: (u = s.n)

However, if the subplan is not hashable, the above Nested Loop Anti Join is
actually faster.

* I'm also not too happy with the (undocumented) rearrangement of
reduce_outer_joins. There's a specific sequence of processing that
that's involved in, as documented at the top of prepjointree.c, and
I doubt that you can just randomly call it from other places and expect
good results. In particular, since JOIN alias var flattening won't have
happened yet when this code is being run from pull_up_sublinks, it's
unlikely that reduce_outer_joins will reliably get the same answers it
would get normally. I also wonder whether it's safe to make the
parsetree changes it makes earlier than normal, and whether it will be
problematic to run it twice on the same tree, and whether its rather
indirect connection to distribute_qual_to_rels is going to misbehave.

The rearrangement of reduce_outer_joins was to make the null test function
is_node_nonnullable() more accurate. Later we added strict predicates logic in
is_node_nonnullable(), so I think we can get rid of the rearrangement of
reduce_outer_joins now without losing accuracy.

* The proposed test additions seem to about triple the runtime of
subselect.sql. This seems excessive. I also wonder why it's necessary
for this test to build its own large tables; couldn't it re-use ones
that already exist in the regression database?

I added a lot of test cases. But yes, I can reuse the existing large table if
there is one that doesn't fit in 64KB work_mem.

* Not really sure that we need a new planner GUC for this, but if we
do, it needs to be documented.

The new GUC is just in case if anything goes wrong, the user can
easily turn it off.

Regards,
Zheng

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-03-24 23:08:15 Re: Ltree syntax improvement
Previous Message Andres Freund 2020-03-24 22:31:52 Re: WIP: WAL prefetch (another approach)