Making Vars outer-join aware

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Making Vars outer-join aware
Date: 2022-07-01 16:42:27
Message-ID: 830269.1656693747@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

[ Before settling into commitfest mode, I wanted to put out a snapshot
of what I've been working on for the past few weeks. This is not
anywhere near committable, but I think people might be interested
in looking at it now anyway. ]

We've had many discussions (eg [1][2]) about the need to treat outer
joins more honestly in parsed queries, so that the planner's reasoning
about things like equivalence classes can stand on a firmer foundation.
The attached patch series makes a start at doing that, and carries the
idea as far as a working system in which all Vars are labeled as to
which outer joins can null them. I have not yet gotten to the fun part
of fixing or ripping out all the higher-level planner logic that could
now be simplified or removed entirely --- but as examples, I believe
that reconsider_outer_join_clause no longer does anything useful, and
a lot of the logic in deconstruct_jointree and distribute_qual_to_rels
could be simplified, and we shouldn't need the notion of
second-class-citizen EquivalenceClasses for "below outer join" cases.

Another thing that could be built on this infrastructure, but I've
not tackled it yet, is fixing the known semantic bugs in grouping sets
[3][4]. What I have in mind there is to invent a dummy RTE representing
the action of grouping, and use Vars that are marked as nullable by that
RTE to represent possibly-nullable grouping-set expressions.

The main thing here that differs from my previous ideas is that the
nulling-rel labeling is placed directly on Vars or PlaceHolderVars,
whereas I had been advocating to use some sort of wrapper node instead.
After several failed attempts I decided that it was too complicated
to separate the labeling from the Var itself. (I'll just mention one
weak spot in that idea: the entire API concept of replace_rte_variables
breaks down, because many of the callbacks using it need to manipulate
nulling-rel labeling along the way, which they can only do if they
see it on the Var they're passed.) Of course, the objection to doing it
like this is that it bloats struct Var, which is a mighty common node
type, even in cases where there's no outer join anywhere. However, on
a 64-bit machine struct Var would widen from 40 to 48 bytes, which is
basically free considering that palloc will round the allocation up to
64 bytes. There's a more valid consideration that the pg_node_tree
representation of a Var will get longer; but really, if you're worried
about that you should be designing a more compact storage representation
for node trees. There's also reason to fear that the distributed cost
of maintaining these extra Bitmapsets will pose a noticeable drag on
parsing or planning speed. However, I see little point in doing
performance measurements when we've not yet reaped any of the
foreseeable planner improvements.

Anyway, on to the patch series. I've broken it up a little bit
for review, but note I'm not claiming that the intermediate states
would compile or pass regression testing.

0000: Write some overview documentation in optimizer/README.
This might be worth reading even if you lack time to look at the code.
I went into some detail about Var semantics, and also added a discussion
of PlaceHolderVars, which Robert has rightfully complained are badly
underdocumented. (At one point I'd thought maybe we could get rid of
PlaceHolderVars, but now I see them as complementary to this work ---
indeed, arguably the reason for them to exist is so that a Var's
nullingrels markers are not lost when replacing it with a pulled-up
expression from a subquery.) The changes in the section about
EquivalenceClasses are pretty rough and speculative, since I've not
actually coded those changes yet.

0001: add infrastructure, namely add new fields to assorted data
structures and update places like backend/nodes/*.c. This is mostly
pretty boring, except for the commentary changes in *nodes.h.

0002: change the parser to correctly label emitted Vars with the
sets of outer joins that can null them, according to the query text
as-written. (That is, we don't account here for the possibility
of join strength reduction or anything like that.)

0003: fix the planner to cope, including adjusting nullingrel labeling
for join elimination, join strength reduction, join reordering, etc.
This is still WIP to some extent. In particular note all the XXX
comments in setrefs.c complaining about how we're not verifying that the
nullingrel states agree when matching upper-level Vars to lower-level
ones. This is partly due to setrefs.c not having readily-available info
about which outer joins are applied at which plan nodes (should we
expend the space to mark them?), and partly because I'm not sure
that we can enforce 100% consistency there anyway. Because of the
compromises involved in implementing outer-join identity 3 (see 0000),
there will be cases where an upper Var that "should" have a nullingrel
bit set will not. I don't know how to make a hole in the check that
will allow those cases without rendering such checking mostly useless.

Is there a way that we can do the identity-3 transformation without
being squishy about the nullability state of Vars in the moved qual?
I've not thought of one, but it's very annoying considering that the
whole point of this patch series is to not be squishy about that.
I guess the good news is that the squishiness only seems to be needed
during final transformation of the plan, where all we are losing is
the ability to detect bugs in earlier planner stages. All of the
decisions that actually count seem to work fine without compromises.

So far the patch series changes no regression test results, and I've
not added any new tests either. The next steps will probably have
visible effects in the form of improved plans for some test queries.

Anyway, even though this is far from done, I'm pretty pleased with
the results so far, so I thought I'd put it out for review by
anyone who cares to take a look. I'll add it to the September CF
in hopes that it might be more or less finished by then, and so
that the cfbot will check it out.

regards, tom lane

[1] https://www.postgresql.org/message-id/7771.1576452845%40sss.pgh.pa.us
[2] https://www.postgresql.org/message-id/flat/15848.1576515643%40sss.pgh.pa.us
[3] https://www.postgresql.org/message-id/17071-24dc13fbfa29672d%40postgresql.org
[4] https://www.postgresql.org/message-id/CAMbWs48AtQTQGk37MSyDk_EAgDO3Y0iA_LzvuvGQ2uO_Wh2muw%40mail.gmail.com

Attachment Content-Type Size
v1-0000-add-overview-documentation.patch text/x-diff 19.1 KB
v1-0001-add-preliminary-infrastructure.patch text/x-diff 36.1 KB
v1-0002-label-Var-nullability-in-parser.patch text/x-diff 20.5 KB
v1-0003-cope-with-nullability-in-planner.patch text/x-diff 159.6 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2022-07-01 16:52:04 Re: Support logical replication of DDLs
Previous Message Bruce Momjian 2022-07-01 16:41:38 Re: First draft of the PG 15 release notes