Re: Key joins

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Arne Roland <arne(dot)roland(at)malkut(dot)net>, Joel Jacobson <joel(at)compiler(dot)org>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Anders Granlund <anders(dot)granlund(dot)0(at)gmail(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, Vik Fearing <vik(at)chouppes(dot)com>
Subject: Re: Key joins
Date: 2026-06-12 17:26:04
Message-ID: f30390c8-b4ff-4604-89b1-cb8f75d279c1@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/12/26 02:20, Arne Roland wrote:
> Hi Tomas,
>
> thank you for checking it out!
>
> On 2026-06-11 11:02 PM, Tomas Vondra wrote:
>> Hi,
>>
>> I took a quick look at the patch over the past couple days. I don't have
>> a perfect understanding of how it works, but let me share what I have so
>> far, before I get distracted by other stuff.
>>
>> I'm interested in this patch because there seems to be a possible
>> overlap with the overlap with the starjoin planning (in that maybe we
>> could try reusing some of the derived information for that).
>>
>> It's a mix of random thoughts, high/low level, important/superficial in
>> no particular order.
>>
>>
>> 1) It does not compile, ATExecSetRowSecurity seems to be missing
>> prev_rls or something like that. I simply commented this out, to get it
>> to compile.
>
> Sorry, I am confused. There is the self contained block of
>
>> @@ -18879,6 +18898,7 @@ ATExecSetRowSecurity(Relation rel, bool rls)
>>       Relation    pg_class;
>>       Oid            relid;
>>       HeapTuple    tuple;
>> +    bool        prev_rls;
>>         relid = RelationGetRelid(rel);
>>   @@ -18890,6 +18910,7 @@ ATExecSetRowSecurity(Relation rel, bool rls)
>>       if (!HeapTupleIsValid(tuple))
>>           elog(ERROR, "cache lookup failed for relation %u", relid);
>>   +    prev_rls = ((Form_pg_class) GETSTRUCT(tuple))->relrowsecurity;
>>       ((Form_pg_class) GETSTRUCT(tuple))->relrowsecurity = rls;
>>       CatalogTupleUpdate(pg_class, &tuple->t_self, tuple);
>>   @@ -18898,6 +18919,21 @@ ATExecSetRowSecurity(Relation rel, bool rls)
>>         table_close(pg_class, RowExclusiveLock);
>>       heap_freetuple(tuple);
>> +
>> +    /*
>> +     * Revalidate stored key-join proofs that depend on this
>> relation.  The
>> +     * key-join base-fact computation refuses to expose facts for a
>> relation
>> +     * with row-level security enabled (parse_key_join.c), so flipping
>> +     * relrowsecurity can leave a stored proof unprovable.  Revalidate
>> +     * whenever the flag actually changes; for the off-to-on
>> transition the
>> +     * revalidation raises an error and aborts this DDL, while the
>> on-to-off
>> +     * transition is a no-op for proofs that were already valid.
>> +     */
>> +    if (rls != prev_rls)
>> +    {
>> +        CommandCounterIncrement();
>> +        RevalidateDependentKeyJoinObjectsOnRelation(relid);
>> +    }
>>   }
> Apart from that there shouldn't be any references to that. What
> definition was missing?

I guess git-am gets confused in some way, because cfbot has the same
issue with this:

https://github.com/postgresql-cfbot/postgresql/actions/runs/27379594748/job/80912606962

>> 2) fk_referenced_selected in find_key_join_match should be marked as
>> PG_USED_FOR_ASSERTS_ONLY, to fix compiler warning
> Thank you for spotting that.
>> 3) It'd be helpful if the commit message for 0001 explained why this
>> change is needed. More clearly than now, I mean. The initial message in
>> this thread points at another thread as the source of this, but that
>> thread is huge so how are reviewers expected to find the explanation?
>>
>> Don't explain just what the commit does, but why it's needed. Say, give
>> an example of how the old code fails.
>
> In short we need that to prevent objects our proof relies on to be
> dropped or altered below us. I'll try to work that into the commit message.

Yeah, understood. I guess the question is where else we should adopt
this new idiom. AFAIK this is pretty much a TOCTOU bug - don't we have a
similar issue elsewhere?

>> 4) I think there needs to be a README explaining how the feature works,
>> i.e. the design and trade offs. Alternatively, it could be explained in
>> a comment at the beginning of some .c file, but a README seems easier to
>> find. Without this, reviewers have to piece it from the sgml "user"
>> docs, and random comments all over the place.
>>
>>
>> 5) It might be helpful if the README defined a couple terms introduced
>> by the patch, but not really defined anywhere. I mean terms like: join
>> point, proof, proof graph, fact, surface, "relation visible from". I can
>> guess what each of these means, but maybe I got it wrong.
>>
>> Although, I now see "join point" was used in the docs before, so maybe
>> it's a well-known term?
>
> Thank you, I will try to come up with some better document explaining
> the concept. For now I can mainly point you only to https://keyjoin.org/
> #sec7.1, which gives a birds eye overview over how the implementation
> and it's artifacts work.
>
> Seeing your confusion, I think we should definitely reduce the amount of
> terminology used. For instance "proof graph" is shorthand for the slice
> of the catalog dependency graph contributed by key-join proofs
> (depending object → proof object), i.e. the deptype='k' subgraph of the
> dependency graph. I don't think we need to introduce this word as a new
> concept. Thank you, this input is very valuable to improve this.
>

I'm not against introducing new terminology, at least not when there's
no established terminology in the code already. But it would be helpful
to explain what the new terms mean, and how they map to the structs and
catalogs (so that people don't have to deduce them).

>> 6) Does it always have to be a "foreign key traversal"? Consider an
>> example like this:
>>
>>      create table dim (id serial primary key);
>>      create table f (id serial primary key, did int);
>>      select * from f left join dim for key (id) <- f(did);
>>
>> Currently this fails because of no matching foreign key constraint, but
>> isn't that pretty much the same thing as if the table actually had the
>> foreign key (at least considering the cardinality of the join - it won't
>> change it by adding/removing rows).
>>
>> I realize that'd contradict the "FOR KEY" part of this patch, but it's
>> also one of the things that might be beneficial for the starjoin
>> planning (in that we could maybe extend it to more cases). Although,
>> maybe we could check those constraints during planning, just like we
>> check the fkey_list.
> This is a very different feature avoiding far less bugs than the
> original key join. To give just once, consider
>
> SELECT *
> FROM customers cu
> LEFT JOIN FOR KEY orders (id) <- cu (id)
> WHERE customers.id = 122354;
>

I don't follow. What does this query illustrate?

>
> We decided, this (among other) error classes, was important enough, we
> pushed for stronger error guarantees, than just the uniqueness of the
> referenced side. I still am convinced, this more save way of writing
> queries is better for it's additional safety. I do think caring about
> the foreign key gives the better syntax.
>
> Do you think such constructions are very common for your customers?
>

I don't know, really. I was merely thinking about using this stuff for
the starjoin optimization patch. And that already handles joins on
foreign keys just fine, but extending to more queries might help.

> Our proof framework with proof facts could potentially be leveraged to
> proof such constructions too. We opted to keep this patch as minimal as
> possible, since it's already huge. In the end our conquest is to get to
> something, which is eventually commitable. This patch has a lot things
> to get wrong. But I think as a second step adding support of other
> schemas and handing this information of to the planer, should be a
> fairly straight forward thing to do. Although it would add a few lines
> on it's own, it's step, that I seems easier than landing a minimal
> version of this patch.
>

OK. I realize this probably goes against my suggestion to make the patch
smaller. I certainly don't insist on supporting it.

>> 7) The patch invents a new "FILTER" clause for joins:
>>
>>     [ FILTER (WHERE join_filter) ]
>>
>> I understand why it's done - there may be additional join conditions, on
>> top of the FK equality. But I think it'll be rather confusing. We
>> already have a "Join Filter", which is used for join clauses that happen
>> to not be used as "proper" join conditions (e.g. Hash/Merge Cond), and
>> has to be evaluated "after" the join itself. And now we'd have another
>> kind of "join filter" ...
>>
>> Is there a different that'd make this work without the FILTER? For
>> example, we might encode the "key join" information in the existing ON
>> (...) clause:
>>
>>      ON (KEY (oi.order_id) -> (o.id) AND ... other clauses ...)
>>
>> but it seems not as readable. Or maybe just use something else than
>> "FILTER"? Not sure.
> This word has the benefit of being already a keyword and it solves the
> problem rather clear cut. Do you think, there is something we could do
> to make this better?

Not at the moment, sorry.

>> 8) I don't understand this change in functioncmds:
>>
>>      /* Routine kind cannot change for an existing pg_proc OID. */
>>      Assert(procForm->prokind != PROKIND_AGGREGATE);
>>
>> The comment says we can't change OID, but the assert checks it's not an
>> aggregate functions. Isn't that misleading?
> Maybe it would be cleaner to mention additionally, that "aggregates were
> already rejected on the pre-lock tuple"? I actually think, that comment
> is indeed stating a helpful invariant to explain why we are concurrency
> save here.

I don't follow. The comment says we can't change OID of an existing
procedure. OK, sure. But the assert checks the procedure is not an
aggregate function. How is that comment an accurate description of the
purpose of the assert?

>> 9) DefineView now adjusts the query ID. Why is that needed? Isn't that a
>> bit weird?
> Sorry, does it? Could you point out where exactly? I might be too tired
> to see it right now.
>> 10) checkWellFormedRecursionWalker now recurses, but why is that needed?
>
> Didn't in already recurse before? Didn't check the blame, but I do think
> it's recursing for a long time.
>
> I just see an added line to check the new filter clause for subqueries
> or something to recurse into.
>

Oh, I see. My mistake, sorry.

>> 11) It's not clear to me why we need p_creating_stored_object, and it's
>> not explained anywhere. Maybe it's obvious, but not to me.
>
> Sorry, that's my oversight. We need to take stronger locks, if we store
> something materially, to prevent concurrent changes, while we store it.
>
> Where do you recon a better comment would be helpful? A longer comment
> directly in the struct definition?
>

Well, I guess it could be mentioned before the struct definition. And
then maybe also in the README explaining how this all fits together.

>> 12) I'm very skeptical anyone can meaningfully review parse_key_join.c.
>> It's 150KB with ~5200 lines (very dense, with pretty minimal comments).
>> That's a massive file. The chance of me declaring this committable is
>> about 0%, simply because I wouldn't believe I understand it.
>>
>> I think it needs to be broken up into smaller pieces, somehow. I don't
>> know how, but perhaps it's possible to extract a "minimal feature"
>> handling some limited subset of cases, and then gradually expand it?
>
> I think there is serious complexity involved in just getting the basic
> the architecture. That being said, I am very open to any idea of a
> stepping stone to this.
>
> It's something I thought a lot about during the last weeks. I am toying
> with the thought of ripping out all changes related to anything stored
> like views and sql functions as a minimal, minimal version. It's so bare
> bone, I'm not terribly exited about doing that, but maybe it's a
> necessary evil. Having thought about a lot of edge cases related to that
> storing, As I see it, we wouldn't save a lot of lines of code with this,
> but it makes reasoning about correctness easier. I can testament to the
> complexity in reasoning and correctness it adds. Do you have any other
> ideas for a stepping stone, that could reasonably land?
>
> I think this one of the most fundamental questions to be answered,
> before we optimize one of the paths. What is the first consumer of the
> surface fact framework and how can we stage a minimal commit around it.
> We need a committer to feel save to commit something with this
> framework. If you have something notably smaller than this, I am all ears.
>

I don't know how to split the patch, but I know that unless that happens
the patch is unlikely to move forward.

FWIW, it does not mean the parts have to be committed separately. It's
possible to review small pieces, and then squash them before commit to
get something "meaningful for users".

>> Furthermore, it seems a lot of that file is duplicate with code we
>> already have elsewhere. For example, couldn't it use a bunch of list
>> functions instead of writing a local version?
>>
>>    list_contains_equal_node -> list_member
>>    append_dependencies_unique -> list_concat_unique
>>    append_dependency_unique -> list_append_unique
>>    ...
>>
>> There may be more such cases, I haven't looked for them.
> Thanks for bringing it up. I will look into that, once I've properly
> gone through the other suggestions. I suspect there won't be many line
> to save, but I'll have a look.

OK, good.

>> 13) I think the main question is whether parse-analyze is the right
>> place to handle this. I don't know. I assume one of the reasons for
>> doing that is to get an error when defining a view, or when altering an
>> object - e.g. like when dropping a function used by a view. Which seems
>> reasonable, people would not like views silently broken by DDL.
>>
>> But it seems to be this also leads to a lot of code duplication, because
>> the parse analysis now has to "reimplement" a lot of the stuff already
>> done in later stages, after parse analyze. For example, we have the
>> root->fkey_list thing, but that's not available yet.
>>
>> There's probably more such information - like innerrel_is_unique,
>> rel_is_distinct_for, relation_has_unique_index_for etc.
>>
>> Maybe it's not worth it? What if we did this in the planner instead? How
>> much simpler would it get? My intuition is it'd get much smaller, but
>> maybe I'm wrong.
>>
>> It'd probably mean it's not necessary to expand views during parse,
>> which was one of the problems mentioned at the beginning of this thread.
>>
>> I suppose we'd need to keep additional information in the plan to know
>> which joins to check, etc.
>
> If we want to use this for correctness and for proofs, we can't do it
> plan time. We have DDL-writes to store proofs. In it's nature this is a
> compile time feature.
>

Is that actually true? If we check the correctness later in the planner,
we'd still get a failure. Why couldn't that verify all the proofs, just
like the parse-analyze? We may need to retain more information, ofc.

> Even if we don't include all of that in the first patch, this places
> very severe limitations like the DDL issue, which would be just tooo
> limiting to get to as an end place. I do think it's more reasonable how
> to cut this up into more digestible pieces.
>

I'm not sure what "DDL issue" is, sorry :-(

I may be missing something, but the main difference in behavior seems to
be for views - the current patch verifies the proofs when creating the
view / altering objects, and rejects those. And AFAICS after moving it
to the planner, that would not be the case anymore.

But for regular ad hoc queries there's not much difference, no? You'd
get a failure, no matter what.

> Even though the planner is messy, I feel more at home there than here in
> the parser. I could see a world were we clean up some of those out of
> the planner to make them available earlier some time in parsing. We
> could push those variables forward to the planner to avoid redoing the
> work. Each of these steps would need a careful performance check and I
> am not sure how the real stepping stones would look like.
>

I'm skeptical about moving this stuff to an earlier phase (so that the
parse can see it). You can try, but it probably requires moving a lot of
other stuff too, some of which may be expensive.

>> 14) If we wanted to use some of this for the starjoin planning stuff,
>> we'd need to propagate more information too, I think. But I'm not sure
>> about this - we already have the information about foreign keys, so if
>> key joins are tied to foreign keys, there would not be no new useful
>> information I suppose.
>>
>> Plus, I don't want to make that patch dependent on people using new
>> syntax. If that can give us *additional* information, that would be a
>> different thing.
>
> As I said above, if we figure out what information we want to propagate
> this will be helpful. We will be able to propagate this. And if we can
> get in a base patch with this architecture, adding this should be
> straight forward.
>
> One thing, that sounds fairly simple to derive from our proof, is the
> proof graph. This means in every node we denote, this node can be
> treated as cardinality preserving joined onto a different node. For
> planning we need something more global, because we need to consider
> something WHERE/HAVING, but the existence of such a clause shouldn't be
> complex to check for.
>

Understood.

>> [...]
>>
>> 16) I was wondering what performance impact this has, roughly. So I
>> created a simple 4-way join with fact + 3 dimensions:
>>
>>    create table dim1 (id serial primary key, val text);
>>    create table dim2 (id serial primary key, val text);
>>    create table dim3 (id serial primary key, val text);
>>    create table f (id serial primary key,
>>                    d1 int not null references dim1(id),
>>                    d2 int not null references dim2(id),
>>                    d3 int not null references dim3(id));
>>
>> and then measured throughput with 2 queries
>>
>>    select * from f
>>             join dim1 on (dim1.id = d1)
>>             join dim2 on (dim2.id = d2)
>>             join dim3 on (dim3.id = d3);
>>
>>    select * from f
>>             join dim1 for key (id) <- f (d1)
>>             join dim2 for key (id) <- f (d2)
>>             join dim3 for key (id) <- f (d3);
>>
>> which is the same query, except for the FOR KEY syntax. And I got this
>> (under explain, to only do the planning):
>>
>>                patched    master
>>    -----------------------------
>>       join       25251     24906
>>    keyjoin       17159
>>
>> So that's ~30% regression compared to master / regular joins. I also
>> tried with join_collapse_limit=1 to eliminate the join order planning:
>>
>>                patched    master
>>    -----------------------------
>>       join       31476     31252
>>    keyjoin       19353
>>
>> That's 40% regression. Of course, this is just explain - once there is
>> data in the table, and actual execution, the differences will be much
>> smaller. Still, it's not great.
>>
>> I haven't looked very closely, but based on some quick profiling it
>> seems ensure_key_join_surface_facts / compute_key_join_relation_facts is
>> doing expensive stuff like findNotNullConstraintAttnum or
>> get_index_constraint, both of which do systable scans. That should
>> probably go through syscache or something like that.
> I'm not too surprised about ensure_key_join_surface_facts holding most
> of the regression, since it's doing all the work. Since I spent no time
> or thought on optimizing runtime thus far, I think there a lot of long
> hanging fruits left. I'd prefer to sort out the architecture first, but
> I think, we should be able to improve the parsetime of this feature by a
> sizeable amount. I fully agree: Before committing this, we should hit
> some of the relevant functions for a quick win.

It's just a guess. I'm not sure fixing ensure_key_join_surface_facts
will make it much faster - it's possible, but I haven't tried. There are
probably more issues like this. But I'm also not surprised the patch has
this kind of issues, it's fine for an early WIP patch.

regards

--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Previous Message Tom Lane 2026-06-12 16:59:14 Modernizing pg_bsd_indent's error/warning reporting code