Quick Links

Re: Key joins

From:	Arne Roland <arne(dot)roland(at)malkut(dot)net>
To:	Tomas Vondra <tomas(at)vondra(dot)me>, Joel Jacobson <joel(at)compiler(dot)org>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Anders Granlund <anders(dot)granlund(dot)0(at)gmail(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, Vik Fearing <vik(at)chouppes(dot)com>
Subject:	Re: Key joins
Date:	2026-06-12 00:20:56
Message-ID:	ef689782-fdfd-4472-b33c-81f9a2213c49@malkut.net
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi Tomas,

thank you for checking it out!

On 2026-06-11 11:02 PM, Tomas Vondra wrote:
> Hi,
>
> I took a quick look at the patch over the past couple days. I don't have
> a perfect understanding of how it works, but let me share what I have so
> far, before I get distracted by other stuff.
>
> I'm interested in this patch because there seems to be a possible
> overlap with the overlap with the starjoin planning (in that maybe we
> could try reusing some of the derived information for that).
>
> It's a mix of random thoughts, high/low level, important/superficial in
> no particular order.
>
>
> 1) It does not compile, ATExecSetRowSecurity seems to be missing
> prev_rls or something like that. I simply commented this out, to get it
> to compile.

Sorry, I am confused. There is the self contained block of

> @@ -18879,6 +18898,7 @@ ATExecSetRowSecurity(Relation rel, bool rls)
> Relation pg_class;
> Oid relid;
> HeapTuple tuple;
> + bool prev_rls;
>
> relid = RelationGetRelid(rel);
>
> @@ -18890,6 +18910,7 @@ ATExecSetRowSecurity(Relation rel, bool rls)
> if (!HeapTupleIsValid(tuple))
> elog(ERROR, "cache lookup failed for relation %u", relid);
>
> + prev_rls = ((Form_pg_class) GETSTRUCT(tuple))->relrowsecurity;
> ((Form_pg_class) GETSTRUCT(tuple))->relrowsecurity = rls;
> CatalogTupleUpdate(pg_class, &tuple->t_self, tuple);
>
> @@ -18898,6 +18919,21 @@ ATExecSetRowSecurity(Relation rel, bool rls)
>
> table_close(pg_class, RowExclusiveLock);
> heap_freetuple(tuple);
> +
> + /*
> + * Revalidate stored key-join proofs that depend on this relation. The
> + * key-join base-fact computation refuses to expose facts for a relation
> + * with row-level security enabled (parse_key_join.c), so flipping
> + * relrowsecurity can leave a stored proof unprovable. Revalidate
> + * whenever the flag actually changes; for the off-to-on transition the
> + * revalidation raises an error and aborts this DDL, while the on-to-off
> + * transition is a no-op for proofs that were already valid.
> + */
> + if (rls != prev_rls)
> + {
> + CommandCounterIncrement();
> + RevalidateDependentKeyJoinObjectsOnRelation(relid);
> + }
> }
Apart from that there shouldn't be any references to that. What
definition was missing?
> 2) fk_referenced_selected in find_key_join_match should be marked as
> PG_USED_FOR_ASSERTS_ONLY, to fix compiler warning
Thank you for spotting that.
> 3) It'd be helpful if the commit message for 0001 explained why this
> change is needed. More clearly than now, I mean. The initial message in
> this thread points at another thread as the source of this, but that
> thread is huge so how are reviewers expected to find the explanation?
>
> Don't explain just what the commit does, but why it's needed. Say, give
> an example of how the old code fails.
In short we need that to prevent objects our proof relies on to be
dropped or altered below us. I'll try to work that into the commit message.
> 4) I think there needs to be a README explaining how the feature works,
> i.e. the design and trade offs. Alternatively, it could be explained in
> a comment at the beginning of some .c file, but a README seems easier to
> find. Without this, reviewers have to piece it from the sgml "user"
> docs, and random comments all over the place.
>
>
> 5) It might be helpful if the README defined a couple terms introduced
> by the patch, but not really defined anywhere. I mean terms like: join
> point, proof, proof graph, fact, surface, "relation visible from". I can
> guess what each of these means, but maybe I got it wrong.
>
> Although, I now see "join point" was used in the docs before, so maybe
> it's a well-known term?

Thank you, I will try to come up with some better document explaining
the concept. For now I can mainly point you only to
https://keyjoin.org/#sec7.1, which gives a birds eye overview over how
the implementation and it's artifacts work.

Seeing your confusion, I think we should definitely reduce the amount of
terminology used. For instance "proof graph" is shorthand for the slice
of the catalog dependency graph contributed by key-join proofs
(depending object → proof object), i.e. the deptype='k' subgraph of the
dependency graph. I don't think we need to introduce this word as a new
concept. Thank you, this input is very valuable to improve this.

> 6) Does it always have to be a "foreign key traversal"? Consider an
> example like this:
>
> create table dim (id serial primary key);
> create table f (id serial primary key, did int);
> select * from f left join dim for key (id) <- f(did);
>
> Currently this fails because of no matching foreign key constraint, but
> isn't that pretty much the same thing as if the table actually had the
> foreign key (at least considering the cardinality of the join - it won't
> change it by adding/removing rows).
>
> I realize that'd contradict the "FOR KEY" part of this patch, but it's
> also one of the things that might be beneficial for the starjoin
> planning (in that we could maybe extend it to more cases). Although,
> maybe we could check those constraints during planning, just like we
> check the fkey_list.
This is a very different feature avoiding far less bugs than the
original key join. To give just once, consider

SELECT *
FROM customers cu
LEFT JOIN FOR KEY orders (id) <- cu (id)
WHERE customers.id = 122354;

We decided, this (among other) error classes, was important enough, we
pushed for stronger error guarantees, than just the uniqueness of the
referenced side. I still am convinced, this more save way of writing
queries is better for it's additional safety. I do think caring about
the foreign key gives the better syntax.

Do you think such constructions are very common for your customers?

Our proof framework with proof facts could potentially be leveraged to
proof such constructions too. We opted to keep this patch as minimal as
possible, since it's already huge. In the end our conquest is to get to
something, which is eventually commitable. This patch has a lot things
to get wrong. But I think as a second step adding support of other
schemas and handing this information of to the planer, should be a
fairly straight forward thing to do. Although it would add a few lines
on it's own, it's step, that I seems easier than landing a minimal
version of this patch.

> 7) The patch invents a new "FILTER" clause for joins:
>
> [ FILTER (WHERE join_filter) ]
>
> I understand why it's done - there may be additional join conditions, on
> top of the FK equality. But I think it'll be rather confusing. We
> already have a "Join Filter", which is used for join clauses that happen
> to not be used as "proper" join conditions (e.g. Hash/Merge Cond), and
> has to be evaluated "after" the join itself. And now we'd have another
> kind of "join filter" ...
>
> Is there a different that'd make this work without the FILTER? For
> example, we might encode the "key join" information in the existing ON
> (...) clause:
>
> ON (KEY (oi.order_id) -> (o.id) AND ... other clauses ...)
>
> but it seems not as readable. Or maybe just use something else than
> "FILTER"? Not sure.
This word has the benefit of being already a keyword and it solves the
problem rather clear cut. Do you think, there is something we could do
to make this better?
> 8) I don't understand this change in functioncmds:
>
> /* Routine kind cannot change for an existing pg_proc OID. */
> Assert(procForm->prokind != PROKIND_AGGREGATE);
>
> The comment says we can't change OID, but the assert checks it's not an
> aggregate functions. Isn't that misleading?
Maybe it would be cleaner to mention additionally, that "aggregates were
already rejected on the pre-lock tuple"? I actually think, that comment
is indeed stating a helpful invariant to explain why we are concurrency
save here.
> 9) DefineView now adjusts the query ID. Why is that needed? Isn't that a
> bit weird?
Sorry, does it? Could you point out where exactly? I might be too tired
to see it right now.
> 10) checkWellFormedRecursionWalker now recurses, but why is that needed?

Didn't in already recurse before? Didn't check the blame, but I do think
it's recursing for a long time.

I just see an added line to check the new filter clause for subqueries
or something to recurse into.

> 11) It's not clear to me why we need p_creating_stored_object, and it's
> not explained anywhere. Maybe it's obvious, but not to me.

Sorry, that's my oversight. We need to take stronger locks, if we store
something materially, to prevent concurrent changes, while we store it.

Where do you recon a better comment would be helpful? A longer comment
directly in the struct definition?

> 12) I'm very skeptical anyone can meaningfully review parse_key_join.c.
> It's 150KB with ~5200 lines (very dense, with pretty minimal comments).
> That's a massive file. The chance of me declaring this committable is
> about 0%, simply because I wouldn't believe I understand it.
>
> I think it needs to be broken up into smaller pieces, somehow. I don't
> know how, but perhaps it's possible to extract a "minimal feature"
> handling some limited subset of cases, and then gradually expand it?

I think there is serious complexity involved in just getting the basic
the architecture. That being said, I am very open to any idea of a
stepping stone to this.

It's something I thought a lot about during the last weeks. I am toying
with the thought of ripping out all changes related to anything stored
like views and sql functions as a minimal, minimal version. It's so bare
bone, I'm not terribly exited about doing that, but maybe it's a
necessary evil. Having thought about a lot of edge cases related to that
storing, As I see it, we wouldn't save a lot of lines of code with this,
but it makes reasoning about correctness easier. I can testament to the
complexity in reasoning and correctness it adds. Do you have any other
ideas for a stepping stone, that could reasonably land?

I think this one of the most fundamental questions to be answered,
before we optimize one of the paths. What is the first consumer of the
surface fact framework and how can we stage a minimal commit around it.
We need a committer to feel save to commit something with this
framework. If you have something notably smaller than this, I am all ears.

> Furthermore, it seems a lot of that file is duplicate with code we
> already have elsewhere. For example, couldn't it use a bunch of list
> functions instead of writing a local version?
>
> list_contains_equal_node -> list_member
> append_dependencies_unique -> list_concat_unique
> append_dependency_unique -> list_append_unique
> ...
>
> There may be more such cases, I haven't looked for them.
Thanks for bringing it up. I will look into that, once I've properly
gone through the other suggestions. I suspect there won't be many line
to save, but I'll have a look.
> 13) I think the main question is whether parse-analyze is the right
> place to handle this. I don't know. I assume one of the reasons for
> doing that is to get an error when defining a view, or when altering an
> object - e.g. like when dropping a function used by a view. Which seems
> reasonable, people would not like views silently broken by DDL.
>
> But it seems to be this also leads to a lot of code duplication, because
> the parse analysis now has to "reimplement" a lot of the stuff already
> done in later stages, after parse analyze. For example, we have the
> root->fkey_list thing, but that's not available yet.
>
> There's probably more such information - like innerrel_is_unique,
> rel_is_distinct_for, relation_has_unique_index_for etc.
>
> Maybe it's not worth it? What if we did this in the planner instead? How
> much simpler would it get? My intuition is it'd get much smaller, but
> maybe I'm wrong.
>
> It'd probably mean it's not necessary to expand views during parse,
> which was one of the problems mentioned at the beginning of this thread.
>
> I suppose we'd need to keep additional information in the plan to know
> which joins to check, etc.

If we want to use this for correctness and for proofs, we can't do it
plan time. We have DDL-writes to store proofs. In it's nature this is a
compile time feature.

Even if we don't include all of that in the first patch, this places
very severe limitations like the DDL issue, which would be just tooo
limiting to get to as an end place. I do think it's more reasonable how
to cut this up into more digestible pieces.

Even though the planner is messy, I feel more at home there than here in
the parser. I could see a world were we clean up some of those out of
the planner to make them available earlier some time in parsing. We
could push those variables forward to the planner to avoid redoing the
work. Each of these steps would need a careful performance check and I
am not sure how the real stepping stones would look like.

> 14) If we wanted to use some of this for the starjoin planning stuff,
> we'd need to propagate more information too, I think. But I'm not sure
> about this - we already have the information about foreign keys, so if
> key joins are tied to foreign keys, there would not be no new useful
> information I suppose.
>
> Plus, I don't want to make that patch dependent on people using new
> syntax. If that can give us *additional* information, that would be a
> different thing.

As I said above, if we figure out what information we want to propagate
this will be helpful. We will be able to propagate this. And if we can
get in a base patch with this architecture, adding this should be
straight forward.

One thing, that sounds fairly simple to derive from our proof, is the
proof graph. This means in every node we denote, this node can be
treated as cardinality preserving joined onto a different node. For
planning we need something more global, because we need to consider
something WHERE/HAVING, but the existence of such a clause shouldn't be
complex to check for.

> [...]
>
> 16) I was wondering what performance impact this has, roughly. So I
> created a simple 4-way join with fact + 3 dimensions:
>
> create table dim1 (id serial primary key, val text);
> create table dim2 (id serial primary key, val text);
> create table dim3 (id serial primary key, val text);
> create table f (id serial primary key,
> d1 int not null references dim1(id),
> d2 int not null references dim2(id),
> d3 int not null references dim3(id));
>
> and then measured throughput with 2 queries
>
> select * from f
> join dim1 on (dim1.id = d1)
> join dim2 on (dim2.id = d2)
> join dim3 on (dim3.id = d3);
>
> select * from f
> join dim1 for key (id) <- f (d1)
> join dim2 for key (id) <- f (d2)
> join dim3 for key (id) <- f (d3);
>
> which is the same query, except for the FOR KEY syntax. And I got this
> (under explain, to only do the planning):
>
> patched master
> -----------------------------
> join 25251 24906
> keyjoin 17159
>
> So that's ~30% regression compared to master / regular joins. I also
> tried with join_collapse_limit=1 to eliminate the join order planning:
>
> patched master
> -----------------------------
> join 31476 31252
> keyjoin 19353
>
> That's 40% regression. Of course, this is just explain - once there is
> data in the table, and actual execution, the differences will be much
> smaller. Still, it's not great.
>
> I haven't looked very closely, but based on some quick profiling it
> seems ensure_key_join_surface_facts / compute_key_join_relation_facts is
> doing expensive stuff like findNotNullConstraintAttnum or
> get_index_constraint, both of which do systable scans. That should
> probably go through syscache or something like that.
I'm not too surprised about ensure_key_join_surface_facts holding most
of the regression, since it's doing all the work. Since I spent no time
or thought on optimizing runtime thus far, I think there a lot of long
hanging fruits left. I'd prefer to sort out the architecture first, but
I think, we should be able to improve the parsetime of this feature by a
sizeable amount. I fully agree: Before committing this, we should hit
some of the relevant functions for a quick win.
> regards
>
I will probably work my way through this incrementally. I hope to have a
version incorporating your vast feedback next week.

Regards
Arne

In response to

Re: Key joins at 2026-06-11 21:02:56 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Fujii Masao	2026-06-12 00:42:16	Re: amcheck: fix bug of missing corruption in allequalimage validation
Previous Message	Chao Li	2026-06-11 23:57:47	Re: [PATCH] REPLICA IDENTITY USING INDEX accepts column with invalid NOT NULL