Re: Idea: Avoid JOINs by using path expressions to follow FKs

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Julien Rouhaud" <rjuju123(at)gmail(dot)com>
Cc: "Isaac Morland" <isaac(dot)morland(at)gmail(dot)com>, "Vik Fearing" <vik(at)postgresfriends(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Alvaro Herrera" <alvherre(at)alvh(dot)no-ip(dot)org>, "PostgreSQL Developers" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Idea: Avoid JOINs by using path expressions to follow FKs
Date: 2021-03-31 09:18:52
Message-ID: 0a559871-12b8-4e3a-9226-ab2a0aa94805@www.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 31, 2021, at 08:18, Julien Rouhaud wrote:
> On Wed, Mar 31, 2021 at 12:50:19AM +0200, Joel Jacobson wrote:
> > On Tue, Mar 30, 2021, at 22:01, Isaac Morland wrote:
> > > On Tue, 30 Mar 2021 at 15:33, Joel Jacobson <joel(at)compiler(dot)org <mailto:joel%40compiler.org>> wrote:
> > >>> Also, should the join be a left join, which would therefore return a NULL when there is no matching record? Or could we have a variation such as ->? to give a left join (NULL when no matching record) with -> using an inner join (record is not included in result when no matching record).
> > >>
> > >> Interesting idea, but I think we can keep it simple, and still support the case you mention:
> > >>
> > >> If we only have -> and you want to exclude records where the column is NULL (i.e. INNER JOIN),
> > >> I think we should just use the WHERE clause and filter on such condition.
> > >>
> > >
> > > Just to be clear, it will always be a left join? Agreed that getting the inner join behaviour can be done in the WHERE clause. I think this is a case where simple is good. As long as the left join case is supported I'm happy.
> >
> > Hmm, I guess, since technically, if all foreign key column(s) are declared as NOT NULL, we would know for sure such values exist, so a LEFT JOIN and INNER JOIN would always produce the same result.
> > I'm not sure if the query planner could produce different plans though, and if an INNER JOIN could be more efficient. If it matters, then I think we should generate an INNER JOIN for the "all column(s) NOT NULL" case.
>
> I'm not sure who is supposed to be the target for this proposal.
>
> As far as I understand this won't change the fact that users will still have to
> understand the "relational" part of RDBMS, understand what is a JOIN
> cardinality and everything that comes with it. So you think that people who
> are too lazy to learn the proper JOIN syntax will still bother to learn about
> relational algebra and understand what they're doing, and I'm very doubtful
> about that.
>
> You also think that writing a proper JOIN is complex, but somehow writing a
> proper WHERE clause to subtly change the query behavior is not a problem, or
> that if users want to use aggregate or anything more complex then they'll
> happily open the documentation and learn how to do that. In my experience what
> will happen is that instead users will keep using that limited subset of SQL
> features and build creative and incredibly inefficient systems to avoid using
> anything else and will then complain that postgres is too slow.

Thanks for interesting new insights and questions.

Traditional SQL JOINs reveals less information about the data model,
compared to this new proposed foreign key based syntax.

Traditional SQL JOINs => undirected graph can be inferred
Foreign key joins => directed graph can be inferred

When looking at a traditional join, you might be able to guess the direction,
based on the name of tables and columns, but you cannot know for sure without
looking at the table definitions.

I'm thinking the target is both expert as well as beginner users,
who prefer a more concise syntax and reduced cognitive load:

Imagine a company with two types of SQL users:
1) Tech core team, responsible for schema changes (DDL), such as adding new tables/columns
and adding proper foreign keys.
2) Normal users, responsible for writing SQL queries using the existing schema.

In such a scenario, (2) would use the foreign keys added by (1),
letting them focus on *what* to join and less on *how* to join,
all in line with the objectives of the declarative paradigm.

By using the foreign keys, it is guaranteed you cannot get an
accidental one-to-many join that would multiply the result set.

How many rows a certain big query with lots of joins returns
can be difficult to reason about, you need to carefully inspect each
table to understand what column(s) there are unique constraints on,
that cannot multiply the result set.

If using the -> notation, you would only need to manually
inspect the tables involved in the remaining JOINs;
since you could be confident all uses of -> cannot affect cardinality.

I think this would be a win also for an expert SQL consultant working
with a new complex data model never seen before.

>
> As an example just yesterday some user complained that it's not possible to
> write a trigger on a table that could intercept inserting a textual value on an
> integer field and replace it with the referenced value. And he rejected our
> suggested solution to replace the "INSERT INTO sometable VALUES..." with
> "INSERT INTO sometable SELECT ...". And no this proposal would not have
> changed anything because changing the python script doing the import to add
> some minimal SQL knowledge was apparently too problematic. Instead he will
> insert the data in a temporary table and dispatch everything on a per-row
> basis, using triggers. So here again the problem wasn't the syntax but having
> to deal with a relational rather than an imperative approach.

Sad but a bit funny story. I guess some people cannot learn from others mistake,
but insist on shooting themselves in the foot first.

I understand it must feel wasteful and hopeless trying to educate such users.
Maybe we could recycle the invested energy into such conversations,
by creating a wiki-page for each such anti-pattern, so that each new attempt
at explaining hopefully eventually leads to sufficient information for anyone
to understand why X is a bad idea.

/Joel

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message talk to ben 2021-03-31 09:39:39 Shared buffers advice for windows in the wiki
Previous Message Ajin Cherian 2021-03-31 09:09:58 Re: [PATCH] add concurrent_abort callback for output plugin