Re: Hybrid Hash/Nested Loop joins and caching results from subplans

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: David Rowley <dgrowleyml(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Hybrid Hash/Nested Loop joins and caching results from subplans
Date: 2020-08-28 15:33:14
Message-ID: CA+Tgmoa8=_mSVNMxy_xy2JcLH9MgVWi7DSAQJJN8F81pxFeXNg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Aug 19, 2020 at 6:58 PM Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:
> On 2020-Aug-19, David Rowley wrote:
> > Andres' suggestion:
> > regression=# explain (analyze, costs off, timing off, summary off)
> > select count(*) from tenk1 t1 inner join tenk1 t2 on
> > t1.twenty=t2.unique1;
> > QUERY PLAN
> > ---------------------------------------------------------------------------------------
> > Aggregate (actual rows=1 loops=1)
> > -> Nested Loop (actual rows=10000 loops=1)
> > Cache Key: t1.twenty Hits: 9980 Misses: 20 Evictions: 0 Overflows: 0
> > -> Seq Scan on tenk1 t1 (actual rows=10000 loops=1)
> > -> Index Scan using tenk1_unique1 on tenk1 t2 (actual rows=1 loops=20)
> > Index Cond: (unique1 = t1.twenty)
> > (6 rows)
>
> I think it doesn't look terrible in the SubPlan case -- it kinda makes
> sense there -- but for nested loop it appears really strange.

I disagree. I don't know why anyone should find this confusing, except
that we're not used to seeing it. It seems to make a lot of sense that
if you are executing the same plan tree with different parameters, you
might want to cache results to avoid recomputation. So why wouldn't
nodes that do this include a cache?

This is not necessarily a vote for Andres's proposal. I don't know
whether it's technically better to include the caching in the Nested
Loop node or to make it a separate node, and I think we should do the
one that's better. Getting pushed into an inferior design because we
think the EXPLAIN output will be clearer does not make sense to me.

I think David's points elsewhere on the thread about ProjectSet and
Materialize nodes are interesting. It was never very clear to me why
ProjectSet was handled separately in every node, adding quite a bit of
complexity, and why Materialize was a separate node. Likewise, why are
Hash Join and Hash two separate nodes instead of just one? Why do we
not treat projection as a separate node type even when we're not
projecting a set? In general, I've never really understood why we
choose to include some functionality in other nodes and keep other
things separate. Is there even an organizing principle, or is it just
historical baggage?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-08-28 15:46:49 More aggressive vacuuming of temporary tables
Previous Message Gilles Darold 2020-08-28 15:21:22 Re: New default role- 'pg_read_all_data'