Re: Hybrid Hash/Nested Loop joins and caching results from subplans

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Hybrid Hash/Nested Loop joins and caching results from subplans
Date: 2020-08-18 09:42:25
Message-ID: CAApHDvpi6PT9wAQD6umRO81K3KZat9M7nh_YL7r9kGYwSZxOPA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 11 Aug 2020 at 17:44, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2020-08-11 17:23:42 +1200, David Rowley wrote:
> > On Tue, 11 Aug 2020 at 12:21, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > >
> > > On 2020-07-09 10:25:14 +1200, David Rowley wrote:
> > > > On Thu, 9 Jul 2020 at 04:53, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > > > I'm not convinced it's a good idea to introduce a separate executor node
> > > > > for this. There's a fair bit of overhead in them, and they will only be
> > > > > below certain types of nodes afaict. It seems like it'd be better to
> > > > > pull the required calls into the nodes that do parametrized scans of
> > > > > subsidiary nodes. Have you considered that?
> > > >
> > > > I see 41 different node types mentioned in ExecReScan(). I don't
> > > > really think it would be reasonable to change all those.
> > >
> > > But that's because we dispatch ExecReScan mechanically down to every
> > > single executor node. That doesn't determine how many nodes would need
> > > to modify to include explicit caching? What am I missing?
> > >
> > > Wouldn't we need roughly just nodeNestloop.c and nodeSubplan.c
> > > integration?
> >
> > hmm, I think you're right there about those two node types. I'm just
> > not sure you're right about overloading these node types to act as a
> > cache.
>
> I'm not 100% either, to be clear. I am just acutely aware that adding
> entire nodes is pretty expensive, and that there's, afaict, no need to
> have arbitrary (i.e. pointer to function) type callbacks to point to the
> cache.

Perhaps you're right, but I'm just not convinced of it. I feel
there's a certain air of magic involved in any node that has a good
name and reputation for doing one thing that we suddenly add new
functionality to which causes it to perform massively differently.

A counterexample to your argument is that Materialize is a node type.
There's only a limits number of places where that node is used. One of
those places can be on the inside of a non-parameterized nested loop.
Your argument of having Nested Loop do caching would also indicate
that Materialize should be part of Nested Loop instead of a node
itself. There's a few other places Materialize is used, e.g scrollable
cursors, but in that regard, you could say that the caching should be
handled in ExecutePlan(). I just don't think it should be, as I don't
think Result Cache should be part of any other node or code.

Another problem I see with overloading nodeSubplan and nodeNestloop
is, we don't really document our executor nodes today, so unless this
patch starts a new standard for that, then there's not exactly a good
place to mention that parameterized nested loops may now cache results
from the inner scan.

I do understand what you mean with the additional node overhead. I saw
that in my adventures of INNER JOIN removals a few years ago. I hope
the fact that I've tried to code the planner so that for nested loops,
it only uses a Result Cache node when it thinks it'll speed things up.
That decision is of course based on having good statistics, which
might not be the case. I don't quite have that luxury with subplans
due to lack of knowledge of the outer plan when planning the subquery.

> > How would you inform users via EXPLAIN ANALYZE of how many
> > cache hits/misses occurred?
>
> Similar to how we display memory for sorting etc.

I was more thinking of how bizarre it would be to see Nested Loop and
SubPlan report cache statistics. It may appear quite magical to users
to see EXPLAIN ANALYZE mentioning that their Nested Loop is now
reporting something about cache hits.

> > What would you use to disable it for an
> > escape hatch for when the planner makes a bad choice about caching?
>
> Isn't that *easier* when embedding it into the node? There's no nice way
> to remove an intermediary executor node entirely, but it's trivial to
> have an if statement like
> if (node->cache && upsert_cache(node->cache, param))

I was more meaning that it might not make sense to keep the
enable_resultcache GUC if the caching were part of the existing nodes.
I think people are pretty used to the enable_* GUCs corresponding to
an executor whose name roughly matches the name of the GUC. In this
case, without a Result Cache node, enable_resultcache would not assist
in self-documenting. However, perhaps 2 new GUCs instead,
enable_nestloop_caching and enable_subplan_caching. We're currently
short of any other enable_* GUCs that are node modifiers. We did have
enable_hashagg_disk until a few weeks ago. Nobody seemed to like that,
but perhaps there were other reasons for people not to like it other
than it was a node modifier GUC.

I'm wondering if anyone else has any thoughts on this?

David

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Yugo NAGATA 2020-08-18 09:52:36 Re: Implementing Incremental View Maintenance
Previous Message torikoshia 2020-08-18 09:41:47 Re: Creating a function for exposing memory usage of backend process