Re: Hybrid Hash/Nested Loop joins and caching results from subplans

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Hybrid Hash/Nested Loop joins and caching results from subplans
Date: 2020-08-25 08:48:37
Message-ID: CAApHDvqJ2cx8dohhTCUcgxGpXygeqM0+QObAmo5rUssA4Q+Ezw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 25 Aug 2020 at 08:26, Andres Freund <andres(at)anarazel(dot)de> wrote:
> While I'm against introducing a separate node for the caching, I'm *not*
> against displaying a different node type when caching is
> present. E.g. it'd be perfectly reasonable from my POV to have a 'Cached
> Nested Loop' join and a plain 'Nested Loop' node in the above node. I'd
> probably still want to display the 'Cache Key' similar to your example,
> but I don't see how it'd be better to display it with one more
> intermediary node.

...Well, this is difficult... For the record, in case anyone missed
it, I'm pretty set on being against doing any node overloading for
this. I think it's a pretty horrid modularity violation regardless of
what text appears in EXPLAIN. I think if we merge these nodes then we
may as well go further and merge in other simple nodes like LIMIT.
Then after a few iterations of that, we end up with with a single node
in EXPLAIN that nobody can figure out what it does. "Here Be Dragons",
as Tom said. That might seem like a bit of an exaggeration, but it is
important to understand that this would start us down that path, and
the more steps you take down that path, the harder it is to return
from it.

Let's look at nodeProjectSet.c, for example, which I recall you spent
quite a bit of time painfully extracting the scattered logic to get it
into a single reusable node (69f4b9c85). I understand your motivation
was for JIT compilation and not to modularise the code, however, I
think the byproduct of that change of having all that code in one
executor node was a good change, and I'm certainly a fan of what it
allowed you to achieve with JIT. I really wouldn't like to put anyone
else in a position of having to extract out some complex logic that we
add to existing nodes in some future version of PostgreSQL. It might
seem quite innocent today, but add a few more years of development and
I'm sure things will get buried a little deeper.

I'm sure you know better than most that the day will come where we go
and start rewriting all of our executor node code to implement
something like batch execution. I'd imagine you'd agree that this job
would be easier if nodes were single-purpose, rather than overloaded
with a bunch of needless complexity that only Heath Robinson himself
could be proud of.

I find it bizarre that on one hand, for non-parameterized nested
loops, we can have the inner scan become materialized with a
Materialize node (I don't recall complaints about that) However, on
the other hand, for parameterized nested loops, we build the caching
into the Nested Loop node itself.

For the other arguments: I'm also struggling a bit to understand the
arguments that it makes EXPLAIN easier to read due to reduced nesting
depth. If that's the case, why don't we get rid of Hash below a Hash
Join? It seems nobody has felt strongly enough about that to go to the
trouble of writing the patch. We could certainly do work to reduce
nesting depth in EXPLAIN provided you're not big on what the plan
actually does. One line should be ok if you're not worried about
what's happening to your tuples. Unfortunately, that does not seem
very useful as it tends to be that people who do look at EXPLAIN do
actually want to know what the planner has decided to do and are
interested in what's happening to their tuples. Hiding away details
that can significantly impact the performance of the plan does not
seem like a great direction to be moving in.

Also, just in case anyone is misunderstanding this Andres' argument.
It's entirely based on the performance impact of having an additional
node. However, given the correct planner choice, there will never be
a gross slowdown due to having the extra node. The costing, the way it
currently is designed will only choose to use a Result Cache if it
thinks it'll be cheaper to do so and cheaper means having enough cache
hits for the caching overhead to be worthwhile. If we get a good
cache hit ratio then the additional node overhead does not exist
during execution since we don't look any further than the cache during
a cache hit. It would only be a cache miss that requires pulling the
tuples through an additional node. Given perfect statistics (which of
course is impossible) and costs, we'll never slow down the execution
of a plan by having a separate Result Cache node. In reality, poor
statistics, e.g, massive n_distinct underestimations, could cause
slowdowns, but loading this into one node is not going to save us from
that. All that your design will save us from is that 12 nanosecond
per-tuple hop (measured on a 5-year-old laptop) to an additional node
during cache misses. It seems like a strange thing to optimise for,
given that the planner only chooses to use a Result Cache when there's
a good number of expected cache hits.

I understand that you've voiced your feelings about this, but what I
want to know is, how strongly do you feel about overloading the node?
Will you stand in my way if I want to push ahead with the separate
node? Will anyone else?

David

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2020-08-25 09:12:17 Re: Asymmetric partition-wise JOIN
Previous Message Peter Smith 2020-08-25 08:15:41 Re: extension patch of CREATE OR REPLACE TRIGGER