Re: Hybrid Hash/Nested Loop joins and caching results from subplans

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Zhihong Yu <zyu(at)yugabyte(dot)com>
Cc: PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Hybrid Hash/Nested Loop joins and caching results from subplans
Date: 2021-03-31 23:49:49
Message-ID: CAApHDvr=qNLbYV=P3tk-xQ9S25MvKZFekLeOKC4kH3heQ48=Yw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 31 Mar 2021 at 05:34, Zhihong Yu <zyu(at)yugabyte(dot)com> wrote:
>
> Hi,
> In paraminfo_get_equal_hashops(),
>
> + /* Reject if there are any volatile functions */
> + if (contain_volatile_functions(expr))
> + {
>
> You can move the above code to just ahead of:
>
> + if (IsA(expr, Var))
> + var_relids = bms_make_singleton(((Var *) expr)->varno);
>
> This way, when we return early, var_relids doesn't need to be populated.

Thanks for having another look. I did a bit more work in that area
and removed that code. I dug a little deeper and I can't see any way
that a lateral_var on a rel can refer to anything inside the rel. It
looks like that code was just a bit over paranoid about that.

I also added some additional caching in RestrictInfo to cache the hash
equality operator to use for the result cache. This saves checking
this each time we consider a join during the join search. In many
cases we would have used the value cached in
RestrictInfo.hashjoinoperator, however, for non-equaliy joins, that
would have be set to InvalidOid. We can still use Result Cache for
non-equality joins.

I've now pushed the main patch.

There's a couple of things I'm not perfectly happy with:

1. The name. There's a discussion on [1] if anyone wants to talk about that.
2. For lateral joins, there's no place to cache the hash equality
operator. Maybe there's some rework to do to add the ability to check
things for those like we use RestrictInfo for regular joins.
3. No ability to cache n_distinct estimates. This must be repeated
each time we consider a join. RestrictInfo allows caching for this to
speed up clauselist_selectivity() for other join types.

There was no consensus reached on the name of the node. "Tuple Cache"
seems like the favourite so far, but there's not been a great deal of
input. At least not enough that I was motivated to rename everything.
People will perhaps have more time to consider names during beta.

Thank you to everyone who gave input and reviewed this patch. It would
be great to get feedback on the performance with real workloads. As
mentioned in the commit message, there is a danger that it causes
performance regressions when n_distinct estimates are significantly
underestimated.

I'm off to look at the buildfarm now.

David

[1] https://www.postgresql.org/message-id/CAApHDvq=yQXr5kqhRviT2RhNKwToaWr9JAN5t+5_PzhuRJ3wvg@mail.gmail.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2021-04-01 00:34:40 Re: Issue with point_ops and NaN
Previous Message Zhihong Yu 2021-03-31 23:47:14 Re: using extended statistics to improve join estimates