Re: Hybrid Hash/Nested Loop joins and caching results from subplans

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Hybrid Hash/Nested Loop joins and caching results from subplans
Date: 2021-03-12 01:49:29
Message-ID: CAApHDvqLbRhwCQ7ZnYAroaf_b-oCa_8V9-HemuFqRzU_T62bsg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 23 Feb 2021 at 18:43, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I doubt it's that bad. We could cache such info in RestrictInfo
> for quals, or PathTarget for tlists, without much new notational
> overhead. That doesn't cover everything the planner deals with
> of course, but it would cover enough that you'd be chasing pretty
> small returns to worry about more.

This seems like a pretty good idea. So I coded it up.

The 0001 patch adds a has_volatile bool field to RestrictInfo and sets
it when building the RestrictInfo. I've also added has_volatile_expr
to PathTarget which is maintained when first building, then adding new
Exprs to the PathTarget. I've modified a series of existing calls to
contain_volatile_functions() to check these new fields first. This
seems pretty good even without the Result Cache patch as it saves a
few duplicate checks for volatile functions. For example, both
check_hashjoinable() and check_mergejoinable() call
contain_volatile_functions(). Now they just check the has_volatile
flag after just calling contain_volatile_functions() once per
RestrictInfo when the RestrictInfo is built.

I tested the performance of just 0001 against master and I did see the
overall planning and execution time of the join order benchmark query
29b go from taking 104.8 ms down to 103.7 ms.

For the Result Cache patch, I've coded it to make use of these new
fields instead of calling contain_volatile_functions().

I also noticed that I can use the pre-cached
RestrictInfo->hashjoinoperator field when it's set. This will be the
same operator as we'd be looking up using lookup_type_cache() anyway.

With Result Cache we can also cache the tuples from non-equality
joins, e.g ON t1.x > t2.y, but we still need to look for the hash
equality operator in that case. I had thoughts that it might be worth
adding an additional field to RestrictInfo for resultcacheoperator to
save having to look it up each time for when hashjoinoperator is not
set.

We must still call estimate_num_groups() once each time we create a
ResultCachePath. That's required in order to estimate the cache hits.
All other join operators only care about clauselist_selectivity(). The
selectivity estimates for those are likely to be cached in the
RestictInfo to save having to do it again next time. There's no
caching for estimate_num_groups(). I don't quite see any way to add
caching for this, however.

I've attached the updated patches.

It took v14 144.6 ms to plan and execute query 29b. It takes v15 128.5
ms. Master takes 104.8 ms (see attached graph). The caching has
improved the planning performance quite a bit. Thank you for the
suggestion.

David

Attachment Content-Type Size
v15-0001-Cache-PathTarget-and-RestrictInfo-s-volatility.patch text/plain 15.3 KB
image/png 34.4 KB
v15-0002-Allow-estimate_num_groups-to-pass-back-further-d.patch text/plain 9.1 KB
v15-0003-Allow-users-of-simplehash.h-to-perform-direct-de.patch text/plain 3.5 KB
v15-0004-Add-Result-Cache-executor-node.patch text/plain 148.7 KB
v15-0005-Remove-code-duplication-in-nodeResultCache.c.patch text/plain 5.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2021-03-12 01:52:28 Re: Procedures versus the "fastpath" API
Previous Message Kyotaro Horiguchi 2021-03-12 01:38:00 Re: shared-memory based stats collector