Re: Hybrid Hash/Nested Loop joins and caching results from subplans

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Hybrid Hash/Nested Loop joins and caching results from subplans
Date: 2020-05-20 12:56:34
Message-ID: CANP8+jLkh_g7Eu-=WoNP7P-Mi2Ye=Kt-5kceho80Xnd8xWrvfw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 20 May 2020 at 12:44, David Rowley <dgrowleyml(at)gmail(dot)com> wrote:

> Hackers,
>
> Over on [1], Heikki mentioned about the usefulness of caching results
> from parameterized subplans so that they could be used again for
> subsequent scans which have the same parameters as a previous scan.
> On [2], I mentioned that parameterized nested loop joins could see
> similar gains with such a cache. I suggested there that instead of
> adding code that only allows this to work for subplans, that instead,
> we add a new node type that can handle the caching for us. We can
> then just inject that node type in places where it seems beneficial.
>

Very cool

> I've attached a patch which implements this. The new node type is
> called "Result Cache". I'm not particularly wedded to keeping that
> name, but if I change it, I only want to do it once. I've got a few
> other names I mind, but I don't feel strongly or confident enough in
> them to go and do the renaming.
>
> How the caching works:
>
> First off, it's only good for plugging in on top of parameterized
> nodes that are rescanned with different parameters. The cache itself
> uses a hash table using the simplehash.h implementation. The memory
> consumption is limited to work_mem. The code maintains an LRU list and
> when we need to add new entries but don't have enough space to do so,
> we free off older items starting at the top of the LRU list. When we
> get a cache hit, we move that entry to the end of the LRU list so that
> it'll be the last to be evicted.
>
> When should we cache:
>
> For nested loop joins, the decision is made purely based on cost.

I thought the main reason to do this was the case when the nested loop
subplan was significantly underestimated and we realize during execution
that we should have built a hash table. So including this based on cost
alone seems to miss a trick.

> The patch does rely heavily on good ndistinct estimates.

Exactly. We know we seldom get those with many-way joins.

So +1 for adding this technique. My question is whether it should be added
as an optional facility of a parameterised sub plan, rather than an
always-needed full-strength node. That way the choice of whether to use it
can happen at execution time once we notice that we've been called too many
times.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
Mission Critical Databases

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Vik Fearing 2020-05-20 13:04:20 Re: SEARCH and CYCLE clauses
Previous Message Atsushi Torikoshi 2020-05-20 12:56:04 Re: Is it useful to record whether plans are generic or custom?