Re: What to call an executor node which lazily caches tuples in a hash table?

From: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
To: Zhihong Yu <zyu(at)yugabyte(dot)com>
Cc: David Rowley <dgrowleyml(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: What to call an executor node which lazily caches tuples in a hash table?
Date: 2021-03-31 01:43:05
Message-ID: CAKU4AWqiwE1r5B2b4XwNhP13eRfaE6SfvCL42cu3ZK8OCTNTaw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 31, 2021 at 7:45 AM Zhihong Yu <zyu(at)yugabyte(dot)com> wrote:

> Hi,
> I was reading this part of the description:
>
> the Result Cache's
> hash table is much smaller than the hash join's due to result cache only
> caching useful values rather than all tuples from the inner side of the
> join.
>
> I think the word 'Result' should be part of the cache name considering the
> above.
>
> Cheers
>
> On Tue, Mar 30, 2021 at 4:30 PM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
>
>> Hackers,
>>
>> Over on [1] I've been working on adding a new type of executor node
>> which caches tuples in a hash table belonging to a given cache key.
>>
>> The current sole use of this node type is to go between a
>> parameterized nested loop and the inner node in order to cache
>> previously seen sets of parameters so that we can skip scanning the
>> inner scan for parameter values that we've already cached. The node
>> could also be used to cache results from correlated subqueries,
>> although that's not done yet.
>>
>> The cache limits itself to not use more than hash_mem by evicting the
>> least recently used entries whenever more space is needed for new
>> entries.
>>
>> Currently, in the patch, the node is named "Result Cache". That name
>> was not carefully thought out. I just needed to pick something when
>> writing the code.
>>
>> Here's an EXPLAIN output with the current name:
>>
>> postgres=# explain (costs off) select relkind,c from pg_class c1,
>> lateral (select count(*) c from pg_class c2 where c1.relkind =
>> c2.relkind) c2;
>> QUERY PLAN
>> ----------------------------------------------------
>> Nested Loop
>> -> Seq Scan on pg_class c1
>> -> Result Cache
>> Cache Key: c1.relkind
>> -> Aggregate
>> -> Seq Scan on pg_class c2
>> Filter: (c1.relkind = relkind)
>> (7 rows)
>>
>> I just got off a team call with Andres, Thomas and Melanie. During the
>> call I mentioned that I didn't like the name "Result Cache". Many name
>> suggestions followed:
>>
>> Here's a list of a few that were mentioned:
>>
>> Probe Cache
>> Tuple Cache
>> Keyed Materialize
>> Hash Materialize
>> Result Cache
>> Cache
>> Hash Cache
>> Lazy Hash
>> Reactive Hash
>> Parameterized Hash
>> Parameterized Cache
>> Keyed Inner Cache
>> MRU Cache
>> MRU Hash
>>
>> I was hoping to commit the final patch pretty soon, but thought I'd
>> have another go at seeing if we can get some consensus on a name
>> before doing that. Otherwise, I'd sort of assumed that we'd just reach
>> some consensus after everyone complained about the current name after
>> the feature is committed.
>>
>> My personal preference is "Lazy Hash", but I feel it might be better
>> to use the word "Reactive" instead of "Lazy".
>>
>> There was some previous discussion on the name in [2]. I suggested
>> some other names in [3]. Andy voted for "Tuple Cache" in [4]
>>
>> Votes? Other suggestions?
>>
>> (I've included all the people who have shown some previous interest in
>> naming this node.)
>>
>> David
>>
>> [1]
>> https://www.postgresql.org/message-id/flat/CAApHDvrPcQyQdWERGYWx8J%2B2DLUNgXu%2BfOSbQ1UscxrunyXyrQ%40mail.gmail.com
>> [2]
>> https://www.postgresql.org/message-id/CA%2BTgmoZMxLeanqrS00_p3xNsU3g1v3EKjNZ4dM02ShRxxLiDBw%40mail.gmail.com
>> [3]
>> https://www.postgresql.org/message-id/CAApHDvoj_sH1H3JVXgHuwnxf1FQbjRVOqqgxzOgJX13NiA9-cg%40mail.gmail.com
>> [4]
>> https://www.postgresql.org/message-id/CAKU4AWoshM0JoymwBK6PKOFDMKg-OO6qtSVU_Piqb0dynxeL5w%40mail.gmail.com
>>
>>
>>
I want to share some feelings about other keywords. Materialize are used
in
Materialize node in executor node, which would write data to disk when
memory
is not enough, and it is used in "Materialized View", where it stores all
the data to disk
This gives me some feeling that "Materialize" usually has something with
disk,
but our result cache node doesn't.

And I think DBA checks plans more than the PostgreSQL developer. So
some MRU might be too internal for them. As for developers, if they want to
know such details, they can just read the source code.

When naming it, we may also think about some non native English speakers,
so
some too advanced words may make them uncomfortable. Actually when I read
"Reactive", I googled to find what its meaning is. I knew reactive
programming, but I
do not truly understand "reactive hash". And Compared with HashJoin, Hash
may
mislead people the result may be spilled into disk as well. so I prefer
"Cache"
over "Hash".

At last, I still want to vote for "Tuple(s) Cache", which sounds simple
and enough.
I was thinking if we need to put "Lazy" in the node name since we do build
cache
lazily, then I found we didn't call "Materialize" as "Lazy Materialize",
so I think we
can keep consistent.

> I was hoping to commit the final patch pretty soon

Very glad to see it, thanks for the great feature.

--
Best Regards
Andy Fan (https://www.aliyun.com/)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Euler Taveira 2021-03-31 01:45:55 Re: row filtering for logical replication
Previous Message Michael Paquier 2021-03-31 01:43:00 Re: Refactor SSL test framework to support multiple TLS libraries