Quick Links

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

From:	Justin Pryzby <pryzby(at)telsasoft(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Hybrid Hash/Nested Loop joins and caching results from subplans
Date:	2021-01-28 05:43:50
Message-ID:	20210128054349.GB7450@telsasoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

@cfbot: rebased on 55dc86eca70b1dc18a79c141b3567efed910329d

On Tue, Dec 08, 2020 at 08:15:52PM +1300, David Rowley wrote:
> From cfbfb8187f4e8303fe3358b5c909533ee6629efe Mon Sep 17 00:00:00 2001
> From: "dgrowley(at)gmail(dot)com" <dgrowley(at)gmail(dot)com>
> Date: Thu, 2 Jul 2020 16:06:36 +1200
> Subject: [PATCH v12 1/5] Allow estimate_num_groups() to pass back further
> details about the estimation

> +#define SELFLAG_USED_DEFAULT (1 << 0) /* Estimation fell back on one
...
> +typedef struct EstimationInfo
> +{
> + int flags; /* Flags, as defined above to mark special
> + * properties of the estimation. */

Maybe it should be a bits32 ?
(Also, according to Michael, some people preferred 0x01 to 1<<0)

> + /* Ensure we didn't mess up the tracking somehow */
> + Assert(rcstate->mem_used >= 0);

I think these assertions aren't useful since the type is unsigned:
+ uint64 mem_used; /* bytes of memory used by cache */

> + hash_mem_bytes = get_hash_mem() * 1024L;

I think "result cache nodes" should be added here:

doc/src/sgml/config.sgml- <para>
doc/src/sgml/config.sgml- Hash-based operations are generally more sensitive to memory
doc/src/sgml/config.sgml- availability than equivalent sort-based operations. The
doc/src/sgml/config.sgml- memory available for hash tables is computed by multiplying
doc/src/sgml/config.sgml- <varname>work_mem</varname> by
doc/src/sgml/config.sgml: <varname>hash_mem_multiplier</varname>. This makes it
doc/src/sgml/config.sgml- possible for hash-based operations to use an amount of memory
doc/src/sgml/config.sgml- that exceeds the usual <varname>work_mem</varname> base
doc/src/sgml/config.sgml- amount.
doc/src/sgml/config.sgml- </para>

Language fixen follow:

> + * Initialize the hash table to empty.

as empty

> + * prepare_probe_slot
> + * Populate rcstate's probeslot with the values from the tuple stored
> + * in 'key'. If 'key' is NULL, then perform the population by evalulating

sp: evaluating

> From d9c3f2cab13ec26bbd8d1245be6304c506e1f878 Mon Sep 17 00:00:00 2001
> From: "dgrowley(at)gmail(dot)com" <dgrowley(at)gmail(dot)com>
> Date: Tue, 8 Dec 2020 17:54:04 +1300
> Subject: [PATCH v12 4/5] Remove code duplication in nodeResultCache.c

> + * cache_check_mem
> + * Check if we've allocate more than our memory budget and, if so, reduce

allocated

XXX: what patch???

> + * Set the number of bytes each cache entry should consume in the cache.
> + * To provide us with better estimations on how many cache entries we can
> + * store at once we make a call to the excutor here to ask it what memory

spell: executor
once COMMA

> + * inappropriate to do so. If we see that this has been done then we'll
done COMMA

> + * Since we've already estimated the maximum number of entries we can
> + * store at once and know the estimated number of distinct values we'll be
> + * called with, well take this opportunity to set the path's est_entries.

we'll

> + * This will ultimately determine the hash table size that the executor
> + * will use. If we leave this at zero the executor will just choose the
zero COMMA

> + * Set the total_cost accounting for the expected cache hit ratio. We
> + * also add on a cpu_operator_cost to account for a cache lookup. This
> + * will happen regardless of if it's a cache hit or not.

"whether it's a cache hit or not"

> + * Additionally we charge a cpu_tuple_cost to account for cache lookups,
> + * which we'll do regardless of if it was a cache hit or not.

same

> + * get_resultcache_path
> + * If possible,.make and return a Result Cache path atop of 'inner_path'.

dotmake

> +SET work_mem TO '64kB';
> +SET enable_mergejoin TO off;
> +-- Ensure we get some evitions. We're unable to validate the hits and misses

evictions

--
Justin

Attachment	Content-Type	Size
0001-Allow-estimate_num_groups-to-pass-back-further-detai.patch	text/x-diff	9.1 KB
0002-Allow-users-of-simplehash.h-to-perform-direct-deleti.patch	text/x-diff	3.5 KB
0003-Add-Result-Cache-executor-node.patch	text/x-diff	144.3 KB
0004-Remove-code-duplication-in-nodeResultCache.c.patch	text/x-diff	5.1 KB
0005-Use-a-Result-Cache-node-to-cache-results-from-subpla.patch	text/x-diff	27.2 KB

In response to

Re: Hybrid Hash/Nested Loop joins and caching results from subplans at 2020-12-08 07:15:52 from David Rowley

Responses

Re: Hybrid Hash/Nested Loop joins and caching results from subplans at 2021-01-29 05:41:52 from David Rowley

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Amit Kapila	2021-01-28 05:44:03	Re: logical replication worker accesses catalogs in error context callback
Previous Message	Mark Dilger	2021-01-28 05:05:02	Re: new heapcheck contrib module