Re: Hybrid Hash/Nested Loop joins and caching results from subplans

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Hybrid Hash/Nested Loop joins and caching results from subplans
Date: 2021-01-28 05:43:50
Message-ID: 20210128054349.GB7450@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

@cfbot: rebased on 55dc86eca70b1dc18a79c141b3567efed910329d

On Tue, Dec 08, 2020 at 08:15:52PM +1300, David Rowley wrote:
> From cfbfb8187f4e8303fe3358b5c909533ee6629efe Mon Sep 17 00:00:00 2001
> From: "dgrowley(at)gmail(dot)com" <dgrowley(at)gmail(dot)com>
> Date: Thu, 2 Jul 2020 16:06:36 +1200
> Subject: [PATCH v12 1/5] Allow estimate_num_groups() to pass back further
> details about the estimation

> +#define SELFLAG_USED_DEFAULT (1 << 0) /* Estimation fell back on one
...
> +typedef struct EstimationInfo
> +{
> + int flags; /* Flags, as defined above to mark special
> + * properties of the estimation. */

Maybe it should be a bits32 ?
(Also, according to Michael, some people preferred 0x01 to 1<<0)

> + /* Ensure we didn't mess up the tracking somehow */
> + Assert(rcstate->mem_used >= 0);

I think these assertions aren't useful since the type is unsigned:
+ uint64 mem_used; /* bytes of memory used by cache */

> + hash_mem_bytes = get_hash_mem() * 1024L;

I think "result cache nodes" should be added here:

doc/src/sgml/config.sgml- <para>
doc/src/sgml/config.sgml- Hash-based operations are generally more sensitive to memory
doc/src/sgml/config.sgml- availability than equivalent sort-based operations. The
doc/src/sgml/config.sgml- memory available for hash tables is computed by multiplying
doc/src/sgml/config.sgml- <varname>work_mem</varname> by
doc/src/sgml/config.sgml: <varname>hash_mem_multiplier</varname>. This makes it
doc/src/sgml/config.sgml- possible for hash-based operations to use an amount of memory
doc/src/sgml/config.sgml- that exceeds the usual <varname>work_mem</varname> base
doc/src/sgml/config.sgml- amount.
doc/src/sgml/config.sgml- </para>

Language fixen follow:

> + * Initialize the hash table to empty.

as empty

> + * prepare_probe_slot
> + * Populate rcstate's probeslot with the values from the tuple stored
> + * in 'key'. If 'key' is NULL, then perform the population by evalulating

sp: evaluating

> From d9c3f2cab13ec26bbd8d1245be6304c506e1f878 Mon Sep 17 00:00:00 2001
> From: "dgrowley(at)gmail(dot)com" <dgrowley(at)gmail(dot)com>
> Date: Tue, 8 Dec 2020 17:54:04 +1300
> Subject: [PATCH v12 4/5] Remove code duplication in nodeResultCache.c

> + * cache_check_mem
> + * Check if we've allocate more than our memory budget and, if so, reduce

allocated

XXX: what patch???

> + * Set the number of bytes each cache entry should consume in the cache.
> + * To provide us with better estimations on how many cache entries we can
> + * store at once we make a call to the excutor here to ask it what memory

spell: executor
once COMMA

> + * inappropriate to do so. If we see that this has been done then we'll
done COMMA

> + * Since we've already estimated the maximum number of entries we can
> + * store at once and know the estimated number of distinct values we'll be
> + * called with, well take this opportunity to set the path's est_entries.

we'll

> + * This will ultimately determine the hash table size that the executor
> + * will use. If we leave this at zero the executor will just choose the
zero COMMA

> + * Set the total_cost accounting for the expected cache hit ratio. We
> + * also add on a cpu_operator_cost to account for a cache lookup. This
> + * will happen regardless of if it's a cache hit or not.

"whether it's a cache hit or not"

> + * Additionally we charge a cpu_tuple_cost to account for cache lookups,
> + * which we'll do regardless of if it was a cache hit or not.

same

> + * get_resultcache_path
> + * If possible,.make and return a Result Cache path atop of 'inner_path'.

dotmake

> +SET work_mem TO '64kB';
> +SET enable_mergejoin TO off;
> +-- Ensure we get some evitions. We're unable to validate the hits and misses

evictions

--
Justin

Attachment Content-Type Size
0001-Allow-estimate_num_groups-to-pass-back-further-detai.patch text/x-diff 9.1 KB
0002-Allow-users-of-simplehash.h-to-perform-direct-deleti.patch text/x-diff 3.5 KB
0003-Add-Result-Cache-executor-node.patch text/x-diff 144.3 KB
0004-Remove-code-duplication-in-nodeResultCache.c.patch text/x-diff 5.1 KB
0005-Use-a-Result-Cache-node-to-cache-results-from-subpla.patch text/x-diff 27.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-01-28 05:44:03 Re: logical replication worker accesses catalogs in error context callback
Previous Message Mark Dilger 2021-01-28 05:05:02 Re: new heapcheck contrib module