Quick Links

Re: Use BumpContext contexts for TupleHashTables' tablecxt

From:	David Rowley <dgrowleyml(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Subject:	Re: Use BumpContext contexts for TupleHashTables' tablecxt
Date:	2025-10-29 22:51:42
Message-ID:	CAApHDvpa=Ag++Fyhs8JURv4qUa8UeTkOGPJw1VPTTaz+qg7opg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, 30 Oct 2025 at 08:51, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> I wrote:
> > David Rowley <dgrowleyml(at)gmail(dot)com> writes:
> >> I can't help wonder if we can improve the memory context parameter
> >> names in BuildTupleHashTable(). Every time I see "tablecxt" I have to
> >> remind myself that it's not for the bucket array, just the stuff we
> >> have the buckets point to. Would "hashedtuplecxt" be better?
>
> > I agree these names are not great. I think they might be leftovers
> > from a time when there actually was a complete hash-table structure
> > in that context.
>
> Looking closer, it seems like a lot of the confusion here arose when
> TupleHashTables were rebased onto simplehash.h. That patch seems to
> have paid about zero attention to updating comments that it'd
> falsified or at least made misleading. Here's a v2 that tries to
> clean up some of the mess.

Thanks for doing that cleanup. A good improvement. Just a couple of
notes from the review for your consideration:

1) The comment in ExecEndSetOp() says: "/* free subsidiary stuff
including hashtable */". How about adjusting that to "hashtable data"?
Is it worth mentioning there that we leave it up to the destruction of
the hash table itself to when the ExecutorState context is reset?
(Apparently execGrouping.c does not expose any way to do
tuplehash_destroy() anyway)

2) I'm not sure if it makes it neater or not, but did you consider
moving the creation of the setopstate->tuplesContext context into
build_hash_table()? The motivation there is that it keeps the
hash-related initialisations together.

3) Not for this patch, but wanted to note the observation: I see we
always create the hash table during plan startup in nodeSetOp.c. I
suspect this could cause the same issue for Setops as I fixed in
Memoize in 57f59396b, which fixed a slow EXPLAIN in [1]. That was
mostly down to a bug that overestimated the number of buckets, but
there's legitimate reasons to want lots of buckets too... a big table
and lots of work_mem.

create table t (a int);
insert into t values(1);
alter table t alter column a set (n_distinct = -1); -- all values distinct
analyze t;
-- hack fool the planner into thinking it's a big table.
update pg_class set reltuples = 1e9 where oid = 't'::regclass;
set work_mem = '4GB';
\timing on
explain (summary on) select a from t except select a from t;

HashSetOp Except (cost=25000002.00..27500002.00 rows=1000000000 width=4)
Disabled: true
-> Seq Scan on t (cost=0.00..10000001.00 rows=1000000000 width=4)
-> Seq Scan on t t_1 (cost=0.00..10000001.00 rows=1000000000 width=4)
Planning Time: 0.091 ms <-- not a long time
(5 rows)

Time: 1658.866 ms (00:01.659) <-- a long time.

> I'm also proposing to move the MemoryContextReset of that context
> into ResetTupleHashTable instead of documenting that we expect the
> caller to do that. The old way just seems like a bizarre artifact.
>
> > Related to this, while I was chasing Jeff's complaint I realized that
> > the none-too-small simplehash table for this is getting made in the
> > query's ExecutorState. That's pretty awful from the standpoint of
> > being able to blame memory consumption on the hash node.
>
> I didn't do anything about this. I briefly considered having
> BuildTupleHashTable construct a per-hashtable context, but backed
> off after seeing that nodeAgg.c goes to some effort to use the same
> metacxt and tuplescxt for several hashtables. I'm not sure if that
> had measured performance benefits behind it, but it well might have.
> I now think that if we care, the right thing is to make the remaining
> callers construct extra memory contexts for their hash tables. That
> could be left for another day.

Reminds me of 590b045c3, but I think maybe not as bad as the bucket
array for the hash table is more likely to be a large allocation,
which aset.c will put on a dedicated AllocBlock rather than sharing an
AllocBlock with other chunks. ResetTupleHashTable() also just results
in a SH_RESET(), which doesn't do any pfreeing work (just memset()).
i.e. shouldn't cause fragmentation due to many rescans. Before
590b045c3, we used to pfree() all tuples at tuplestore_end(), which
happens in a Materialize node's rescan. That resulted in bad
fragmentation of the ExecutorState context. Nothing like that is
happening here, so I'm not that worried about it, though I agree that
it isn't ideal.

David

[1] https://postgr.es/m/CAFj8pRAMp%3DQsMi6sPQJ4W3hczoFJRvyXHJV3AZAZaMyTVM312Q%40mail.gmail.com

In response to

Re: Use BumpContext contexts for TupleHashTables' tablecxt at 2025-10-29 19:51:46 from Tom Lane

Responses

Re: Use BumpContext contexts for TupleHashTables' tablecxt at 2025-10-30 00:22:13 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Masahiko Sawada	2025-10-29 23:01:01	Re: Assertion failure in SnapBuildInitialSnapshot()
Previous Message	Andrew Dunstan	2025-10-29 22:22:46	Re: Speed up COPY FROM text/CSV parsing using SIMD