Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject: Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)
Date: 2020-12-14 18:59:03
Message-ID: 1013756.1607972343@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Noah Misch <noah(at)leadboat(dot)com> writes:
> On Sun, Dec 13, 2020 at 11:49:31AM -0500, Tom Lane wrote:
>> A quick count of grep hits suggest that the large majority of
>> existing hash_create() calls use HASH_BLOBS, and there might be
>> only order-of-ten calls that would need to be touched if we
>> required an explicit HASH_STRING flag. So option #2 is seeming
>> kind of attractive. Maybe that together with an assertion that
>> string keys have to exceed 8 or 16 bytes would be enough protection.

> Agreed. I expect (2) gives most of the benefit. Requiring 8-byte capacity
> should be harmless, and most architectures can zero 8 bytes in one
> instruction. Requiring more bytes trades specificity for sensitivity.

Attached is a proposed patch that requires HASH_STRINGS to be stated
explicitly (in the event, there are 13 callers needing that) and insists
on keysize > 8 for string keys. In examining the now-easily-visible uses
of string keys, almost all of them are using NAMEDATALEN-sized keys, or
in a few places larger values. Only two are smaller:

1. ShmemIndex uses SHMEM_INDEX_KEYSIZE, which is only set to 48.

2. ResetUnloggedRelationsInDbspaceDir is using OIDCHARS + 1, because
it stores relfilenode OIDs as strings. That seems pretty damfool
to me, so I'm inclined to change it to store binary OIDs instead;
those'd be a third the size (or probably a quarter the size after
alignment padding) and likely faster to hash or compare. But I
didn't do that here, since it's still more than 8. (I did whack
it upside the head to the extent of not storing its temporary
hash table in CacheMemoryContext.)

So it seems to me that insisting on keysize > 8 is fine.

There are a couple of other API oddities that maybe we should think
about while we're here:

* Should we just have a blanket insistence that all callers supply
HASH_ELEM? The default sizes that dynahash.c uses without that are
undocumented and basically useless. We're already asserting that
in the HASH_BLOBS path, which is the majority use-case, and this
patch now asserts it for HASH_STRINGS too.

* The coding convention that the HASHCTL argument struct should be
pre-zeroed seems to have been ignored at a lot of call sites.
I added a memset call to a couple of callers that I was touching
in this patch, but I'm having second thoughts about that. Maybe
we should just rip out all those memsets as pointless, since there's
basically no case where you'd use the memset to fill a field that
you meant to pass as zero. The fact that hash_create() doesn't
read fields it's not told to by a flag means we should not need
the memsets to avoid uninitialized-memory reads.

regards, tom lane

Attachment Content-Type Size
invent-HASH_STRINGS-flag-1.patch text/x-diff 12.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joshua Drake 2020-12-14 19:50:07 Optimizing the documentation
Previous Message Bossart, Nathan 2020-12-14 18:25:23 Re: archive status ".ready" files may be created too early