Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)

From: Noah Misch <noah(at)leadboat(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject: Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)
Date: 2020-12-13 20:06:08
Message-ID: 20201213200608.GA2309473@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Dec 13, 2020 at 11:49:31AM -0500, Tom Lane wrote:
> But what jumps out at me here is that this sort of error seems way
> too easy to make, and evidently way too hard to detect. What can we
> do to make it more obvious if one has incorrectly used or omitted
> HASH_BLOBS? Both directions of error might easily escape notice on
> little-endian hardware.
>
> I thought of a few ideas, all of which have drawbacks:
>
> 1. Invert the sense of the flag, ie HASH_BLOBS becomes the default.
> This seems to just move the problem somewhere else, besides which
> it'd require touching an awful lot of callers, and would silently
> break third-party callers.
>
> 2. Don't allow a default: invent a new HASH_STRING flag, and
> require that hash_create() calls specify exactly one of HASH_BLOBS,
> HASH_STRING, or HASH_FUNCTION. This doesn't completely fix the
> hazard of mindless-copy-and-paste, but I think it might make it
> a little more obvious. Still requires touching a lot of calls.

I like (2), for making the bug harder and for greppability. Probably
pluralize it to HASH_STRINGS, for the parallel with HASH_BLOBS.

> 3. Add some sort of heuristic restriction on keysize. A keysize
> that's only 4 or 8 bytes almost certainly is not a string.
> This doesn't give us much traction for larger keysizes, though.
>
> 4. Disallow empty string keys, ie something like "Assert(s_len > 0)"
> in string_hash(). I think we could get away with that given that
> SQL disallows empty identifiers. However, it would only help to
> catch one direction of error (omitting HASH_BLOBS), and it would
> only help on big-endian hardware, which is getting harder to find.
> Still, we could hope that the buildfarm would detect errors.

It's nontrivial to confirm that the empty-string key can't happen for a given
hash table. (In contrast, what (3) asserts on is usually a compile-time
constant.) I would stop short of adding (4), though it could be okay.

> A quick count of grep hits suggest that the large majority of
> existing hash_create() calls use HASH_BLOBS, and there might be
> only order-of-ten calls that would need to be touched if we
> required an explicit HASH_STRING flag. So option #2 is seeming
> kind of attractive. Maybe that together with an assertion that
> string keys have to exceed 8 or 16 bytes would be enough protection.

Agreed. I expect (2) gives most of the benefit. Requiring 8-byte capacity
should be harmless, and most architectures can zero 8 bytes in one
instruction. Requiring more bytes trades specificity for sensitivity.

> A different angle we could think about is that the name "HASH_BLOBS"
> is kind of un-obvious. Maybe we should deprecate that spelling in
> favor of something like "HASH_BINARY".

With (2) in place, I wouldn't worry about renaming HASH_BLOBS. It's hard to
confuse with HASH_STRINGS or HASH_FUNCTION. If anything, HASH_BLOBS conveys
something more specific. HASH_FUNCTION cases see binary data, but that data
has structure that promotes it out of "blob" status.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-12-13 21:40:32 Re: Rethinking plpgsql's assignment implementation
Previous Message Zhihong Yu 2020-12-13 18:42:09 Re: query on smallint array column