HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)
Date: 2020-12-13 16:49:31
Message-ID: 590625.1607878171@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> writes:
> On Wed, Dec 9, 2020 at 2:56 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
>> The problem is xidhash using strcmp() to compare keys; it needs memcmp().

> Your analysis is correct.

Sorry for not having noticed this thread before. Noah's fix is
clearly correct, and I have no objection to the added test case.
But what jumps out at me here is that this sort of error seems way
too easy to make, and evidently way too hard to detect. What can we
do to make it more obvious if one has incorrectly used or omitted
HASH_BLOBS? Both directions of error might easily escape notice on
little-endian hardware.

I thought of a few ideas, all of which have drawbacks:

1. Invert the sense of the flag, ie HASH_BLOBS becomes the default.
This seems to just move the problem somewhere else, besides which
it'd require touching an awful lot of callers, and would silently
break third-party callers.

2. Don't allow a default: invent a new HASH_STRING flag, and
require that hash_create() calls specify exactly one of HASH_BLOBS,
HASH_STRING, or HASH_FUNCTION. This doesn't completely fix the
hazard of mindless-copy-and-paste, but I think it might make it
a little more obvious. Still requires touching a lot of calls.

3. Add some sort of heuristic restriction on keysize. A keysize
that's only 4 or 8 bytes almost certainly is not a string.
This doesn't give us much traction for larger keysizes, though.

4. Disallow empty string keys, ie something like "Assert(s_len > 0)"
in string_hash(). I think we could get away with that given that
SQL disallows empty identifiers. However, it would only help to
catch one direction of error (omitting HASH_BLOBS), and it would
only help on big-endian hardware, which is getting harder to find.
Still, we could hope that the buildfarm would detect errors.

There might be some more options. Also, some of these ideas
could be applied in combination.

A quick count of grep hits suggest that the large majority of
existing hash_create() calls use HASH_BLOBS, and there might be
only order-of-ten calls that would need to be touched if we
required an explicit HASH_STRING flag. So option #2 is seeming
kind of attractive. Maybe that together with an assertion that
string keys have to exceed 8 or 16 bytes would be enough protection.

Also, this census now suggests to me that the opposite problem
(copy-and-paste HASH_BLOBS when you meant string keys) might be
a real hazard, since so many of the existing prototypes that you
might copy have HASH_BLOBS. I'm not sure if there's much to be
done for this case though. A small saving grace is that it seems
relatively likely that you'd notice a functional problem pretty
quickly with this type of mistake, since lookups would tend to
fail due to trailing garbage after your lookup string.

A different angle we could think about is that the name "HASH_BLOBS"
is kind of un-obvious. Maybe we should deprecate that spelling in
favor of something like "HASH_BINARY".

Thoughts?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2020-12-13 16:49:58 Re: Add session statistics to pg_stat_database
Previous Message Andrey Borodin 2020-12-13 15:58:18 Re: Speeding up GIST index creation for tsvectors