Re: Reproducible coliisions in jsonb_hash

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Valeriy Meleshkin <valeriy(at)meleshk(dot)in>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reproducible coliisions in jsonb_hash
Date: 2022-05-12 14:25:07
Message-ID: CA+Tgmob_=6+KfesBKkdeORZDnwXdY_wrYroSrLEKXOV3rruH0A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, May 12, 2022 at 9:57 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> > On 2022-05-12 Th 07:02, Valeriy Meleshkin wrote:
> >> AFAICT it happens because when iterating over a jsonb the hash function makes no distinction between raw scalars and arrays (it doesn't inspect v.val.array.rawScalar)
>
> > It does look rather like a bug, but I'm unclear about the implications
> > of fixing it.
>
> Changing this hash algorithm would break existing hash indexes on jsonb
> columns. Maybe there aren't any, but even if so I can't get very excited
> about changing this. Hash algorithms always have collisions, and we have
> never made any promise that ours are cryptographically strong.

I might be missing something, but I don't know why "cryptographically
strong" is the relevant concept here. It seems like the question is
how likely it is that this would lead to queries having crappy
performance. In general we want hash functions to deliver different
values for different inputs so that we give ourselves the best
possible chance of spreading keys evenly across buckets. If for
example we hashed every possible JSON object to the same constant
value, everything we tried to do with this hash function would suck,
and the problem would be so severe that I think we'd have to fix it,
even though it would mean a compatibility break. Or even if we hashed
every JSON integer to the same value, that would be horrible, because
it's quite likely that you could have a column full of json objects
where ignoring the difference between one integer and another results
in a ton of duplicate hash values.

Here, that doesn't seem too likely. You could have a column that
contains 'tom' and ['tom'] and [['tom']] and [[['tom']]] and so forth
and they all get mapped onto the same bucket and you're sad. But
probably not. So I'd judge that while it was probably a mistake to
make the hash function work this way, it's not likely to cause serious
problems, and therefore we ought to maybe leave it alone for now, but
add a comment so that if we ever break backward-compatibility for any
other reason, we remember to fix this too.

IOW, I think I mostly agree with your conclusion, but perhaps not
entirely with the reasoning.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2022-05-12 14:35:40 Re: First draft of the PG 15 release notes
Previous Message Bruce Momjian 2022-05-12 14:25:03 Re: gitmaster access