Re: Why hash OIDs?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why hash OIDs?
Date: 2018-08-28 13:37:08
Message-ID: 18991.1535463428@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> On 2018-08-28 14:45:49 +1200, Thomas Munro wrote:
>> Yeah, it would be a terrible idea as a general hash function for use
>> in contexts where the "avalanche effect" assumption is made about
>> information being spread out over the bits (the HJ batching code
>> wouldn't work for example). I was wondering specifically about the
>> limited case of hash tables that are used to look things up in caches.

> I don't understand why that'd be ok there? With a simple 1:1 hash
> function, which you seem to advocate, many hash-tables would be much
> fuller in the 1000-3000 (where pg_class, etc all reside) than in any
> other range. A lot of our tables are populated on-demand, so you'd
> often end up with most of the data in one or two buckets, and a larger
> number largely empty.

Hmm ... but what we actually care about, for dynahash tables, is just
the low order bits of the hash; the original range of the values
isn't interesting. Some desultory experimentation with queries like

select oid::int % 1024, count(*) from pg_proc
group by 1 order by 2 desc;

select (hashoid(oid)+2^32)::int8 % 1024, count(*) from pg_proc
group by 1 order by 2 desc;

offers some support for Thomas' idea: the hashoid results are
certainly not better distributed than the raw OIDs, and in some
cases noticeably worse.

I'd supposed that we needed to get the OIDs' high-order bits to
participate in the hash in order to get decent results, but
looking at these numbers makes me wonder.

OTOH, I'm also not convinced it's worth messing with. In principle
we could shave some cycles with a no-op hash function, but I've not
seen anything to make me think that hashoid() is a measurable
bottleneck.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-08-28 13:39:46 Re: typcache.c typos
Previous Message Michael Paquier 2018-08-28 13:36:07 Re: Fix help option of contrib/oid2name