Re: Hash index build performance tweak from sorting

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash index build performance tweak from sorting
Date: 2022-07-28 18:50:13
Message-ID: 3517464.1659034213@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com> writes:
> Thanks for the nudge. New version attached.

I also see a speed improvement from this, so pushed (after minor comment
editing). I notice though that if I feed it random data,

---
DROP TABLE IF EXISTS hash_speed;
CREATE unlogged TABLE hash_speed (x integer);
INSERT INTO hash_speed SELECT random()*10000000 FROM
generate_series(1,10000000) x;
vacuum hash_speed;
\timing on
CREATE INDEX ON hash_speed USING hash (x);
---

then the speed improvement is only about 5% not the 7.5% I see
with your original test case. I don't have an explanation
for that, do you?

Also, it seems like we've left some money on the table by not
exploiting downstream the knowledge that this sorting happened.
During an index build, it's no longer necessary for
_hash_pgaddtup to do _hash_binsearch, and therefore also not
_hash_get_indextuple_hashkey: we could just always append the new
tuple at the end. Perhaps checking it against the last existing
tuple is worth the trouble as a bug guard, but for sure we don't
need the log2(N) comparisons that _hash_binsearch will do.

Another point that I noticed is that it's not really desirable to
use the same _hash_binsearch logic for insertions and searches.
_hash_binsearch finds the first entry with hash >= target, which
is necessary for searches, but for insertions we'd really rather
find the first entry with hash > target. As things stand, to
the extent that there are duplicate hash values we are still
performing unnecessary data motion within PageAddItem.

I've not looked into how messy these things would be to implement,
nor whether we get any noticeable speed gain thereby. But since
you've proven that cutting the PageAddItem data motion cost
yields visible savings, these things might be visible too.

At this point the cfbot will start to bleat that the patch of
record doesn't apply, so I'm going to mark the CF entry committed.
If anyone wants to produce a follow-on patch, please make a
new entry.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2022-07-28 19:09:32 Re: pg_auth_members.grantor is bunk
Previous Message David Rowley 2022-07-28 18:49:53 Re: Add proper planner support for ORDER BY / DISTINCT aggregates