Re: Hash index build performance tweak from sorting

From: Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash index build performance tweak from sorting
Date: 2022-07-29 12:49:01
Message-ID: CANbhV-FmHF+rky-N4Lr+rfA4547KVrv6MjT-0jfZbyKmBHCXUw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 28 Jul 2022 at 19:50, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com> writes:
> > Thanks for the nudge. New version attached.
>
> I also see a speed improvement from this, so pushed (after minor comment
> editing).

Thanks

> I notice though that if I feed it random data,
>
> ---
> DROP TABLE IF EXISTS hash_speed;
> CREATE unlogged TABLE hash_speed (x integer);
> INSERT INTO hash_speed SELECT random()*10000000 FROM
> generate_series(1,10000000) x;
> vacuum hash_speed;
> \timing on
> CREATE INDEX ON hash_speed USING hash (x);
> ---
>
> then the speed improvement is only about 5% not the 7.5% I see
> with your original test case. I don't have an explanation
> for that, do you?

No, sorry. It could be a data-based effect or a physical effect.

> Also, it seems like we've left some money on the table by not
> exploiting downstream the knowledge that this sorting happened.
> During an index build, it's no longer necessary for
> _hash_pgaddtup to do _hash_binsearch, and therefore also not
> _hash_get_indextuple_hashkey: we could just always append the new
> tuple at the end. Perhaps checking it against the last existing
> tuple is worth the trouble as a bug guard, but for sure we don't
> need the log2(N) comparisons that _hash_binsearch will do.

Hmm, I had that in an earlier version of the patch, not sure why it
dropped out since I wrote it last year, but then I've got lots of
future WIP patches in the area of hash indexes.

> Another point that I noticed is that it's not really desirable to
> use the same _hash_binsearch logic for insertions and searches.
> _hash_binsearch finds the first entry with hash >= target, which
> is necessary for searches, but for insertions we'd really rather
> find the first entry with hash > target. As things stand, to
> the extent that there are duplicate hash values we are still
> performing unnecessary data motion within PageAddItem.

That thought is new to me, and will investigate.

> I've not looked into how messy these things would be to implement,
> nor whether we get any noticeable speed gain thereby. But since
> you've proven that cutting the PageAddItem data motion cost
> yields visible savings, these things might be visible too.

It's a clear follow-on thought, so will pursue. Thanks for the nudge.

> At this point the cfbot will start to bleat that the patch of
> record doesn't apply, so I'm going to mark the CF entry committed.
> If anyone wants to produce a follow-on patch, please make a
> new entry.

Will do. Thanks.

--
Simon Riggs http://www.EnterpriseDB.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2022-07-29 12:55:03 Re: Maximize page freezing
Previous Message Robert Haas 2022-07-29 12:46:06 Re: pg_auth_members.grantor is bunk