Quick Links

Hash index build performance tweak from sorting

From:	Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>
To:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Hash index build performance tweak from sorting
Date:	2022-04-18 21:35:21
Message-ID:	CANbhV-FG-1ZNMBuwhUF7AxxJz3u5137dYL-o6hchK1V_dMw86g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hash index pages are stored in sorted order, but we don't prepare the
data correctly.

We sort the data as the first step of a hash index build, but we
forget to sort the data by hash as well as by hash bucket. This causes
the normal insert path to do extra pushups to put the data in the
correct sort order on each page, which wastes effort.

Adding this patch makes a CREATE INDEX about 8-9% faster, on an unlogged table.

Thoughts?

Aside:

I'm not very sure why tuplesort has private code in it dedicated to
hash indexes, but it does.

Even more strangely, the Tuplesortstate fixes the size of max_buckets
at tuplesort_begin() time rather than tuplesort_performsort(), forcing
us to estimate the number of tuples ahead of time rather than using
the exact number. Next trick would be to alter the APIs to allow exact
values to be used for sorting, which would allow page at a time
builds.

--
Simon Riggs http://www.EnterpriseDB.com/

Attachment	Content-Type	Size
hash_sort_by_hash.v1.patch	application/octet-stream	811 bytes
hashtest.sql	application/octet-stream	228 bytes

Responses

Re: Hash index build performance tweak from sorting at 2022-04-30 11:12:17 from Amit Kapila

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Thomas Munro	2022-04-18 21:45:11	Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman
Previous Message	Simon Riggs	2022-04-18 21:05:57	Re: Dump/Restore of non-default PKs