Re: Hash index build performance tweak from sorting

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash index build performance tweak from sorting
Date: 2022-05-04 10:27:34
Message-ID: CAA4eK1JiCB85JPe-HT6hA9qEZe4HO1FDAJxJSAyVcptTg0wRHA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, May 2, 2022 at 9:28 PM Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com> wrote:
>
> On Sat, 30 Apr 2022 at 12:12, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Tue, Apr 19, 2022 at 3:05 AM Simon Riggs
> > <simon(dot)riggs(at)enterprisedb(dot)com> wrote:
> > >
> > > Hash index pages are stored in sorted order, but we don't prepare the
> > > data correctly.
> > >
> > > We sort the data as the first step of a hash index build, but we
> > > forget to sort the data by hash as well as by hash bucket.
> > >
> >
> > I was looking into the nearby comments (Fetch hash keys and mask off
> > bits we don't want to sort by.) and it sounds like we purposefully
> > don't want to sort by the hash key. I see that this comment was
> > originally introduced in the below commit:
> >
> > commit 4adc2f72a4ccd6e55e594aca837f09130a6af62b
> > Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
> > Date: Mon Sep 15 18:43:41 2008 +0000
> >
> > Change hash indexes to store only the hash code rather than the
> > whole indexed
> > value.
> >
> > But even before that, we seem to mask off the bits before comparison.
> > Is it that we are doing so because we want to keep the order of hash
> > keys in a particular bucket so such masking was required?
>
> We need to sort by both hash bucket and hash value.
>
> Hash bucket id so we can identify the correct hash bucket to insert into.
>
> But then on each bucket/overflow page we store it sorted by hash value
> to make lookup faster, so inserts go faster if they are also sorted.
>

I also think so. So, we should go with this unless someone else sees
any flaw here.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2022-05-04 10:54:57 Add a new function and a document page to get/show all the server hooks
Previous Message Amit Kapila 2022-05-04 10:18:47 Re: Logical replication timeout problem