Re: Next Steps with Hash Indexes

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Sadhuprasad Patro <b(dot)sadhu(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Next Steps with Hash Indexes
Date: 2021-10-05 11:24:01
Message-ID: CAFiTN-sSJUQfMm1reaNeot6MHuFEFPJ-CZF=AZUCfuP2f2yuRw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 5, 2021 at 4:08 PM Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com> wrote:
>
> On Mon, 27 Sept 2021 at 06:52, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Thu, Sep 23, 2021 at 11:11 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > > On Thu, Sep 23, 2021 at 10:04 AM Sadhuprasad Patro <b(dot)sadhu(at)gmail(dot)com> wrote:
> > > >
> > > > And to get the multi-column hash index selected, we may set
> > > > enable_hashjoin =off, to avoid any condition become join condition,
> > > > saw similar behaviors in other DBs as well...
> > >
> > > This may be related to Tom's point that, if some of the quals are
> > > removed due to optimization or converted to join quals, then now, even
> > > if the user has given qual on all the key columns the index scan will
> > > not be selected because we will be forcing that the hash index can
> > > only be selected if it has quals on all the key attributes?
> > >
> > > I don't think suggesting enable_hashjoin =off is a solution,
> > >
> >
> > Yeah, this doesn't sound like a good idea. How about instead try to
> > explore the idea where the hash (bucket assignment and search) will be
> > based on the first index key and the other columns will be stored as
> > payload? I think this might pose some difficulty in the consecutive
> > patch to enable a unique index because it will increase the chance of
> > traversing more buckets for uniqueness checks. If we see such
> > problems, then I have another idea to minimize the number of buckets
> > that we need to lock during uniqueness check which is by lock chaining
> > as is used during hash bucket clean up where at a time we don't need
> > to lock more than two buckets at a time.
>
> I have presented a simple, almost trivial, patch to allow multi-col
> hash indexes. It hashes the first column only, which can be a downside
> in *some* cases. If that is clearly documented, it would not cause
> many issues, IMHO. However, it does not have any optimization issues
> or complexities, which is surely a very good thing.
>
> Trying to involve *all* columns in the hash index is a secondary
> optimization. It requires subtle changes in optimizer code, as Tom
> points out. It also needs fine tuning to make the all-column approach
> beneficial for the additional cases without losing against what the
> "first column" approach gives.
>
> I did consider both approaches and after this discussion I am still in
> favour of committing the very simple "first column" approach to
> multi-col hash indexes now.

But what about the other approach suggested by Tom, basically we hash
only based on the first column for identifying the bucket, but we also
store the hash value for other columns? With that, we don't need
changes in the optimizer and we can also avoid a lot of disk fetches
because after finding the bucket we can match the secondary columns
before fetching the disk tuple. I agree, one downside with this
approach is we will increase the index size.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amul Sul 2021-10-05 11:43:41 Re: using an end-of-recovery record in all cases
Previous Message Amit Kapila 2021-10-05 11:11:32 Re: Added schema level support for publication.