Re: Multicolumn hash indexes

From: Nico Williams <nico(at)cryptonector(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Tomasz Ostrowski <tometzky+pg(at)ato(dot)waw(dot)pl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Multicolumn hash indexes
Date: 2017-09-27 21:52:24
Message-ID: 20170927215223.GC1251@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 27, 2017 at 11:57:23AM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > On Wed, Sep 27, 2017 at 9:56 AM, Jesper Pedersen
> > <jesper(dot)pedersen(at)redhat(dot)com> wrote:
> >> Maybe an initial proof-of-concept could store the hash of the first column
> >> (col1) plus the hash of all columns (col1, col2, col3) in the index, and see
> >> what requirements / design decisions would appear from that.
>
> > I thought about that sort of thing yesterday but it's not that simple.
> > The problem is that the hash code isn't just stored; it's used to
> > assign tuples to buckets. If you have two hash codes, you have to
> > pick one of the other to use for assigning the tuple to a bucket. And
> > then if you want to search using the other hash code, you have to
> > search all of the buckets, which will stink.
>
> If we follow GIST's lead that the leading column is "most important",
> the idea could be to require a search constraint on the first column,
> which produces the hash that determines the bucket assignment. Hashes
> for additional columns would just be payload data in the index entry.
> If you have search constraint(s) on low-order column(s), you can check
> for hash matches before visiting the heap, but they don't reduce how
> much of the index you have to search. Even btree works that way for
> many combinations of incomplete index constraints.

But it might be difficult to obtain sufficient selectivity with any
choice of first column for an otherwise reasonable schema design.

Presumably that is why the OP is interested in hash indexing multiple
columns, as otherwise a plain old b-tree would have sufficed.

It might be better to address the planner assumptions about key prefixes
instead, though I recognize that that might be difficult (or perhaps
even undesirable). The potential for slowing down the planner might be
too high.

I note that it's not possible today to create a hash index on an
expression -- if it were then the obvious workaround would be to create
a hash on a `row(<column_list>)` expression, then use a
`row(<column_list>) = row(<literal_or_parameter_list>)` WHERE clause in
queries, which presumably the planner could not decompose and optimize
or pessimize. If need be the planner could learn to treat
row(<literal_or_parameter_list>) as a barrier to such optimizations.

A present workaround would be to denormalize the column list one cares
about into a single record-type column that one could then hash. To use
this one would have to use a `record_column = row(<parameter_list>)`
WHERE clause. Or, if record types cannot be hashed, then use
row_to_json() and hash the JSON text.

Nico
--

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2017-09-27 22:59:06 Re: bgw_type (was Re: Why does logical replication launcher set application_name?)
Previous Message Tomasz Ostrowski 2017-09-27 21:52:19 Re: Multicolumn hash indexes