Re: Next Steps with Hash Indexes

From: Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Sadhuprasad Patro <b(dot)sadhu(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Next Steps with Hash Indexes
Date: 2021-10-13 10:43:44
Message-ID: CANbhV-HtPWD8qrc3FPxZw-UW6BPp7zwm_G05UALb3KkiEZ4RJA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 5 Oct 2021 at 20:06, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:

> >>> I have presented a simple, almost trivial, patch to allow multi-col
> >>> hash indexes. It hashes the first column only, which can be a downside
> >>> in *some* cases. If that is clearly documented, it would not cause
> >>> many issues, IMHO. However, it does not have any optimization issues
> >>> or complexities, which is surely a very good thing.
> >>>
> >>> Trying to involve *all* columns in the hash index is a secondary
> >>> optimization. It requires subtle changes in optimizer code, as Tom
> >>> points out. It also needs fine tuning to make the all-column approach
> >>> beneficial for the additional cases without losing against what the
> >>> "first column" approach gives.
> >>>
> >>> I did consider both approaches and after this discussion I am still in
> >>> favour of committing the very simple "first column" approach to
> >>> multi-col hash indexes now.
> >>
> >> But what about the other approach suggested by Tom, basically we hash
> >> only based on the first column for identifying the bucket, but we also
> >> store the hash value for other columns? With that, we don't need
> >> changes in the optimizer and we can also avoid a lot of disk fetches
> >> because after finding the bucket we can match the secondary columns
> >> before fetching the disk tuple. I agree, one downside with this
> >> approach is we will increase the index size.
> >
> > Identifying the bucket is the main part of a hash index's work, so
> > that part would be identical.
> >
> > Once we have identified the bucket, we sort the bucket page by hash,
> > so having an all-col hash would help de-duplicate multi-col hash
> > collisions, but not enough to be worth it, IMHO, given that storing an
> > extra 4 bytes per index tuple is a significant size increase which
> > would cause extra overflow pages etc.. The same thought applies to
> > 8-byte hashes.
> >
>
> IMO it'd be nice to show some numbers to support the claims that storing
> the extra hashes and/or 8B hashes is not worth it ...

Using an 8-byte hash is possible, but only becomes effective when
4-byte hash collisions get hard to manage. 8-byte hash also makes the
index 20% bigger, so it is not a good default.

Let's look at the distribution of values:

In a table with 100 million rows, with consecutive monotonic values,
starting at 1
No Collisions - 98.8%
1 Collision - 1.15%
2+ Collisions - 0.009% (8979 values colliding)
Max=4

In a table with 1 billion rows (2^30), with consecutive monotonic
values, starting at 1
No Collisions - 89.3%
1 Collision - 9.8%
2 Collisions - 0.837%
3+ Collisions - 0.0573% (615523 values colliding)
Max=9

At 100 million rows, the collisions from a 4-byte hash are not
painful, but by a billion rows they are starting to become a problem,
and by 2 billion rows we have a noticeable issue (>18% collisions).

Clearly, 8-byte hash values would be appropriate for tables larger
than this. However, we expect users to know about and to use
partitioning, with reasonable limits somewhere in the 100 million row
(with 100 byte rows, 10GB) to 1 billion row (with 100 byte rows,
100GB) range.

The change from 4 to 8 byte hashes seems simple, so I am not against
it for that reason. IMHO there is no use case for 8-byte hashes since
reasonable users would not have tables big enough to care.

That is my reasoning, YMMV.

> I'm sure there are cases where it'd be a net loss, but imagine for
> example a case when the first column has a lot of duplicate values.
> Which is not all that unlikely - duplicates seem like one of the natural
> reasons why people want multi-column hash indexes. And those duplicates
> are quite expensive, due to having to access the heap. Being able to
> eliminate those extra accesses cheaply might be a clear win, even if it
> makes the index a bit larger (shorter hashes might be enough).

I agree, eliminating duplicates would be a good thing, if that is possible.

However, hashing on multiple columns doesn't eliminate duplicates, we
can still get them from different combinations of rows.

With a first-column hash then (1,1) and (1,2) collide.
But with an all-column hash then (1,2) and (2,1) collide.
So we can still end up with collisions and this depends on the data
values and types.
We can all come up with pessimal use cases.

Perhaps it would be possible to specify a parameter that says how many
columns in the index are part of the hash? Not sure how easy that is.

If you have a situation with lots of duplicates, then use btrees
instead. We shouldn't have to make hash indexes work well for *all*
cases before we allow multiple columns for some cases. The user will
already get to compare btree-vs-hash before they use them in a
particular use case. The perfect should not be the enemy of the good.

Storing multiple hashes uses more space and is more complex. It
doesn't feel like a good solution to me, given the purpose of an index
is not completeness but optimality.
Storing 2 4-byte hashes uses 20% more space than one 4-byte hash.
Storing 2 8-byte hashes uses 40% more space than one 4-byte hash.

> > IMHO, multi-column hash collisions are a secondary issue, given that
> > we can already select the column order for an index and hash indexes
> > would only be used by explicit user choice. If there are some minor
> > sub-optimal aspects of using hash indexes, then btree was already the
> > default and a great choice for many cases.
> >
>
> But we can't select arbitrary column order, because the first column is
> used to select the bucket. Which means it's also about what columns are
> used by the queries. If the query is not using the first column, it
> can't use the index.

Neither approach works in that case, so it is moot. i.e. you cannot
use a first-column hash, nor an all-column hash.

--
Simon Riggs http://www.EnterpriseDB.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2021-10-13 10:52:52 Re: Inconsistency in startup process's MyBackendId and procsignal array registration with ProcSignalInit()
Previous Message Etsuro Fujita 2021-10-13 10:15:40 Re: postgres_fdw: misplaced? comments in connection.c