Re: Next Steps with Hash Indexes

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Sadhuprasad Patro <b(dot)sadhu(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Next Steps with Hash Indexes
Date: 2021-08-28 11:00:48
Message-ID: CAA4eK1KM+vFv9PBAkE9QKgMsuHKY-kzRdaaodSN3suDo8kYCYg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Aug 27, 2021 at 4:27 PM Sadhuprasad Patro <b(dot)sadhu(at)gmail(dot)com> wrote:
>
> IMHO, as discussed above, since other databases also have the
> limitation that if you create a multi-column hash index then the hash
> index can not be used until all the key columns are used in the search
> condition. So my point is that users might be using the hash index
> with this limitation and their use case might be that they want to
> gain the best performance when they use this particular case and they
> might not be looking for much flexibility like we provide in BTREE.
>
> For reference:
> https://dev.mysql.com/doc/refman/8.0/en/index-btree-hash.html
> https://docs.microsoft.com/en-us/sql/relational-databases/in-memory-oltp/indexes-for-memory-optimized-tables?view=sql-server-ver15
>
> We already know that performance will be better with a single hash for
> multiple columns, but still I just wanted to check the performance
> difference in PG. This might help us to decide the approach we need to
> go for. With a quick POC of both the ideas, I have observed there is a
> major performance advantage with single combined hash for multi-key
> columns.
>
> Performance Test Details: (Used PGBENCH Tool)
> Initialize cmd: “./pgbench -i -s 100 -d postgres"
>
> postgres=# \d+ pgbench_accounts
>
> Table "public.pgbench_accounts"
>
> Column | Type | Collation | Nullable | Default | Storage
> | Compression | Stats target | Description
>
> ----------+---------------+-----------+----------+---------+----------+-------------+--------------+-------------
>
> aid | integer | | not null | | plain
> | | |
> bid | integer | | | | plain
> | | |
> abalance | integer | | | | plain
> | | |
> filler | character(84) | | | | extended
> | | |
>
> Indexes:
> "pgbench_accounts_idx" hash (aid, bid)
> Access method: heap
> Options: fillfactor=100
>
> Test Command: “./pgbench -j 1 postgres -C -M prepared -S -T 300”
>
> Performance Test Results:
> Idea-1: Single Hash value for multiple key columns
> TPS = ~292
>
> Idea-2: Separate Hash values for each key column. But use only the
> first one to search the bucket. Other hash values are used as payload
> to get to the matching tuple before going to the heap.
> TPS = ~212
>
> Note: Here we got near to 25% better performance in a single combine
> hash approach with only TWO key columns.
>

That's a significant difference. Have you checked via perf or some
other way what causes this difference? I have seen that sometimes
single client performance with pgbench is not stable, so can you
please once check with 4 clients or so and possibly with a larger
dataset as well.

One more thing to consider is that it seems that the planner requires
a condition for the first column of an index before considering an
indexscan plan. See Tom's email [1] in this regard. I think it would
be better to see what kind of work is involved there if you want to
explore a single hash value for all columns idea.

[1] - https://www.postgresql.org/message-id/29263.1506483172%40sss.pgh.pa.us

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhihong Yu 2021-08-28 11:29:08 Re: Gather performance analysis
Previous Message Gilles Darold 2021-08-28 09:57:28 Re: Schema variables - new implementation for Postgres 15