Re: Speed up transaction completion faster after many relations are accessed in a transaction

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, "Imai, Yoshikazu" <imai(dot)yoshikazu(at)jp(dot)fujitsu(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Speed up transaction completion faster after many relations are accessed in a transaction
Date: 2021-06-20 13:56:13
Message-ID: CAApHDvoKqWRxw5nnUPZ8+mAJKHPOPxYGoY1gQdh0WeS4+biVhg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 14 Aug 2019 at 19:25, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> wrote:
> For now, I'm out of ideas. If anyone else feels like suggesting
> something of picking this up, feel free.

This is a pretty old thread, so we might need a recap:

# Recap

Basically LockReleaseAll() becomes slow after a large number of locks
have all been held at once by a backend. The problem is that the
LockMethodLocalHash dynahash table must grow to store all the locks
and when later transactions only take a few locks, LockReleaseAll() is
slow due to hash_seq_search() having to skip the sparsely filled
buckets in the bloated hash table.

The following things were tried on this thread. Each one failed:

1) Use a dlist in LOCALLOCK to track the next and prev LOCALLOCK.
Simply loop over the dlist in LockReleaseAll().
2) Try dropping and recreating the dynahash table if it becomes
bloated using some heuristics to decide what "bloated" means and if
recreating is worthwhile.

#1 failed due to concerns with increasing the size of LOCALLOCK to
store the dlist pointers. Performance regressions were seen too.
Possibly due to size increase or additional overhead from pushing onto
the dlist.
#2 failed because it was difficult to agree on what the heuristics
would be and we had no efficient way to determine the maximum number
of locks that a given transaction held at any one time. We only know
how many were still held at LockReleaseAll().

There were also some suggestions to fix dynahash's hash_seq_search()
slowness, and also a suggestion to try using simplehash.h instead of
dynahash.c. Unfortunately simplehash.h would suffer the same issues as
it too would have to skip over empty buckets in a sparsely populated
hash table.

I'd like to revive this effort as I have a new idea on how to solve the problem.

# Background

Over in [1] I'm trying to improve the performance of smgropen() during
recovery. The slowness there comes from the dynahash table lookups to
find the correct SMgrRelation. Over there I proposed to use simplehash
instead of dynahash because it's quite a good bit faster and far
lessens the hash lookup overhead during recovery. One problem on that
thread is that relcache keeps a pointer into the SMgrRelation
(RelationData.rd_smgr) and because simplehash moves things around
during inserts and deletes, then we can't have anything point to
simplehash entries, they're unstable. I fixed that over on the other
thread by having the simplehash entry point to a palloced
SMgrRelationData... My problem is, I don't really like that idea as it
means we need to palloc() pfree() lots of little chunks of memory.

To fix the above, I really think we need a version of simplehash that
has stable pointers. Providing that implementation is faster than
dynahash, then it will help solve the smgropen() slowness during
recovery.

# A new hashtable implementation

I ended up thinking of this thread again because the implementation of
the stable pointer hash that I ended up writing for [1] happens to be
lightning fast for hash table sequential scans, even if the table has
become bloated. The reason the seq scans are so fast is that the
implementation loops over the data arrays, which are tightly packed
and store the actual data rather than pointers to the data. The code
does not need to loop over the bucket array for this at all, so how
large that has become is irrelevant to hash table seq scan
performance.

The patch stores elements in "segments" which is set to some power of
2 value. When we run out of space to store new items in a segment, we
just allocate another segment. When we remove items from the table,
new items reuse the first unused item in the first segment with free
space. This helps to keep the used elements tightly packed. A
segment keeps a bitmap of used items so that means scanning all used
items is very fast. If you flip the bits in the used_item bitmap,
then you get a free list for that segment, so it's also very fast to
find a free element when inserting a new item into the table.

I've called the hash table implementation "generichash". It uses the
same preprocessor tricks as simplehash.h does and borrows the same
linear probing code that's used in simplehash. The bucket array just
stores the hash value and a uint32 index into the segment item that
stores the data. Since segments store a power of 2 items, we can
easily address both the segment number and the item within the segment
from the single uint32 index value. The 'index' field just stores a
special value when the bucket is empty. No need to add another field
for that. This means the bucket type is just 8 bytes wide.

One thing I will mention about the new hash table implementation is
that GH_ITEMS_PER_SEGMENT is, by default, set to 256. This means
that's the minimum size for the table. I could drop this downto 64,
but that's still quite a bit more than the default size of the
dynahash table of 16. I think 16 is a bit on the low side and that it
might be worth making this 64 anyway. I'd just need to lower
GH_ITEMS_PER_SEGMENT down. The current code does not allow it to go
lower as I've done nothing to allow partial bitmap words, they're
64-bits on a 64-bit machine.

I've not done too much benchmarking between it and simplehash.h, but I
think in some cases it will be faster. Since the bucket type is just 8
bytes, moving stuff around during insert/deletes will be cheaper than
with simplehash. Lookups are likely to be a bit slower due to having
to lookup the item within the segment, which is a few pointer
dereferences away.

A quick test shows an improvement when compared to dynahash.

# select count(pg_try_advisory_lock(99999,99999)) from
generate_series(1,1000000);

Master:
Time: 190.257 ms
Time: 189.440 ms
Time: 188.768 ms

Patched:
Time: 182.779 ms
Time: 182.267 ms
Time: 186.902 ms

This is just hitting the local lock table. The advisory lock key is
the same each time, so it remains a lock check. Also, it's a pretty
small gain, but I'm mostly trying to show that the new hash table is
not any slower than dynahash for probing for inserts.

The real wins come from what we're trying to solve in this thread --
the performance of LockReleaseAll().

Benchmarking results measuring the TPS of a simple select from an
empty table after another transaction has bloated the locallock table.

Master:
127544 tps
113971 tps
123255 tps
121615 tps

Patched:
170803 tps
167305 tps
165002 tps
171976 tps

About 38% faster.

The benchmark I used was:

t1.sql:
\set p 1
select a from t1 where a = :p

hp.sql:
select count(*) from hp

"hp" is a hash partitioned table with 10k partitions.

pgbench -j 20 -c 20 -T 60 -M prepared -n -f hp(dot)sql(at)1 -f t1(dot)sql(at)100000 postgres

I'm using the query to the hp table to bloat the locallock table. It's
only executed every 1 in 100,000 queries. The tps numbers above are
the ones to run t1.sql

I've not quite looked into why yet, but the hp.sql improved
performance by 58%. It went from an average of 1.061377 in master to
an average of 1.683616 in the patched version. I can't quite see where
this gain is coming from. It's pretty hard to get good stable
performance results out of this AMD machine, so it might be related to
that. That's why I ran 20 threads. It seems slightly better. The
machine seems to have trouble waking up properly for a single thread.

It would be good if someone could repeat the tests to see if the gains
appear on other hardware.

Also, it would be good to hear what people think about solving the
problem this way.

Patch attached.

David

[1] https://www.postgresql.org/message-id/flat/CAApHDvpkWOGLh_bYg7jproXN8B2g2T9dWDcqsmKsXG5+WwZaqw(at)mail(dot)gmail(dot)com

Attachment Content-Type Size
v1-0001-Add-a-new-hash-table-type-which-has-stable-pointe.patch application/octet-stream 51.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nikolay Shaplov 2021-06-20 14:03:15 Re: [PATCH] Finally split StdRdOptions into HeapOptions and ToastOptions
Previous Message Andrew Dunstan 2021-06-20 13:44:30 PXGS vs TAP tests