Re: Speed up transaction completion faster after many relations are accessed in a transaction

From: Zhihong Yu <zyu(at)yugabyte(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, "Imai, Yoshikazu" <imai(dot)yoshikazu(at)jp(dot)fujitsu(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Speed up transaction completion faster after many relations are accessed in a transaction
Date: 2021-06-20 17:06:47
Message-ID: CALNJ-vQfySdYxLUoUbqg7-r+yhv3MgygOXWb+r_D8qQ7727+Fg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jun 20, 2021 at 6:56 AM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:

> On Wed, 14 Aug 2019 at 19:25, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
> wrote:
> > For now, I'm out of ideas. If anyone else feels like suggesting
> > something of picking this up, feel free.
>
> This is a pretty old thread, so we might need a recap:
>
> # Recap
>
> Basically LockReleaseAll() becomes slow after a large number of locks
> have all been held at once by a backend. The problem is that the
> LockMethodLocalHash dynahash table must grow to store all the locks
> and when later transactions only take a few locks, LockReleaseAll() is
> slow due to hash_seq_search() having to skip the sparsely filled
> buckets in the bloated hash table.
>
> The following things were tried on this thread. Each one failed:
>
> 1) Use a dlist in LOCALLOCK to track the next and prev LOCALLOCK.
> Simply loop over the dlist in LockReleaseAll().
> 2) Try dropping and recreating the dynahash table if it becomes
> bloated using some heuristics to decide what "bloated" means and if
> recreating is worthwhile.
>
> #1 failed due to concerns with increasing the size of LOCALLOCK to
> store the dlist pointers. Performance regressions were seen too.
> Possibly due to size increase or additional overhead from pushing onto
> the dlist.
> #2 failed because it was difficult to agree on what the heuristics
> would be and we had no efficient way to determine the maximum number
> of locks that a given transaction held at any one time. We only know
> how many were still held at LockReleaseAll().
>
> There were also some suggestions to fix dynahash's hash_seq_search()
> slowness, and also a suggestion to try using simplehash.h instead of
> dynahash.c. Unfortunately simplehash.h would suffer the same issues as
> it too would have to skip over empty buckets in a sparsely populated
> hash table.
>
> I'd like to revive this effort as I have a new idea on how to solve the
> problem.
>
> # Background
>
> Over in [1] I'm trying to improve the performance of smgropen() during
> recovery. The slowness there comes from the dynahash table lookups to
> find the correct SMgrRelation. Over there I proposed to use simplehash
> instead of dynahash because it's quite a good bit faster and far
> lessens the hash lookup overhead during recovery. One problem on that
> thread is that relcache keeps a pointer into the SMgrRelation
> (RelationData.rd_smgr) and because simplehash moves things around
> during inserts and deletes, then we can't have anything point to
> simplehash entries, they're unstable. I fixed that over on the other
> thread by having the simplehash entry point to a palloced
> SMgrRelationData... My problem is, I don't really like that idea as it
> means we need to palloc() pfree() lots of little chunks of memory.
>
> To fix the above, I really think we need a version of simplehash that
> has stable pointers. Providing that implementation is faster than
> dynahash, then it will help solve the smgropen() slowness during
> recovery.
>
> # A new hashtable implementation
>
> I ended up thinking of this thread again because the implementation of
> the stable pointer hash that I ended up writing for [1] happens to be
> lightning fast for hash table sequential scans, even if the table has
> become bloated. The reason the seq scans are so fast is that the
> implementation loops over the data arrays, which are tightly packed
> and store the actual data rather than pointers to the data. The code
> does not need to loop over the bucket array for this at all, so how
> large that has become is irrelevant to hash table seq scan
> performance.
>
> The patch stores elements in "segments" which is set to some power of
> 2 value. When we run out of space to store new items in a segment, we
> just allocate another segment. When we remove items from the table,
> new items reuse the first unused item in the first segment with free
> space. This helps to keep the used elements tightly packed. A
> segment keeps a bitmap of used items so that means scanning all used
> items is very fast. If you flip the bits in the used_item bitmap,
> then you get a free list for that segment, so it's also very fast to
> find a free element when inserting a new item into the table.
>
> I've called the hash table implementation "generichash". It uses the
> same preprocessor tricks as simplehash.h does and borrows the same
> linear probing code that's used in simplehash. The bucket array just
> stores the hash value and a uint32 index into the segment item that
> stores the data. Since segments store a power of 2 items, we can
> easily address both the segment number and the item within the segment
> from the single uint32 index value. The 'index' field just stores a
> special value when the bucket is empty. No need to add another field
> for that. This means the bucket type is just 8 bytes wide.
>
> One thing I will mention about the new hash table implementation is
> that GH_ITEMS_PER_SEGMENT is, by default, set to 256. This means
> that's the minimum size for the table. I could drop this downto 64,
> but that's still quite a bit more than the default size of the
> dynahash table of 16. I think 16 is a bit on the low side and that it
> might be worth making this 64 anyway. I'd just need to lower
> GH_ITEMS_PER_SEGMENT down. The current code does not allow it to go
> lower as I've done nothing to allow partial bitmap words, they're
> 64-bits on a 64-bit machine.
>
> I've not done too much benchmarking between it and simplehash.h, but I
> think in some cases it will be faster. Since the bucket type is just 8
> bytes, moving stuff around during insert/deletes will be cheaper than
> with simplehash. Lookups are likely to be a bit slower due to having
> to lookup the item within the segment, which is a few pointer
> dereferences away.
>
> A quick test shows an improvement when compared to dynahash.
>
> # select count(pg_try_advisory_lock(99999,99999)) from
> generate_series(1,1000000);
>
> Master:
> Time: 190.257 ms
> Time: 189.440 ms
> Time: 188.768 ms
>
> Patched:
> Time: 182.779 ms
> Time: 182.267 ms
> Time: 186.902 ms
>
> This is just hitting the local lock table. The advisory lock key is
> the same each time, so it remains a lock check. Also, it's a pretty
> small gain, but I'm mostly trying to show that the new hash table is
> not any slower than dynahash for probing for inserts.
>
> The real wins come from what we're trying to solve in this thread --
> the performance of LockReleaseAll().
>
> Benchmarking results measuring the TPS of a simple select from an
> empty table after another transaction has bloated the locallock table.
>
> Master:
> 127544 tps
> 113971 tps
> 123255 tps
> 121615 tps
>
> Patched:
> 170803 tps
> 167305 tps
> 165002 tps
> 171976 tps
>
> About 38% faster.
>
> The benchmark I used was:
>
> t1.sql:
> \set p 1
> select a from t1 where a = :p
>
> hp.sql:
> select count(*) from hp
>
> "hp" is a hash partitioned table with 10k partitions.
>
> pgbench -j 20 -c 20 -T 60 -M prepared -n -f hp(dot)sql(at)1 -f t1(dot)sql(at)100000
> postgres
>
> I'm using the query to the hp table to bloat the locallock table. It's
> only executed every 1 in 100,000 queries. The tps numbers above are
> the ones to run t1.sql
>
> I've not quite looked into why yet, but the hp.sql improved
> performance by 58%. It went from an average of 1.061377 in master to
> an average of 1.683616 in the patched version. I can't quite see where
> this gain is coming from. It's pretty hard to get good stable
> performance results out of this AMD machine, so it might be related to
> that. That's why I ran 20 threads. It seems slightly better. The
> machine seems to have trouble waking up properly for a single thread.
>
> It would be good if someone could repeat the tests to see if the gains
> appear on other hardware.
>
> Also, it would be good to hear what people think about solving the
> problem this way.
>
> Patch attached.
>
> David
>
> [1]
> https://www.postgresql.org/message-id/flat/CAApHDvpkWOGLh_bYg7jproXN8B2g2T9dWDcqsmKsXG5+WwZaqw(at)mail(dot)gmail(dot)com

Hi,

+ * GH_ELEMENT_TYPE defines the data type that the hashtable stores. Each
+ * instance of GH_ELEMENT_TYPE which is stored in the hash table is done
so
+ * inside a GH_SEGMENT.

I think the second sentence can be written as (since done means stored, it
is redundant):

Each instance of GH_ELEMENT_TYPE is stored in the hash table inside a
GH_SEGMENT.

+ * Macros for translating a bucket's index into the segment and another to
+ * determine the item number within the segment.
+ */
+#define GH_INDEX_SEGMENT(i) (i) / GH_ITEMS_PER_SEGMENT

into the segment -> into the segment number (in the code I see segidx but I
wonder if segment index may cause slight confusion).

+ GH_BITMAP_WORD used_items[GH_BITMAP_WORDS]; /* A 1-bit for each used
item
+ * in the items array */

'A 1-bit' -> One bit (A and 1 mean the same)

+ uint32 first_free_segment;

Since the segment may not be totally free, maybe name the field
first_segment_with_free_slot

+ * This is similar to GH_NEXT_ONEBIT but flips the bits before operating on
+ * each GH_BITMAP_WORD.

It seems the only difference from GH_NEXT_ONEBIT is in this line:

+ GH_BITMAP_WORD word = ~(words[wordnum] & mask); /* flip bits */

If a 4th parameter is added to signify whether the flipping should be done,
these two functions can be unified.

+ * next insert will store in this segment. If it's already pointing to
an
+ * earlier segment, then leave it be.

The last sentence is confusing: first_free_segment cannot point to some
segment and earlier segment at the same time.
Maybe drop the last sentence.

Cheers

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-06-20 17:19:31 Re: Race condition in InvalidateObsoleteReplicationSlots()
Previous Message Peter Geoghegan 2021-06-20 16:55:36 Re: Teaching users how they can get the most out of HOT in Postgres 14