Quick Links

RE: [Patch] Optimize dropping of relation buffers using dlist

From:	"k(dot)jamison(at)fujitsu(dot)com" <k(dot)jamison(at)fujitsu(dot)com>
To:	'Tomas Vondra' <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	'Robert Haas' <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	RE: [Patch] Optimize dropping of relation buffers using dlist
Date:	2019-11-28 03:18:59
Message-ID:	OSBPR01MB32072C1FB12EC977B9C430C4EF470@OSBPR01MB3207.jpnprd01.prod.outlook.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Nov 13, 2019 4:20AM (GMT +9), Tomas Vondra wrote:
> On Tue, Nov 12, 2019 at 10:49:49AM +0000, k(dot)jamison(at)fujitsu(dot)com wrote:
> >On Thurs, November 7, 2019 1:27 AM (GMT+9), Robert Haas wrote:
> >> On Tue, Nov 5, 2019 at 10:34 AM Tomas Vondra
> >> <tomas(dot)vondra(at)2ndquadrant(dot)com>
> >> wrote:
> >> > 2) This adds another hashtable maintenance to BufferAlloc etc. but
> >> > you've only done tests / benchmark for the case this optimizes. I
> >> > think we need to see a benchmark for workload that allocates and
> >> > invalidates lot of buffers. A pgbench with a workload that fits into
> >> > RAM but not into shared buffers would be interesting.
> >>
> >> Yeah, it seems pretty hard to believe that this won't be bad for some
> workloads.
> >> Not only do you have the overhead of the hash table operations, but
> >> you also have locking overhead around that. A whole new set of
> >> LWLocks where you have to take and release one of them every time you
> >> allocate or invalidate a buffer seems likely to cause a pretty substantial
> contention problem.
> >
> >I'm sorry for the late reply. Thank you Tomas and Robert for checking this
> patch.
> >Attached is the v3 of the patch.
> >- I moved the unnecessary items from buf_internals.h to cached_buf.c
> >since most of
> > of those items are only used in that file.
> >- Fixed the bug of v2. Seems to pass both RT and TAP test now
> >
> >Thanks for the advice on benchmark test. Please refer below for test and
> results.
> >
> >[Machine spec]
> >CPU: 16, Number of cores per socket: 8
> >RHEL6.5, Memory: 240GB
> >
> >scale: 3125 (about 46GB DB size)
> >shared_buffers = 8GB
> >
> >[workload that fits into RAM but not into shared buffers] pgbench -i -s
> >3125 cachetest pgbench -c 16 -j 8 -T 600 cachetest
> >
> >[Patched]
> >scaling factor: 3125
> >query mode: simple
> >number of clients: 16
> >number of threads: 8
> >duration: 600 s
> >number of transactions actually processed: 8815123 latency average =
> >1.089 ms tps = 14691.436343 (including connections establishing) tps =
> >14691.482714 (excluding connections establishing)
> >
> >[Master/Unpatched]
> >...
> >number of transactions actually processed: 8852327 latency average =
> >1.084 ms tps = 14753.814648 (including connections establishing) tps =
> >14753.861589 (excluding connections establishing)
> >
> >
> >My patch caused a little overhead of about 0.42-0.46%, which I think is small.
> >Kindly let me know your opinions/comments about the patch or tests, etc.
> >
>
> Now try measuring that with a read-only workload, with prepared statements.
> I've tried that on a machine with 16 cores, doing
>
> # 16 clients
> pgbench -n -S -j 16 -c 16 -M prepared -T 60 test
>
> # 1 client
> pgbench -n -S -c 1 -M prepared -T 60 test
>
> and average from 30 runs of each looks like this:
>
> # clients master patched %
> ---------------------------------------------------------
> 1 29690 27833 93.7%
> 16 300935 283383 94.1%
>
> That's quite significant regression, considering it's optimizing an
> operation that is expected to be pretty rare (people are generally not
> dropping dropping objects as often as they query them).

I updated the patch and reduced the lock contention of new LWLock,
with tunable definitions in the code and instead of using rnode as the hash key,
I also added the modulo of block number.
#define NUM_MAP_PARTITIONS_FOR_REL 128 /* relation-level */
#define NUM_MAP_PARTITIONS_IN_REL 4 /* block-level */
#define NUM_MAP_PARTITIONS \
(NUM_MAP_PARTITIONS_FOR_REL * NUM_MAP_PARTITIONS_IN_REL)

I executed again a benchmark for read-only workload,
but regression currently sits at 3.10% (reduced from v3's 6%).

Average of 10 runs, 16 clients
read-only, prepared query mode

[Master]
num of txn processed: 11,950,983.67
latency average = 0.080 ms
tps = 199,182.24
tps = 199,189.54

[V4 Patch]
num of txn processed: 11,580,256.36
latency average = 0.083 ms
tps = 193,003.52
tps = 193,010.76

I checked the wait event statistics (non-impactful events omitted)
and got the following below.
I reset the stats before running the pgbench script,
Then showed the stats right after the run.

It seems the buffer_mapping LWLock wait is 4x slower.
However, I'd like to continue working on this patch to next commitfest,
and further reduce its impact to read-only workloads.

Regards,
Kirk Jamison

Attachment	Content-Type	Size
v4-Optimize-dropping-of-relation-buffers-using-dlist.patch	application/octet-stream	21.4 KB

In response to

Re: [Patch] Optimize dropping of relation buffers using dlist at 2019-11-12 19:19:33 from Tomas Vondra

Responses

RE: [Patch] Optimize dropping of relation buffers using dlist at 2019-12-13 10:18:46 from k.jamison@fujitsu.com

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Amit Kapila	2019-11-28 03:23:56	Re: dropdb --force
Previous Message	Michael Paquier	2019-11-28 03:09:03	Re: [HACKERS] Restricting maximum keep segments by repslots