RE: [Patch] Optimize dropping of relation buffers using dlist

From: "k(dot)jamison(at)fujitsu(dot)com" <k(dot)jamison(at)fujitsu(dot)com>
To: 'Tomas Vondra' <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: 'Robert Haas' <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: [Patch] Optimize dropping of relation buffers using dlist
Date: 2019-11-28 03:18:59
Message-ID: OSBPR01MB32072C1FB12EC977B9C430C4EF470@OSBPR01MB3207.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 13, 2019 4:20AM (GMT +9), Tomas Vondra wrote:
> On Tue, Nov 12, 2019 at 10:49:49AM +0000, k(dot)jamison(at)fujitsu(dot)com wrote:
> >On Thurs, November 7, 2019 1:27 AM (GMT+9), Robert Haas wrote:
> >> On Tue, Nov 5, 2019 at 10:34 AM Tomas Vondra
> >> <tomas(dot)vondra(at)2ndquadrant(dot)com>
> >> wrote:
> >> > 2) This adds another hashtable maintenance to BufferAlloc etc. but
> >> > you've only done tests / benchmark for the case this optimizes. I
> >> > think we need to see a benchmark for workload that allocates and
> >> > invalidates lot of buffers. A pgbench with a workload that fits into
> >> > RAM but not into shared buffers would be interesting.
> >>
> >> Yeah, it seems pretty hard to believe that this won't be bad for some
> workloads.
> >> Not only do you have the overhead of the hash table operations, but
> >> you also have locking overhead around that. A whole new set of
> >> LWLocks where you have to take and release one of them every time you
> >> allocate or invalidate a buffer seems likely to cause a pretty substantial
> contention problem.
> >
> >I'm sorry for the late reply. Thank you Tomas and Robert for checking this
> patch.
> >Attached is the v3 of the patch.
> >- I moved the unnecessary items from buf_internals.h to cached_buf.c
> >since most of
> > of those items are only used in that file.
> >- Fixed the bug of v2. Seems to pass both RT and TAP test now
> >
> >Thanks for the advice on benchmark test. Please refer below for test and
> results.
> >
> >[Machine spec]
> >CPU: 16, Number of cores per socket: 8
> >RHEL6.5, Memory: 240GB
> >
> >scale: 3125 (about 46GB DB size)
> >shared_buffers = 8GB
> >
> >[workload that fits into RAM but not into shared buffers] pgbench -i -s
> >3125 cachetest pgbench -c 16 -j 8 -T 600 cachetest
> >
> >[Patched]
> >scaling factor: 3125
> >query mode: simple
> >number of clients: 16
> >number of threads: 8
> >duration: 600 s
> >number of transactions actually processed: 8815123 latency average =
> >1.089 ms tps = 14691.436343 (including connections establishing) tps =
> >14691.482714 (excluding connections establishing)
> >
> >[Master/Unpatched]
> >...
> >number of transactions actually processed: 8852327 latency average =
> >1.084 ms tps = 14753.814648 (including connections establishing) tps =
> >14753.861589 (excluding connections establishing)
> >
> >
> >My patch caused a little overhead of about 0.42-0.46%, which I think is small.
> >Kindly let me know your opinions/comments about the patch or tests, etc.
> >
>
> Now try measuring that with a read-only workload, with prepared statements.
> I've tried that on a machine with 16 cores, doing
>
> # 16 clients
> pgbench -n -S -j 16 -c 16 -M prepared -T 60 test
>
> # 1 client
> pgbench -n -S -c 1 -M prepared -T 60 test
>
> and average from 30 runs of each looks like this:
>
> # clients master patched %
> ---------------------------------------------------------
> 1 29690 27833 93.7%
> 16 300935 283383 94.1%
>
> That's quite significant regression, considering it's optimizing an
> operation that is expected to be pretty rare (people are generally not
> dropping dropping objects as often as they query them).

I updated the patch and reduced the lock contention of new LWLock,
with tunable definitions in the code and instead of using rnode as the hash key,
I also added the modulo of block number.
#define NUM_MAP_PARTITIONS_FOR_REL 128 /* relation-level */
#define NUM_MAP_PARTITIONS_IN_REL 4 /* block-level */
#define NUM_MAP_PARTITIONS \
(NUM_MAP_PARTITIONS_FOR_REL * NUM_MAP_PARTITIONS_IN_REL)

I executed again a benchmark for read-only workload,
but regression currently sits at 3.10% (reduced from v3's 6%).

Average of 10 runs, 16 clients
read-only, prepared query mode

[Master]
num of txn processed: 11,950,983.67
latency average = 0.080 ms
tps = 199,182.24
tps = 199,189.54

[V4 Patch]
num of txn processed: 11,580,256.36
latency average = 0.083 ms
tps = 193,003.52
tps = 193,010.76

I checked the wait event statistics (non-impactful events omitted)
and got the following below.
I reset the stats before running the pgbench script,
Then showed the stats right after the run.

[Master]
wait_event_type | wait_event | calls | microsec
-----------------+-----------------------+----------+----------
Client | ClientRead | 25116 | 49552452
IO | DataFileRead | 14467109 | 92113056
LWLock | buffer_mapping | 204618 | 1364779

[Patch V4]
wait_event_type | wait_event | calls | microsec
-----------------+-----------------------+----------+----------
Client | ClientRead | 111393 | 68773946
IO | DataFileRead | 14186773 | 90399833
LWLock | buffer_mapping | 463844 | 4025198
LWLock | cached_buf_tranche_id | 83390 | 336080

It seems the buffer_mapping LWLock wait is 4x slower.
However, I'd like to continue working on this patch to next commitfest,
and further reduce its impact to read-only workloads.

Regards,
Kirk Jamison

Attachment Content-Type Size
v4-Optimize-dropping-of-relation-buffers-using-dlist.patch application/octet-stream 21.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2019-11-28 03:23:56 Re: dropdb --force
Previous Message Michael Paquier 2019-11-28 03:09:03 Re: [HACKERS] Restricting maximum keep segments by repslots