Re: Refactoring the checkpointer's fsync request queue

From: Shawn Debnath <sdn(at)amazon(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Refactoring the checkpointer's fsync request queue
Date: 2019-02-20 23:27:40
Message-ID: 20190220232739.GA8280@f01898859afd.ant.amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

As promised, here's a patch that addresses the points discussed by
Andres and Thomas at FOSDEM. As a result of how we want checkpointer to
track what files to fsync, the pending ops table now integrates the
forknum and segno as part of the hash key eliminating the need for the
bitmapsets, or vectors from the previous iterations. We re-construct the
pathnames from the RelFileNode, ForkNumber and SegmentNumber and use
PathNameOpenFile to get the file descriptor to use for fsync.

Apart from that, this patch moves the system for requesting and
processing fsyncs out of md.c into smgr.c, allowing us to call on smgr
component specific callbacks to retrieve metadata like relation and
segment paths. This allows smgr components to maintain how relfilenodes,
forks and segments map to specific files without exposing this knowledge
to smgr. It redefines smgrsync() behavior to be closer to that of
smgrimmedsysnc(), i.e., if a regular sync is required for a particular
file, enqueue it in locally or forward it to checkpointer.
smgrimmedsync() retains the existing behavior and fsyncs the file right
away. The processing of fsync requests has been moved from mdsync() to a
new ProcessFsyncRequests() function.

Testing
-------

Checkpointer stats didn't cover what I wanted to verify, i.e., time
spent dealing with the pending operations table. So I added temporary
instrumentation to get the numbers by timing the code in
ProcessFsyncRequests which starts by absorbing fsync requests from
checkpointer queue, processing them and finally issuing sync on the
files. Similarly, I added the same instrumentation in the mdsync code in
master branch. The time to actually execute FileSync is irrelevant for
this patch.

I did two separate runs for 30 mins, both with scale=10,000 on
i3.8xlarge instances [1] with default params to force frequent
checkpoints:

1. Single pgbench run having 1000 clients update 4 tables, as a result
we get 4 relations and its forks and several segments in each being
synced.

2. 10 parallel pgbench runs on 10 separate databases having 200 clients
each. This results in more relations and more segments being touched
letting us better compare against the bitmapset optimizations.

Results
--------

The important metric to look at would be the total time spent absorbing
and processing the fsync requests as that's what the changes revolve
around. The other metrics are here for posterity. The new code is about
6% faster in total time taken to process the queue for the single
pgbench run. For the 10x parallel pgbench run, we are seeing drops up to
70% with the patch.

Would be great if some other folks can verify this. The temporary
instrumentation patches for the master branch and one that applies after
the main patch are attached. Enable log_checkpoints and then use grep
and cut to extract the numbers from the log file after the runs.

[Requests Absorbed]

single pgbench run
Min Max Average Median Mode Std Dev
-------- ------- -------- ---------- -------- ------- ----------
patch 15144 144961 78628.84 76124 58619 24135.69
master 25728 138422 81455.04 80601 25728 21295.83

10 parallel pgbench runs
Min Max Average Median Mode Std Dev
-------- -------- -------- ----------- -------- -------- ----------
patch 45098 282158 155969.4 151603 153049 39990.91
master 191833 602512 416533.86 424946 191833 82014.48

[Files Synced]

single pgbench run
Min Max Average Median Mode Std Dev
-------- ----- ----- --------- -------- ------ ---------
patch 153 166 158.11 158 159 1.86
master 154 166 158.29 159 159 10.29

10 parallel pgbench runs
Min Max Average Median Mode Std Dev
-------- ------ ------ --------- -------- ------ ---------
patch 1540 1662 1556.42 1554 1552 11.12
master 1546 1546 1546 1559 1553 12.79

[Total Time in ProcessFsyncRequest/mdsync]

single pgbench run
Min Max Average Median Mode Std Dev
-------- ----- --------- --------- -------- ------ ---------
patch 500 3833.51 2305.22 2239 500 510.08
master 806 4430.32 2458.77 2382 806 497.01

10 parallel pgbench runs
Min Max Average Median Mode Std Dev
-------- ------ ------- ---------- -------- ------ ---------
patch 908 6927 3022.58 2863 908 939.09
master 4323 17858 10982.15 11154 4322 2760.47


[1] https://aws.amazon.com/ec2/instance-types/i3/

--
Shawn Debnath
Amazon Web Services (AWS)

Attachment Content-Type Size
0001-Refactor-the-fsync-machinery-to-support-future-SMGR-v9.patch text/plain 72.4 KB
mdsync-total-time-instrumentation.patch text/plain 1.5 KB
ProcessFsyncRequests-total-time-instrumentation.patch text/plain 1.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2019-02-20 23:35:24 Re: WAL insert delay settings
Previous Message Ryan David Sheasby 2019-02-20 23:16:46 Journal based VACUUM FULL