Re: patch to allow disable of WAL recycling

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Jerry Jelinek <jerry(dot)jelinek(at)joyent(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: patch to allow disable of WAL recycling
Date: 2018-08-27 01:59:34
Message-ID: CAEepm=2QXmF9xDmGDyMtoEeTEH6=jcf=b8--yLzdeVzBfVLVuA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Aug 27, 2018 at 10:14 AM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:
> zfs (Linux)
> -----------
> On scale 200, there's pretty much no difference.

Speculation: It could be that the dnode and/or indirect blocks that point
to data blocks are falling out of memory in my test setup[1] but not in
yours. I don't know, but I guess those blocks compete with regular data
blocks in the ARC? If so it might come down to ARC size and the amount of
other data churning through it.

Further speculation: Other filesystems have equivalent data structures,
but for example XFS jams that data into the inode itself in a compact
"extent list" format[2] if it can, to avoid the need for an external
btree. Hmm, I wonder if that format tends to be used for our segment
files. Since cached inodes are reclaimed in a different way than cached
data pages, I wonder if that makes them more sticky in the face of high
data churn rates (or I guess less, depending on your Linux
vfs_cache_pressure setting and number of active files). I suppose the
combination of those two things, sticky inodes with internalised extent
lists, might make it more likely that we can overwrite an old file without
having to fault anything in.

One big difference between your test rig and mine is that your Optane 900P
claims to do about half a million random IOPS. That is about half a
million more IOPS than my spinning disks. (Actually I used my 5400RPM
steam powered machine deliberately for that test: I disabled fsync so that
commit rate wouldn't be slowed down but cache misses would be obvious. I
guess Joyent's storage is somewhere between these two extremes...)

> On scale 2000, the
> throughput actually decreased a bit, by about 5% - from the chart it
> seems disabling the WAL reuse somewhat amplifies impact of checkpoints,
> for some reason.

Huh.

> I have no idea what happened at the largest scale (8000) - on master
> there's a huge drop after ~120 minutes, which somewhat recovers at ~220
> minutes (but not fully). Without WAL reuse there's no such drop,
> although there seems to be some degradation after ~220 minutes (i.e. at
> about the same time the master partially recovers. I'm not sure what to
> think about this, I wonder if it might be caused by almost filling the
> disk space, or something like that. I'm rerunning this with scale 600.

There are lots of reports of ZFS performance degrading when free space gets
below something like 20%.

[1]
https://www.postgresql.org/message-id/CAEepm%3D2pypg3nGgBDYyG0wuCH%2BxTWsAJddvJUGBNsDiyMhcXaQ%40mail.gmail.com
[2]
http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure/tmp/en-US/html/Data_Extents.html

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2018-08-27 04:12:20 Re: simplehash.h comment
Previous Message Tatsuo Ishii 2018-08-27 00:31:59 Re: Adding a note to protocol.sgml regarding CopyData