Re: rebased background worker reimplementation prototype

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: rebased background worker reimplementation prototype
Date: 2019-07-16 17:53:46
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 2019-07-12 15:47:02 +0200, Tomas Vondra wrote:
> I've done a bit of benchmarking / testing on this, so let me report some
> basic results. I haven't done any significant code review, I've simply
> ran a bunch of pgbench runs on different systems with different scales.


> System #1
> ---------
> * CPU: Intel i5
> * RAM: 8GB
> * storage: 6 x SATA SSD RAID0 (Intel S3700)
> * autovacuum_analyze_scale_factor = 0.1
> * autovacuum_vacuum_cost_delay = 2
> * autovacuum_vacuum_cost_limit = 1000
> * autovacuum_vacuum_scale_factor = 0.01
> * bgwriter_delay = 100
> * bgwriter_lru_maxpages = 10000
> * checkpoint_timeout = 30min
> * max_wal_size = 64GB
> * shared_buffers = 1GB

What's the controller situation here? Can the full SATA3 bandwidth on
all of those drives be employed concurrently?

> System #2
> ---------
> * CPU: 2x Xeon E5-2620v5
> * RAM: 64GB
> * storage: 3 x 7.2k SATA RAID0, 1x NVMe
> * autovacuum_analyze_scale_factor = 0.1
> * autovacuum_vacuum_cost_delay = 2
> * autovacuum_vacuum_cost_limit = 1000
> * autovacuum_vacuum_scale_factor = 0.01
> * bgwriter_delay = 100
> * bgwriter_lru_maxpages = 10000
> * checkpoint_completion_target = 0.9
> * checkpoint_timeout = 15min
> * max_wal_size = 32GB
> * shared_buffers = 8GB

What type of NVMe disk is this? I'm mostly wondering whether it's fast
enough that there's no conceivable way that IO scheduling is going to
make a meaningful difference, given other bottlenecks in postgres.

In some preliminary benchmark runs I've seen fairly significant gains on
SATA and SAS SSDs, as well as spinning rust, but I've not yet
benchmarked on a decent NVMe SSD.

> For each config I've done tests with three scales - small (fits into
> shared buffers), medium (fits into RAM) and large (at least 2x the RAM).
> Aside from the basic metrics (throughput etc.) I've also sampled data
> about 5% of transactions, to be able to look at latency stats.
> The tests were done on master and patched code (both in the 'legacy' and
> new mode).

> I haven't done any temporal analysis yet (i.e. I'm only looking at global
> summaries, not tps over time etc).

FWIW, I'm working on a tool that generates correlated graphs of OS, PG,
pgbench stats. Especially being able to correlate the kernel's
'Writeback' stats (grep Writeback: /proc/meminfo) and latency is very
valuable. Sampling wait events over time also is worthwhile.

> When running on the 7.2k SATA RAID, the throughput improves with the
> medium scale - from ~340tps to ~439tps, which is a pretty significant
> jump. But on the large scale this disappears (in fact, it seems to be a
> bit lower than master/legacy cases). Of course, all this is just from a
> single run (although 4h, so noise should even out).

Any chance there's an order-of-test factor here? In my tests I found two
related issues very important: 1) the first few tests are slower,
because WAL segments don't yet exist. 2) Some poor bugger of a later
test will get hit with anti-wraparound vacuums, even if otherwise not

The fact that the master and "legacy" numbers differ significantly
e.g. in the "xeon sata scale 1000" latency CDF does make me wonder
whether there's an effect like that. While there might be some small
performance difference due to different stats message sizes, and a few
additional branches, I don't see how it could be that noticable.

> I've also computed latency CDF (from the 5% sample) - I've attached this
> for the two interesting cases mentioned in the previous paragraph. This
> shows that with the medium scale the latencies move down (with the patch,
> both in the legacy and "new" modes), while on large scale the "new" mode
> moves a bit to the right to higher values).

Hm. I can't yet explain that.

> And finally, I've looked at buffer stats, i.e. number of buffers written
> in various ways (checkpoing, bgwriter, backends) etc. Interestingly
> enough, these numbers did not change very much - especially on the flash
> storage. Maybe that's expected, though.

Some of that is expected, e.g. because file extensions count as backend
writes, and are going to be roughly correlate with throughput, and not
much else. But they're more similar than I'd actually expect.

I do see a pretty big difference in the number of bgwriter written
backends in the "new" case for scale 10000, on the nvme?

For the SATA SSD case, I wonder if the throughput bottleneck is WAL
writes. I see much more noticable differences if I enable
wal_compression or disable full_page_writes, because otherwise the bulk
of the volume is WAL data. But even in that case, I see a latency
stddev reduction with the new bgwriter around checkpoints.

> The one case where it did change is the "medium" scale on SATA storage,
> where the throughput improved with the patch. But the change is kinda
> strange, because the number of buffers evicted by the bgwriter decreased
> (and instead it got evicted by the checkpointer). Which might explain the
> higher throughput, because checkpointer is probably more efficient.

Well, one problem with the current bgwriter implementation is that the
victim selection isn't good. Because it doesn't perform clock sweep, and
doesn't clean buffers with a usagecount, it'll often run until it finds
a dirty buffer that's pretty far ahead of the clock hand, and clean
those. But with a random test like pgbench it's somewhat likely that
those buffers will get re-dirtied before backends actually get to
reusing them (that's a problem with the new implementation too, the
window just is smaller). But I'm far from sure that that's the cause here.


Andres Freund

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Thom Brown 2019-07-16 18:21:56 Re: SQL/JSON path issues/questions
Previous Message Andres Freund 2019-07-16 17:26:44 Re: Minimal logical decoding on standbys