Re: rebased background worker reimplementation prototype

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: rebased background worker reimplementation prototype
Date: 2019-07-16 19:16:29
Message-ID: 20190716191629.al652qwkf7nkx537@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 16, 2019 at 10:53:46AM -0700, Andres Freund wrote:
>Hi,
>
>On 2019-07-12 15:47:02 +0200, Tomas Vondra wrote:
>> I've done a bit of benchmarking / testing on this, so let me report some
>> basic results. I haven't done any significant code review, I've simply
>> ran a bunch of pgbench runs on different systems with different scales.
>
>Thanks!
>
>
>> System #1
>> ---------
>> * CPU: Intel i5
>> * RAM: 8GB
>> * storage: 6 x SATA SSD RAID0 (Intel S3700)
>> * autovacuum_analyze_scale_factor = 0.1
>> * autovacuum_vacuum_cost_delay = 2
>> * autovacuum_vacuum_cost_limit = 1000
>> * autovacuum_vacuum_scale_factor = 0.01
>> * bgwriter_delay = 100
>> * bgwriter_lru_maxpages = 10000
>> * checkpoint_timeout = 30min
>> * max_wal_size = 64GB
>> * shared_buffers = 1GB
>
>What's the controller situation here? Can the full SATA3 bandwidth on
>all of those drives be employed concurrently?
>

There's just an on-board SATA controller, so it might be a bottleneck.

A single drive can do ~440 MB/s reads sequentially, and the whole RAID0
array (Linux sw raid) does ~1.6GB/s, so not exactly 6x that. But I don't
think we're generating that many writes during the test.

>
>> System #2
>> ---------
>> * CPU: 2x Xeon E5-2620v5
>> * RAM: 64GB
>> * storage: 3 x 7.2k SATA RAID0, 1x NVMe
>> * autovacuum_analyze_scale_factor = 0.1
>> * autovacuum_vacuum_cost_delay = 2
>> * autovacuum_vacuum_cost_limit = 1000
>> * autovacuum_vacuum_scale_factor = 0.01
>> * bgwriter_delay = 100
>> * bgwriter_lru_maxpages = 10000
>> * checkpoint_completion_target = 0.9
>> * checkpoint_timeout = 15min
>> * max_wal_size = 32GB
>> * shared_buffers = 8GB
>
>What type of NVMe disk is this? I'm mostly wondering whether it's fast
>enough that there's no conceivable way that IO scheduling is going to
>make a meaningful difference, given other bottlenecks in postgres.
>
>In some preliminary benchmark runs I've seen fairly significant gains on
>SATA and SAS SSDs, as well as spinning rust, but I've not yet
>benchmarked on a decent NVMe SSD.
>

Intel Optane 900P 280MB (model SSDPED1D280GA) [1].

[1] https://ssd.userbenchmark.com/SpeedTest/315555/INTEL-SSDPED1D280GA

I think one of the main improvements in this generation of drives is
good performance with low queue depth. See for example [2].

[2] https://www.anandtech.com/show/12136/the-intel-optane-ssd-900p-480gb-review/5

Not sure if that plays role here, but I've seen this to afffect prefetch
and similar things.

>
>> For each config I've done tests with three scales - small (fits into
>> shared buffers), medium (fits into RAM) and large (at least 2x the RAM).
>> Aside from the basic metrics (throughput etc.) I've also sampled data
>> about 5% of transactions, to be able to look at latency stats.
>>
>> The tests were done on master and patched code (both in the 'legacy' and
>> new mode).
>
>
>
>> I haven't done any temporal analysis yet (i.e. I'm only looking at global
>> summaries, not tps over time etc).
>
>FWIW, I'm working on a tool that generates correlated graphs of OS, PG,
>pgbench stats. Especially being able to correlate the kernel's
>'Writeback' stats (grep Writeback: /proc/meminfo) and latency is very
>valuable. Sampling wait events over time also is worthwhile.
>

Good to know, although I don't think it's difficult to fetch the data
from sar and plot it. I might even already have ugly bash scripts doing
that, somewhere.

>
>> When running on the 7.2k SATA RAID, the throughput improves with the
>> medium scale - from ~340tps to ~439tps, which is a pretty significant
>> jump. But on the large scale this disappears (in fact, it seems to be a
>> bit lower than master/legacy cases). Of course, all this is just from a
>> single run (although 4h, so noise should even out).
>
>Any chance there's an order-of-test factor here? In my tests I found two
>related issues very important: 1) the first few tests are slower,
>because WAL segments don't yet exist. 2) Some poor bugger of a later
>test will get hit with anti-wraparound vacuums, even if otherwise not
>necessary.
>

Not sure - I'll check, but I find it unlikely. I need to repeat the
tests to have multiple runs.

>The fact that the master and "legacy" numbers differ significantly
>e.g. in the "xeon sata scale 1000" latency CDF does make me wonder
>whether there's an effect like that. While there might be some small
>performance difference due to different stats message sizes, and a few
>additional branches, I don't see how it could be that noticable.
>

That's about the one case where things like anti-wraparound are pretty
much impossible, because the SATA storage is so slow ...

>
>> I've also computed latency CDF (from the 5% sample) - I've attached this
>> for the two interesting cases mentioned in the previous paragraph. This
>> shows that with the medium scale the latencies move down (with the patch,
>> both in the legacy and "new" modes), while on large scale the "new" mode
>> moves a bit to the right to higher values).
>
>Hm. I can't yet explain that.
>
>
>> And finally, I've looked at buffer stats, i.e. number of buffers written
>> in various ways (checkpoing, bgwriter, backends) etc. Interestingly
>> enough, these numbers did not change very much - especially on the flash
>> storage. Maybe that's expected, though.
>
>Some of that is expected, e.g. because file extensions count as backend
>writes, and are going to be roughly correlate with throughput, and not
>much else. But they're more similar than I'd actually expect.
>
>I do see a pretty big difference in the number of bgwriter written
>backends in the "new" case for scale 10000, on the nvme?
>

Right.

>For the SATA SSD case, I wonder if the throughput bottleneck is WAL
>writes. I see much more noticable differences if I enable
>wal_compression or disable full_page_writes, because otherwise the bulk
>of the volume is WAL data. But even in that case, I see a latency
>stddev reduction with the new bgwriter around checkpoints.
>

I may try that during the next round of tests.

>
>> The one case where it did change is the "medium" scale on SATA storage,
>> where the throughput improved with the patch. But the change is kinda
>> strange, because the number of buffers evicted by the bgwriter decreased
>> (and instead it got evicted by the checkpointer). Which might explain the
>> higher throughput, because checkpointer is probably more efficient.
>
>Well, one problem with the current bgwriter implementation is that the
>victim selection isn't good. Because it doesn't perform clock sweep, and
>doesn't clean buffers with a usagecount, it'll often run until it finds
>a dirty buffer that's pretty far ahead of the clock hand, and clean
>those. But with a random test like pgbench it's somewhat likely that
>those buffers will get re-dirtied before backends actually get to
>reusing them (that's a problem with the new implementation too, the
>window just is smaller). But I'm far from sure that that's the cause here.
>

OK.

Time for more tests, I guess.

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2019-07-16 19:23:44 Re: Parallel Append subplan order instability on aye-aye
Previous Message Tom Lane 2019-07-16 18:52:22 Re: POC: converting Lists into arrays