Re: [PoC] Non-volatile WAL buffer

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, Takashi Menjo <takashi(dot)menjo(at)gmail(dot)com>
Cc: Takashi Menjo <takashi(dot)menjou(dot)vg(at)hco(dot)ntt(dot)co(dot)jp>, "Deng, Gang" <gang(dot)deng(at)intel(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Non-volatile WAL buffer
Date: 2021-01-06 17:16:38
Message-ID: 3b16d44c-85cc-499d-9277-2cd052938228@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I think I've managed to get the 0002 patch [1] rebased to master and
working (with help from Masahiko Sawada). It's not clear to me how it
could have worked as submitted - my theory is that an incomplete patch
was submitted by mistake, or something like that.

Unfortunately, the benchmark results were kinda disappointing. For a
pgbench on scale 500 (fits into shared buffers), an average of three
5-minute runs looks like this:

branch 1 16 32 64 96
----------------------------------------------------------------
master 7291 87704 165310 150437 224186
ntt 7912 106095 213206 212410 237819
simple-no-buffers 7654 96544 115416 95828 103065

NTT refers to the patch from September 10, pre-allocating a large WAL
file on PMEM, and simple-no-buffers is the simpler patch simply removing
the WAL buffers and writing directly to a mmap-ed WAL segment on PMEM.

Note: The patch is just replacing the old implementation with mmap.
That's good enough for experiments like this, but we probably want to
keep the old one for setups without PMEM. But it's good enough for
testing, benchmarking etc.

Unfortunately, the results for this simple approach are pretty bad. Not
only compared to the "ntt" patch, but even to master. I'm not entirely
sure what's the root cause, but I have a couple hypotheses:

1) bug in the patch - That's clearly a possibility, although I've tried
tried to eliminate this possibility.

2) PMEM is slower than DRAM - From what I know, PMEM is much faster than
NVMe storage, but still much slower than DRAM (both in terms of latency
and bandwidth, see [2] for some data). It's not terrible, but the
latency is maybe 2-3x higher - not a huge difference, but may matter for
WAL buffers?

3) PMEM does not handle parallel writes well - If you look at [2],
Figure 4(b), you'll see that the throughput actually *drops" as the
number of threads increase. That's pretty strange / annoying, because
that's how we write into WAL buffers - each thread writes it's own data,
so parallelism is not something we can get rid of.

I've added some simple profiling, to measure number of calls / time for
each operation (use -DXLOG_DEBUG_STATS to enable). It accumulates data
for each backend, and logs the counts every 1M ops.

Typical stats from a concurrent run looks like this:

xlog stats cnt 43000000
map cnt 100 time 5448333 unmap cnt 100 time 3730963
memcpy cnt 985964 time 1550442272 len 15150499
memset cnt 0 time 0 len 0
persist cnt 13836 time 10369617 len 16292182

The times are in nanoseconds, so this says the backend did 100 mmap and
unmap calls, taking ~10ms in total. There were ~14k pmem_persist calls,
taking 10ms in total. And the most time (~1.5s) was used by pmem_memcpy
copying about 15MB of data. That's quite a lot :-(

My conclusion from this is that eliminating WAL buffers and writing WAL
directly to PMEM (by memcpy to mmap-ed WAL segments) is probably not the
right approach.

I suppose we should keep WAL buffers, and then just write the data to
mmap-ed WAL segments on PMEM. Which I think is what the NTT patch does,
except that it allocates one huge file on PMEM and writes to that
(instead of the traditional WAL segments).

So I decided to try how it'd work with writing to regular WAL segments,
mmap-ed ad hoc. The pmem-with-wal-buffers-master.patch patch does that,
and the results look a bit nicer:

branch 1 16 32 64 96
----------------------------------------------------------------
master 7291 87704 165310 150437 224186
ntt 7912 106095 213206 212410 237819
simple-no-buffers 7654 96544 115416 95828 103065
with-wal-buffers 7477 95454 181702 140167 214715

So, much better than the version without WAL buffers, somewhat better
than master (except for 64/96 clients), but still not as good as NTT.

At this point I was wondering how could the NTT patch be faster when
it's doing roughly the same thing. I'm sire there are some differences,
but it seemed strange. The main difference seems to be that it only maps
one large file, and only once. OTOH the alternative "simple" patch maps
segments one by one, in each backend. Per the debug stats the map/unmap
calls are fairly cheap, but maybe it interferes with the memcpy somehow.

So I did an experiment by increasing the size of the WAL segments. I
chose to try with 521MB and 1024MB, and the results with 1GB look like this:

branch 1 16 32 64 96
----------------------------------------------------------------
master 6635 88524 171106 163387 245307
ntt 7909 106826 217364 223338 242042
simple-no-buffers 7871 101575 199403 188074 224716
with-wal-buffers 7643 101056 206911 223860 261712

So yeah, there's a clear difference. It changes the values for "master"
a bit, but both the "simple" patches (with and without) WAL buffers are
much faster. The with-wal-buffers is almost equal to the NTT patch,
which was using 96GB file. I presume larger WAL segments would get even
closer, if we supported them.

I'll continue investigating this, but my conclusion so far seem to be
that we can't really replace WAL buffers with PMEM - that seems to
perform much worse.

The question is what to do about the segment size. Can we reduce the
overhead of mmap-ing individual segments, so that this works even for
smaller WAL segments, to make this useful for common instances (not
everyone wants to run with 1GB WAL). Or whether we need to adopt the
design with a large file, mapped just once.

Another question is whether it's even worth the extra complexity. On
16MB segments the difference between master and NTT patch seems to be
non-trivial, but increasing the WAL segment size kinda reduces that. So
maybe just using File I/O on PMEM DAX filesystem seems good enough.
Alternatively, maybe we could switch to libpmemblk, which should
eliminate the filesystem overhead at least.

I'm also wondering if WAL is the right usage for PMEM. Per [2] there's a
huge read-write assymmetry (the writes being way slower), and their
recommendation (in "Observation 3" is)

The read-write asymmetry of PMem im-plies the necessity of avoiding
writes as much as possible for PMem.

So maybe we should not be trying to use PMEM for WAL, which is pretty
write-heavy (and in most cases even write-only).

I'll continue investigating this, but I'd welcome some feedback and
thoughts about this.

Attached are:

* patches.tgz - all three patches discussed here, rebased to master

* bench.tgz - benchmarking scripts / config files I used

* pmem.pdf - charts illustrating results between the patches, and also
showing the impact of the increased WAL segments

regards

[1]
https://www.postgresql.org/message-id/000001d5dff4%24995ed180%24cc1c7480%24%40hco.ntt.co.jp_1

[2] https://arxiv.org/pdf/2005.07658.pdf (Lessons learned from the early
performance evaluation of IntelOptane DC Persistent Memory in DBMS)

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
bench.tgz application/x-compressed-tar 10.0 KB
patches.tgz application/x-compressed-tar 50.1 KB
PMEM.pdf application/pdf 302.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-01-06 17:18:32 Re: Phrase search vs. multi-lexeme tokens
Previous Message Stephen Frost 2021-01-06 17:10:26 Re: data_checksums enabled by default (was: Move --data-checksums to common options in initdb --help)