Re: [PoC] Non-volatile WAL buffer

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Takashi Menjo <takashi(dot)menjo(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, Takashi Menjo <takashi(dot)menjou(dot)vg(at)hco(dot)ntt(dot)co(dot)jp>, "Deng, Gang" <gang(dot)deng(at)intel(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Non-volatile WAL buffer
Date: 2021-03-01 20:40:27
Message-ID: 9beaac79-2375-8bfc-489b-eb62bd8d4020@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I've performed some additional benchmarking and testing on the patches
sent on 26/1 [1], and I'd like to share some interesting results.

I did the tests on two different machines, with slightly different
configurations. Both machines use the same CPU generation with slightly
different frequency, a different OS (Ubuntu vs. RH), kernel (5.3 vs.
4.18) and so on. A more detailed description is in the attached PDF,
along with the PostgreSQL configuration.

The benchmark is fairly simple - pgbench with scale 500 (fits into
shared buffers) and 5000 (fits into RAM). The runs were just 1 minute
each, which is fairly short - it's however intentional, because I've
done this with both full_page_writes=on/off to test how this behaves
with many and no FPIs. This models extreme behaviors at the beginning
and at the end of a checkpoint.

This thread is rather confusing because there are far too many patches
with over-lapping version numbers - even [1] contains two very different
patches. I'll refer to them as "NTT / buffer" (for the patch using one
large PMEM buffer) and "NTT / segments" for the patch using regular WAL
segments.

The attached PDF shows all these results along with charts. The two
systems have a bit different performance (throughput), the conclusions
seem to be mostly the same, so I'll just talk about results from one of
the systems here (aka "System A").

Note: Those systems are hosted / provided by Intel SDP, and Intel is
interested in providing access to other devs interested in PMEM.

Furthermore, these patches seem to be very insensitive to WAL segment
size (unlike the experimental patches I shared some time ago), so I'll
only show results for one WAL segment size. (Obviously, the NTT / buffer
patch can't be sensitive to this by definition, as it's not using WAL
segments at all.)

Results
-------

For scale 500, the results (with full_page_writes=on) look like this:

1 8 16 32 48 64
------------------------------------------------------------------
master 9411 58833 111453 181681 215552 234099
NTT / buffer 10837 77260 145251 222586 255651 264207
NTT / segments 11011 76892 145049 223078 255022 269737

So there is a fairly nice speedup - about 30%, which is consistent with
the results shared before. Moreover, the "NTT / segments" patch performs
about the same as the "NTT / buffer" which is encouraging.

For scale 5000, the results look like this:

1 8 16 32 48 64
------------------------------------------------------------------
master 7388 42020 64523 91877 102805 111389
NTT / buffer 8650 58018 96314 132440 139512 134228
NTT / segments 8614 57286 97173 138435 157595 157138

That's intriguing - the speedup is even higher, almost 40-60% with
enough clients (16-64). For me this is a bit surprising, because in this
case the data don't fit into shared_buffers, so extra time needs to be
spent copying data between RAM and shared_buffers and perhaps even doing
some writes. So my expectation was that this increases the amount of
time spent outside XLOG code, thus diminishing the speedup.

Now, let's look at results with full_page_writes=off. For scale 500 the
results are:

1 8 16 32 48 64
------------------------------------------------------------------
master 10476 67191 122191 198620 234381 251452
NTT / buffer 11119 79530 148580 229523 262142 275281
NTT / segments 11528 79004 148978 229714 259798 274753

and on scale 5000:

1 8 16 32 48 64
------------------------------------------------------------------
master 8192 55870 98451 145097 172377 172907
NTT / buffer 9063 62659 110868 161352 173977 164359
NTT / segments 9277 63226 112307 166070 171997 158085

That is, the speedups with scale 500 drops to ~10%, and for scale 5000
it disappears almost entirely.

I'd have expected that without FPIs the patches will actually be more
effective - so this seems interesting. The conclusion however seems to
be that the lower the amount of FPIs in the WAL stream, the smaller the
speedup. Or in a different way - it's most effective right after a
checkpoint, and it decreases during the checkpoint. So in a well tuned
system with significant distance between checkpoints, the speedup seems
to be fairly limited.

This is also consistent with the fact that for scale 5000 (with FPW=on)
the speedups are much more significant, simply because there are far
more pages (and thus FPIs). Also, after disabling FPWs the speedup
almost entirely disappears.

On the second system, the differences are even more significant (see the
PDF). I suspect this is dues to slightly different hardware config with
slower CPU / different PMEM capacity, etc. The overall behavior and
conclusions are however the same, I think.

Of course, another question is how this will be affected by never PMEM
versions with higher performance (e.g. the new generation of Intel PMEM
should be ~20% faster, from what I hear).

Issues & Questions
------------------

While testing the "NTT / segments" patch, I repeatedly managed to crash
the cluster with errors like this:

2021-02-28 00:07:21.221 PST client backend [3737139] WARNING: creating
logfile segment just before mapping; path "pg_wal/00000001000000070000002F"
2021-02-28 00:07:21.670 PST client backend [3737142] WARNING: creating
logfile segment just before mapping; path "pg_wal/000000010000000700000030"
...
2021-02-28 00:07:21.698 PST client backend [3737145] WARNING: creating
logfile segment just before mapping; path "pg_wal/000000010000000700000030"
2021-02-28 00:07:21.698 PST client backend [3737130] PANIC: could not
open file "pg_wal/000000010000000700000030": No such file or directory

I do believe this is a thinko in the 0008 patch, which does XLogFileInit
in XLogFileMap. Notice there are multiple "creating logfile" messages
with the ..0030 segment, followed by the failure. AFAICS the XLogFileMap
may be called from multiple backends, so they may call XLogFileInit
concurrently, likely triggering some sort of race condition. It's fairly
rare issue, though - I've only seen it twice from ~20 runs.

The other question I have is about WALInsertLockUpdateInsertingAt. 0003
removes this function, but leaves behind some of the other bits working
with insert locks and insertingAt. But it does not explain how it works
without WaitXLogInsertionsToFinish() - how does it ensure that when we
commit something, all the preceding WAL is "complete" (i.e. written by
other backends etc.)?

Conclusion
----------

I do think the "NTT / segments" patch is the most promising way forward.
It does perform about as well as the "NTT / buffer" patch (and much both
perform much better than the experimental patches I shared in January).

The "NTT / buffer" patch seems much more disruptive - it introduces one
large buffer for WAL, which makes various other tasks more complicated
(i.e. it needs additional complexity to handle WAL archival, etc.). Are
there some advantages of this patch (compared to the other patch)?

As for the "NTT / segments" patch, I wonder if we can just rework the
code like this (to use mmap etc.) or whether we need to support both
both ways (file I/O and mmap). I don't have much experience with many
other platforms, but it seems quite possible that mmap won't work all
that well on some of them. So my assumption is we'll need to support
both file I/O and mmap to make any of this committable, but I may be wrong.

[1]
https://www.postgresql.org/message-id/CAOwnP3Oz4CnKp0-_KU-x5irr9pBqPNkk7pjwZE5Pgo8i1CbFGg%40mail.gmail.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
pmem-benchmarks.pdf application/pdf 459.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Juan José Santamaría Flecha 2021-03-01 20:49:25 Re: [PATCH] Bug fix in initdb output
Previous Message Alejandro Sánchez 2021-03-01 20:35:02 Re: Improvements in prepared statements