Re: [PoC] Non-volatile WAL buffer

From: Takashi Menjo <takashi(dot)menjo(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, Takashi Menjo <takashi(dot)menjou(dot)vg(at)hco(dot)ntt(dot)co(dot)jp>, "Deng, Gang" <gang(dot)deng(at)intel(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Non-volatile WAL buffer
Date: 2021-03-05 08:08:46
Message-ID: CAOwnP3OfT=9FDgpP=syHwkMq3+GrGO-0kkiam2zjTeDpFaJd8g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tomas,

Thank you so much for your report. I have read it with great interest.

Your conclusion sounds reasonable to me. My patchset you call "NTT /
segments" got as good performance as "NTT / buffer" patchset. I have
been worried that calling mmap/munmap for each WAL segment file could
have a lot of overhead. Based on your performance tests, however, the
overhead looks less than what I thought. In addition, "NTT / segments"
patchset is more compatible to the current PG and more friendly to
DBAs because that patchset uses WAL segment files and does not
introduce any other new WAL-related file.

I also think that supporting both file I/O and mmap is better than
supporting only mmap. I will continue my work on "NTT / segments"
patchset to support both ways.

In the following, I will answer "Issues & Questions" you reported.

> While testing the "NTT / segments" patch, I repeatedly managed to crash the cluster with errors like this:
>
> 2021-02-28 00:07:21.221 PST client backend [3737139] WARNING: creating logfile segment just before
> mapping; path "pg_wal/00000001000000070000002F"
> 2021-02-28 00:07:21.670 PST client backend [3737142] WARNING: creating logfile segment just before
> mapping; path "pg_wal/000000010000000700000030"
> ...
> 2021-02-28 00:07:21.698 PST client backend [3737145] WARNING: creating logfile segment just before
> mapping; path "pg_wal/000000010000000700000030"
> 2021-02-28 00:07:21.698 PST client backend [3737130] PANIC: could not open file
> "pg_wal/000000010000000700000030": No such file or directory
>
> I do believe this is a thinko in the 0008 patch, which does XLogFileInit in XLogFileMap. Notice there are multiple
> "creating logfile" messages with the ..0030 segment, followed by the failure. AFAICS the XLogFileMap may be
> called from multiple backends, so they may call XLogFileInit concurrently, likely triggering some sort of race
> condition. It's fairly rare issue, though - I've only seen it twice from ~20 runs.

Thank you for your report. I found that rather the patch 0009 has an
issue, and that will also cause WAL loss. I should have set
use_existent to true, or InstallXlogFileSegment and BasicOpenFile in
XLogFileInit can be racy. I have misunderstood that use_existent can
be true because I am creating a brand-new file with XLogFileInit.

I will fix the issue.

> The other question I have is about WALInsertLockUpdateInsertingAt. 0003 removes this function, but leaves
> behind some of the other bits working with insert locks and insertingAt. But it does not explain how it works without
> WaitXLogInsertionsToFinish() - how does it ensure that when we commit something, all the preceding WAL is
> "complete" (i.e. written by other backends etc.)?

To wait for *all* the WALInsertLocks to be released, no matter each of
them precedes or follows the current insertion.

It would have worked functionally, but I rethink it is not good for
performance because XLogFileMap in GetXLogBuffer (where
WaitXLogInsertionsToFinish is removed) can block because it can
eventually call write() in XLogFileInit.

I will restore the WALInsertLockUpdateInsertingAt function and related
code for mmap.

Best regards,
Takashi

On Tue, Mar 2, 2021 at 5:40 AM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
> Hi,
>
> I've performed some additional benchmarking and testing on the patches
> sent on 26/1 [1], and I'd like to share some interesting results.
>
> I did the tests on two different machines, with slightly different
> configurations. Both machines use the same CPU generation with slightly
> different frequency, a different OS (Ubuntu vs. RH), kernel (5.3 vs.
> 4.18) and so on. A more detailed description is in the attached PDF,
> along with the PostgreSQL configuration.
>
> The benchmark is fairly simple - pgbench with scale 500 (fits into
> shared buffers) and 5000 (fits into RAM). The runs were just 1 minute
> each, which is fairly short - it's however intentional, because I've
> done this with both full_page_writes=on/off to test how this behaves
> with many and no FPIs. This models extreme behaviors at the beginning
> and at the end of a checkpoint.
>
> This thread is rather confusing because there are far too many patches
> with over-lapping version numbers - even [1] contains two very different
> patches. I'll refer to them as "NTT / buffer" (for the patch using one
> large PMEM buffer) and "NTT / segments" for the patch using regular WAL
> segments.
>
> The attached PDF shows all these results along with charts. The two
> systems have a bit different performance (throughput), the conclusions
> seem to be mostly the same, so I'll just talk about results from one of
> the systems here (aka "System A").
>
> Note: Those systems are hosted / provided by Intel SDP, and Intel is
> interested in providing access to other devs interested in PMEM.
>
> Furthermore, these patches seem to be very insensitive to WAL segment
> size (unlike the experimental patches I shared some time ago), so I'll
> only show results for one WAL segment size. (Obviously, the NTT / buffer
> patch can't be sensitive to this by definition, as it's not using WAL
> segments at all.)
>
>
> Results
> -------
>
> For scale 500, the results (with full_page_writes=on) look like this:
>
> 1 8 16 32 48 64
> ------------------------------------------------------------------
> master 9411 58833 111453 181681 215552 234099
> NTT / buffer 10837 77260 145251 222586 255651 264207
> NTT / segments 11011 76892 145049 223078 255022 269737
>
> So there is a fairly nice speedup - about 30%, which is consistent with
> the results shared before. Moreover, the "NTT / segments" patch performs
> about the same as the "NTT / buffer" which is encouraging.
>
> For scale 5000, the results look like this:
>
> 1 8 16 32 48 64
> ------------------------------------------------------------------
> master 7388 42020 64523 91877 102805 111389
> NTT / buffer 8650 58018 96314 132440 139512 134228
> NTT / segments 8614 57286 97173 138435 157595 157138
>
> That's intriguing - the speedup is even higher, almost 40-60% with
> enough clients (16-64). For me this is a bit surprising, because in this
> case the data don't fit into shared_buffers, so extra time needs to be
> spent copying data between RAM and shared_buffers and perhaps even doing
> some writes. So my expectation was that this increases the amount of
> time spent outside XLOG code, thus diminishing the speedup.
>
> Now, let's look at results with full_page_writes=off. For scale 500 the
> results are:
>
> 1 8 16 32 48 64
> ------------------------------------------------------------------
> master 10476 67191 122191 198620 234381 251452
> NTT / buffer 11119 79530 148580 229523 262142 275281
> NTT / segments 11528 79004 148978 229714 259798 274753
>
> and on scale 5000:
>
> 1 8 16 32 48 64
> ------------------------------------------------------------------
> master 8192 55870 98451 145097 172377 172907
> NTT / buffer 9063 62659 110868 161352 173977 164359
> NTT / segments 9277 63226 112307 166070 171997 158085
>
> That is, the speedups with scale 500 drops to ~10%, and for scale 5000
> it disappears almost entirely.
>
> I'd have expected that without FPIs the patches will actually be more
> effective - so this seems interesting. The conclusion however seems to
> be that the lower the amount of FPIs in the WAL stream, the smaller the
> speedup. Or in a different way - it's most effective right after a
> checkpoint, and it decreases during the checkpoint. So in a well tuned
> system with significant distance between checkpoints, the speedup seems
> to be fairly limited.
>
> This is also consistent with the fact that for scale 5000 (with FPW=on)
> the speedups are much more significant, simply because there are far
> more pages (and thus FPIs). Also, after disabling FPWs the speedup
> almost entirely disappears.
>
> On the second system, the differences are even more significant (see the
> PDF). I suspect this is dues to slightly different hardware config with
> slower CPU / different PMEM capacity, etc. The overall behavior and
> conclusions are however the same, I think.
>
> Of course, another question is how this will be affected by never PMEM
> versions with higher performance (e.g. the new generation of Intel PMEM
> should be ~20% faster, from what I hear).
>
>
> Issues & Questions
> ------------------
>
> While testing the "NTT / segments" patch, I repeatedly managed to crash
> the cluster with errors like this:
>
> 2021-02-28 00:07:21.221 PST client backend [3737139] WARNING: creating
> logfile segment just before mapping; path "pg_wal/00000001000000070000002F"
> 2021-02-28 00:07:21.670 PST client backend [3737142] WARNING: creating
> logfile segment just before mapping; path "pg_wal/000000010000000700000030"
> ...
> 2021-02-28 00:07:21.698 PST client backend [3737145] WARNING: creating
> logfile segment just before mapping; path "pg_wal/000000010000000700000030"
> 2021-02-28 00:07:21.698 PST client backend [3737130] PANIC: could not
> open file "pg_wal/000000010000000700000030": No such file or directory
>
> I do believe this is a thinko in the 0008 patch, which does XLogFileInit
> in XLogFileMap. Notice there are multiple "creating logfile" messages
> with the ..0030 segment, followed by the failure. AFAICS the XLogFileMap
> may be called from multiple backends, so they may call XLogFileInit
> concurrently, likely triggering some sort of race condition. It's fairly
> rare issue, though - I've only seen it twice from ~20 runs.
>
>
> The other question I have is about WALInsertLockUpdateInsertingAt. 0003
> removes this function, but leaves behind some of the other bits working
> with insert locks and insertingAt. But it does not explain how it works
> without WaitXLogInsertionsToFinish() - how does it ensure that when we
> commit something, all the preceding WAL is "complete" (i.e. written by
> other backends etc.)?
>
>
> Conclusion
> ----------
>
> I do think the "NTT / segments" patch is the most promising way forward.
> It does perform about as well as the "NTT / buffer" patch (and much both
> perform much better than the experimental patches I shared in January).
>
> The "NTT / buffer" patch seems much more disruptive - it introduces one
> large buffer for WAL, which makes various other tasks more complicated
> (i.e. it needs additional complexity to handle WAL archival, etc.). Are
> there some advantages of this patch (compared to the other patch)?
>
> As for the "NTT / segments" patch, I wonder if we can just rework the
> code like this (to use mmap etc.) or whether we need to support both
> both ways (file I/O and mmap). I don't have much experience with many
> other platforms, but it seems quite possible that mmap won't work all
> that well on some of them. So my assumption is we'll need to support
> both file I/O and mmap to make any of this committable, but I may be wrong.
>
>
> [1]
> https://www.postgresql.org/message-id/CAOwnP3Oz4CnKp0-_KU-x5irr9pBqPNkk7pjwZE5Pgo8i1CbFGg%40mail.gmail.com
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

--
Takashi Menjo <takashi(dot)menjo(at)gmail(dot)com>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2021-03-05 08:14:13 Re: Add support for PROVE_FLAGS and PROVE_TESTS in MSVC scripts
Previous Message Kyotaro Horiguchi 2021-03-05 08:00:41 Re: 011_crash_recovery.pl intermittently fails