Re: [PoC] Non-volatile WAL buffer

From: Takashi Menjo <takashi(dot)menjo(at)gmail(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, Takashi Menjo <takashi(dot)menjou(dot)vg(at)hco(dot)ntt(dot)co(dot)jp>, "Deng, Gang" <gang(dot)deng(at)intel(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Subject: Re: [PoC] Non-volatile WAL buffer
Date: 2021-01-27 08:28:25
Message-ID: CAOwnP3OSzRJr+k2zxX85xnVL0CyN4jErr1LMQaHCX3vqT3A_cw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Now I have caught up with this thread. I see that many of you are
interested in performance profiling.

I share my slides in SNIA SDC 2020 [1]. In the slides, I had profiles
focused on XLogInsert and XLogFlush (mainly the latter) for my non-volatile
WAL buffer patchset. I found that the time for XLogWrite and
locking/unlocking WALWriteLock were eliminated by the patchset. Instead,
XLogInsert and WaitXLogInsertionsToFinish took more (or a little more) time
than ever because memcpy-ing to PMEM (Optane PMem) is slower than to DRAM.
For details, please see the slides.

Best regards,
Takashi

[1]
https://www.snia.org/educational-library/how-can-persistent-memory-make-databases-faster-and-how-could-we-go-ahead-2020

2021年1月26日(火) 18:50 Takashi Menjo <takashi(dot)menjo(at)gmail(dot)com>:

> Dear everyone, Tomas,
>
> First of all, the "v4" patchset for non-volatile WAL buffer attached to
> the previous mail is actually v5... Please read "v4" as "v5."
>
> Then, to Tomas:
> Thank you for your crash report you gave on Nov 27, 2020, regarding msync
> patchset. I applied the latest msync patchset v3 attached to the previous
> to master 411ae64 (on Jan18, 2021) then tested it, and I got no error when
> pgbench -i -s 500. Please try it if necessary.
>
> Best regards,
> Takashi
>
>
> 2021年1月26日(火) 17:52 Takashi Menjo <takashi(dot)menjo(at)gmail(dot)com>:
>
>> Dear everyone,
>>
>> Sorry but I forgot to attach my patchsets... Please see the files
>> attached to this mail. Please also note that they contain some fixes.
>>
>> Best regards,
>> Takashi
>>
>>
>> 2021年1月26日(火) 17:46 Takashi Menjo <takashi(dot)menjo(at)gmail(dot)com>:
>>
>>> Dear everyone,
>>>
>>> I'm sorry for the late reply. I rebase my two patchsets onto the latest
>>> master 411ae64.The one patchset prefixed with v4 is for non-volatile WAL
>>> buffer; the other prefixed with v3 is for msync.
>>>
>>> I will reply to your thankful feedbacks one by one within days. Please
>>> wait for a moment.
>>>
>>> Best regards,
>>> Takashi
>>>
>>>
>>> 01/25/2021(Mon) 11:56 Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>:
>>>
>>>> On Fri, Jan 22, 2021 at 11:32 AM Tomas Vondra
>>>> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>>> >
>>>> >
>>>> >
>>>> > On 1/21/21 3:17 AM, Masahiko Sawada wrote:
>>>> > > On Thu, Jan 7, 2021 at 2:16 AM Tomas Vondra
>>>> > > <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>>> > >>
>>>> > >> Hi,
>>>> > >>
>>>> > >> I think I've managed to get the 0002 patch [1] rebased to master
>>>> and
>>>> > >> working (with help from Masahiko Sawada). It's not clear to me how
>>>> it
>>>> > >> could have worked as submitted - my theory is that an incomplete
>>>> patch
>>>> > >> was submitted by mistake, or something like that.
>>>> > >>
>>>> > >> Unfortunately, the benchmark results were kinda disappointing. For
>>>> a
>>>> > >> pgbench on scale 500 (fits into shared buffers), an average of
>>>> three
>>>> > >> 5-minute runs looks like this:
>>>> > >>
>>>> > >> branch 1 16 32 64
>>>> 96
>>>> > >>
>>>> ----------------------------------------------------------------
>>>> > >> master 7291 87704 165310 150437
>>>> 224186
>>>> > >> ntt 7912 106095 213206 212410
>>>> 237819
>>>> > >> simple-no-buffers 7654 96544 115416 95828
>>>> 103065
>>>> > >>
>>>> > >> NTT refers to the patch from September 10, pre-allocating a large
>>>> WAL
>>>> > >> file on PMEM, and simple-no-buffers is the simpler patch simply
>>>> removing
>>>> > >> the WAL buffers and writing directly to a mmap-ed WAL segment on
>>>> PMEM.
>>>> > >>
>>>> > >> Note: The patch is just replacing the old implementation with mmap.
>>>> > >> That's good enough for experiments like this, but we probably want
>>>> to
>>>> > >> keep the old one for setups without PMEM. But it's good enough for
>>>> > >> testing, benchmarking etc.
>>>> > >>
>>>> > >> Unfortunately, the results for this simple approach are pretty
>>>> bad. Not
>>>> > >> only compared to the "ntt" patch, but even to master. I'm not
>>>> entirely
>>>> > >> sure what's the root cause, but I have a couple hypotheses:
>>>> > >>
>>>> > >> 1) bug in the patch - That's clearly a possibility, although I've
>>>> tried
>>>> > >> tried to eliminate this possibility.
>>>> > >>
>>>> > >> 2) PMEM is slower than DRAM - From what I know, PMEM is much
>>>> faster than
>>>> > >> NVMe storage, but still much slower than DRAM (both in terms of
>>>> latency
>>>> > >> and bandwidth, see [2] for some data). It's not terrible, but the
>>>> > >> latency is maybe 2-3x higher - not a huge difference, but may
>>>> matter for
>>>> > >> WAL buffers?
>>>> > >>
>>>> > >> 3) PMEM does not handle parallel writes well - If you look at [2],
>>>> > >> Figure 4(b), you'll see that the throughput actually *drops" as the
>>>> > >> number of threads increase. That's pretty strange / annoying,
>>>> because
>>>> > >> that's how we write into WAL buffers - each thread writes it's own
>>>> data,
>>>> > >> so parallelism is not something we can get rid of.
>>>> > >>
>>>> > >> I've added some simple profiling, to measure number of calls /
>>>> time for
>>>> > >> each operation (use -DXLOG_DEBUG_STATS to enable). It accumulates
>>>> data
>>>> > >> for each backend, and logs the counts every 1M ops.
>>>> > >>
>>>> > >> Typical stats from a concurrent run looks like this:
>>>> > >>
>>>> > >> xlog stats cnt 43000000
>>>> > >> map cnt 100 time 5448333 unmap cnt 100 time 3730963
>>>> > >> memcpy cnt 985964 time 1550442272 len 15150499
>>>> > >> memset cnt 0 time 0 len 0
>>>> > >> persist cnt 13836 time 10369617 len 16292182
>>>> > >>
>>>> > >> The times are in nanoseconds, so this says the backend did 100
>>>> mmap and
>>>> > >> unmap calls, taking ~10ms in total. There were ~14k pmem_persist
>>>> calls,
>>>> > >> taking 10ms in total. And the most time (~1.5s) was used by
>>>> pmem_memcpy
>>>> > >> copying about 15MB of data. That's quite a lot :-(
>>>> > >
>>>> > > It might also be interesting if we can see how much time spent on
>>>> each
>>>> > > logging function, such as XLogInsert(), XLogWrite(), and
>>>> XLogFlush().
>>>> > >
>>>> >
>>>> > Yeah, we could extend it to that, that's fairly mechanical thing. Bbut
>>>> > maybe that could be visible in a regular perf profile. Also, I suppose
>>>> > most of the time will be used by the pmem calls, shown in the stats.
>>>> >
>>>> > >>
>>>> > >> My conclusion from this is that eliminating WAL buffers and
>>>> writing WAL
>>>> > >> directly to PMEM (by memcpy to mmap-ed WAL segments) is probably
>>>> not the
>>>> > >> right approach.
>>>> > >>
>>>> > >> I suppose we should keep WAL buffers, and then just write the data
>>>> to
>>>> > >> mmap-ed WAL segments on PMEM. Which I think is what the NTT patch
>>>> does,
>>>> > >> except that it allocates one huge file on PMEM and writes to that
>>>> > >> (instead of the traditional WAL segments).
>>>> > >>
>>>> > >> So I decided to try how it'd work with writing to regular WAL
>>>> segments,
>>>> > >> mmap-ed ad hoc. The pmem-with-wal-buffers-master.patch patch does
>>>> that,
>>>> > >> and the results look a bit nicer:
>>>> > >>
>>>> > >> branch 1 16 32 64
>>>> 96
>>>> > >>
>>>> ----------------------------------------------------------------
>>>> > >> master 7291 87704 165310 150437
>>>> 224186
>>>> > >> ntt 7912 106095 213206 212410
>>>> 237819
>>>> > >> simple-no-buffers 7654 96544 115416 95828
>>>> 103065
>>>> > >> with-wal-buffers 7477 95454 181702 140167
>>>> 214715
>>>> > >>
>>>> > >> So, much better than the version without WAL buffers, somewhat
>>>> better
>>>> > >> than master (except for 64/96 clients), but still not as good as
>>>> NTT.
>>>> > >>
>>>> > >> At this point I was wondering how could the NTT patch be faster
>>>> when
>>>> > >> it's doing roughly the same thing. I'm sire there are some
>>>> differences,
>>>> > >> but it seemed strange. The main difference seems to be that it
>>>> only maps
>>>> > >> one large file, and only once. OTOH the alternative "simple" patch
>>>> maps
>>>> > >> segments one by one, in each backend. Per the debug stats the
>>>> map/unmap
>>>> > >> calls are fairly cheap, but maybe it interferes with the memcpy
>>>> somehow.
>>>> > >>
>>>> > >
>>>> > > While looking at the two methods: NTT and simple-no-buffer, I
>>>> realized
>>>> > > that in XLogFlush(), NTT patch flushes (by pmem_flush() and
>>>> > > pmem_drain()) WAL without acquiring WALWriteLock whereas
>>>> > > simple-no-buffer patch acquires WALWriteLock to do that
>>>> > > (pmem_persist()). I wonder if this also affected the performance
>>>> > > differences between those two methods since WALWriteLock serializes
>>>> > > the operations. With PMEM, multiple backends can concurrently flush
>>>> > > the records if the memory region is not overlapped? If so, flushing
>>>> > > WAL without WALWriteLock would be a big benefit.
>>>> > >
>>>> >
>>>> > That's a very good question - it's quite possible the WALWriteLock is
>>>> > not really needed, because the processes are actually "writing" the
>>>> WAL
>>>> > directly to PMEM. So it's a bit confusing, because it's only really
>>>> > concerned about making sure it's flushed.
>>>> >
>>>> > And yes, multiple processes certainly can write to PMEM at the same
>>>> > time, in fact it's a requirement to get good throughput I believe. My
>>>> > understanding is we need ~8 processes, at least that's what I heard
>>>> from
>>>> > people with more PMEM experience.
>>>>
>>>> Thanks, that's good to know.
>>>>
>>>> >
>>>> > TBH I'm not convinced the code in the "simple-no-buffer" code (coming
>>>> > from the 0002 patch) is actually correct. Essentially, consider the
>>>> > backend needs to do a flush, but does not have a segment mapped. So it
>>>> > maps it and calls pmem_drain() on it.
>>>> >
>>>> > But does that actually flush anything? Does it properly flush changes
>>>> > done by other processes that may not have called pmem_drain() yet? I
>>>> > find this somewhat suspicious and I'd bet all processes that did write
>>>> > something have to call pmem_drain().
>>>>
>>>> Yeah, in terms of experiments at least it's good to find out that the
>>>> approach mmapping each WAL segment is not good at performance.
>>>>
>>>> >
>>>> >
>>>> > >> So I did an experiment by increasing the size of the WAL segments.
>>>> I
>>>> > >> chose to try with 521MB and 1024MB, and the results with 1GB look
>>>> like this:
>>>> > >>
>>>> > >> branch 1 16 32 64
>>>> 96
>>>> > >>
>>>> ----------------------------------------------------------------
>>>> > >> master 6635 88524 171106 163387
>>>> 245307
>>>> > >> ntt 7909 106826 217364 223338
>>>> 242042
>>>> > >> simple-no-buffers 7871 101575 199403 188074
>>>> 224716
>>>> > >> with-wal-buffers 7643 101056 206911 223860
>>>> 261712
>>>> > >>
>>>> > >> So yeah, there's a clear difference. It changes the values for
>>>> "master"
>>>> > >> a bit, but both the "simple" patches (with and without) WAL
>>>> buffers are
>>>> > >> much faster. The with-wal-buffers is almost equal to the NTT
>>>> patch,
>>>> > >> which was using 96GB file. I presume larger WAL segments would get
>>>> even
>>>> > >> closer, if we supported them.
>>>> > >>
>>>> > >> I'll continue investigating this, but my conclusion so far seem to
>>>> be
>>>> > >> that we can't really replace WAL buffers with PMEM - that seems to
>>>> > >> perform much worse.
>>>> > >>
>>>> > >> The question is what to do about the segment size. Can we reduce
>>>> the
>>>> > >> overhead of mmap-ing individual segments, so that this works even
>>>> for
>>>> > >> smaller WAL segments, to make this useful for common instances (not
>>>> > >> everyone wants to run with 1GB WAL). Or whether we need to adopt
>>>> the
>>>> > >> design with a large file, mapped just once.
>>>> > >>
>>>> > >> Another question is whether it's even worth the extra complexity.
>>>> On
>>>> > >> 16MB segments the difference between master and NTT patch seems to
>>>> be
>>>> > >> non-trivial, but increasing the WAL segment size kinda reduces
>>>> that. So
>>>> > >> maybe just using File I/O on PMEM DAX filesystem seems good enough.
>>>> > >> Alternatively, maybe we could switch to libpmemblk, which should
>>>> > >> eliminate the filesystem overhead at least.
>>>> > >
>>>> > > I think the performance improvement by NTT patch with the 16MB WAL
>>>> > > segment, the most common WAL segment size, is very good (150437 vs.
>>>> > > 212410 with 64 clients). But maybe evaluating writing WAL segment
>>>> > > files on PMEM DAX filesystem is also worth, as you mentioned, if we
>>>> > > don't do that yet.
>>>> > >
>>>> >
>>>> > Well, not sure. I think the question is still open whether it's
>>>> actually
>>>> > safe to run on DAX, which does not have atomic writes of 512B sectors,
>>>> > and I think we rely on that e.g. for pg_config. But maybe for WAL
>>>> that's
>>>> > not an issue.
>>>>
>>>> I think we can use the Block Translation Table (BTT) driver that
>>>> provides atomic sector updates.
>>>>
>>>> >
>>>> > > Also, I'm interested in why the through-put of NTT patch saturated
>>>> at
>>>> > > 32 clients, which is earlier than the master's one (96 clients). How
>>>> > > many CPU cores are there on the machine you used?
>>>> > >
>>>> >
>>>> > From what I know, this is somewhat expected for PMEM devices, for a
>>>> > bunch of reasons:
>>>> >
>>>> > 1) The memory bandwidth is much lower than for DRAM (maybe ~10-20%),
>>>> so
>>>> > it takes fewer processes to saturate it.
>>>> >
>>>> > 2) Internally, the PMEM has a 256B buffer for writes, used for
>>>> combining
>>>> > etc. With too many processes sending writes, it becomes to look more
>>>> > random, which is harmful for throughput.
>>>> >
>>>> > When combined, this means the performance starts dropping at certain
>>>> > number of threads, and the optimal number of threads is rather low
>>>> > (something like 5-10). This is very different behavior compared to
>>>> DRAM.
>>>>
>>>> Makes sense.
>>>>
>>>> >
>>>> > There's a nice overview and measurements in this paper:
>>>> >
>>>> > Building blocks for persistent memory / How to get the most out of
>>>> your
>>>> > new memory?
>>>> > Alexander van Renen, Lukas Vogel, Viktor Leis, Thomas Neumann & Alfons
>>>> > Kemper
>>>> >
>>>> > https://link.springer.com/article/10.1007/s00778-020-00622-9
>>>>
>>>> Thank you. I'll read it.
>>>>
>>>> >
>>>> >
>>>> > >> I'm also wondering if WAL is the right usage for PMEM. Per [2]
>>>> there's a
>>>> > >> huge read-write assymmetry (the writes being way slower), and their
>>>> > >> recommendation (in "Observation 3" is)
>>>> > >>
>>>> > >> The read-write asymmetry of PMem im-plies the necessity of
>>>> avoiding
>>>> > >> writes as much as possible for PMem.
>>>> > >>
>>>> > >> So maybe we should not be trying to use PMEM for WAL, which is
>>>> pretty
>>>> > >> write-heavy (and in most cases even write-only).
>>>> > >
>>>> > > I think using PMEM for WAL is cost-effective but it leverages the
>>>> only
>>>> > > low-latency (sequential) write, but not other abilities such as
>>>> > > fine-grained access and low-latency random write. If we want to
>>>> > > exploit its all ability we might need some drastic changes to
>>>> logging
>>>> > > protocol while considering storing data on PMEM.
>>>> > >
>>>> >
>>>> > True. I think investigating whether it's sensible to use PMEM for this
>>>> > purpose. It may turn out that replacing the DRAM WAL buffers with
>>>> writes
>>>> > directly to PMEM is not economical, and aggregating data in a DRAM
>>>> > buffer is better :-(
>>>>
>>>> Yes. I think it might be interesting to do an analysis of the
>>>> bottlenecks of NTT patch by perf etc. If bottlenecks are moved to
>>>> other places by removing WALWriteLock during flush, it's probably a
>>>> good sign for further performance improvements. IIRC WALWriteLock is
>>>> one of the main bottlenecks on OLTP workload, although my memory might
>>>> already be out of date.
>>>>
>>>> Regards,
>>>>
>>>> --
>>>> Masahiko Sawada
>>>> EDB: https://www.enterprisedb.com/
>>>>
>>>
>>>
>>> --
>>> Takashi Menjo <takashi(dot)menjo(at)gmail(dot)com>
>>>
>>
>>
>> --
>> Takashi Menjo <takashi(dot)menjo(at)gmail(dot)com>
>>
>
>
> --
> Takashi Menjo <takashi(dot)menjo(at)gmail(dot)com>
>

--
Takashi Menjo <takashi(dot)menjo(at)gmail(dot)com>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2021-01-27 08:32:13 Re: simplifying foreign key/RI checks
Previous Message Bharath Rupireddy 2021-01-27 08:17:06 Re: Parallel Inserts in CREATE TABLE AS