Re: [PoC] Non-volatile WAL buffer

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, Takashi Menjo <takashi(dot)menjo(at)gmail(dot)com>, Takashi Menjo <takashi(dot)menjou(dot)vg(at)hco(dot)ntt(dot)co(dot)jp>, "Deng, Gang" <gang(dot)deng(at)intel(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Non-volatile WAL buffer
Date: 2021-01-25 02:56:15
Message-ID: CAD21AoBkFfu-sjNHeUkeT1yDKAeo+scV1Ld0eGj8GQxn7QtM1A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 22, 2021 at 11:32 AM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
>
>
> On 1/21/21 3:17 AM, Masahiko Sawada wrote:
> > On Thu, Jan 7, 2021 at 2:16 AM Tomas Vondra
> > <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> >>
> >> Hi,
> >>
> >> I think I've managed to get the 0002 patch [1] rebased to master and
> >> working (with help from Masahiko Sawada). It's not clear to me how it
> >> could have worked as submitted - my theory is that an incomplete patch
> >> was submitted by mistake, or something like that.
> >>
> >> Unfortunately, the benchmark results were kinda disappointing. For a
> >> pgbench on scale 500 (fits into shared buffers), an average of three
> >> 5-minute runs looks like this:
> >>
> >> branch 1 16 32 64 96
> >> ----------------------------------------------------------------
> >> master 7291 87704 165310 150437 224186
> >> ntt 7912 106095 213206 212410 237819
> >> simple-no-buffers 7654 96544 115416 95828 103065
> >>
> >> NTT refers to the patch from September 10, pre-allocating a large WAL
> >> file on PMEM, and simple-no-buffers is the simpler patch simply removing
> >> the WAL buffers and writing directly to a mmap-ed WAL segment on PMEM.
> >>
> >> Note: The patch is just replacing the old implementation with mmap.
> >> That's good enough for experiments like this, but we probably want to
> >> keep the old one for setups without PMEM. But it's good enough for
> >> testing, benchmarking etc.
> >>
> >> Unfortunately, the results for this simple approach are pretty bad. Not
> >> only compared to the "ntt" patch, but even to master. I'm not entirely
> >> sure what's the root cause, but I have a couple hypotheses:
> >>
> >> 1) bug in the patch - That's clearly a possibility, although I've tried
> >> tried to eliminate this possibility.
> >>
> >> 2) PMEM is slower than DRAM - From what I know, PMEM is much faster than
> >> NVMe storage, but still much slower than DRAM (both in terms of latency
> >> and bandwidth, see [2] for some data). It's not terrible, but the
> >> latency is maybe 2-3x higher - not a huge difference, but may matter for
> >> WAL buffers?
> >>
> >> 3) PMEM does not handle parallel writes well - If you look at [2],
> >> Figure 4(b), you'll see that the throughput actually *drops" as the
> >> number of threads increase. That's pretty strange / annoying, because
> >> that's how we write into WAL buffers - each thread writes it's own data,
> >> so parallelism is not something we can get rid of.
> >>
> >> I've added some simple profiling, to measure number of calls / time for
> >> each operation (use -DXLOG_DEBUG_STATS to enable). It accumulates data
> >> for each backend, and logs the counts every 1M ops.
> >>
> >> Typical stats from a concurrent run looks like this:
> >>
> >> xlog stats cnt 43000000
> >> map cnt 100 time 5448333 unmap cnt 100 time 3730963
> >> memcpy cnt 985964 time 1550442272 len 15150499
> >> memset cnt 0 time 0 len 0
> >> persist cnt 13836 time 10369617 len 16292182
> >>
> >> The times are in nanoseconds, so this says the backend did 100 mmap and
> >> unmap calls, taking ~10ms in total. There were ~14k pmem_persist calls,
> >> taking 10ms in total. And the most time (~1.5s) was used by pmem_memcpy
> >> copying about 15MB of data. That's quite a lot :-(
> >
> > It might also be interesting if we can see how much time spent on each
> > logging function, such as XLogInsert(), XLogWrite(), and XLogFlush().
> >
>
> Yeah, we could extend it to that, that's fairly mechanical thing. Bbut
> maybe that could be visible in a regular perf profile. Also, I suppose
> most of the time will be used by the pmem calls, shown in the stats.
>
> >>
> >> My conclusion from this is that eliminating WAL buffers and writing WAL
> >> directly to PMEM (by memcpy to mmap-ed WAL segments) is probably not the
> >> right approach.
> >>
> >> I suppose we should keep WAL buffers, and then just write the data to
> >> mmap-ed WAL segments on PMEM. Which I think is what the NTT patch does,
> >> except that it allocates one huge file on PMEM and writes to that
> >> (instead of the traditional WAL segments).
> >>
> >> So I decided to try how it'd work with writing to regular WAL segments,
> >> mmap-ed ad hoc. The pmem-with-wal-buffers-master.patch patch does that,
> >> and the results look a bit nicer:
> >>
> >> branch 1 16 32 64 96
> >> ----------------------------------------------------------------
> >> master 7291 87704 165310 150437 224186
> >> ntt 7912 106095 213206 212410 237819
> >> simple-no-buffers 7654 96544 115416 95828 103065
> >> with-wal-buffers 7477 95454 181702 140167 214715
> >>
> >> So, much better than the version without WAL buffers, somewhat better
> >> than master (except for 64/96 clients), but still not as good as NTT.
> >>
> >> At this point I was wondering how could the NTT patch be faster when
> >> it's doing roughly the same thing. I'm sire there are some differences,
> >> but it seemed strange. The main difference seems to be that it only maps
> >> one large file, and only once. OTOH the alternative "simple" patch maps
> >> segments one by one, in each backend. Per the debug stats the map/unmap
> >> calls are fairly cheap, but maybe it interferes with the memcpy somehow.
> >>
> >
> > While looking at the two methods: NTT and simple-no-buffer, I realized
> > that in XLogFlush(), NTT patch flushes (by pmem_flush() and
> > pmem_drain()) WAL without acquiring WALWriteLock whereas
> > simple-no-buffer patch acquires WALWriteLock to do that
> > (pmem_persist()). I wonder if this also affected the performance
> > differences between those two methods since WALWriteLock serializes
> > the operations. With PMEM, multiple backends can concurrently flush
> > the records if the memory region is not overlapped? If so, flushing
> > WAL without WALWriteLock would be a big benefit.
> >
>
> That's a very good question - it's quite possible the WALWriteLock is
> not really needed, because the processes are actually "writing" the WAL
> directly to PMEM. So it's a bit confusing, because it's only really
> concerned about making sure it's flushed.
>
> And yes, multiple processes certainly can write to PMEM at the same
> time, in fact it's a requirement to get good throughput I believe. My
> understanding is we need ~8 processes, at least that's what I heard from
> people with more PMEM experience.

Thanks, that's good to know.

>
> TBH I'm not convinced the code in the "simple-no-buffer" code (coming
> from the 0002 patch) is actually correct. Essentially, consider the
> backend needs to do a flush, but does not have a segment mapped. So it
> maps it and calls pmem_drain() on it.
>
> But does that actually flush anything? Does it properly flush changes
> done by other processes that may not have called pmem_drain() yet? I
> find this somewhat suspicious and I'd bet all processes that did write
> something have to call pmem_drain().

Yeah, in terms of experiments at least it's good to find out that the
approach mmapping each WAL segment is not good at performance.

>
>
> >> So I did an experiment by increasing the size of the WAL segments. I
> >> chose to try with 521MB and 1024MB, and the results with 1GB look like this:
> >>
> >> branch 1 16 32 64 96
> >> ----------------------------------------------------------------
> >> master 6635 88524 171106 163387 245307
> >> ntt 7909 106826 217364 223338 242042
> >> simple-no-buffers 7871 101575 199403 188074 224716
> >> with-wal-buffers 7643 101056 206911 223860 261712
> >>
> >> So yeah, there's a clear difference. It changes the values for "master"
> >> a bit, but both the "simple" patches (with and without) WAL buffers are
> >> much faster. The with-wal-buffers is almost equal to the NTT patch,
> >> which was using 96GB file. I presume larger WAL segments would get even
> >> closer, if we supported them.
> >>
> >> I'll continue investigating this, but my conclusion so far seem to be
> >> that we can't really replace WAL buffers with PMEM - that seems to
> >> perform much worse.
> >>
> >> The question is what to do about the segment size. Can we reduce the
> >> overhead of mmap-ing individual segments, so that this works even for
> >> smaller WAL segments, to make this useful for common instances (not
> >> everyone wants to run with 1GB WAL). Or whether we need to adopt the
> >> design with a large file, mapped just once.
> >>
> >> Another question is whether it's even worth the extra complexity. On
> >> 16MB segments the difference between master and NTT patch seems to be
> >> non-trivial, but increasing the WAL segment size kinda reduces that. So
> >> maybe just using File I/O on PMEM DAX filesystem seems good enough.
> >> Alternatively, maybe we could switch to libpmemblk, which should
> >> eliminate the filesystem overhead at least.
> >
> > I think the performance improvement by NTT patch with the 16MB WAL
> > segment, the most common WAL segment size, is very good (150437 vs.
> > 212410 with 64 clients). But maybe evaluating writing WAL segment
> > files on PMEM DAX filesystem is also worth, as you mentioned, if we
> > don't do that yet.
> >
>
> Well, not sure. I think the question is still open whether it's actually
> safe to run on DAX, which does not have atomic writes of 512B sectors,
> and I think we rely on that e.g. for pg_config. But maybe for WAL that's
> not an issue.

I think we can use the Block Translation Table (BTT) driver that
provides atomic sector updates.

>
> > Also, I'm interested in why the through-put of NTT patch saturated at
> > 32 clients, which is earlier than the master's one (96 clients). How
> > many CPU cores are there on the machine you used?
> >
>
> From what I know, this is somewhat expected for PMEM devices, for a
> bunch of reasons:
>
> 1) The memory bandwidth is much lower than for DRAM (maybe ~10-20%), so
> it takes fewer processes to saturate it.
>
> 2) Internally, the PMEM has a 256B buffer for writes, used for combining
> etc. With too many processes sending writes, it becomes to look more
> random, which is harmful for throughput.
>
> When combined, this means the performance starts dropping at certain
> number of threads, and the optimal number of threads is rather low
> (something like 5-10). This is very different behavior compared to DRAM.

Makes sense.

>
> There's a nice overview and measurements in this paper:
>
> Building blocks for persistent memory / How to get the most out of your
> new memory?
> Alexander van Renen, Lukas Vogel, Viktor Leis, Thomas Neumann & Alfons
> Kemper
>
> https://link.springer.com/article/10.1007/s00778-020-00622-9

Thank you. I'll read it.

>
>
> >> I'm also wondering if WAL is the right usage for PMEM. Per [2] there's a
> >> huge read-write assymmetry (the writes being way slower), and their
> >> recommendation (in "Observation 3" is)
> >>
> >> The read-write asymmetry of PMem im-plies the necessity of avoiding
> >> writes as much as possible for PMem.
> >>
> >> So maybe we should not be trying to use PMEM for WAL, which is pretty
> >> write-heavy (and in most cases even write-only).
> >
> > I think using PMEM for WAL is cost-effective but it leverages the only
> > low-latency (sequential) write, but not other abilities such as
> > fine-grained access and low-latency random write. If we want to
> > exploit its all ability we might need some drastic changes to logging
> > protocol while considering storing data on PMEM.
> >
>
> True. I think investigating whether it's sensible to use PMEM for this
> purpose. It may turn out that replacing the DRAM WAL buffers with writes
> directly to PMEM is not economical, and aggregating data in a DRAM
> buffer is better :-(

Yes. I think it might be interesting to do an analysis of the
bottlenecks of NTT patch by perf etc. If bottlenecks are moved to
other places by removing WALWriteLock during flush, it's probably a
good sign for further performance improvements. IIRC WALWriteLock is
one of the main bottlenecks on OLTP workload, although my memory might
already be out of date.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-01-25 02:58:22 Re: Single transaction in the tablesync worker?
Previous Message Peter Smith 2021-01-25 02:53:04 Re: Single transaction in the tablesync worker?