Re: [PoC] Non-volatile WAL buffer

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, Takashi Menjo <takashi(dot)menjo(at)gmail(dot)com>, Takashi Menjo <takashi(dot)menjou(dot)vg(at)hco(dot)ntt(dot)co(dot)jp>, "Deng, Gang" <gang(dot)deng(at)intel(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Non-volatile WAL buffer
Date: 2021-02-13 03:18:36
Message-ID: CAD21AoB_FX=Ce_r1rZjFjKqFnW=FaZFh+CvGheh1y7BJMLfGjQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 28, 2021 at 1:41 AM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
> On 1/25/21 3:56 AM, Masahiko Sawada wrote:
> >>
> >> ...
> >>
> >> On 1/21/21 3:17 AM, Masahiko Sawada wrote:
> >>> ...
> >>>
> >>> While looking at the two methods: NTT and simple-no-buffer, I realized
> >>> that in XLogFlush(), NTT patch flushes (by pmem_flush() and
> >>> pmem_drain()) WAL without acquiring WALWriteLock whereas
> >>> simple-no-buffer patch acquires WALWriteLock to do that
> >>> (pmem_persist()). I wonder if this also affected the performance
> >>> differences between those two methods since WALWriteLock serializes
> >>> the operations. With PMEM, multiple backends can concurrently flush
> >>> the records if the memory region is not overlapped? If so, flushing
> >>> WAL without WALWriteLock would be a big benefit.
> >>>
> >>
> >> That's a very good question - it's quite possible the WALWriteLock is
> >> not really needed, because the processes are actually "writing" the WAL
> >> directly to PMEM. So it's a bit confusing, because it's only really
> >> concerned about making sure it's flushed.
> >>
> >> And yes, multiple processes certainly can write to PMEM at the same
> >> time, in fact it's a requirement to get good throughput I believe. My
> >> understanding is we need ~8 processes, at least that's what I heard from
> >> people with more PMEM experience.
> >
> > Thanks, that's good to know.
> >
> >>
> >> TBH I'm not convinced the code in the "simple-no-buffer" code (coming
> >> from the 0002 patch) is actually correct. Essentially, consider the
> >> backend needs to do a flush, but does not have a segment mapped. So it
> >> maps it and calls pmem_drain() on it.
> >>
> >> But does that actually flush anything? Does it properly flush changes
> >> done by other processes that may not have called pmem_drain() yet? I
> >> find this somewhat suspicious and I'd bet all processes that did write
> >> something have to call pmem_drain().
> >
> For the record, from what I learned / been told by engineers with PMEM
> experience, calling pmem_drain() should properly flush changes done by
> other processes. So it should be sufficient to do that in XLogFlush(),
> from a single process.
>
> My understanding is that we have about three challenges here:
>
> (a) we still need to track how far we flushed, so this needs to be
> protected by some lock anyway (although perhaps a much smaller section
> of the function)
>
> (b) pmem_drain() flushes all the changes, so it flushes even "future"
> part of the WAL after the requested LSN, which may negatively affects
> performance I guess. So I wonder if pmem_persist would be a better fit,
> as it allows specifying a range that should be persisted.
>
> (c) As mentioned before, PMEM behaves differently with concurrent
> access, i.e. it reaches peak throughput with relatively low number of
> threads wroting data, and then the throughput drops quite quickly. I'm
> not sure if the same thing applies to pmem_drain() too - if it does, we
> may need something like we have for insertions, i.e. a handful of locks
> allowing limited number of concurrent inserts.

Thanks. That's a good summary.

>
>
> > Yeah, in terms of experiments at least it's good to find out that the
> > approach mmapping each WAL segment is not good at performance.
> >
> Right. The problem with small WAL segments seems to be that each mmap
> causes the TLB to be thrown away, which means a lot of expensive cache
> misses. As the mmap needs to be done by each backend writing WAL, this
> is particularly bad with small WAL segments. The NTT patch works around
> that by doing just a single mmap.
>
> I wonder if we could pre-allocate and mmap small segments, and keep them
> mapped and just rename the underlying files when recycling them. That'd
> keep the regular segment files, as expected by various tools, etc. The
> question is what would happen when we temporarily need more WAL, etc.
>
> >>>
> >>> ...
> >>>
> >>> I think the performance improvement by NTT patch with the 16MB WAL
> >>> segment, the most common WAL segment size, is very good (150437 vs.
> >>> 212410 with 64 clients). But maybe evaluating writing WAL segment
> >>> files on PMEM DAX filesystem is also worth, as you mentioned, if we
> >>> don't do that yet.
> >>>
> >>
> >> Well, not sure. I think the question is still open whether it's actually
> >> safe to run on DAX, which does not have atomic writes of 512B sectors,
> >> and I think we rely on that e.g. for pg_config. But maybe for WAL that's
> >> not an issue.
> >
> > I think we can use the Block Translation Table (BTT) driver that
> > provides atomic sector updates.
> >
>
> But we have benchmarked that, see my message from 2020/11/26, which
> shows this table:
>
> master/btt master/dax ntt simple
> -----------------------------------------------------------
> 1 5469 7402 7977 6746
> 16 48222 80869 107025 82343
> 32 73974 158189 214718 158348
> 64 85921 154540 225715 164248
> 96 150602 221159 237008 217253
>
> Clearly, BTT is quite expensive. Maybe there's a way to tune that at
> filesystem/kernel level, I haven't tried that.

I missed your mail. Yeah, BTT seems to be quite expensive.

>
> >>
> >>>> I'm also wondering if WAL is the right usage for PMEM. Per [2] there's a
> >>>> huge read-write assymmetry (the writes being way slower), and their
> >>>> recommendation (in "Observation 3" is)
> >>>>
> >>>> The read-write asymmetry of PMem im-plies the necessity of avoiding
> >>>> writes as much as possible for PMem.
> >>>>
> >>>> So maybe we should not be trying to use PMEM for WAL, which is pretty
> >>>> write-heavy (and in most cases even write-only).
> >>>
> >>> I think using PMEM for WAL is cost-effective but it leverages the only
> >>> low-latency (sequential) write, but not other abilities such as
> >>> fine-grained access and low-latency random write. If we want to
> >>> exploit its all ability we might need some drastic changes to logging
> >>> protocol while considering storing data on PMEM.
> >>>
> >>
> >> True. I think investigating whether it's sensible to use PMEM for this
> >> purpose. It may turn out that replacing the DRAM WAL buffers with writes
> >> directly to PMEM is not economical, and aggregating data in a DRAM
> >> buffer is better :-(
> >
> > Yes. I think it might be interesting to do an analysis of the
> > bottlenecks of NTT patch by perf etc. If bottlenecks are moved to
> > other places by removing WALWriteLock during flush, it's probably a
> > good sign for further performance improvements. IIRC WALWriteLock is
> > one of the main bottlenecks on OLTP workload, although my memory might
> > already be out of date.
> >
>
> I think WALWriteLock itself (i.e. acquiring/releasing it) is not an
> issue - the problem is that writing the WAL to persistent storage itself
> is expensive, and we're waiting to that.
>
> So it's not clear to me if removing the lock (and allowing multiple
> processes to do pmem_drain concurrently) can actually help, considering
> pmem_drain() should flush writes from other processes anyway.
>
> But as I said, that is just my theory - I might be entirely wrong, it'd
> be good to hack XLogFlush a bit and try it out.
>
>

I've done some performance benchmarks with the master and NTT v4
patch. Let me share the results.

pgbench setup:
* scale factor = 2000
* duration = 600 sec
* clients = 32, 64, 96

NVWAL setup:
* nvwal_size = 50GB
* max_wal_size = 50GB
* min_wal_size = 50GB

The whole database fits in shared_buffers and WAL segment file size is 16MB.

The results are:

master NTT master-unlogged
32 113209 67107 154298
64 144880 54289 178883
96 151405 50562 180018

"master-unlogged" is the same setup as "master" except for using
unlogged tables (using --unlogged-tables pgbench option). The TPS
increased by about 20% compared to "master" case (i.g., logged table
case). The reason why I experimented unlogged table case as well is
that we can think these results as an ideal performance if we were
able to write WAL records in 0 sec. IOW, even if the PMEM patch would
significantly improve WAL logging performance, I think it could not
exceed this performance. But hope is that if we currently have a
performance bottle-neck in WAL logging (.e.g, locking and writing
WAL), removing or minimizing WAL logging would bring a chance to
further improve performance by eliminating the new-coming bottle-neck.

As we can see from the above result, apparently, the performance of
“ntt” case was not good in this evaluation. I've not reviewed the
patch in-depth yet but something might be wrong with the v4 patch or
PMEM configuration I did on my environment is wrong.

Besides, I've checked the main wait events on each experiment using
pg_wait_sampling. Here are the top 5 wait events on "master" case
excluding wait events on the main function of auxiliary processes:

event_type | event | sum
------------+----------------------+-------
Client | ClientRead | 46902
LWLock | WALWrite | 33405
IPC | ProcArrayGroupUpdate | 8855
LWLock | WALInsert | 3215
LWLock | ProcArray | 3022

We can see the wait event on WALWrite lwlock acquisition happened many
times and it was the primary wait event. On the other hand, In
"master-unlogged" case, I got:

event_type | event | sum
------------+----------------------+-------
Client | ClientRead | 59871
IPC | ProcArrayGroupUpdate | 17528
LWLock | ProcArray | 4317
LWLock | XactSLRU | 3705
IPC | XactGroupUpdate | 3045

LWLock of WAL logging disappeared.

The result of "ntt" case is:

event_type | event | sum
------------+----------------------+--------
LWLock | WALInsert | 126487
Client | ClientRead | 12173
LWLock | BufferContent | 4480
Lock | transactionid | 2017
IPC | ProcArrayGroupUpdate | 924

The wait event on WALWrite lwlock disappeared. Instead, there were
many wait events on WALInsert lwlock. I've not investigated this
result yet. This could be because the v4 patch acquires WALInsert lock
more than necessary or writing WAL records to PMEM took more time than
writing to DRAM as Tomas mentioned before.

If the PMEM patch introduces a new WAL file (called nwwal file in the
patch) and writes a normal WAL segment file based on nvwal file, I
think it doesn't necessarily need to follow the current WAL segment
file format (i.g., sequential writes to 8kB each block). I think there
is a better algorithm to write WAL records to PMEM more efficiently
like this paper proposing[1].

Finally, I realized while using the PMEM patch that with a large nvwal
file, PostgreSQL server takes a long time to start since it
initializes nvwal file. In my environment, nvwal size is 50GB and it
took 1 min to startup. This could lead to downtime in production.

[1] https://jianh.web.engr.illinois.edu/papers/jian-vldb15.pdf

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhihong Yu 2021-02-13 04:09:17 Re: Possible dereference null return (src/backend/replication/logical/reorderbuffer.c)
Previous Message Michael Paquier 2021-02-13 02:52:51 Re: [DOC] add missing "[ NO ]" to various "DEPENDS ON" synopses