Re: [PoC] Non-volatile WAL buffer

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, Takashi Menjo <takashi(dot)menjo(at)gmail(dot)com>, Takashi Menjo <takashi(dot)menjou(dot)vg(at)hco(dot)ntt(dot)co(dot)jp>, "Deng, Gang" <gang(dot)deng(at)intel(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Non-volatile WAL buffer
Date: 2021-02-25 03:28:01
Message-ID: CAD21AoBvBLEpgf5vdUoZtumGjmLsk-aQUZ04rAT+3eCTCZjMVA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Feb 13, 2021 at 12:18 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Thu, Jan 28, 2021 at 1:41 AM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> >
> > On 1/25/21 3:56 AM, Masahiko Sawada wrote:
> > >>
> > >> ...
> > >>
> > >> On 1/21/21 3:17 AM, Masahiko Sawada wrote:
> > >>> ...
> > >>>
> > >>> While looking at the two methods: NTT and simple-no-buffer, I realized
> > >>> that in XLogFlush(), NTT patch flushes (by pmem_flush() and
> > >>> pmem_drain()) WAL without acquiring WALWriteLock whereas
> > >>> simple-no-buffer patch acquires WALWriteLock to do that
> > >>> (pmem_persist()). I wonder if this also affected the performance
> > >>> differences between those two methods since WALWriteLock serializes
> > >>> the operations. With PMEM, multiple backends can concurrently flush
> > >>> the records if the memory region is not overlapped? If so, flushing
> > >>> WAL without WALWriteLock would be a big benefit.
> > >>>
> > >>
> > >> That's a very good question - it's quite possible the WALWriteLock is
> > >> not really needed, because the processes are actually "writing" the WAL
> > >> directly to PMEM. So it's a bit confusing, because it's only really
> > >> concerned about making sure it's flushed.
> > >>
> > >> And yes, multiple processes certainly can write to PMEM at the same
> > >> time, in fact it's a requirement to get good throughput I believe. My
> > >> understanding is we need ~8 processes, at least that's what I heard from
> > >> people with more PMEM experience.
> > >
> > > Thanks, that's good to know.
> > >
> > >>
> > >> TBH I'm not convinced the code in the "simple-no-buffer" code (coming
> > >> from the 0002 patch) is actually correct. Essentially, consider the
> > >> backend needs to do a flush, but does not have a segment mapped. So it
> > >> maps it and calls pmem_drain() on it.
> > >>
> > >> But does that actually flush anything? Does it properly flush changes
> > >> done by other processes that may not have called pmem_drain() yet? I
> > >> find this somewhat suspicious and I'd bet all processes that did write
> > >> something have to call pmem_drain().
> > >
> > For the record, from what I learned / been told by engineers with PMEM
> > experience, calling pmem_drain() should properly flush changes done by
> > other processes. So it should be sufficient to do that in XLogFlush(),
> > from a single process.
> >
> > My understanding is that we have about three challenges here:
> >
> > (a) we still need to track how far we flushed, so this needs to be
> > protected by some lock anyway (although perhaps a much smaller section
> > of the function)
> >
> > (b) pmem_drain() flushes all the changes, so it flushes even "future"
> > part of the WAL after the requested LSN, which may negatively affects
> > performance I guess. So I wonder if pmem_persist would be a better fit,
> > as it allows specifying a range that should be persisted.
> >
> > (c) As mentioned before, PMEM behaves differently with concurrent
> > access, i.e. it reaches peak throughput with relatively low number of
> > threads wroting data, and then the throughput drops quite quickly. I'm
> > not sure if the same thing applies to pmem_drain() too - if it does, we
> > may need something like we have for insertions, i.e. a handful of locks
> > allowing limited number of concurrent inserts.
>
> Thanks. That's a good summary.
>
> >
> >
> > > Yeah, in terms of experiments at least it's good to find out that the
> > > approach mmapping each WAL segment is not good at performance.
> > >
> > Right. The problem with small WAL segments seems to be that each mmap
> > causes the TLB to be thrown away, which means a lot of expensive cache
> > misses. As the mmap needs to be done by each backend writing WAL, this
> > is particularly bad with small WAL segments. The NTT patch works around
> > that by doing just a single mmap.
> >
> > I wonder if we could pre-allocate and mmap small segments, and keep them
> > mapped and just rename the underlying files when recycling them. That'd
> > keep the regular segment files, as expected by various tools, etc. The
> > question is what would happen when we temporarily need more WAL, etc.
> >
> > >>>
> > >>> ...
> > >>>
> > >>> I think the performance improvement by NTT patch with the 16MB WAL
> > >>> segment, the most common WAL segment size, is very good (150437 vs.
> > >>> 212410 with 64 clients). But maybe evaluating writing WAL segment
> > >>> files on PMEM DAX filesystem is also worth, as you mentioned, if we
> > >>> don't do that yet.
> > >>>
> > >>
> > >> Well, not sure. I think the question is still open whether it's actually
> > >> safe to run on DAX, which does not have atomic writes of 512B sectors,
> > >> and I think we rely on that e.g. for pg_config. But maybe for WAL that's
> > >> not an issue.
> > >
> > > I think we can use the Block Translation Table (BTT) driver that
> > > provides atomic sector updates.
> > >
> >
> > But we have benchmarked that, see my message from 2020/11/26, which
> > shows this table:
> >
> > master/btt master/dax ntt simple
> > -----------------------------------------------------------
> > 1 5469 7402 7977 6746
> > 16 48222 80869 107025 82343
> > 32 73974 158189 214718 158348
> > 64 85921 154540 225715 164248
> > 96 150602 221159 237008 217253
> >
> > Clearly, BTT is quite expensive. Maybe there's a way to tune that at
> > filesystem/kernel level, I haven't tried that.
>
> I missed your mail. Yeah, BTT seems to be quite expensive.
>
> >
> > >>
> > >>>> I'm also wondering if WAL is the right usage for PMEM. Per [2] there's a
> > >>>> huge read-write assymmetry (the writes being way slower), and their
> > >>>> recommendation (in "Observation 3" is)
> > >>>>
> > >>>> The read-write asymmetry of PMem im-plies the necessity of avoiding
> > >>>> writes as much as possible for PMem.
> > >>>>
> > >>>> So maybe we should not be trying to use PMEM for WAL, which is pretty
> > >>>> write-heavy (and in most cases even write-only).
> > >>>
> > >>> I think using PMEM for WAL is cost-effective but it leverages the only
> > >>> low-latency (sequential) write, but not other abilities such as
> > >>> fine-grained access and low-latency random write. If we want to
> > >>> exploit its all ability we might need some drastic changes to logging
> > >>> protocol while considering storing data on PMEM.
> > >>>
> > >>
> > >> True. I think investigating whether it's sensible to use PMEM for this
> > >> purpose. It may turn out that replacing the DRAM WAL buffers with writes
> > >> directly to PMEM is not economical, and aggregating data in a DRAM
> > >> buffer is better :-(
> > >
> > > Yes. I think it might be interesting to do an analysis of the
> > > bottlenecks of NTT patch by perf etc. If bottlenecks are moved to
> > > other places by removing WALWriteLock during flush, it's probably a
> > > good sign for further performance improvements. IIRC WALWriteLock is
> > > one of the main bottlenecks on OLTP workload, although my memory might
> > > already be out of date.
> > >
> >
> > I think WALWriteLock itself (i.e. acquiring/releasing it) is not an
> > issue - the problem is that writing the WAL to persistent storage itself
> > is expensive, and we're waiting to that.
> >
> > So it's not clear to me if removing the lock (and allowing multiple
> > processes to do pmem_drain concurrently) can actually help, considering
> > pmem_drain() should flush writes from other processes anyway.
> >
> > But as I said, that is just my theory - I might be entirely wrong, it'd
> > be good to hack XLogFlush a bit and try it out.
> >
> >
>
> I've done some performance benchmarks with the master and NTT v4
> patch. Let me share the results.
>
> pgbench setup:
> * scale factor = 2000
> * duration = 600 sec
> * clients = 32, 64, 96
>
> NVWAL setup:
> * nvwal_size = 50GB
> * max_wal_size = 50GB
> * min_wal_size = 50GB
>
> The whole database fits in shared_buffers and WAL segment file size is 16MB.
>
> The results are:
>
> master NTT master-unlogged
> 32 113209 67107 154298
> 64 144880 54289 178883
> 96 151405 50562 180018
>
> "master-unlogged" is the same setup as "master" except for using
> unlogged tables (using --unlogged-tables pgbench option). The TPS
> increased by about 20% compared to "master" case (i.g., logged table
> case). The reason why I experimented unlogged table case as well is
> that we can think these results as an ideal performance if we were
> able to write WAL records in 0 sec. IOW, even if the PMEM patch would
> significantly improve WAL logging performance, I think it could not
> exceed this performance. But hope is that if we currently have a
> performance bottle-neck in WAL logging (.e.g, locking and writing
> WAL), removing or minimizing WAL logging would bring a chance to
> further improve performance by eliminating the new-coming bottle-neck.
>
> As we can see from the above result, apparently, the performance of
> “ntt” case was not good in this evaluation. I've not reviewed the
> patch in-depth yet but something might be wrong with the v4 patch or
> PMEM configuration I did on my environment is wrong.

I've reconfigured PMEM and done the same benchmark. I got the
following results (changed only "ntt" case):

master NTT master-unlogged
32 113209 144829 154298
64 144880 164899 178883
96 151405 166096 180018

I got a much better performance with "ntt" patch. I think I think it
was wrong that I created a partition on PMEM (i.g., created filesystem
on /dev/pmem1p1) when the last evaluation. Sorry for confusing you,
Menjo-san.

FWIW here are the top 5 wait events on new "ntt" case:

event_type | event | sum
------------+----------------------+------
Client | ClientRead | 8462
LWLock | WALInsert | 1049
LWLock | ProcArray | 627
IPC | ProcArrayGroupUpdate | 481
LWLock | XactSLRU | 247

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message houzj.fnst@fujitsu.com 2021-02-25 03:33:32 RE: Parallel INSERT (INTO ... SELECT ...)
Previous Message Kyotaro Horiguchi 2021-02-25 01:22:12 Re: Is Recovery actually paused?