Re: [PoC] Non-volatile WAL buffer

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, Takashi Menjo <takashi(dot)menjo(at)gmail(dot)com>
Cc: Takashi Menjo <takashi(dot)menjou(dot)vg(at)hco(dot)ntt(dot)co(dot)jp>, "Deng, Gang" <gang(dot)deng(at)intel(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Non-volatile WAL buffer
Date: 2020-11-26 20:59:20
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 26/11/2020 21:27, Tomas Vondra wrote:
> Hi,
> Here's the "simple patch" that I'm currently experimenting with. It
> essentially replaces open/close/write/fsync with pmem calls
> (map/unmap/memcpy/persist variants), and it's by no means committable.
> But it works well enough for experiments / measurements, etc.
> The numbers (5-minute pgbench runs on scale 500) look like this:
> master/btt master/dax ntt simple
> -----------------------------------------------------------
> 1 5469 7402 7977 6746
> 16 48222 80869 107025 82343
> 32 73974 158189 214718 158348
> 64 85921 154540 225715 164248
> 96 150602 221159 237008 217253
> A chart illustrating these results is attached. The four columns are
> showing unpatched master with WAL on a pmem device, in BTT or DAX modes,
> "ntt" is the patch submitted to this thread, and "simple" is the patch
> I've hacked together.
> As expected, the BTT case performs poorly (compared to the rest).
> The "master/dax" and "simple" perform about the same. There are some
> differences, but those may be attributed to noise. The NTT patch does
> outperform these cases by ~20-40% in some cases.
> The question is why. I recall suggestions this is due to page faults
> when writing data into the WAL, but I did experiment with various
> settings that I think should prevent that (e.g. disabling WAL reuse
> and/or disabling zeroing the segments) but that made no measurable
> difference.

The page faults are only a problem when mmap() is used *without* DAX.

Takashi tried a patch earlier to mmap() WAL segments and insert WAL to
them directly. See 0002-Use-WAL-segments-as-WAL-buffers.patch at
Could you test that patch too, please? Using your nomenclature, that
patch skips wal_buffers and does:

clients -> wal segments (PMEM DAX)

He got good results with that with DAX, but otherwise it performed
worse. And then we discussed why that might be, and the page fault
hypothesis was brought up.

I think 0002-Use-WAL-segments-as-WAL-buffers.patch is the most promising
approach here. But because it's slower without DAX, we need to keep the
current code for non-DAX systems. Unfortunately it means that we need to
maintain both implementations, selectable with a GUC or some DAX
detection magic. The question then is whether the code complexity is
worth the performance gin on DAX-enabled systems.

Andres was not excited about mmapping the WAL segments because of
performance reasons. I'm not sure how much of his critique applies if we
keep supporting both methods and only use mmap() if so configured.

- Heikki

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2020-11-26 21:19:56 Re: [PoC] Non-volatile WAL buffer
Previous Message Alvaro Herrera 2020-11-26 19:56:15 Re: remove spurious CREATE INDEX CONCURRENTLY wait