Re: [PoC] Non-volatile WAL buffer

From: Takashi Menjo <takashi(dot)menjo(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Takashi Menjo <takashi(dot)menjou(dot)vg(at)hco(dot)ntt(dot)co(dot)jp>, "Deng, Gang" <gang(dot)deng(at)intel(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PoC] Non-volatile WAL buffer
Date: 2020-10-30 05:57:05
Message-ID: CAOwnP3PO8gntfraGRDqvF9DsVB1HyciwP=WZ9CJA5wE2Mm9rWg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Heikki,

> I had a new look at this thread today, trying to figure out where we are.
I'm a bit confused.
>
> One thing we have established: mmap()ing WAL files performs worse than
the current method, if pg_wal is not on
> a persistent memory device. This is because the kernel faults in existing
content of each page, even though we're
> overwriting everything.
Yes. In addition, after a certain page (in the sense of OS page) is
msync()ed, another page fault will occur again when something is stored
into that page.

> That's unfortunate. I was hoping that mmap() would be a good option even
without persistent memory hardware.
> I wish we could tell the kernel to zero the pages instead of reading them
from the file. Maybe clear the file with
> ftruncate() before mmapping it?
The area extended by ftruncate() appears as if it were zero-filled [1].
Please note that it merely "appears as if." It might not be actually
zero-filled as data blocks on devices, so pre-allocating files should
improve transaction performance. At least, on Linux 5.7 and ext4, it takes
more time to store into the mapped file just open(O_CREAT)ed and
ftruncate()d than into the one filled already and actually.

> That should not be problem with a real persistent memory device, however
(or when emulating it with DRAM). With
> DAX, the storage is memory-mapped directly and there is no page cache,
and no pre-faulting.
Yes, with filesystem DAX, there is no page cache for file data. A page
fault still occurs but for each 2MiB DAX hugepage, so its overhead
decreases compared with 4KiB page fault. Such a DAX hugepage fault is only
applied to DAX-mapped files and is different from a general transparent
hugepage fault.

> Because of that, I'm baffled by what the
v4-0002-Non-volatile-WAL-buffer.patch does. If I understand it
> correctly, it puts the WAL buffers in a separate file, which is stored on
the NVRAM. Why? I realize that this is just
> a Proof of Concept, but I'm very much not interested in anything that
requires the DBA to manage a second WAL
> location. Did you test the mmap() patches with persistent memory
hardware? Did you compare that with the pmem
> patchset, on the same hardware? If there's a meaningful performance
difference between the two, what's causing
> it?
Yes, this patchset puts the WAL buffers into the file specified by
"nvwal_path" in postgresql.conf.

Why this patchset puts the buffers into the separated file, not existing
segment files in PGDATA/pg_wal, is because it reduces the overhead due to
system calls such as open(), mmap(), munmap(), and close(). It open()s and
mmap()s the file "nvwal_path" once, and keeps that file mapped while
running. On the other hand, as for the patchset mmap()ing the segment
files, a backend process should munmap() and close() the current mapped
file and open() and mmap() the new one for each time the inserting location
for that process goes over segments. This causes the performance difference
between the two.

Best regards,
Takashi

[1]
https://pubs.opengroup.org/onlinepubs/9699919799/functions/ftruncate.html

--
Takashi Menjo <takashi(dot)menjo(at)gmail(dot)com>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2020-10-30 06:00:58 Re: Parallel INSERT (INTO ... SELECT ...)
Previous Message Amit Langote 2020-10-30 05:38:30 Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8