Re: [PoC] Non-volatile WAL buffer

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Takashi Menjo <takashi(dot)menjou(dot)vg(at)hco(dot)ntt(dot)co(dot)jp>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Non-volatile WAL buffer
Date: 2020-01-27 18:54:38
Message-ID: CA+TgmoZWvm36GyYNDn3gksVAkuPrc86G9W4of8AgYR=SSU7Lmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 27, 2020 at 2:01 AM Takashi Menjo
<takashi(dot)menjou(dot)vg(at)hco(dot)ntt(dot)co(dot)jp> wrote:
> It sounds reasonable, but I'm sorry that I haven't tested such a program
> yet. I'll try it to compare with my non-volatile WAL buffer. For now, I'm
> a little worried about the overhead of mmap()/munmap() for each WAL segment
> file.

I guess the question here is how the cost of one mmap() and munmap()
pair per WAL segment (normally 16MB) compares to the cost of one
write() per block (normally 8kB). It could be that mmap() is a more
expensive call than read(), but by a small enough margin that the
vastly reduced number of system calls makes it a winner. But that's
just speculation, because I don't know how heavy mmap() actually is.

I have a different concern. I think that, right now, when we reuse a
WAL segment, we write entire blocks at a time, so the old contents of
the WAL segment are overwritten without ever being read. But that
behavior might not be maintained when using mmap(). It might be that
as soon as we write the first byte to a mapped page, the old contents
have to be faulted into memory. Indeed, it's unclear how it could be
otherwise, since the VM page must be made read-write at that point and
the system cannot know that we will overwrite the whole page. But
reading in the old contents of a recycled WAL file just to overwrite
them seems like it would be disastrously expensive.

A related, but more minor, concern is whether there are any
differences in in the write-back behavior when modifying a mapped
region vs. using write(). Either way, the same pages of the same file
will get dirtied, but the kernel might not have the same idea in
either case about when the changed pages should be written back down
to disk, and that could make a big difference to performance.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2020-01-27 19:01:09 Re: JIT performance bug/regression & JIT EXPLAIN
Previous Message Andres Freund 2020-01-27 17:41:03 Re: JIT performance bug/regression & JIT EXPLAIN