Re: [PoC] Non-volatile WAL buffer

From: Takashi Menjo <takashi(dot)menjo(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Takashi Menjo <takashi(dot)menjou(dot)vg(at)hco(dot)ntt(dot)co(dot)jp>, "Deng, Gang" <gang(dot)deng(at)intel(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
Subject: Re: [PoC] Non-volatile WAL buffer
Date: 2021-01-29 09:02:05
Message-ID: CAOwnP3OnbzYUyC5QJCHDEGPsM8zx=CiVWzHgbxgCQKM8dTH9Qg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tomas,

I'd answer your questions. (Not all for now, sorry.)

> Do I understand correctly that the patch removes "regular" WAL buffers
and instead writes the data into the non-volatile PMEM buffer, without
writing that to the WAL segments at all (unless in archiving mode)?
> Firstly, I guess many (most?) instances will have to write the WAL
segments anyway because of PITR/backups, so I'm not sure we can save much
here.

Mostly yes. My "non-volatile WAL buffer" patchset removes regular volatile
WAL buffers and brings non-volatile ones. All the WAL will get into the
non-volatile buffers and persist there. No write out of the buffers to WAL
segment files is required. However in archiving mode or in a case of buffer
full (described later), both of the non-volatile buffers and the segment
files are used.

In archiving mode with my patchset, for each time one segment (16MB
default) is fixed on the non-volatile buffers, that segment is written to a
segment file asynchronously (by XLogBackgroundFlush). Then it will be
archived by existing archiving functionality.

> But more importantly - doesn't that mean the nvwal_size value is
essentially a hard limit? With max_wal_size, it's a soft limit i.e. we're
allowed to temporarily use more WAL when needed. But with a pre-allocated
file, that's clearly not possible. So what would happen in those cases?

Yes, nvwal_size is a hard limit, and I see it's a major weak point of my
patchset.

When all non-volatile WAL buffers are filled, the oldest segment on the
buffers is written (by XLogWrite) to a regular WAL segment file, then those
buffers are cleared (by AdvanceXLInsertBuffer) for new records. All WAL
record insertions to the buffers block until that write and clear are
complete. Due to that, all write transactions also block.

To make the matter worse, if a checkpoint eventually occurs in such a
buffer full case, record insertions would block for a certain time at the
end of the checkpoint because a large amount of the non-volatile buffers
will be cleared (see PreallocNonVolatileXlogBuffer). From a client view, it
would look as if the postgres server freezes for a while.

Proper checkpointing would prevent such cases, but it could be hard to
control. When I reproduced the Gang's case reported in this thread, such
buffer full and freeze occured.

> Also, is it possible to change nvwal_size? I haven't tried, but I wonder
what happens with the current contents of the file.

The value of nvwal_size should be equal to the actual size of nvwal_path
file when postgres starts up. If not equal, postgres will panic at
MapNonVolatileXLogBuffer (see nv_xlog_buffer.c), and the WAL contents on
the file will remain as it was. So, if an admin accidentally changes the
nvwal_size value, they just cannot get postgres up.

The file size may be extended/shrunk offline by truncate(1) command, but
the WAL contents on the file also should be moved to the proper offset
because the insertion/recovery offset is calculated by modulo, that is,
record's LSN % nvwal_size; otherwise we lose WAL. An offline tool to do
such an operation might be required, but is not yet.

> The way I understand the current design is that we're essentially
switching from this architecture:
>
> clients -> wal buffers (DRAM) -> wal segments (storage)
>
> to this
>
> clients -> wal buffers (PMEM)
>
> (Assuming there we don't have to write segments because of archiving.)

Yes. Let me describe how current PostgreSQL design is and how the patchsets
and works talked in this thread changes it, AFAIU:

- Current PostgreSQL:
clients -[memcpy]-> buffers (DRAM) -[write]-> segments (disk)

- Patch "pmem-with-wal-buffers-master.patch" Tomas posted:
clients -[memcpy]-> buffers (DRAM) -[pmem_memcpy]-> mmap-ed segments
(PMEM)

- My "non-volatile WAL buffer" patchset:
clients -[pmem_memcpy(*)]-> buffers (PMEM)

- My another patchset mmap-ing segments as buffers:
clients -[pmem_memcpy(*)]-> mmap-ed segments as buffers (PMEM)

- "Non-volatile Memory Logging" in PGcon 2016 [1][2][3]:
clients -[memcpy]-> buffers (WC[4] DRAM as pseudo PMEM) -[async
write]-> segments (disk)

(* or memcpy + pmem_flush)

And I'd say that our previous work "Introducing PMDK into PostgreSQL"
talked in PGCon 2018 [5] and its patchset [6 for the latest] are based on
the same idea as Tomas's patch above.

That's all for this mail. Please be patient for the next mail.

Best regards,
Takashi

[1] https://www.pgcon.org/2016/schedule/track/Performance/945.en.html
[2] https://github.com/meistervonperf/postgresql-NVM-logging
[3] https://github.com/meistervonperf/pseudo-pram
[4] https://www.kernel.org/doc/html/latest/x86/pat.html
[5] https://pgcon.org/2018/schedule/events/1154.en.html
[6]
https://www.postgresql.org/message-id/CAOwnP3ONd9uXPXKoc5AAfnpCnCyOna1ru6sU=eY_4WfMjaKG9A@mail.gmail.com

--
Takashi Menjo <takashi(dot)menjo(at)gmail(dot)com>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Yugo NAGATA 2021-01-29 09:53:53 Re: Is Recovery actually paused?
Previous Message Masahiro Ikeda 2021-01-29 08:49:00 Re: About to add WAL write/fsync statistics to pg_stat_wal view