Re: [PoC] Non-volatile WAL buffer

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, Takashi Menjo <takashi(dot)menjo(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Takashi Menjo <takashi(dot)menjou(dot)vg(at)hco(dot)ntt(dot)co(dot)jp>, "Deng, Gang" <gang(dot)deng(at)intel(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Non-volatile WAL buffer
Date: 2020-11-24 18:26:55
Message-ID: a60bfa39-dd59-eff1-7941-29264c558830@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/24/20 7:34 AM, tsunakawa(dot)takay(at)fujitsu(dot)com wrote:
> From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
>> So I wonder if using PMEM for the WAL buffer is the right way forward.
>> AFAIK the WAL buffer is quite concurrent (multiple clients writing
>> data), which seems to contradict the PMEM vs. DRAM trade-offs.
>>
>> The design I've originally expected would look more like this
>>
>> clients -> wal buffers (DRAM) -> wal segments (PMEM DAX)
>>
>> i.e. mostly what we have now, but instead of writing the WAL segments
>> "the usual way" we'd write them using mmap/memcpy, without fsync.
>>
>> I suppose that's what Heikki meant too, but I'm not sure.
>
> SQL Server probably does so. Please see the following page and the links in "Next steps" section. I'm saying "probably" because the document doesn't clearly state whether SQL Server memcpys data from DRAM log cache to non-volatile log cache only for transaction commits or for all log cache writes. I presume the former.
>
>
> Add persisted log buffer to a database
> https://docs.microsoft.com/en-us/sql/relational-databases/databases/add-persisted-log-buffer?view=sql-server-ver15
> --------------------------------------------------
> With non-volatile, tail of the log storage the pattern is
>
> memcpy to LC
> memcpy to NV LC
> Set status
> Return control to caller (commit is now valid)
> ...
>
> With this new functionality, we use a region of memory which is mapped to a file on a DAX volume to hold that buffer. Since the memory hosted by the DAX volume is already persistent, we have no need to perform a separate flush, and can immediately continue with processing the next operation. Data is flushed from this buffer to more traditional storage in the background.
> --------------------------------------------------
>

Interesting, thanks for the likn. If I understand [1] correctly, they
essentially do this:

clients -> buffers (DRAM) -> buffers (PMEM) -> wal (storage)

that is, they insert the PMEM buffer between the LC (in DRAM) and
traditional (non-PMEM) storage, so that a commit does not need to do any
fsyncs etc.

It seems to imply the memcpy between DRAM and PMEM happens right when
writing the WAL, but I guess that's not strictly required - we might
just as well do that in the background, I think.

It's interesting that they only place the tail of the log on PMEM, i.e.
the PMEM buffer has limited size, and the rest of the log is not on
PMEM. It's a bit as if we inserted a PMEM buffer between our wal buffers
and the WAL segments, and kept the WAL segments on regular storage. That
could work, but I'd bet they did that because at that time the NV
devices were much smaller, and placing the whole log on PMEM was not
quite possible. So it might be unnecessarily complicated, considering
the PMEM device capacity is much higher now.

So I'd suggest we simply try this:

clients -> buffers (DRAM) -> wal segments (PMEM)

I plan to do some hacking and maybe hack together some simple tools to
benchmarks various approaches.

regards

[1]
https://docs.microsoft.com/en-us/archive/blogs/bobsql/how-it-works-it-just-runs-faster-non-volatile-memory-sql-server-tail-of-log-caching-on-nvdimm

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-11-24 18:33:30 Re: mark/restore failures on unsorted merge joins
Previous Message Tom Lane 2020-11-24 18:21:38 Re: [HACKERS] Custom compression methods