Re: [PoC] Non-volatile WAL buffer

From: Takashi Menjo <takashi(dot)menjo(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, Takashi Menjo <takashi(dot)menjou(dot)vg(at)hco(dot)ntt(dot)co(dot)jp>, "Deng, Gang" <gang(dot)deng(at)intel(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Non-volatile WAL buffer
Date: 2021-03-09 06:53:45
Message-ID: CAOwnP3MATB4wP59G-GONoCi3moaf+HvxE_JRtT-3TqY9YHS8sg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tomas,

> Hello Takashi-san,
>
> On 3/5/21 9:08 AM, Takashi Menjo wrote:
> > Hi Tomas,
> >
> > Thank you so much for your report. I have read it with great interest.
> >
> > Your conclusion sounds reasonable to me. My patchset you call "NTT /
> > segments" got as good performance as "NTT / buffer" patchset. I have
> > been worried that calling mmap/munmap for each WAL segment file could
> > have a lot of overhead. Based on your performance tests, however, the
> > overhead looks less than what I thought. In addition, "NTT / segments"
> > patchset is more compatible to the current PG and more friendly to
> > DBAs because that patchset uses WAL segment files and does not
> > introduce any other new WAL-related file.
> >
>
> I agree. I was actually a bit surprised it performs this well, mostly in
> line with the "NTT / buffer" patchset. I've seen significant issues with
> our simple experimental patches, which however went away with larger WAL
> segments. But the "NTT / segments" patch does not have that issue, so
> either our patches were doing something wrong, or perhaps there was some
> other issue (not sure why larger WAL segments would improve that).
>
> Do these results match your benchmarks? Or are you seeing significantly
> different behavior?

I made a performance test for "NTT / segments" and added its results
to my previous report [1], on the same conditions. The updated graph
is attached to this mail. Note that some legends are renamed: "Mapped
WAL file" to "NTT / simple", and "Non-volatile WAL buffer" to "NTT /
buffer."

The graph tells me that "NTT / segments" performs as well as "NTT /
buffer." This matches with the results you reported.

> Do you have any thoughts regarding the impact of full-page writes? So
> far all the benchmarks we did focused on small OLTP transactions on data
> sets that fit into RAM. The assumption was that that's the workload that
> would benefit from this, but maybe that's missing something important
> about workloads producing much larger WAL records? Say, workloads
> working with large BLOBs, bulk loads etc.

I'd say that more work is needed for workloads producing a large
amount of WAL (in the number of records or the size per record, or
both of them). Based on the case Gang reported and I have tried to
reproduce in this thread [2][3], the current inserting and flushing
method can be unsuitable for such workloads. The case was for "NTT /
buffer," but I think it can be also applied to "NTT / segments."

> The other question is whether simply placing WAL on DAX (without any
> code changes) is safe. If it's not, then all the "speedups" are computed
> with respect to unsafe configuration and so are useless. And BTT should
> be used instead, which would of course produce very different results.

I think it's safe, thanks to the checksum in the header of WAL record
(xl_crc in struct XLogRecord). In DAX mode, user data (WAL record
here) is written to the PMEM device by a smaller unit (probably a byte
or a cache line) than the traditional 512-byte disk sector. So a
torn-write such that "some bytes in a sector persist, other bytes not"
can occur when crash. AFAICS, however, the checksum for WAL records
can also support such a torn-write case.

> > I also think that supporting both file I/O and mmap is better than
> > supporting only mmap. I will continue my work on "NTT / segments"
> > patchset to support both ways.
> >
>
> +1
>
> > In the following, I will answer "Issues & Questions" you reported.
> >
> >
> >> While testing the "NTT / segments" patch, I repeatedly managed to crash the cluster with errors like this:
> >>
> >> 2021-02-28 00:07:21.221 PST client backend [3737139] WARNING: creating logfile segment just before
> >> mapping; path "pg_wal/00000001000000070000002F"
> >> 2021-02-28 00:07:21.670 PST client backend [3737142] WARNING: creating logfile segment just before
> >> mapping; path "pg_wal/000000010000000700000030"
> >> ...
> >> 2021-02-28 00:07:21.698 PST client backend [3737145] WARNING: creating logfile segment just before
> >> mapping; path "pg_wal/000000010000000700000030"
> >> 2021-02-28 00:07:21.698 PST client backend [3737130] PANIC: could not open file
> >> "pg_wal/000000010000000700000030": No such file or directory
> >>
> >> I do believe this is a thinko in the 0008 patch, which does XLogFileInit in XLogFileMap. Notice there are multiple
> >> "creating logfile" messages with the ..0030 segment, followed by the failure. AFAICS the XLogFileMap may be
> >> called from multiple backends, so they may call XLogFileInit concurrently, likely triggering some sort of race
> >> condition. It's fairly rare issue, though - I've only seen it twice from ~20 runs.
> >
> > Thank you for your report. I found that rather the patch 0009 has an
> > issue, and that will also cause WAL loss. I should have set
> > use_existent to true, or InstallXlogFileSegment and BasicOpenFile in
> > XLogFileInit can be racy. I have misunderstood that use_existent can
> > be true because I am creating a brand-new file with XLogFileInit.
> >
> > I will fix the issue.
> >
>
> OK, thanks for looking into this.
>
> >
> >> The other question I have is about WALInsertLockUpdateInsertingAt. 0003 removes this function, but leaves
> >> behind some of the other bits working with insert locks and insertingAt. But it does not explain how it works without
> >> WaitXLogInsertionsToFinish() - how does it ensure that when we commit something, all the preceding WAL is
> >> "complete" (i.e. written by other backends etc.)?
> >
> > To wait for *all* the WALInsertLocks to be released, no matter each of
> > them precedes or follows the current insertion.
> >
> > It would have worked functionally, but I rethink it is not good for
> > performance because XLogFileMap in GetXLogBuffer (where
> > WaitXLogInsertionsToFinish is removed) can block because it can
> > eventually call write() in XLogFileInit.
> >
> > I will restore the WALInsertLockUpdateInsertingAt function and related
> > code for mmap.
> >
>
> OK. I'm still not entirely sure I understand if the current version is
> correct, but I'll wait for the reworked version.
>
>
> kind regards

Best regards,
Takashi

[1] https://www.postgresql.org/message-id/CAOwnP3OFofOsFtmeikQcbMp0YWdJn0kVB4Ka_0tj+Urq7dtAzQ@mail.gmail.com
[2] https://www.postgresql.org/message-id/BYAPR11MB344801FF81E9C92A081D3E10E6080@BYAPR11MB3448.namprd11.prod.outlook.com
[3] https://www.postgresql.org/message-id/CAOwnP3NHAbVFOfAawZPs5ezn57_7fcX%3DKaaQ5YMgirc9rNrijQ%40mail.gmail.com

--
Takashi Menjo <takashi(dot)menjo(at)gmail(dot)com>

Attachment Content-Type Size
image/png 44.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joel Jacobson 2021-03-09 06:57:05 Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]
Previous Message Amit Kapila 2021-03-09 06:40:36 Re: Make stream_prepare an optional callback