Re: [PoC] Non-volatile WAL buffer

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, Takashi Menjo <takashi(dot)menjo(at)gmail(dot)com>, Takashi Menjo <takashi(dot)menjou(dot)vg(at)hco(dot)ntt(dot)co(dot)jp>, "Deng, Gang" <gang(dot)deng(at)intel(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Non-volatile WAL buffer
Date: 2021-01-27 16:41:36
Message-ID: 93100c91-66e9-faa6-704c-ac47634e1203@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/25/21 3:56 AM, Masahiko Sawada wrote:
>>
>> ...
>>
>> On 1/21/21 3:17 AM, Masahiko Sawada wrote:
>>> ...
>>>
>>> While looking at the two methods: NTT and simple-no-buffer, I realized
>>> that in XLogFlush(), NTT patch flushes (by pmem_flush() and
>>> pmem_drain()) WAL without acquiring WALWriteLock whereas
>>> simple-no-buffer patch acquires WALWriteLock to do that
>>> (pmem_persist()). I wonder if this also affected the performance
>>> differences between those two methods since WALWriteLock serializes
>>> the operations. With PMEM, multiple backends can concurrently flush
>>> the records if the memory region is not overlapped? If so, flushing
>>> WAL without WALWriteLock would be a big benefit.
>>>
>>
>> That's a very good question - it's quite possible the WALWriteLock is
>> not really needed, because the processes are actually "writing" the WAL
>> directly to PMEM. So it's a bit confusing, because it's only really
>> concerned about making sure it's flushed.
>>
>> And yes, multiple processes certainly can write to PMEM at the same
>> time, in fact it's a requirement to get good throughput I believe. My
>> understanding is we need ~8 processes, at least that's what I heard from
>> people with more PMEM experience.
>
> Thanks, that's good to know.
>
>>
>> TBH I'm not convinced the code in the "simple-no-buffer" code (coming
>> from the 0002 patch) is actually correct. Essentially, consider the
>> backend needs to do a flush, but does not have a segment mapped. So it
>> maps it and calls pmem_drain() on it.
>>
>> But does that actually flush anything? Does it properly flush changes
>> done by other processes that may not have called pmem_drain() yet? I
>> find this somewhat suspicious and I'd bet all processes that did write
>> something have to call pmem_drain().
>
For the record, from what I learned / been told by engineers with PMEM
experience, calling pmem_drain() should properly flush changes done by
other processes. So it should be sufficient to do that in XLogFlush(),
from a single process.

My understanding is that we have about three challenges here:

(a) we still need to track how far we flushed, so this needs to be
protected by some lock anyway (although perhaps a much smaller section
of the function)

(b) pmem_drain() flushes all the changes, so it flushes even "future"
part of the WAL after the requested LSN, which may negatively affects
performance I guess. So I wonder if pmem_persist would be a better fit,
as it allows specifying a range that should be persisted.

(c) As mentioned before, PMEM behaves differently with concurrent
access, i.e. it reaches peak throughput with relatively low number of
threads wroting data, and then the throughput drops quite quickly. I'm
not sure if the same thing applies to pmem_drain() too - if it does, we
may need something like we have for insertions, i.e. a handful of locks
allowing limited number of concurrent inserts.

> Yeah, in terms of experiments at least it's good to find out that the
> approach mmapping each WAL segment is not good at performance.
>
Right. The problem with small WAL segments seems to be that each mmap
causes the TLB to be thrown away, which means a lot of expensive cache
misses. As the mmap needs to be done by each backend writing WAL, this
is particularly bad with small WAL segments. The NTT patch works around
that by doing just a single mmap.

I wonder if we could pre-allocate and mmap small segments, and keep them
mapped and just rename the underlying files when recycling them. That'd
keep the regular segment files, as expected by various tools, etc. The
question is what would happen when we temporarily need more WAL, etc.

>>>
>>> ...
>>>
>>> I think the performance improvement by NTT patch with the 16MB WAL
>>> segment, the most common WAL segment size, is very good (150437 vs.
>>> 212410 with 64 clients). But maybe evaluating writing WAL segment
>>> files on PMEM DAX filesystem is also worth, as you mentioned, if we
>>> don't do that yet.
>>>
>>
>> Well, not sure. I think the question is still open whether it's actually
>> safe to run on DAX, which does not have atomic writes of 512B sectors,
>> and I think we rely on that e.g. for pg_config. But maybe for WAL that's
>> not an issue.
>
> I think we can use the Block Translation Table (BTT) driver that
> provides atomic sector updates.
>

But we have benchmarked that, see my message from 2020/11/26, which
shows this table:

master/btt master/dax ntt simple
-----------------------------------------------------------
1 5469 7402 7977 6746
16 48222 80869 107025 82343
32 73974 158189 214718 158348
64 85921 154540 225715 164248
96 150602 221159 237008 217253

Clearly, BTT is quite expensive. Maybe there's a way to tune that at
filesystem/kernel level, I haven't tried that.

>>
>>>> I'm also wondering if WAL is the right usage for PMEM. Per [2] there's a
>>>> huge read-write assymmetry (the writes being way slower), and their
>>>> recommendation (in "Observation 3" is)
>>>>
>>>> The read-write asymmetry of PMem im-plies the necessity of avoiding
>>>> writes as much as possible for PMem.
>>>>
>>>> So maybe we should not be trying to use PMEM for WAL, which is pretty
>>>> write-heavy (and in most cases even write-only).
>>>
>>> I think using PMEM for WAL is cost-effective but it leverages the only
>>> low-latency (sequential) write, but not other abilities such as
>>> fine-grained access and low-latency random write. If we want to
>>> exploit its all ability we might need some drastic changes to logging
>>> protocol while considering storing data on PMEM.
>>>
>>
>> True. I think investigating whether it's sensible to use PMEM for this
>> purpose. It may turn out that replacing the DRAM WAL buffers with writes
>> directly to PMEM is not economical, and aggregating data in a DRAM
>> buffer is better :-(
>
> Yes. I think it might be interesting to do an analysis of the
> bottlenecks of NTT patch by perf etc. If bottlenecks are moved to
> other places by removing WALWriteLock during flush, it's probably a
> good sign for further performance improvements. IIRC WALWriteLock is
> one of the main bottlenecks on OLTP workload, although my memory might
> already be out of date.
>

I think WALWriteLock itself (i.e. acquiring/releasing it) is not an
issue - the problem is that writing the WAL to persistent storage itself
is expensive, and we're waiting to that.

So it's not clear to me if removing the lock (and allowing multiple
processes to do pmem_drain concurrently) can actually help, considering
pmem_drain() should flush writes from other processes anyway.

But as I said, that is just my theory - I might be entirely wrong, it'd
be good to hack XLogFlush a bit and try it out.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Konstantin Knizhnik 2021-01-27 16:51:32 Improve join selectivity estimation using extended statistics
Previous Message Konstantin Knizhnik 2021-01-27 16:39:00 Re: Columns correlation and adaptive query optimization