Re: [PoC] Non-volatile WAL buffer

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, Takashi Menjo <takashi(dot)menjo(at)gmail(dot)com>, Takashi Menjo <takashi(dot)menjou(dot)vg(at)hco(dot)ntt(dot)co(dot)jp>, "Deng, Gang" <gang(dot)deng(at)intel(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Non-volatile WAL buffer
Date: 2021-02-19 03:25:33
Message-ID: 361cf3fd-40d8-366a-a50a-778a91ff52bb@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/22/21 5:04 PM, Konstantin Knizhnik wrote:
> ...
>
> I have heard from several DBMS experts that appearance of huge and
> cheap non-volatile memory can make a revolution in database system
> architecture. If all database can fit in non-volatile memory, then we
> do not need buffers, WAL, ...>
> But although  multi-terabyte NVM announces were made by IBM several
> years ago, I do not know about some successful DBMS prototypes with new
> architecture.
>
> I tried to understand why...
>
IMHO those predictions are a bit too optimistic, because they often
assume PMEM behavior is mostly similar to DRAM, except for the extra
persistence. But that's not quite true - throughput with PMEM is much
lower in general, peak throughput is reached with few processes (and
then drops quickly) etc. But over the last few years we were focused on
optimizing for exactly the opposite - systems with many CPU cores and
processes, because that's what maximizes DRAM throughput.

I'm not saying a revolution is not possible, but it'll probably require
quite significant rethinking of the whole architecture, and it may take
multiple PMEM generations until the performance improves enough to make
this economical. Some systems are probably more suitable for this (e.g.
Redis is doing most of the work in a single process, IIRC).

The other challenge of course is availability of the hardware - most
users run on whatever is widely available at cloud providers. And PMEM
is unlikely to get there very soon, I'd guess. Until that happens, the
pressure from these customers will be (naturally) fairly low. Perhaps
someone will develop hardware appliances for on-premise setups, as was
quite common in the past. Not sure.

> It was very interesting to me to read this thread, which is actually
> started in 2016 with "Non-volatile Memory Logging" presentation at PGCon.
> As far as I understand  from Tomas result right now using PMEM for WAL
> doesn't provide some substantial increase of performance.
>

At the moment, I'd probably agree. It's quite possible the PoC patches
are missing some optimizations and the difference might be better, but
even then the performance increase seems fairly modest and limited to
certainly workloads.

> But the main advantage of PMEM from my point of view is that it allows
> to avoid write-ahead logging at all!

No, PMEM certainly does not allow avoiding write-ahead logging - we
still need to handle e.g. recovery after a crash, when the data files
are in unknown / corrupted state.

Not to mention that WAL is used for physical and logical replication
(and thus HA), and so on.

> Certainly we need to change our algorithms to make it possible. Speaking
> about Postgres, we have to rewrite all indexes + heap
> and throw away buffer manager + WAL.
>

The problem with removing buffer manager and just writing everything
directly to PMEM is the worse latency/throughput (compared to DRAM).
It's probably much more efficient to combine multiple writes into RAM
and then do one (much slower) write to persistent storage, than pay the
higher latency for every write.

It might make sense for data sets that are larger than DRAM but can fit
into PMEM. But that seems like fairly rare case, and even then it may be
more efficient to redesign the schema to fit into RAM somehow (sharding,
partitioning, ...).

> What can be used instead of standard B-Tree?
> For example there is description of multiword-CAS approach:
>
>    http://justinlevandoski.org/papers/mwcas.pdf
>
> and BzTree implementation on top of it:
>
>    https://www.cc.gatech.edu/~jarulraj/papers/2018.bztree.vldb.pdf
>
> There is free BzTree implementation at github:
>
>     git(at)github(dot)com:sfu-dis/bztree.git
>
> I tried to adopt it for Postgres. It was not so easy because:
> 1. It was written in modern C++ (-std=c++14)
> 2. It supports multithreading, but not mutliprocess access
>
> So I have to patch code of this library instead of just using it:
>
>   git(at)github(dot)com:postgrespro/bztree.git
>
> I have not tested yet most iterating case: access to PMEM through PMDK.
> And I do not have hardware for such tests.
> But first results are also seem to be interesting: PMwCAS is kind of
> lockless algorithm and it shows much better scaling at
> NUMA host comparing with standard Postgres.
>
> I have done simple parallel insertion test: multiple clients are
> inserting data with random keys.
> To make competition with vanilla Postgres more honest I used unlogged
> table:
>
> create unlogged table t(pk int, payload int);
> create index on t using bztree(pk);
>
> randinsert.sql:
> insert into t (payload,pk) values
> (generate_series(1,1000),random()*1000000000);
>
> pgbench -f randinsert.sql -c N -j N -M prepared -n -t 1000 -P 1 postgres
>
> So each client is inserting one million records.
> The target system has 160 virtual and 80 real cores with 256GB of RAM.
> Results (TPS) are the following:
>
> N      nbtree      bztree
> 1           540          455
> 10         993        2237
> 100     1479        5025
>
> So bztree is more than 3 times faster for 100 clients.
> Just for comparison: result for inserting in this table without index is
> 10k TPS.
>

I'm not familiar with bztree, but I agree novel indexing structures are
an interesting topic on their own. I only quickly skimmed the bztree
paper, but it seems it might be useful even on DRAM (assuming it will
work with replication etc.).

The other "problem" with placing data files (tables, indexes) on PMEM
and making this code PMEM-aware is that these writes generally happen
asynchronously in the background, so the impact on transaction rate is
fairly low. This is why all the patches in this thread try to apply PMEM
on the WAL logging / flushing, which is on the critical path.

> I am going then try to play with PMEM.
> If results will be promising, then it is possible to think about
> reimplementation of heap and WAL-less Postgres!
>
> I am sorry, that my post has no direct relation to the topic of this
> thread (Non-volatile WAL buffer).
> It seems to be that it is better to use PMEM to eliminate WAL at all
> instead of optimizing it.
> Certainly, I realize that WAL plays very important role in Postgres:
> archiving and replication are based on WAL. So even if we can live
> without WAL, it is still not clear whether we really want to live
> without it.
>
> One more idea: using multiword CAS approach  requires us to make changes
> as editing sequences.
> Such editing sequence is actually ready WAL records. So implementors of
> access methods do not have to do
> double work: update data structure in memory and create correspondent
> WAL records. Moreover, PMwCAS operations are atomic:
> we can replay or revert them in case of fault. So there is no need in
> FPW (full page writes) which have very noticeable impact on WAL size and
> database performance.
>

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message tsunakawa.takay@fujitsu.com 2021-02-19 04:42:56 RE: Parallel INSERT (INTO ... SELECT ...)
Previous Message Amit Langote 2021-02-19 03:09:40 Re: a misbehavior of partition row movement (?)