Write lifetime hints for NVMe

From: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Write lifetime hints for NVMe
Date: 2018-01-27 13:20:38
Message-ID: CA+q6zcX_iz9ekV7MyO6xGH1LHHhiutmHY34n1VHNN3dLf_4C4Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

From what I see some time ago the write lifetime hints support for NVMe multi
streaming was merged into Linux kernel [1]. Theoretically it allows data
written together on media so they can be erased together, which minimizes
garbage collection, resulting in reduced write amplification as well as
efficient flash utilization [2]. I couldn't find any discussion about that on
hackers, so I decided to experiment with this feature a bit. My idea was to
test quite naive approach when all file descriptors, that are related to
temporary files, have assigned `RWH_WRITE_LIFE_SHORT`, and rest of them
`RWH_WRITE_LIFE_EXTREME`. Attached patch is a dead simple POC without any
infrastructure around to enable/disable hints.

It turns out that it's possible to perform benchmarks on some EC2 instance
types (e.g. c5) with the corresponding version of the kernel, since they expose
a volume as nvme device:

```
# nvme list
Node SN Model
Namespace Usage Format FW Rev
---------------- --------------------
---------------------------------------- ---------
-------------------------- ---------------- --------
/dev/nvme0n1 vol01cdbc7ec86f17346 Amazon Elastic Block Store
1 0.00 B / 8.59 GB 512 B + 0 B 1.0
```

To get some baseline results I've run several rounds of pgbench on these quite
modest instances (dedicated, with optimized EBS) with slightly adjusted
`max_wal_size` and with default configuration:

$ pgbench -s 200 -i
$ pgbench -T 600 -c 2 -j 2

Analyzing `strace` output I can see that during this test there were some
significant number of operations with pg_stat_tmp and xlogtemp, so I assume
write lifetime hints should have some effect.

As a result I've got reduction of latency about 5-8% (but so far these numbers
are unstable, probably because of virtualization).

```
# without patch
number of transactions actually processed: 491945
latency average = 2.439 ms
tps = 819.906323 (including connections establishing)
tps = 819.908755 (excluding connections establishing)
```

```
with patch
number of transactions actually processed: 521805
latency average = 2.300 ms
tps = 869.665330 (including connections establishing)
tps = 869.668026 (excluding connections establishing)
```

So I have a few questions:

* Does it sound interesting and worthwhile to create a proper patch?

* Maybe someone else has similar results?

* Any suggestions about what can be the best/worst case scenarios of using such
kind of hints?

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c75b1d9421f80f4143e389d2d50ddfc8a28c8c35
[2]: https://regmedia.co.uk/2016/09/23/0_storage-intelligence-prodoverview-2015-0.pdf

Attachment Content-Type Size
nvme_write_lifetime_poc.patch application/octet-stream 6.0 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Verite 2018-01-27 14:22:02 Re: [HACKERS] proposal: psql command \graw
Previous Message Tomas Vondra 2018-01-27 11:40:03 Re: Setting BLCKSZ 4kB