Re: pgcon unconference / impact of block size on performance

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: pgcon unconference / impact of block size on performance
Date: 2022-06-06 15:00:56
Message-ID: 62160038-cf65-72a6-4738-343454d72e87@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 6/6/22 16:27, Jakub Wartak wrote:
> Hi Tomas,
>
>> Hi,
>>
>> At on of the pgcon unconference sessions a couple days ago, I presented a
>> bunch of benchmark results comparing performance with different data/WAL
>> block size. Most of the OLTP results showed significant gains (up to 50%) with
>> smaller (4k) data pages.
>
> Nice. I just saw this
https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference , do
you have any plans for publishing those other graphs too (e.g. WAL block
size impact)?
>

Well, there's plenty of charts in the github repositories, including the
charts I think you're asking for:

https://github.com/tvondra/pg-block-bench-pgbench/blob/master/process/heatmaps/xeon/20220406-fpw/16/heatmap-tps.png

https://github.com/tvondra/pg-block-bench-pgbench/blob/master/process/heatmaps/i5/20220427-fpw/16/heatmap-io-tps.png

I admit the charts may not be documented very clearly :-(

>> This opened a long discussion about possible explanations - I claimed one of the
>> main factors is the adoption of flash storage, due to pretty fundamental
>> differences between HDD and SSD systems. But the discussion concluded with an
>> agreement to continue investigating this, so here's an attempt to support the
>> claim with some measurements/data.
>>
>> Let me present results of low-level fio benchmarks on a couple different HDD
>> and SSD drives. This should eliminate any postgres-related influence (e.g. FPW),
>> and demonstrates inherent HDD/SSD differences.
>> All the SSD results show this behavior - the Optane and Samsung nicely show
>> that 4K is much better (in random write IOPS) than 8K, but 1-2K pages make it
>> worse.
>>
> [..]
> Can you share what Linux kernel version, what filesystem , it's
> mount options and LVM setup were you using if any(?)
>

The PostgreSQL benchmarks were with 5.14.x kernels, with either ext4 or
xfs filesystems.

i5 uses LVM on the 6x SATA SSD devices, with this config:

bench ~ # mdadm --detail /dev/md0
/dev/md0:
Version : 0.90
Creation Time : Thu Feb 8 15:05:49 2018
Raid Level : raid0
Array Size : 586106880 (558.96 GiB 600.17 GB)
Raid Devices : 6
Total Devices : 6
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Thu Feb 8 15:05:49 2018
State : clean
Active Devices : 6
Working Devices : 6
Failed Devices : 0
Spare Devices : 0

Chunk Size : 512K

Consistency Policy : none

UUID : 24c6158c:36454b38:529cc8e5:b4b9cc9d (local to host
bench)
Events : 0.1

Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 17 1 active sync /dev/sdb1
2 8 33 2 active sync /dev/sdc1
3 8 49 3 active sync /dev/sdd1
4 8 65 4 active sync /dev/sde1
5 8 81 5 active sync /dev/sdf1

bench ~ # mount | grep md0
/dev/md0 on /mnt/raid type xfs
(rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=16,swidth=96,noquota)

and the xeon just uses ext4 on the device directly:

/dev/nvme0n1p1 on /mnt/data type ext4 (rw,relatime)

> I've hastily tried your script on 4VCPU/32GB RAM/1xNVMe device @
> ~900GB (AWS i3.xlarge), kernel 5.x, ext4 defaults, no LVM, libaio
> only, fio deviations: runtime -> 1min, 64GB file, 1 iteration only.
> Results are attached, w/o graphs.
>
>> Now, compare this to the SSD. There are some differences between
>> the models, manufacturers, interface etc. but the impact of page
>> size on IOPS is pretty clear. On the Optane you can get +20-30% by
>> using 4K pages, on the Samsung it's even more, etc. This means that
>> workloads dominated by random I/O get significant benefit from
>> smaller pages.
>
> Yup, same here, reproduced, 1.42x faster on writes:
> [root(at)x ~]# cd libaio/nvme/randwrite/128/ # 128=queue depth
> [root(at)x 128]# grep -r "write:" * | awk '{print $1, $4, $5}' | sort -n
> 1k/1.txt: bw=24162KB/s, iops=24161,
> 2k/1.txt: bw=47164KB/s, iops=23582,
> 4k/1.txt: bw=280450KB/s, iops=70112, <<<
> 8k/1.txt: bw=393082KB/s, iops=49135,
> 16k/1.txt: bw=393103KB/s, iops=24568,
> 32k/1.txt: bw=393283KB/s, iops=12290,
>
> BTW it's interesting to compare to your's Optane 900P result (same
> two high bars for IOPS @ 4,8kB), but in my case it's even more import
> to select 4kB so it behaves more like Samsung 860 in your case
>

Thanks. Interesting!

> # 1.41x on randreads
> [root(at)x ~]# cd libaio/nvme/randread/128/ # 128=queue depth
> [root(at)x 128]# grep -r "read :" | awk '{print $1, $5, $6}' | sort -n
> 1k/1.txt: bw=169938KB/s, iops=169937,
> 2k/1.txt: bw=376653KB/s, iops=188326,
> 4k/1.txt: bw=691529KB/s, iops=172882, <<<
> 8k/1.txt: bw=976916KB/s, iops=122114,
> 16k/1.txt: bw=990524KB/s, iops=61907,
> 32k/1.txt: bw=974318KB/s, iops=30447,
>
> I think that the above just a demonstration of device bandwidth
> saturation: 32k*30k IOPS =~ 1GB/s random reads. Given that DB would
> be tuned @ 4kB for app(OLTP), but once upon a time Parallel Seq
> Scans "critical reports" could only achieve 70% of what it could
> achieve on 8kB, correct? (I'm assuming most real systems are really
> OLTP but with some reporting/data exporting needs).
>

Right, that's roughly my thinking too. Also, OLAP queries often do a lot
of random I/O, due to index scans etc.

I also wonder how is this related to filesystem page size - in all the
benchmarks I did I used the default (4k), but maybe it'd behave if the
filesystem page matched the data page.

> One way or another it would be very nice to be able to select the
> tradeoff using initdb(1) without the need to recompile, which then
> begs for some initdb --calibrate /mnt/nvme (effective_io_concurrency,
> DB page size, ...).>
> Do you envision any plans for this we still in a need to gather more
> info exactly why this happens? (perf reports?)
>

Not sure I follow. Plans for what? Something that calibrates cost
parameters? That might be useful, but that's a rather separate issue
from what's discussed here - page size, which needs to happen before
initdb (at least with how things work currently).

The other issue (e.g. with effective_io_concurrency) is that it very
much depends on the access pattern - random pages and sequential pages
will require very different e_i_c values. But again, that's something to
discuss in a separate thread (e.g. [1])

[1]: https://postgr.es/m/Yl92RVoXVfs+z2Yj@momjian.us

> Also have you guys discussed on that meeting any long-term future
> plans on storage layer by any chance ? If sticking to 4kB pages on
> DB/page size/hardware sector size, wouldn't it be possible to win
> also disabling FPWs in the longer run using uring (assuming O_DIRECT
> | O_ATOMIC one day?)>
> I recall that Thomas M. was researching O_ATOMIC, I think he wrote
> some of that pretty nicely in [1]
>
> [1] - https://wiki.postgresql.org/wiki/FreeBSD/AtomicIO

No, no such discussion - at least no in this unconference slot.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Geier 2022-06-06 15:01:23 Re: Assertion failure with barriers in parallel hash join
Previous Message Robert Haas 2022-06-06 14:51:12 Re: oat_post_create expected behavior