RE: pgcon unconference / impact of block size on performance

From: Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: RE: pgcon unconference / impact of block size on performance
Date: 2022-06-06 14:27:06
Message-ID: AS8PR07MB8249D56663CE82B5826D64B2F6A29@AS8PR07MB8249.eurprd07.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tomas,

> Hi,
>
> At on of the pgcon unconference sessions a couple days ago, I presented a
> bunch of benchmark results comparing performance with different data/WAL
> block size. Most of the OLTP results showed significant gains (up to 50%) with
> smaller (4k) data pages.

Nice. I just saw this https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference , do you have any plans for publishing those other graphs too (e.g. WAL block size impact)?

> This opened a long discussion about possible explanations - I claimed one of the
> main factors is the adoption of flash storage, due to pretty fundamental
> differences between HDD and SSD systems. But the discussion concluded with an
> agreement to continue investigating this, so here's an attempt to support the
> claim with some measurements/data.
>
> Let me present results of low-level fio benchmarks on a couple different HDD
> and SSD drives. This should eliminate any postgres-related influence (e.g. FPW),
> and demonstrates inherent HDD/SSD differences.
> All the SSD results show this behavior - the Optane and Samsung nicely show
> that 4K is much better (in random write IOPS) than 8K, but 1-2K pages make it
> worse.
>
[..]
Can you share what Linux kernel version, what filesystem , it's mount options and LVM setup were you using if any(?)

I've hastily tried your script on 4VCPU/32GB RAM/1xNVMe device @ ~900GB (AWS i3.xlarge), kernel 5.x, ext4 defaults, no LVM, libaio only, fio deviations: runtime -> 1min, 64GB file, 1 iteration only. Results are attached, w/o graphs.

> Now, compare this to the SSD. There are some differences between the models, manufacturers, interface etc. but the impact of page size on IOPS is pretty clear. On the Optane you can get +20-30% by using 4K pages, on the Samsung it's even more, etc. This means that workloads dominated by random I/O get significant benefit from smaller pages.

Yup, same here, reproduced, 1.42x faster on writes:
[root(at)x ~]# cd libaio/nvme/randwrite/128/ # 128=queue depth
[root(at)x 128]# grep -r "write:" * | awk '{print $1, $4, $5}' | sort -n
1k/1.txt: bw=24162KB/s, iops=24161,
2k/1.txt: bw=47164KB/s, iops=23582,
4k/1.txt: bw=280450KB/s, iops=70112, <<<
8k/1.txt: bw=393082KB/s, iops=49135,
16k/1.txt: bw=393103KB/s, iops=24568,
32k/1.txt: bw=393283KB/s, iops=12290,
BTW it's interesting to compare to your's Optane 900P result (same two high bars for IOPS @ 4,8kB), but in my case it's even more import to select 4kB so it behaves more like Samsung 860 in your case

# 1.41x on randreads
[root(at)x ~]# cd libaio/nvme/randread/128/ # 128=queue depth
[root(at)x 128]# grep -r "read :" | awk '{print $1, $5, $6}' | sort -n
1k/1.txt: bw=169938KB/s, iops=169937,
2k/1.txt: bw=376653KB/s, iops=188326,
4k/1.txt: bw=691529KB/s, iops=172882, <<<
8k/1.txt: bw=976916KB/s, iops=122114,
16k/1.txt: bw=990524KB/s, iops=61907,
32k/1.txt: bw=974318KB/s, iops=30447,

I think that the above just a demonstration of device bandwidth saturation: 32k*30k IOPS =~ 1GB/s random reads. Given that DB would be tuned @ 4kB for app(OLTP), but once upon a time Parallel Seq Scans "critical reports" could only achieve 70% of what it could achieve on 8kB, correct? (I'm assuming most real systems are really OLTP but with some reporting/data exporting needs). One way or another it would be very nice to be able to select the tradeoff using initdb(1) without the need to recompile, which then begs for some initdb --calibrate /mnt/nvme (effective_io_concurrency, DB page size, ...).

Do you envision any plans for this we still in a need to gather more info exactly why this happens? (perf reports?)

Also have you guys discussed on that meeting any long-term future plans on storage layer by any chance ? If sticking to 4kB pages on DB/page size/hardware sector size, wouldn't it be possible to win also disabling FPWs in the longer run using uring (assuming O_DIRECT | O_ATOMIC one day?)
I recall that Thomas M. was researching O_ATOMIC, I think he wrote some of that pretty nicely in [1]

[1] - https://wiki.postgresql.org/wiki/FreeBSD/AtomicIO

Attachment Content-Type Size
libaio-2022-06-06.tgz application/x-compressed 53.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2022-06-06 14:41:54 Re: Add index scan progress to pg_stat_progress_vacuum
Previous Message Justin Pryzby 2022-06-06 14:20:44 Re: Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list