From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Should we update the random_page_cost default value? |
Date: | 2025-10-07 16:17:47 |
Message-ID: | d43bdf2f-aed3-46c2-8cfe-49383abde3a1@vondra.me |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 10/7/25 17:32, Andres Freund wrote:
> Hi,
>
> On 2025-10-07 16:23:36 +0200, Tomas Vondra wrote:
>> On 10/7/25 14:08, Tomas Vondra wrote:
>>> ...
>>>>>>>> I think doing this kind of measurement via normal SQL query processing is
>>>>>>>> almost always going to have too much other influences. I'd measure using fio
>>>>>>>> or such instead. It'd be interesting to see fio numbers for your disks...
>>>>>>>>
>>>>>>>> fio --directory /srv/fio --size=8GiB --name test --invalidate=0 --bs=$((8*1024)) --rw read --buffered 0 --time_based=1 --runtime=5 --ioengine pvsync --iodepth 1
>>>>>>>> vs --rw randread
>>>>>>>>
>>>>>>>> gives me 51k/11k for sequential/rand on one SSD and 92k/8.7k for another.
>>>>>>>>
>>>>>>>
>>>>>>> I can give it a try. But do we really want to strip "our" overhead with
>>>>>>> reading data?
>>>
>>> I got this on the two RAID devices (NVMe and SATA):
>>>
>>> NVMe: 83.5k / 15.8k
>>> SATA: 28.6k / 8.5k
>>>
>>> So the same ballpark / ratio as your test. Not surprising, really.
>>>
>>
>> FWIW I do see about this number in iostat. There's a 500M test running
>> right now, and iostat reports this:
>>
>> Device r/s rkB/s ... rareq-sz ... %util
>> md1 15273.10 143512.80 ... 9.40 ... 93.64
>>
>> So it's not like we're issuing far fewer I/Os than the SSD can handle.
>
> Not really related to this thread:
>
> IME iostat's utilization is pretty much useless for anything other than "is
> something happening at all", and even that is not reliable. I don't know the
> full reason for it, but I long learned to just discount it.
>
> I ran
> fio --directory /srv/fio --size=8GiB --name test --invalidate=0 --bs=$((8*1024)) --rw read --buffered 0 --time_based=1 --runtime=100 --ioengine pvsync --iodepth 1 --rate_iops=40000
>
> a few times in a row, while watching iostat. Sometimes utilization is 100%,
> sometimes it's 0.2%. Whereas if I run without rate limiting, utilization
> never goes above 71%, despite doing more iops.
>
>
> And then gets completely useless if you use a deeper iodepth, because there's
> just not a good way to compute something like a utilization number once
> you take parallel IO processing into account.
>
> fio --directory /srv/fio --size=8GiB --name test --invalidate=0 --bs=$((8*1024)) --rw read --buffered 0 --time_based=1 --runtime=100 --ioengine io_uring --iodepth 1 --rw randread
> iodepth util iops
> 1 94% 9.3k
> 2 99.6% 18.4k
> 4 100% 35.9k
> 8 100% 68.0k
> 16 100% 123k
>
Yeah. Interpreting %util is hard, the value on it's own is borderline
useless. I only included it because it's the last thing on the line.
AFAIK the reason why it doesn't say much is that it says "device is
doing something", nothing about the bandwidth/throughput. It's very
obvious on RAID storage, where you can see 100% util on the md device,
but the members are used only at 25%. SSDs are similar internally,
except that the members are not visible.
regards
--
Tomas Vondra
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2025-10-07 16:24:06 | Re: Should we update the random_page_cost default value? |
Previous Message | Robert Treat | 2025-10-07 15:50:46 | Re: Add mode column to pg_stat_progress_vacuum |