Re: Dynamic gathering the values for seq_page_cost/xxx_cost

From: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Dynamic gathering the values for seq_page_cost/xxx_cost
Date: 2020-09-21 03:41:09
Message-ID: CAKU4AWpSvKu+Oeo8rpDfCSPCzgSK2GB8MVm5cGwkHQ37eO8-Og@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tomas:
Thanks for checking.

On Fri, Sep 18, 2020 at 9:50 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

> >I recently tried something in this direction and the result looks
> >promising based on my limited test.
> >
> >Since the unit of a xxx_cost is "seq_page_cost", then how to detect
> >seq_page_cost is important. In the cost model, the IO cost of a seqscan is
> >rel->pages * seq_page_cost, it doesn't consider any cache (file system
> >cache or
> >shared buffer cache). However, it assumes the OS will prefetch the IO. So
> >to
> >detect the seq_page_cost, I enabled the prefetch but avoided the file
> system
> >cache. I tested this with 1). drop the cache on the file system. 2). Open
> >the test
> >file without O_DIRECT so that the prefetch can work.
> >
> >To detect the random page read, I read it with pread with a random offset.
> >Since the random offsets may be the same as each other during the test,
> >so even dropping the file system cache at the beginning doesn't work. so
> >I open it with the O_DIRECT option.
> >
> >I also measure the cost of reading a page from a file system cache, during
> >my test, it is about 10% of a seq scan read.
> >
> >After I get the basic numbers about the hardware capability, I let the
> user
> >provide a cache hit ratio (This is a place where we can further improve if
> >this
> >is a right direction).
> >
> >Here is the test result on my hardware.
> >
> >fs_cache_lat = 0.832025us, seq_read_lat = 8.570290us, random_page_lat =
> >73.987732us
> >
> >cache hit ratio: 1.000000 random_page_cost 1.000000
> >cache hit ratio: 0.900000 random_page_cost 5.073692
> >cache hit ratio: 0.500000 random_page_cost 7.957589
> >cache hit ratio: 0.100000 random_page_cost 8.551591
> >cache hit ratio: 0.000000 random_page_cost 8.633049
> >
> >
> >Then I tested the suggested value with the 10GB TPCH
> >workload. I compared the plans with 2 different settings random_page_cost
> =
> >1). 4 is the default value) 2). 8.6 the cache hint ratio = 0 one. Then
> 11
> >out of the 22
> >queries generated a different plan. At last I drop the cache (including
> >both
> >file system cache and shared_buffer) before run each query and run the 11
> >queries
> >under the 2 different settings. The execution time is below.
> >
> >
> >| | random_page_cost=4 | random_page_cost=8.6 |
> >|-----+--------------------+----------------------|
> >| Q1 | 1425.964 | 1121.928 |
> >| Q2 | 2553.072 | 2567.450 |
> >| Q5 | 4397.514 | 1475.343 |
> >| Q6 | 12576.985 | 4622.503 |
> >| Q7 | 3459.777 | 2987.241 |
> >| Q8 | 8360.995 | 8415.311 |
> >| Q9 | 4661.842 | 2930.370 |
> >| Q11 | 4885.289 | 2348.541 |
> >| Q13 | 2610.937 | 1497.776 |
> >| Q20 | 13218.122 | 10985.738 |
> >| Q21 | 264.639 | 262.350 |
> >
> >
> >The attached main.c is the program I used to detect the
> >random_page_cost. result.tar.gz is the test result, you can run a git log
> >first
> >to see the difference on plan or execution stat.
> >
> >Any feedback is welcome. Thanks!
> >
>
> That seems pretty neat. What kind of hardware have you done these tests
> on?

The following is my hardware info.

I have 12 SSD behind the MR9271-8i RAID Controller which has a 1GB buffer.
[1]

root# lshw -short -C disk
H/W path Device Class Description
==============================================================
/0/100/2/0/2.0.0 /dev/sda disk 2398GB MR9271-8i
/0/100/2/0/2.1.0 /dev/sdb disk 5597GB MR9271-8i <-- my
data location

/opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -aALL

Adapter #0

Memory Size : 1024MB
RAID Level : Primary-5, Secondary-0, RAID Level Qualifier-3
..
Current Cache Policy: WriteBack, ReadAheadNone, Direct, Write Cache OK if
Bad
BBU
...
Device Present
================
Virtual Drives : 2
Degraded : 0
Offline : 0
Physical Devices : 14
Disks : 12
Critical Disks : 0
Failed Disks : 0

root# /opt/MegaRAID/MegaCli/MegaCli64 -LdPdInfo -a0 | egrep 'Media
Type|Raw Size'
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Media Type: Solid State Device

CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz, 32 processors.
Memory: 251 GB
Linux: 3.10.0-327
fs: ext4. mount options: defaults,noatime,nodiratime,nodelalloc,barrier=0
Physical machine.

It's probably worth testing on various other storage systems to see
> how that applies to those.
>
> Yes, I can test more on new hardware once I get it. Now it is still in
progress.
However I can only get a physical machine with SSD or Virtual machine with
SSD, other types are hard for me right now.

Have you tried existing I/O testing tools, e.g. fio? If your idea is to
> propose some built-in tool (similar to pg_test_fsync) then we probably
> should not rely on external tools, but I wonder if we're getting the
> same numbers.
>

Thanks for this hint, I found more interesting stuff during the comparison.

I define the FIO jobs as below.

random_page_cost.job:
[global]
blocksize=8k
size=1Gi
filesize=1Gi
ioengine=sync
directory=/u01/yizhi/data/fio

[random_page_cost]
direct=1
readwrite=randread

Even it is direct IO, the device cache still plays an important
part. The device cache is filled in preparing the test data file stage.
I invalidate the device cache by writing a new dummy file. At last the avg
latency time is 148 us.

seq.job

[global]
blocksize=8k
size=1Gi
filesize=1Gi
ioengine=sync
directory=/u01/yizhi/data/fio

[seq_page_cost]
buffered=1
readwrite=read

For seq read, We need buffered IO for perfetch, however, we need to bypass
the file
system cache and device cache. fio have no control of such caches, so I did:

1). Run fio to generate the test file.
2). Invalidate device cache first with dd if=/dev/zero of=a_dummy_file
bs=1048576 count=1024
3). drop the file system cache.
4). Run the fio again.

The final avg latency is ~12 us.

This is 1.5 ~ 2 X difference with my previous result. (seq_read_lat =
8.570290us, random_page_lat =
73.987732us)

Here are some changes for my detection program.

| | seq_read_lat (us) |
random_read_lat (us) |
| FIO | 12 |
148 |
| Previous main.c | 8.5 |
74 |
| invalidate_device_cache before each testing | 9 |
150 |
| prepare the test data file with O_DIRECT option | 15 |
150 |

In invalidate_device_cache, I just create another 1GB data file and read
it. (see invalidate_device_cache function) this is similar as the previous
fio setup.

prepare test data file with O_DIRECT option means in the past, I prepare
the test
file with buffer IO. and before testing, I do invalidate device cache, file
system cache. but the buffered prepared file still get better performance, I
have no idea of it. Since I don't want any cache. I use O_DIRECT
option at last. The seq_read_lat changed from 9us to 15us.
I still can't find out the 25% difference with the FIO result. (12 us vs 9
us).

At last, the random_page_cost happens to not change very much.

/u/y/g/fdirect> sudo ./main
fs_cache_lat = 0.569031us, seq_read_lat = 18.901749us, random_page_lat =
148.650589us

cache hit ratio: 1.000000 random_page_cost 1.000000
cache hit ratio: 0.900000 random_page_cost 6.401019
cache hit ratio: 0.500000 random_page_cost 7.663772
cache hit ratio: 0.100000 random_page_cost 7.841498
cache hit ratio: 0.000000 random_page_cost 7.864383

This result looks much different from "we should use 1.1 ~ 1.5 for SSD".

The attached is the modified detection program.

[1]
https://www.cdw.com/product/lsi-megaraid-sas-9271-8i-storage-controller-raid-sas-pcie-3.0-x8/4576538#PO

--
Best Regards
Andy Fan

Attachment Content-Type Size
main.c application/octet-stream 7.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Wang, Shenhao 2020-09-21 03:43:41 make MaxBackends available in _PG_init
Previous Message Amit Kapila 2020-09-21 03:31:48 Re: recovering from "found xmin ... from before relfrozenxid ..."