Re: automatic analyze: readahead - add "IO read time" log message

From: Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: automatic analyze: readahead - add "IO read time" log message
Date: 2020-11-03 09:33:31
Message-ID: VI1PR0701MB696040EA07BC742E18313060F6110@VI1PR0701MB6960.eurprd07.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Stephen, hackers,

>> > With all those 'readahead' calls it certainly makes one wonder if the
>> > Linux kernel is reading more than just the block we're looking for
>> > because it thinks we're doing a sequential read and will therefore want
>> > the next few blocks when, in reality, we're going to skip past them,
>> > meaning that any readahead the kernel is doing is likely just wasted
>> > I/O.
>> I've done some quick&dirty tests with blockdev --setra/setfra 0 after spending time looking at the smgr/md/fd API changes required to find shortcut, but I'm getting actually a little bit worse timings at least on "laptop DB tests". One thing that I've noticed is that needs to be only for automatic-analyze, but not for automatic-vacuum where apparently there is some boost due to readahead.

>Interesting that you weren't seeing any benefit to disabling readahead.

I've got some free minutes and I have repeated the exercise in more realistic
and strict environment that previous one to conclude that the current situation is preferable:

Analyzed table was having 171GB (as reported by \dt+) + indexes: 35GB, 147GB, 35GB, 65GB (as reported by \di+)
Linux kernel 4.14.x, 2x NVME under dm-0 (it might matter as /dev/dm-0 might is different layer and might have different storage settings), VG on top of dm-0, LV with stripe-size 8kB, ext4.
s_b=128MB, RAM=128GB (- ~30GB which were reserved for HugePages), typical output of PgSQL12:
INFO: "x": scanned 1500000 of 22395442 pages, containing 112410444 live rows and 0 dead rows; 1500000 rows in sample, 1678321053 estimated total rows

Hot VFS cache:
Run0: Defaults, default RA on dm-1=256 (*512=128kB), most of the time is spent heapam_scan_analyze_next_block() -> .. -> pread() which causes ~70..80MB/s as reported by pidstat, maximum 22-25% CPU, ~8k IOPS in iostat with average request size per IO=25 sectors(*512/1024 = ~12kB), readahead on, hot caches, total elapsed ~3m
Run1: Defaults, similar as above (hot VFS cache), total elapsed 2m:50s
Run2: Defaults, similar as above (hot VFS cache), total elapsed 2m:42s
Run3: Defaults, miliaria as above (hot VFS cache), total elapsed 2m:40s

No VFS cache:
Run4: echo 3 > drop_caches, still with s_b=128MB: maximum 18-23% CPU, ~70MB/s read, ondemand_readahead visible in perf, total elapsed 3m30s
Run5: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 3m29s
Run6: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 3m28s

No VFS cache, readahead off:
Run7: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: reads at 33MB/s, ~13% CPU, 8.7k read IOPS @ avgrq-sz = 11 sectors (*512=5.5kB), total elapsed 5m59s
Run8: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: as above, double-confirmed no readaheads [ pread()->generic_file_read_iter()->ext4_mpage_readpages()-> bio.. ], total elapsed 5m56s
Run9: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: as above, total elapsed 5m55s

One thing not clear here is maybe in future worth measuring how striped LVs are being
affected by readaheads.

>Were you able to see where the time in the kernel was going when
>readahead was turned off for the ANALYZE?

Yes, my interpretation is that the time spent goes into directly block I/O layer waiting.

54.67% 1.33% postgres postgres [.] FileRead
---FileRead
--53.33%--__pread_nocancel
--50.67%--entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_pread64
|--49.33%--vfs_read
| --48.00%--__vfs_read
| |--45.33%--generic_file_read_iter
| | |--42.67%--ondemand_readahead
| | | __do_page_cache_readahead
| | | |--25.33%--ext4_mpage_readpages
| | | | |--10.67%--submit_bio
| | | | | generic_make_request
| | | | | |--8.00%--blk_mq_make_request
| | | | | | |--4.00%--blk_mq_get_request
| | | | | | | |--1.33%--blk_mq_get_tag
| | | | | | | --1.33%--sched_clock
| | | | | | | xen_sched_clock
| | | | | | | pvclock_clocksource_read
| | | | | | |--1.33%--bio_integrity_prep
| | | | | | --1.33%--blk_account_io_start
| | | | | | part_round_stats
| | | | | | blk_mq_in_flight
| | | | | | blk_mq_queue_tag_busy_iter
| | | | | --2.67%--dm_make_request
| | | | | __split_and_process_bio
| | | | | __split_and_process_non_flush
| | | | | |--1.33%--__map_bio
| | | | | | generic_make_request
| | | | | | generic_make_request_checks
| | | | | | percpu_counter_add_batch
| | | | | --1.33%--bio_alloc_bioset
| | | | | mempool_alloc
| | | | | kmem_cache_alloc
| | | | |--6.67%--ext4_map_blocks
| | | | | |--4.00%--ext4_es_lookup_extent
| | | | | | --2.67%--_raw_read_lock
| | | | | --2.67%--__check_block_validity.constprop.81
| | | | | ext4_data_block_valid
| | | | --6.67%--add_to_page_cache_lru
| | | | |--4.00%--__add_to_page_cache_locked
| | | | | --1.33%--mem_cgroup_try_charge
| | | | | get_mem_cgroup_from_mm
| | | | --2.67%--__lru_cache_add
| | | | pagevec_lru_move_fn
| | | | __lock_text_start
| | | |--12.00%--blk_finish_plug
| | | | blk_flush_plug_list
| | | | blk_mq_flush_plug_list
| | | | |--10.67%--__blk_mq_delay_run_hw_queue
| | | | | __blk_mq_run_hw_queue
| | | | | blk_mq_sched_dispatch_requests
| | | | | --9.33%--blk_mq_dispatch_rq_list
| | | | | nvme_queue_rq
| | | | | --1.33%--blk_mq_start_request

>The VACUUM case is going to be complicated by what's in the visibility
>map. (..)

After observing the ANALYZE readahead behavior benefit I've abandoned
the case of testing much more advanced VACUUM processing, clearly Linux
read-ahead is beneficial in even simple cases.

>> My only idea would be that a lot of those blocks could be read asynchronously in batches (AIO) with POSIX_FADV_RANDOM issued on block-range before, so maybe the the optimization is possible, but not until we'll have AIO ;)
>
> (..)AIO is a whole other animal that's been discussed off and on
>around here but it's a much larger and more invasive change than just
>calling posix_fadvise().

Yes, I'm aware and I'm keeping my fingers crossed that maybe some day....

The ANALYZE just seem fit to be natural candidate to use it. The only easy chance
of acceleration of stats gathering - at least to me and enduser point of view -
is to have more parallel autoanalyze workers running to drive more I/O concurrency
(by e.g. partitioning the data), both in readahead and non-readahead scenarios.
Which is a pity because 70-80% of such process sits idle. The readahead might read
10x more unnecessary data, but pread() doesn't have to wait. <speculation>Once AIO
would be it could throw thousands of requests without readahead and achieve much
better efficiency probably</speculation>

I hope the previous simple patch goes into master and helps other people understand
the picture more easily.

-J.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2020-11-03 09:36:50 Re: Online checksums verification in the backend
Previous Message Pavel Borisov 2020-11-03 09:30:41 Re: Bogus documentation for bogus geometric operators