Re: automatic analyze: readahead - add "IO read time" log message

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: automatic analyze: readahead - add "IO read time" log message
Date: 2020-10-27 15:35:26
Message-ID: 20201027153526.GD16415@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings Jakub,

* Jakub Wartak (Jakub(dot)Wartak(at)tomtom(dot)com) wrote:
> > The analyze is doing more-or-less random i/o since it's skipping through
> > the table picking out select blocks, not doing regular sequential i/o.
> VS
> >> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890910, bstrategy=0x1102278) at heapam_handler.c:984
> >> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890912, bstrategy=0x1102278) at heapam_handler.c:984
> >> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890922, bstrategy=0x1102278) at heapam_handler.c:984
> >Not really sure what's interesting here, but it does look like we're
> >skipping through the table as expected.
>
> Yes, but not randomly in this case. I wanted to point out that this is incrementing block number, therefore I've included this debug output which might trigger readahead heuristics.

Sure, it's incrementing, but it's skipping- this is very similar to what
we do with Bitmap Heap Scans, and that makes it different from a typical
sequential scan.

> Perhaps this depends on how the table was built / vacuumed ? (in this case, pure-INSERT-only; I would expect the same in time series DBs and DWHs).

No, this is how ANALYZE behaves and hasn't got much to do with how the
table was built or vacuumed.

> > With all those 'readahead' calls it certainly makes one wonder if the
> > Linux kernel is reading more than just the block we're looking for
> > because it thinks we're doing a sequential read and will therefore want
> > the next few blocks when, in reality, we're going to skip past them,
> > meaning that any readahead the kernel is doing is likely just wasted
> > I/O.
>
> I've done some quick&dirty tests with blockdev --setra/setfra 0 after spending time looking at the smgr/md/fd API changes required to find shortcut, but I'm getting actually a little bit worse timings at least on "laptop DB tests". One thing that I've noticed is that needs to be only for automatic-analyze, but not for automatic-vacuum where apparently there is some boost due to readahead.

Interesting that you weren't seeing any benefit to disabling readahead.
Were you able to see where the time in the kernel was going when
readahead was turned off for the ANALYZE?

Note that you shouldn't need to make changes to smgr/md/fd to leverage
posix_fadvise- what you would do is use PrefetchBuffer(), see
BitmapPrefetch().

The VACUUM case is going to be complicated by what's in the visibility
map. A VACUUM that isn't able to skip any pages yet is certainly going
to benefit from the kernel's readahead, but a VACUUM that's able to skip
over pages likely wouldn't benefit as much.

The way to test this would look something like:

- COPY a bunch of data into a table
- make sure to commit that and any other ongoing transactions
- VACUUM FREEZE the table
- check the visibility map to make sure most of the pages are marked as
all-frozen after the VACUUM FREEZE
- randomly UPDATE the table, to really get the effect, maybe update 20%
of the pages while leaving the rest alone (and therefore 80% of the
table should still have the all-frozen bit set in the visibility map)
- *then* do a VACUUM on the table and see what happens with different
amounts of read-ahead (or, ideally, with posix_fadvise() being used to
let the kernel know what pages we're going to actually want).

> > That definitely seems like a useful thing to include and thanks for the
> > patch!  Please be sure to register it in the commitfest app:
> > https://commitfest.postgresql.org
>
> Thank you! Thread is now registered.

Great!

> > I would think that, ideally, we'd teach analyze.c to work in the same
> > way that bitmap heap scans do- that is, use posix_fadvise to let the
> > kernel know what pages we're going to want next instead of the kernel
> > guessing (incorrectly) or not doing any pre-fetching.  I didn't spend a
> > lot of time poking, but it doesn't look like analyze.c tries to do any
> > prefetching today.  In a similar vein, I wonder if VACUUM should be
> > doing prefetching too today, at least when it's skipping through the
> > heap based on the visibility map and jumping over all-frozen pages.
>
> My only idea would be that a lot of those blocks could be read asynchronously in batches (AIO) with POSIX_FADV_RANDOM issued on block-range before, so maybe the the optimization is possible, but not until we'll have AIO ;)

Well, the idea is that posix_fadvise() usage through PrefetchBuffer()
gets us some of that by letting the kernel know what we're going to ask
for next. AIO is a whole other animal that's been discussed off and on
around here but it's a much larger and more invasive change than just
calling posix_fadvise().

> > Haven't looked too closely at this but in general +1 on the idea and
> > this approach looks pretty reasonable to me.  Only thing I can think of
> > off-hand is to check how it compares to other places where we report IO
> > read time and make sure that it looks similar.
>
> Ok, I've changed the output in 002 version to include "avg read rate" just like in the autovacuum case but still maintaining single line output, e.g:
> automatic analyze of table "test.public.t1_default" avg read rate: 96.917 MB/s (read time: 2.52 s), system usage: CPU: user: 0.28 s, system: 0.26 s, elapsed: 2.94 s

Ah, in this case I had been thinking actually about the other things
that track_io_timing controls, as in:

ExplainPropertyFloat("I/O Read Time", "ms",
INSTR_TIME_GET_MILLISEC(usage->blk_read_time),
3, es);
ExplainPropertyFloat("I/O Write Time", "ms",
INSTR_TIME_GET_MILLISEC(usage->blk_write_time),
3, es);

Seems we already aren't exactly consistent in how we report these
things, unfortunately.

That said, since this is ANALYZE, yeah, it probably makes more sense to
have it look more like what VACUUM reports than what you'd get from
EXPLAIN. As an interesting point also- what VACUUM reports isn't
through using track_io_timing, which makes one wonder if we should be
doing that here or not..

Thanks!

Stephen

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2020-10-27 15:37:02 Re: Parallel copy
Previous Message vignesh C 2020-10-27 15:33:22 Re: Parallel copy