Re: BitmapHeapScan streaming read user and prelim refactoring

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Andres Freund <andres(at)anarazel(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Subject: Re: BitmapHeapScan streaming read user and prelim refactoring
Date: 2024-03-29 01:12:48
Message-ID: CA+hUKGJtm_gkmW_h_02-Q9ZRcG3yOx2uzVqbCTfz7YPnTfs+DA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 29, 2024 at 10:43 AM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> I think there's some sort of bug, triggering this assert in heapam
>
> Assert(BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno);

Thanks for the repro. I can't seem to reproduce it (still trying) but
I assume this is with Melanie's v11 patch set which had
v11-0016-v10-Read-Stream-API.patch.

Would you mind removing that commit and instead applying the v13
stream_read.c patches[1]? v10 stream_read.c was a little confused
about random I/O combining, which I fixed with a small adjustment to
the conditions for the "if" statement right at the end of
read_stream_look_ahead(). Sorry about that. The fixed version, with
eic=4, with your test query using WHERE a < a, ends its scan with:

...
posix_fadvise(32,0x28aee000,0x4000,POSIX_FADV_WILLNEED) = 0 (0x0)
pread(32,"\0\0\0\0(at)4\M-5:\0\0\^D\0\M-x\^A"...,40960,0x28acc000) = 40960 (0xa000)
posix_fadvise(32,0x28af4000,0x4000,POSIX_FADV_WILLNEED) = 0 (0x0)
pread(32,"\0\0\0\0\^XC\M-6:\0\0\^D\0\M-x"...,32768,0x28ad8000) = 32768 (0x8000)
posix_fadvise(32,0x28afc000,0x4000,POSIX_FADV_WILLNEED) = 0 (0x0)
pread(32,"\0\0\0\0\M-XQ\M-7:\0\0\^D\0\M-x"...,24576,0x28ae4000) = 24576 (0x6000)
posix_fadvise(32,0x28b02000,0x8000,POSIX_FADV_WILLNEED) = 0 (0x0)
pread(32,"\0\0\0\0\M^(at)3\M-8:\0\0\^D\0\M-x"...,16384,0x28aee000) = 16384 (0x4000)
pread(32,"\0\0\0\0\M-`\M-:\M-8:\0\0\^D\0"...,16384,0x28af4000) = 16384 (0x4000)
pread(32,"\0\0\0\0po\M-9:\0\0\^D\0\M-x\^A"...,16384,0x28afc000) = 16384 (0x4000)
pread(32,"\0\0\0\0\M-P\M-v\M-9:\0\0\^D\0"...,32768,0x28b02000) = 32768 (0x8000)

In other words it's able to coalesce, but v10 was a bit b0rked in that
respect and wouldn't do as well at that. Then if you set
io_combine_limit = 1, it looks more like master, eg lots of little
reads, but not as many fadvises as master because of sequential
access:

...
posix_fadvise(32,0x28af4000,0x2000,POSIX_FADV_WILLNEED) = 0 (0x0) -+
pread(32,...,8192,0x28ae8000) = 8192 (0x2000) |
pread(32,...,8192,0x28aee000) = 8192 (0x2000) |
posix_fadvise(32,0x28afc000,0x2000,POSIX_FADV_WILLNEED) = 0 (0x0) ---+
pread(32,...,8192,0x28af0000) = 8192 (0x2000) | |
pread(32,...,8192,0x28af4000) = 8192 (0x2000) <--------------------+ |
posix_fadvise(32,0x28b02000,0x2000,POSIX_FADV_WILLNEED) = 0 (0x0) -----+
pread(32,...,8192,0x28af6000) = 8192 (0x2000) | |
pread(32,...,8192,0x28afc000) = 8192 (0x2000) <----------------------+ |
pread(32,...,8192,0x28afe000) = 8192 (0x2000) }-- no advice |
pread(32,...,8192,0x28b02000) = 8192 (0x2000) <------------------------+
pread(32,...,8192,0x28b04000) = 8192 (0x2000) }
pread(32,...,8192,0x28b06000) = 8192 (0x2000) }-- no advice
pread(32,...,8192,0x28b08000) = 8192 (0x2000) }

It becomes slightly less eager to start I/Os as soon as
io_combine_limit > 1, because when it has hit max_ios, if ... <thinks>
yeah if the average block that it can combine is bigger than 4, an
arbitrary number from:

max_pinned_buffers = Max(max_ios * 4, io_combine_limit);

.... then it can run out of look ahead window before it can reach
max_ios (aka eic), so that's a kind of arbitrary/bogus I/O depth
constraint, which is another way of saying what I was saying earlier:
maybe it just needs more distance. So let's see the average combined
I/O length in your test query... for me it works out to 27,169 bytes.
But I think there must be times when it runs out of window due to
clustering. So you could also try increasing that 4->8 to see what
happens to performance.

[1] https://www.postgresql.org/message-id/CA%2BhUKG%2B5UofvseJWv6YqKmuc_%3Drguc7VqKcNEG1eawKh3MzHXQ%40mail.gmail.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David G. Johnston 2024-03-29 01:42:12 Re: CREATE TABLE creates a composite type corresponding to the table row, which is and is not there
Previous Message Zhijie Hou (Fujitsu) 2024-03-29 01:06:15 RE: Synchronizing slots from primary to standby