Re: Streaming I/O, vectored I/O (WIP)

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Melanie Plageman <melanieplageman(at)gmail(dot)com>
Subject: Re: Streaming I/O, vectored I/O (WIP)
Date: 2023-11-28 13:06:06
Message-ID: d629c93c-e048-4194-b66b-f253c8c2b635@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 28/11/2023 14:17, Thomas Munro wrote:
> On Thu, Sep 28, 2023 at 7:33 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> in streaming_read.h:
>>
>>> typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
>>> uintptr_t pgsr_private,
>>> void *per_io_private,
>>> BufferManagerRelation *bmr,
>>> ForkNumber *forkNum,
>>> BlockNumber *blockNum,
>>> ReadBufferMode *mode);
>>
>> I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on
>> each read. I see that you used that in the WAL replay prefetching, so I
>> guess that makes sense.
>
> In this version I have introduced an alternative simple callback.
> It's approximately what we had already tried out in an earlier version
> before I started streamifying recovery, but in this version you can
> choose, so recovery can opt for the wider callback.

Ok. Two APIs is a bit redundant, but because most callers would prefer
the simpler API, that's probably a good tradeoff.

> I've added some ramp-up logic. The idea is that after we streamify
> everything in sight, we don't want to penalise users that don't really
> need more than one or two blocks, but don't know that yet. Here is
> how the system calls look when you do pg_prewarm():
>
> pread64(32, ..., 8192, 0) = 8192 <--- start with just one block
> pread64(32, ..., 16384, 8192) = 16384
> pread64(32, ..., 32768, 24576) = 32768
> pread64(32, ..., 65536, 57344) = 65536
> pread64(32, ..., 131072, 122880) = 131072 <--- soon reading 16
> blocks at a time
> pread64(32, ..., 131072, 253952) = 131072
> pread64(32, ..., 131072, 385024) = 131072

> I guess it could be done in quite a few different ways and I'm open to
> better ideas. This way inserts prefetching stalls but ramps up
> quickly and is soon out of the way. I wonder if we would want to make
> that a policy that a caller can disable, if you want to skip the
> ramp-up and go straight for the largest possible I/O size? Then I
> think we'd need a 'flags' argument to the streaming read constructor
> functions.

I think a 'flags' argument and a way to opt-out of the slow start would
make sense. pg_prewarm in particular knows that it will read the whole
relation.

> A small detour: While contemplating how this interacts with parallel
> sequential scan, which also has a notion of ramping up, I noticed
> another problem. One parallel seq scan process does this:
>
> fadvise64(32, 35127296, 131072, POSIX_FADV_WILLNEED) = 0
> preadv(32, [...], 2, 35127296) = 131072
> preadv(32, [...], 2, 35258368) = 131072
> fadvise64(32, 36175872, 131072, POSIX_FADV_WILLNEED) = 0
> preadv(32, [...], 2, 36175872) = 131072
> preadv(32, [...], 2, 36306944) = 131072
> ...
>
> We don't really want those fadvise() calls. We don't get them with
> parallelism disabled, because streaming_read.c is careful not to
> generate advice for sequential workloads based on ancient wisdom from
> this mailing list, re-confirmed on recent Linux: WILLNEED hints
> actually get in the way of Linux's own prefetching and slow you down,
> so we only want them for truly random access. But the logic can't see
> that another process is making holes in this process's sequence.

Hmm, aside from making the sequential pattern invisible to this process,
are we defeating Linux's logic too, just by performing the reads from
multiple processes? The processes might issue the reads to the kernel
out-of-order.

How bad is the slowdown when you issue WILLNEED hints on sequential access?

> The two obvious solutions are (1) pass in a flag at the start saying
> "I promise this is sequential even if it doesn't look like it, no
> hints please" and (2) invent "shared" (cross-process) streaming
> reads, and teach all the parallel seq scan processes to get their
> buffers from there.
>
> Idea (2) is interesting to think about but even if it is a useful idea
> (not sure) it is certainly overkill just to solve this little problem
> for now. So perhaps I should implement (1), which would be another
> reason to add a flags argument. It's not a perfect solution though
> because some more 'data driven' parallel scans (indexes, bitmaps, ...)
> have a similar problem that is less amenable to top-down kludgery.

(1) seems fine to me.

--
Heikki Linnakangas
Neon (https://neon.tech)

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2023-11-28 13:26:52 Re: Missing docs on AT TIME ZONE precedence?
Previous Message Jingxian Li 2023-11-28 12:52:31 [PATCH] LockAcquireExtended improvement