Re: Trying out read streams in pgvector (an extension)

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>, "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Trying out read streams in pgvector (an extension)
Date: 2025-11-12 10:47:16
Message-ID: CA+hUKGL7-Dx8KiUo=G91Y5tfFpwDUFFQJ6=9D8Gr1n=DZxGh+w@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 12, 2025 at 9:04 PM Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com> wrote:
> On Wed, 12 Nov 2025 at 07:12, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> 0002:
>
> + /* End-of-stream. */
> + buf = read_stream_next_buffer(stream, NULL);
> + Assert(buf == InvalidBuffer);
> + buf = read_stream_next_buffer(stream, NULL);
> + Assert(buf == InvalidBuffer);
>
> I noticed there are two 'read_stream_next_buffer()' and
> 'InvalidBuffer' checks. Does having both provide any additional
> validation? I tried removing one of them, and the test still passed.

I wanted to demonstrate that this is a state that the stream is stuck
in until you call _resume().

I suppose an alternative design would be that _next_buffer() returns
InvalidBuffer only once (= the block number callback returns
InvalidBlock once) and then automatically resumes (= it restores the
distance) and then you can call read_stream_next_buffer() again (= the
block number callback will be called again to fill the stream up with
new buffers before waiting for the first one to be ready to give to
you if it isn't already). That would have the advantage of not
requiring a new function at all and make the patch even shorter, but I
don't know, I guess I thought that would be a bit more fragile in some
way, less explicit. Hmm, would it actually be better if it worked
like that?

> Also, there is one thing I wanted to clarify about the
> 'read_stream_resume()'. If 'read_stream_next_buffer()' returns an
> 'InvalidBuffer', then we can use 'read_stream_resume()' alone because
> we know that we already consumed all buffers in the stream. For the
> rest, we need to use 'read_stream_resume()' together with the
> 'read_stream_reset()', right?

For the rest, there would be no need to call read_stream_resume().

To recap the uses of read_stream_reset(): the original purpose was to
release any buffers (pins) that the stream is holding internally
because of look-ahead, and put it back to its original state, ready to
be reused. It is called (1) by read_stream_end() as an implementation
detail (eg if a LIMIT or anything else except ERROR/FATAL ends your
query early, we need to unpin buffers queued in the stream before we
pfree it), (2) explicitly by rescans, (3) in hypothetical code I
thought about that would want to stream blocks speculatively and then
change its mind after predicting incorrectly (I had a few patches like
that, abandoned for now), and then (4) in this case, by places that
temporarily ran out of block numbers, but will have some more again
soon and want to resume the stream.

It was already debatable whether heuristic state like lookahead
distance should survive acoss rescans, or in other words, whether the
expected I/O requirements of the previous scan are a useful prediction
of the requirements of the next scan, but the answer is clearer in
case (4), hence the desire to find a way to separate that use case
from the others.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Álvaro Herrera 2025-11-12 10:49:22 Re: [PATCH] Add pg_get_subscription_ddl() function
Previous Message Yugo Nagata 2025-11-12 10:46:40 Re: Make PQgetResult() not return NULL on out-of-memory error