| From: | Melanie Plageman <melanieplageman(at)gmail(dot)com> |
|---|---|
| To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
| Cc: | Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Trying out read streams in pgvector (an extension) |
| Date: | 2025-11-18 21:17:28 |
| Message-ID: | CAAKRu_ZGhnWZXOyEyZ2r47g-F7U8asMRA6U8YZw3h=2rR=m_hQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Wed, Nov 12, 2025 at 5:47 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>
> I suppose an alternative design would be that _next_buffer() returns
> InvalidBuffer only once (= the block number callback returns
> InvalidBlock once) and then automatically resumes (= it restores the
> distance) and then you can call read_stream_next_buffer() again (= the
> block number callback will be called again to fill the stream up with
> new buffers before waiting for the first one to be ready to give to
> you if it isn't already). That would have the advantage of not
> requiring a new function at all and make the patch even shorter, but I
> don't know, I guess I thought that would be a bit more fragile in some
> way, less explicit. Hmm, would it actually be better if it worked
> like that?
We discussed off-list and decided that changing existing functionality
in an unexpected way is undesirable. So, it is better we stick with
adding read_stream_resume. However, in talking about
read_stream_resume() further, Thomas and I also thought of potential
issues with it:
If read_stream_resume() is called before the read stream user callback
has ever returned InvalidBlockNumber,
1) The value of resume_distance will be the original value of distance
from read_stream_begin_relation(). You don't want to reset the
distance to that value.
2) There may be inflight or completed buffers that have yet to be
yielded which will be returned the next time read_stream_next_buffer()
is invoked. If the user resets the state the callback is using to
return blocks and expects the next invocation of
read_stream_next_buffer() to return buffers with those blocks, they
will be disappointed.
If we try to address this by requiring that stream->distance is 0 when
read_stream_resume() is called, that won't work because while it is
set to 0 when the callback returns InvalidBlockNumber, there may still
be unreturned buffers in the stream.
If the user wants to use read_stream_reset() to exhaust the stream
before calling read_stream_resume(), read_stream_reset() sets
stream->distance to 1 at the end, so read_stream_resume() couldn't
detect if reset() was correctly called first or if the distance is > 0
because the stream is still in progress.
To make sure 1) distance isn't reset to a resume_distance from
read_stream_begin_relation() and 2) unexpected buffers aren't returned
from the read stream, we could error out in read_stream_resume() if
pinned_buffers > 0. And in read_stream_reset(), we would save distance
in resume_distance before clearing distance. That would allow calling
read_stream_resume() either if you called read_stream_reset() or if
you exhausted the stream yourself. See rough attached patch for a
sketch of this.
It would be nicer if we could error out if read_stream_next_buffer()
didn't return InvalidBuffer, but we can't do that if we want to allow
calling read_stream_reset() followed by read_stream_resume() because
distance won't be 0.
I tried this with a modified pgvector with an hnsw read stream user
and it seemed to work correctly.
- Melanie
| Attachment | Content-Type | Size |
|---|---|---|
| 0001-resume.patch | text/x-patch | 3.2 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2025-11-18 21:18:38 | Re: Compile error on the aarch64 platform: Missing asm/hwcap.h |
| Previous Message | Nico Williams | 2025-11-18 21:17:18 | Re: GUC thread-safety approaches |