| From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
|---|---|
| To: | "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org> |
| Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Trying out read streams in pgvector (an extension) |
| Date: | 2025-11-11 21:21:50 |
| Message-ID: | CA+hUKGL-3mBtkA9RTbLFHuSS5cviuv0ko7nBhCg9KM7Q-GSEkw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Fri, Sep 6, 2024 at 4:28 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> Here's a new version with a TODO tidied up. I also understood that we
> need to tweak the read_stream_reset() function, so that it doesn't
> forget its current readhead distance when it hops between HNSW nodes
> (which is something that comes up in several other potential uses
> cases including another one I am working in in core). Without this
> patch for PostgreSQL, it reads 1, 2, 4, 7 blocks (= 16 in total)
> before it has to take a break to hop to a new page, and then it start
> again at 1. Oops. With this patch, it is less forgetful, and reaches
> the full possible I/O concurrency of 16 (or whatever the minimum of
> HNSW's m parameter and effective_io_concurrency is for you).
I heard that the pgvector project is now trying to do this for real,
and (surprise!) running into this problem. It causes streamified HNSW
search to regress in performance on some queries when the overheads of
streaming are not outweighed by the (bogusly constrained) gains in
concurrency. We just don't generate enough concurrency to win. I
probably should have been more opinionated and just committed a
version of that distance-reset policy change, but I guess at the time
I wrote the above, streaming and AIO were a little too abstract to
attract reviews relating to hypothetical external projects.
We definitely want to fix that for v19, because it also affects the
streamified index scan project and doubtless many other things. I
wrote about that with patches[1] and will start a new thread soon with
a new collection of rebased heuristics improvements.
But for now, to fix pgvector's woes, I wonder if it might make sense
to call this a bug in v18, and back-patch the tiniest possible change.
Something like what I posted[2] in this thread almost two years ago.
I don't think it really affects any core code: we use
read_stream_reset() only in very minimal ways there (I could
elaborate), and it's quite arguable that the existing policy is wrong
for them too, but we'd need to confirm that and perhaps think about
other extensions that might be using it.
Better ideas?
[1] https://www.postgresql.org/message-id/flat/CA%2BhUKGL6hCd40Dh1AcFcoiw5zJXK2T1dRKO3oe8RkPExqA5zoQ%40mail.gmail.com#181a22a8be99ff561b7beae44986870c
[2] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2Bx2BcqWzBC77cN0ewhzMF0kYhC6c4G_T2gJLPbqYQ6Ow%40mail.gmail.com#9aa6012713b473611ae46d8e2032586f
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Nathan Bossart | 2025-11-11 21:36:10 | Re: obsolete autovacuum comment |
| Previous Message | Matheus Alcantara | 2025-11-11 21:00:53 | Re: Asynchronous MergeAppend |