Re: vector search support

From: Giuseppe Broccolo <g(dot)broccolo(dot)7(at)gmail(dot)com>
To: "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, mail(at)joeconway(dot)com
Subject: Re: vector search support
Date: 2023-05-29 13:18:03
Message-ID: CAFtuf8AzttX4Vzy5AZebNY_PxKzve_aWYf+0YUFfeKJ0xzYm_A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Jonathan,

On 5/26/23 3:38 PM, Jonathan S. Katz <jkatz(at)postgresql(dot)org> wrote:

> On 4/26/23 9:31 AM, Giuseppe Broccolo wrote:
> > We finally opted for ElasticSearch as search engine, considering that it
> > was providing what we needed:
> >
> > * support to store dense vectors
> > * support for kNN searches (last version of ElasticSearch allows this)
>
> I do want to note that we can implement indexing techniques with GiST
> that perform K-NN searches with the "distance" support function[1], so
> adding the fundamental functions to help with this around known vector
> search techniques could add this functionality. We already have this
> today with "cube", but as Nathan mentioned, it's limited to 100 dims.
>

Yes, I was aware of this. It would be enough to define the required support
functions for GiST
indexing (I was a bit in the loop when it was tried to add PG14 presorting
support to GiST indexing
in PostGIS[1]). That would be really helpful indeed. I was just mentioning
it because I know about
other teams using ElasticSearch as a storage of dense vectors only for this.

> > An internal benchmark showed us that we were able to achieve the
> > expected performance, although we are still lacking some points:
> >
> > * clustering of vectors (this has to be done outside the search engine,
> > using DBScan for our use case)
>
> From your experience, have you found any particular clustering
> algorithms better at driving a good performance/recall tradeoff?
>

Nope, it really depends on the use case: the point of using DBScan above
was mainly because it's a way of clustering without knowing a priori the
number
of clusters the algorithm should be able to retrieve, which is actually a
parameter
needed for Kmeans. Depending on the use case, DBScan might have better
performance in noisy datasets (i.e. entries that really do not belong to a
cluster in
particular). Noise in vectors obtained with embedding models is quite
normal,
especially when the embedding model is not properly tuned/trained.

In our use case, DBScan was more or less the best choice, without biasing
the
expected clusters.

Also PostGIS includes an implementation of DBScan for its geometries[2].

> > * concurrency in updating the ElasticSearch indexes storing the dense
> > vectors
>
> I do think concurrent updates of vector-based indexes is one area
> PostgreSQL can ultimately be pretty good at, whether in core or in an
> extension.

Oh, it would save a lot of overhead in updating indexed vectors! It's
something needed
when embedding models are re-trained, vectors are re-generated and indexes
need to
be updated.

Regards,
Giuseppe.

[1]
https://github.com/postgis/postgis/blob/a4f354398e52ad7ed3564c47773701e4b6b87ae8/doc/release_notes.xml#L284
[2]
https://github.com/postgis/postgis/blob/ce75a0e81aec2e8a9fad2649ff7b230327acb64b/postgis/lwgeom_window.c#L117

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2023-05-29 13:35:25 Re: make_ctags: use -I option to ignore pg_node_attr macro
Previous Message vignesh C 2023-05-29 12:46:22 Re: Support logical replication of DDLs