Re: vector search support

From: Giuseppe Broccolo <g(dot)broccolo(dot)7(at)gmail(dot)com>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, jkatz(at)postgresql(dot)org, mail(at)joeconway(dot)com
Subject: Re: vector search support
Date: 2023-04-26 13:31:37
Message-ID: CAFtuf8CR6LKu0sVfOBgEKjPtRf6=n=QZSWyD_+yWkSnKYMWD-A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Nathan,

I find the patches really interesting. Personally, as Data/MLOps Engineer,
I'm involved in a project where we use embedding techniques to generate
vectors from documents, and use clustering and kNN searches to find similar
documents basing on spatial neighbourhood of generated vectors.

We finally opted for ElasticSearch as search engine, considering that it
was providing what we needed:

* support to store dense vectors
* support for kNN searches (last version of ElasticSearch allows this)

An internal benchmark showed us that we were able to achieve the expected
performance, although we are still lacking some points:

* clustering of vectors (this has to be done outside the search engine,
using DBScan for our use case)
* concurrency in updating the ElasticSearch indexes storing the dense
vectors

I found these patches really interesting, considering that they would solve
some of open issues when storing dense vectors. Index support would help a
lot with searches though.

Not sure if it's the best to include in PostgreSQL core, but would be
fantastic to have it as an extension.

All the best,
Giuseppe.

On Sat, 22 Apr 2023, 01:07 Nathan Bossart, <nathandbossart(at)gmail(dot)com> wrote:

> Attached is a proof-of-concept/work-in-progress patch set that adds
> functions for "vectors" repreѕented with one-dimensional float8 arrays.
> These functions may be used in a variety of applications, but I am
> proposing them with the AI/ML use-cases in mind. I am posting this early
> in the v17 cycle in hopes of gathering feedback prior to PGCon.
>
> With the accessibility of AI/ML tools such as large language models (LLMs),
> there has been a demand for storing and manipulating high-dimensional
> vectors in PostgreSQL, particularly around nearest-neighbor queries. Many
> of these vectors have more than 1500 dimensions. The cube extension [0]
> provides some of the distance functionality (e.g., taxicab, Euclidean, and
> Chebyshev), but it is missing some popular functions (e.g., cosine
> similarity, dot product), and it is limited to 100 dimensions. We could
> extend cube to support more dimensions, but this would require reworking
> its indexing code and filling in gaps between the cube data type and the
> array types. For some previous discussion about using the cube extension
> for this kind of data, see [1].
>
> float8[] is well-supported and allows for effectively unlimited dimensions
> of data. float8 matches the common output format of many AI embeddings,
> and it allows us or extensions to implement indexing methods around these
> functions. This patch set does not yet contain indexing support, but we
> are exploring using GiST or GIN for the use-cases in question. It might
> also be desirable to add support for other linear algebra operations (e.g.,
> operations on matrices). The attached patches likely only scratch the
> surface of the "vector search" use-case.
>
> The patch set is broken up as follows:
>
> * 0001 does some minor refactoring of dsqrt() in preparation for 0002.
> * 0002 adds several vector-related functions, including distance functions
> and a kmeans++ implementation.
> * 0003 adds support for optionally using the OpenBLAS library, which is an
> implementation of the Basic Linear Algebra Subprograms [2]
> specification. Basic testing with this library showed a small
> performance boost, although perhaps not enough to justify giving this
> patch serious consideration.
>
> Of course, there are many open questions. For example, should PostgreSQL
> support this stuff out-of-the-box in the first place? And should we
> introduce a vector data type or SQL domains for treating float8[] as
> vectors? IMHO these vector search use-cases are an exciting opportunity
> for the PostgreSQL project, so I am eager to hear what folks think.
>
> [0] https://www.postgresql.org/docs/current/cube.html
> [1] https://postgr.es/m/2271927.1593097400%40sss.pgh.pa.us
> [2] https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms
>
> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2023-04-26 13:59:05 issue with meson builds on msys2
Previous Message Tom Lane 2023-04-26 13:27:05 Re: run pgindent on a regular basis / scripted manner