Re: vector search support

From: "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Cc: mail(at)joeconway(dot)com
Subject: Re: vector search support
Date: 2023-05-26 14:24:21
Message-ID: 221cf48d-7f5d-120c-e227-2bebdde40ccb@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 4/21/23 8:07 PM, Nathan Bossart wrote:
> Attached is a proof-of-concept/work-in-progress patch set that adds
> functions for "vectors" repreѕented with one-dimensional float8 arrays.
> These functions may be used in a variety of applications, but I am
> proposing them with the AI/ML use-cases in mind. I am posting this early
> in the v17 cycle in hopes of gathering feedback prior to PGCon.

Thanks for proposing this. Looking forward to discussing more in person.
There's definitely demand to use PostgreSQL to store / search over
vector data, and I do think we need to improve upon this in core.

> With the accessibility of AI/ML tools such as large language models (LLMs),
> there has been a demand for storing and manipulating high-dimensional
> vectors in PostgreSQL, particularly around nearest-neighbor queries. Many
> of these vectors have more than 1500 dimensions.

1536 seems to be a popular one from LLMs, but I've been seeing much
higher dimensionality (8K, 16K etc). My hunch is that at a practical
level, apps are going to favor data sets / sources that use a reduced
dimensionality, but I wouldn't be shocked if we see vectors of all sizes.

> The cube extension [0]
> provides some of the distance functionality (e.g., taxicab, Euclidean, and
> Chebyshev), but it is missing some popular functions (e.g., cosine
> similarity, dot product), and it is limited to 100 dimensions. We could
> extend cube to support more dimensions, but this would require reworking
> its indexing code and filling in gaps between the cube data type and the
> array types. For some previous discussion about using the cube extension
> for this kind of data, see [1].

I've stared at the cube code quite a bit over the past few months. There
are definitely some clever methods in it for handling searches over
(now) lower dimensionality data, but I generally agree we should add
functionality that's specific to ARRAY types.

I'll start making specific comments on the patches below.

> float8[] is well-supported and allows for effectively unlimited dimensions
> of data. float8 matches the common output format of many AI embeddings,
> and it allows us or extensions to implement indexing methods around these
> functions. This patch set does not yet contain indexing support, but we
> are exploring using GiST or GIN for the use-cases in question. It might
> also be desirable to add support for other linear algebra operations (e.g.,
> operations on matrices). The attached patches likely only scratch the
> surface of the "vector search" use-case.
>
> The patch set is broken up as follows:
>
> * 0001 does some minor refactoring of dsqrt() in preparation for 0002.

This seems pretty benign and may as well do anyway, though we may need
to expand on it based on comments on next patch. Question on:

+static inline float8
+float8_sqrt(const float8 val)
+{
+ float8 result;
+
+ if (unlikely(val < 0))

Should this be:

if (unlikely(float8_lt(val, 0))

Similarly:

+ if (unlikely(result == 0.0) && val != 0.0)

if (unlikely(float8_eq(result,0.0)) && float8_ne(val, 0.0))

> * 0002 adds several vector-related functions, including distance functions
> and a kmeans++ implementation.

Nice. Generally I like this patch. The functions seems to match the most
commonly used vector distance functions I'm seeing, and it includes a
function that can let a user specify a constraint on an ARRAY column so
they can ensure it contains valid vectors.

While I think supporting float8 is useful, I've been seeing a mix of
data types in the different AI/ML vector embeddings, i.e. float4 /
float8. Additionally, it could be helpful to support integers as well,
particularly based on some of the dimensionality reduction techniques
I've seen. I think this holds double true for kmeans, which is often
used in those calculations.

I'd suggest ensure these functions support:

* float4, float8
* int2, int4, int8

There's probably some nuance of how we document this too, given our
docs[1] specify real / double precision, and smallint, int, bigint.

(Separately, we mention the int2/int4/int8 aliases in [1], but not
float4/float8, which seems like a small addition we should make).

If you agree, I'd be happy to review more closely once that's implemented.

Other things:

* kmeans -- we're using kmeans++, should the function name reflect that?
Do you think we could end up with a different kmeans algo in the future?
Maybe we allow the user to specify the kmeans algo from the function
name (with the default / only option today being kmeans++)?

> * 0003 adds support for optionally using the OpenBLAS library, which is an
> implementation of the Basic Linear Algebra Subprograms [2]
> specification. Basic testing with this library showed a small
> performance boost, although perhaps not enough to justify giving this
> patch serious consideration.

It'd be good to see what else we could use OpenBLAS with. Maybe that's a
discussion for PGCon.

> Of course, there are many open questions. For example, should PostgreSQL
> support this stuff out-of-the-box in the first place?

Yes :) One can argue an extension (and pgvector[2] already does a lot
here), but I think native support would be generally helpful for users.
It does remove the friction of starting out.

There's also an interesting use-case downthread (I'll comment on that
there) that demonstrates why it's helpful to have variability in vector
size in an ARRAY column, which is an argument for supporting it there.

> And should we
> introduce a vector data type or SQL domains for treating float8[] as
> vectors? IMHO these vector search use-cases are an exciting opportunity
> for the PostgreSQL project, so I am eager to hear what folks think.

Having a vector type could give us some advantages in how we
store/search over the data. For example, we perform validation checks up
front, normalize the vector, etc. and any index implementations will
have less work to do on that front. We may also be able to give more
options to tune how the vector is stored, e.g. perform inversion on
insert/update.

Again, it's a fair argument that this can be done in an extension, but
historically we've seen reduced friction when we add support in core.
It'd also make building additional functionality easier, whether in core
or an extension.

Thanks,

Jonathan

[1] https://www.postgresql.org/docs/current/datatype-numeric.html
[2] https://github.com/pgvector/pgvector

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jonathan S. Katz 2023-05-26 14:32:18 Re: vector search support
Previous Message Masahiko Sawada 2023-05-26 12:48:09 Re: running logical replication as the subscription owner