Re: Increasing IndexTupleData.t_info from uint16 to uint32

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Montana Low <montana(at)postgresml(dot)org>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Increasing IndexTupleData.t_info from uint16 to uint32
Date: 2024-01-18 16:22:24
Message-ID: CAEze2WiqzsdHw24cxjVTtsd2ZbYtrUrXWh0GDzVyCpUpY5upCQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 18 Jan 2024 at 13:41, Montana Low <montana(at)postgresml(dot)org> wrote:
>
> The overall trend in machine learning embedding sizes has been growing rapidly over the last few years from 128 up to 4K dimensions yielding additional value and quality improvements. It's not clear when this trend in growth will ease. The leading text embedding models generate now exceeds the index storage available in IndexTupleData.t_info.
>
> The current index tuple size is stored in 13 bits of IndexTupleData.t_info, which limits the max size of an index tuple to 2^13 = 8129 bytes. Vectors implemented by pgvector currently use a 32 bit float for elements, which limits vector size to 2K dimensions, which is no longer state of the art.
>
> I've attached a patch that increases IndexTupleData.t_info from 16bits to 32bits allowing for significantly larger index tuple sizes. I would guess this patch is not a complete implementation that allows for migration from previous versions, but it does compile and initdb succeeds. I'd be happy to continue work if the core team is receptive to an update in this area, and I'd appreciate any feedback the community has on the approach.

I'm not sure why this is needed.
Vector data indexing generally requires bespoke index methods, which
are not currently available in the core PostgreSQL repository, and
indexes are not at all required to utilize the IndexTupleData format
for their data tuples (one example of this being BRIN).
The only hard requirement in AMs which use Postgres' relfile format is
that they follow the Page layout and optionally the pd_linp/ItemId
array, which limit the size of Page tuples to 2^15-1 (see
ItemIdData.lp_len) and ~2^16-bytes
(PageHeaderData.pd_pagesize_version).

Next, the only non-internal use of IndexTuple is in IndexOnlyScans.
However, here the index may fill the scandesc->xs_hitup with a heap
tuple instead, which has a length stored in uint32, too. So, I don't
quite see why this would be required for all indexes.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Matthias van de Meent 2024-01-18 16:38:58 Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Previous Message Aleksander Alekseev 2024-01-18 16:21:52 Re: UUID v7