plan_create_index_workers doesn't account for TOAST

From: "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: plan_create_index_workers doesn't account for TOAST
Date: 2023-06-29 14:12:54
Message-ID: ad8a178f-bbe7-d89d-b407-2f0fede93144@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

plan_create_index_workers[1] does not consider the amount of tuples
existing in TOAST pages when determining the number of parallel workers
to use for a build. The estimation comes from estimate_rel_size[2],
which in this case, will just take the value from rel->rd_rel->relpages.

We probably don't notice this much with B-trees, given a B-tree is
typically used for data that does not require toasting. However, this
becomes more visible when working on custom index access methods that
implement their own parallel build strategy.

For example, pgvector[3] provides its own data types and index access
method for indexing vector data. Vectors can get quite large fairly
quickly, e.g. a 768-dimensional vector takes up 8 + 4*768 = 3080 bytes
on disk, which quickly clears the default TOAST tuple threshold.

In a recent patch proposal to allow for building indexes in parallel[4],
I performed a few experiments on how many parallel workers would be
spawned when indexing 1,000,000 (1MM) 768-dim vectors, both with
EXTEDNED (default) and PLAIN storage. In all cases, I allowed for leader
participation, but the leader is not considered in
plan_create_index_workers.

With EXTENDED, plan_create_index_workers recommended 2 workers. The
build time was ~2x faster than the serial build.

With PLAIN, plan_create_index_workers recommended 4 workers. The build
time was **~3X faster** than the serial build.

(I've been doing more detailed, less hand-waivy performance testing, but
I wanted to provide directional numbers here)

It seems like we're leaving some performance for columns with TOASTed
data that require indexing, so I wanted to propose allowing the pages in
TOASTed tables to be considered when we're trying to index a column with
TOASTed attributes.

Thanks,

Jonathan

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/optimizer/plan/planner.c;hb=refs/heads/master#l6734
[2]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/optimizer/util/plancat.c;hb=refs/heads/master#l1117
[3] https://github.com/pgvector/pgvector
[4] https://github.com/pgvector/pgvector/commits/parallel-index-build

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alena Rybakina 2023-06-29 15:15:05 Re: POC, WIP: OR-clause support for indexes
Previous Message Álvaro Herrera 2023-06-29 13:45:16 Re: Does a cancelled REINDEX CONCURRENTLY need to be messy?