REPACK enhancements

From: Antonin Houska <ah(at)cybertec(dot)at>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: REPACK enhancements
Date: 2026-06-16 12:53:02
Message-ID: 109367.1781614382@localhost
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

This patch set is for the next development cycle. It tries to relax some
limitations of the REPACK (CONCURRENTLY) command.

1) REPACK (CONCURRENTLY) is not MVCC-safe.

I proposed a fix earlier (the last version containing the fix was [1], see the
part 0005), but it was rather a prototype and I didn't even expect it to end
up in v19. Here (part 0008) I post an improved version of this feature.

In [1] I tried to build historic snapshot for each scan of the new heap (in
order to find the tuple to be updated / deleted when replaying data changes
that took place in the old table during copying), but it appeared to be quite
expensive. The new version introduces a new type of snapshot which uses the
fact that the new table does not contain tuples left behind aborted
transactions. That eliminates the need for getting many snapshots from
snapbuild.c.

Another problem I had missed in [1] was that even TOAST needs to reuse the
existing XID. And once it does, we need to freeze not only the tuples of the
main table, but also the TOAST tuples.

One thing I'm still not sure about is how should the original XID be passed to
the AM. In the current version I try to avoid adding a new argument to the
Table AM callbacks: a new flag (TABLE_REUSE_XID) indicates that the XID should
be retrieved from the tuple passed in the slot. However, that imposes a
restriction on the slot type (not all slots can maintain information as
specific as tuple xmin/xmax). Moreover, table_tuple_delete() needs the 'xid'
argument anyway because it does not receive any tuple.

Ideally I'd like to avoid adding XID to all table_tuple_insert(),
table_tuple_update() and table_tuple_delete() because it'd be useless (and
possibly confusing) for all users other than REPACK (CONCURRENTLY), but the
callbacks already have 'cid' argument, so adding 'xid' might not be that bad.

2) A single snapshot is used for the whole scan of the table.

Consequently, xmin of that snapshot prevents the xmin horizons for VACUUM from
advancing, and that affects the whole database.

Part 0004 of this set fixes that by using particular snapshot only for given
number of pages. (Currently it's a GUC repack_pages_per_snapshot, which is
good for evaluation, however a constant should be sufficient for the final
version of the patch.)

Earlier version of this part (0006 in [2]) missed the fact that the
"concurrent data changes" need to be applied before the next range of block is
scanned (using a new snapshot), otherwise it might be impossible to build a
unique index on it, and unique index is essential to replay the changes (see
more info in the commit message). Therefore we cannot use tuplestore for
clustering.

The current version introduces an "auxiliary table", which is used instead of
the tuplestore, however it does have the identity index. Sorting is achieved
by building the clustering index on this table and by scanning that
index. That may be less efficient than an explicit sort in some cases, but I
have no better idea right now. (On the other hand, if REPACK (CONCURRENTLY) no
longer gets into the way of VACUUM, the execution time is probably less
important than it was so far.)

3) A single transaction is used for the whole run.

Even if we use multiple snapshots, XID has to be assigned to the backend and
that also blocks the progress of the xmin horizons for VACUUM. Moreover,
during its startup, the logical decoding system has to wait for all running
transactions to finish, and that includes transactions started by other
backends running REPACK (CONCURRENTLY). Thus only one backend in the whole
cluster can run REPACK (CONCURRENTLY) at a time.

I tried to fix that in PG 19 by introducing "database specific replication
slots" (0d3dba38c777), but this approach had serious flaws and had to be
reverted (01a80f062146). Anyway, that would still allow only one REPACK
(CONCURRENTLY) per database.

In the part 0006 here, REPACK (CONCURRENTLY) uses multiple transactions. Once
the new (transient) table is created, we commit the transaction (while the new
table is still empty) and start a new one so that there is no XID assigned
while we copy the data. Due to the MVCC-safety, we can preserve the XIDs
retrieved from the old table, so the transaction running REPACK (CONCURRENTLY)
does not get XID assigned until the copying stage (typically the longest one)
is finished.

Summary:

* 0008 makes REPACK (CONCURRENTLY) MVCC-safe

* 0004, 0006 and 0008 together reduce the impact on VACUUM horizons and also
eliminate the contention on replication slots.

The other parts are preparations for these. For more details, please see the
commit messages and, of course, the code.

Feedback is appreciated.

[1] https://www.postgresql.org/message-id/178741.1743514291%40localhost
[2] https://www.postgresql.org/message-id/88003.1769511456%40localhost

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachment Content-Type Size
v01-0001-Use-tuple-slot-to-pass-tuples-for-rewriting.patch text/x-diff 11.5 KB
v01-0002-Move-functions-to-repack.c.patch text/x-diff 8.4 KB
v01-0003-Introduce-RepackDest-structure.patch text/x-diff 18.4 KB
v01-0004-Use-multiple-snapshots-to-copy-the-data.patch text/plain 110.0 KB
v01-0005-Simplify-the-way-restrictions-are-imposed-on-index-f.patch text/x-diff 22.8 KB
v01-0006-Use-separate-transactions-for-catalog-changes.patch text/x-diff 42.4 KB
v01-0007-Decouple-updating-of-freezing-information-from-swap_.patch text/x-diff 8.1 KB
v01-0008-Make-REPACK-CONCURRENTLY-MVCC-safe.patch text/plain 115.9 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2026-06-16 12:58:32 Re: Bypassing cursors in postgres_fdw to enable parallel plans
Previous Message Fujii Masao 2026-06-16 12:45:55 Re: Fix race in ReplicationSlotRelease for ephemeral slots