Quick Links

Re: Adding REPACK [concurrently]

From:	Antonin Houska <ah(at)cybertec(dot)at>
To:	Mihail Nikalayeu <mihailnikalayeu(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Pg Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Treat <rob(at)xzilla(dot)net>
Subject:	Re: Adding REPACK [concurrently]
Date:	2026-01-22 08:37:40
Message-ID:	74802.1769071060@localhost
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Mihail Nikalayeu <mihailnikalayeu(at)gmail(dot)com> wrote:

> Some comments for 0006:
>
> > SnapBuildSnapshotForRepack(SnapBuild *builder)
> Does it also "replays" previously processed WAL to the position that snapshot is ready to use?
> I am afraid we may see some non-yet processed parts of WAL leading to duplicate insertion.

The changes present in WAL decoded prior the snapshot creation are not
replayed - these changes are visible to the snapshot. (This is not really
specific to the 0006 part.)

> > first_block
> What is the reason for that variable? It seems to be always the first block
> of relation.

Although scan usually starts at the first block, it does not have to,
especially due to synchronized sequential scans.

> Also, what if we have a huge amount of empty space at the start. In that case the first block will be the block of the first "filled" page. But
> insert may (and will) fill empty pages before first_block - out of the range.

Good catch! I think I used this "lazy initialization" because I couldn't find
the start block in TableScanDesc, and missed the problem that you describe
here.

> So, I think we should delete it and always use 0 instead.

No, we need to set first_block to heapScan->rs_startblock before the scan
starts.

> > tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
> With SnapshotAny we are going to check the same tuple multiple times. Better to let scan logic handle it (and change snapshots used by scan
> code).

The current API does not seem to support changing snapshot of an in-progress
scan and I don't want to change that. Plus note that the current
implementation of CLUSTER also uses SnapshotAny and then checks the visibility
separately. Finally, SnapshotAny is not really an expensive visibility check,
if it can be considered a visibility check at all.

> > if (blkno >= range_end)
> I don't think it is legal to switch a snapshot while holding the tuple. Nothing is protecting it from being pruned\reused.

Snapshot protects the table as whole from pruning live (or recently dead)
tuples, but when you have fetched a tuple, the containing buffer remains
pinned. The buffer pin itself makes prunning of the page impossible.

And especially with REPACK (CONCURRENTLY), page pruning is also restricted by
the replication slot's xmin. This is increased by calling
LogicalIncreaseXminForSlot() from the decoding worker, each time it has
created a new snapshot.

> > PopActiveSnapshot();
> > InvalidateCatalogSnapshot();
> I think it is a good idea to add here assert for MyProc->xmin and MyProc->xid to be invalid. To ensure we really allow the horizon to advance.

I've added it only for xmin. xid is valid because REPACK is executed in a
transaction. That reminds me that PROC_IN_VACUUM should be present in
MyProc->statusFlags. Fixed.

> > /* Set to request a snapshot. */
> > bool snapshot_requested;
> We know the end or region in advance, so it should be possible to filter before writing changes to file.

Yes, filtering before writing makes sense, I'll consider that.

> So, it is some kind of "this is the end of region, create new file and store everything before + create new snapshot for me".

The last se of changes does not have to be followed by a snapshot - that's the
purpose of snapshot_requested.

> > PopActiveSnapshot();
> Sometimes without InvalidateCatalogSnapshot().

[ It'd be a bit easier to find the code if you included hunk headers. ]

heapam_relation_copy_for_cluster() does not access catalogs after the first
invalidation. The following comment is related:

/*
* XXX It might be worth Assert(CatalogSnapshot == NULL) here,
* however that symbol is not external.
*/

> > PushActiveSnapshot(GetTransactionSnapshot());
> GetLatestSnapshot() feels better here.

What will then happen to code that uses GetActiveSnapshot() ?

> > * The individual builds might still be a problem, but that's a
> > * separate issue.
> Opening the index may create a catalog snapshot, so it needs to be invalidated after.

It'll be invalidated in the next iteration. The point of this invalidation is
to use one snapshot per index.

> > * TODO Can we somehow use the fact that the new heap is not yet
> > * visible to other transaction, and thus cannot be vacuumed?
> Snapshot resetting [0] may work here (without CIC, just as part of the scan + some code to ensure catalog snapshot is managed correctly).
> Also, to correctly build a unique index - some tech from [0] is required (building a unique index with multiple snapshots is a little bit tricky).
> Or we may implement some super lightweight way - just SnapshotAny without any visibility checks (just assume everything is ok since it
> copied from another relation with the same index set).

ok, I'll check your patch.

> > This approach introduces one limitation though: if the USING INDEX clause is
> > specified, an explicit sort is always used. Index scan wouldn't work because
> > it does not return the tuples sorted by CTID.
>
> Technically we may just use keys (if they are comparable) as a way to specify regions. Instead of number of pages to switch snapshot -
> number of tuples or time.
> But because we don't know the region end in advance - we have to keep all the changes in file and filter only while applying.

Good idea. Unfortunately it questions your proposal to filter the changes
before writing, as suggested above.

> > range_end = repack_blocks_per_snapshot;
> Should be repack_blocks_per_snapshot + ctx->first_block ?

Indeed, I also failed to avoid assuming first_block==0 :-)

> > * XXX It might be worth Assert(CatalogSnapshot == NULL)
> > * here, however that symbol is not external.
> As said above - better assert for MyProc->xmin/xid + add InvalidateCatalogSnapshot.

I proposed the Assert above, but still thinking about it.

> > Is REPACKED (CONCURRENTLY) is being run by this backend?
> REPACK and double "is"

Other comments accepted. Thanks.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachment	Content-Type	Size
v31-0001-Add-REPACK-command.patch	text/x-diff	144.0 KB
v31-0002-Refactor-index_concurrently_create_copy-for-use-with.patch	text/x-diff	8.7 KB
v31-0003-Move-conversion-of-a-historic-to-MVCC-snapshot-to-a-.patch	text/x-diff	5.6 KB
v31-0004-Add-CONCURRENTLY-option-to-REPACK-command.patch	text/plain	146.8 KB
v31-0005-Use-background-worker-to-do-logical-decoding.patch	text/x-diff	65.2 KB
v31-0006-Use-multiple-snapshots-to-copy-the-data.patch	text/plain	71.5 KB

In response to

Re: Adding REPACK [concurrently] at 2026-01-18 21:52:00 from Mihail Nikalayeu

Responses

Re: Adding REPACK [concurrently] at 2026-01-22 11:30:00 from Mihail Nikalayeu

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Ilyasov Ian	2026-01-22 08:37:56	RE: ReadRecentBuffer() doesn't scale well
Previous Message	Pavel Stehule	2026-01-22 08:26:28	proposal: parentid and naturalid for plpgsql nodes