| From: | James Locke <james(dot)locke(dot)uk(at)gmail(dot)com> |
|---|---|
| To: | Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> |
| Cc: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Disabling Heap-Only Tuples |
| Date: | 2026-05-08 14:13:44 |
| Message-ID: | CAGEtbYUe=9bJdj9Pd4BY5RA-gC2hSoHo0BbfYeJ_t7z_0+z2vg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Fri, May 8, 2026 at 2:00 PM Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
wrote:
>
> Hello James,
>
> On 2026-May-08, James Locke wrote:
>
> > Attached is a POC to enable userland table compaction: A top-level
COMPACT
> > command that performs the relocation directly in the server, with a
> > stripped-down heap_relocate primitive instead of full UPDATE, and a
> > built-in prune-and-truncate pass so it runs to a useful end state in one
> > command.
>
> How does this implementation handle the case of a seqscan in the middle
> of scanning the table, which has already skipped the destination page
> and not yet the page from where the table is to be removed? There needs
> to be a way to distinguish which of these to show (it must be exactly
> one), and you didn't mention this in your description.
It's the same invariant a cross-page UPDATE relies on, and heap_relocate
inherits it because the on-disk and WAL record are identical to a regular
update.
heap_relocate sets the source's xmax and the new tuple's xmin to the same
xid (the relocator's), and both writes go through one log_heap_update AL
record. So when HeapTupleSatisfiesMVCC asks "is this visible" for either
tuple, it ends up asking the same XidInMVCCSnapshot(R, snap) question
against the eqscan's snapshot; once for the destination's xmin and once for
the source's xmax. Same xid, same answer.
seqscan reads block 5 first and sees no live tuple there, either because
the relocation hasn't happened yet, or it has but R is still in the
snapshot's xip list so xmin reads as in-progress. Then COMPACT commits
cluster-wide. Seqscan reaches block 200 still using the snapshot it took at
scan start, which treats R the same way it did at block 5; snapshots don't
change mid-scan. So either both pages treated R as committed (block 5
returned the row already, block 200 now sees the source as dead) or both
treated it as running (block 5 saw nothing, block 200 returns the source).
Exactly one.
The page-level atomicity comes from log_heap_update registering both
buffers in one record and the modifications happening inside one
RIT_SECTION with exclusive content locks on both pages; concurrent
share-locking readers can't see half-applied state.
James
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Nathan Bossart | 2026-05-08 14:31:17 | Re: Fix typo 586/686 in atomics/arch-x86.h |
| Previous Message | Álvaro Herrera | 2026-05-08 14:12:10 | Re: Disallow whole-row index references with virtual generated columns? |