| From: | James Locke <james(dot)locke(dot)uk(at)gmail(dot)com> |
|---|---|
| To: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> |
| Cc: | Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Thom Brown <thom(at)linux(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Disabling Heap-Only Tuples |
| Date: | 2026-05-08 12:25:50 |
| Message-ID: | CAGEtbYUFNMQ1XqX2VFNM-RZ_2AZaUmU970V20Guvx2L6soMw9Q@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Fri, May 8, 2026 at 12:45 PM Matthias van de Meent <
boekewurm+postgres(at)gmail(dot)com> wrote:
>
> On Wed, 13 Mar 2024 at 14:27, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
wrote:
> >
> > On Thu, 2023-09-21 at 16:18 -0700, Andres Freund wrote:
> > > I think a minimal working approach could be to have the configuration
be based
> > > on the relation size vs space known to the FSM. If the target block
of an
> > > update is higher than ((relation_size - fsm_free_space) *
> > > new_reloption_or_guc), try finding the target block via the FSM, even
if
> > > there's space on the page.
> >
> > That sounds like a good way forward.
> >
> > The patch is in state "needs review", but it got review. I'll change
it to
> > "waiting for author".
>
> Then I'll withdraw this patch as I don't currently have (nor expect to
> have anytime soon) the bandwitdh or expertise to rewrite this patch to
> include a system that calculates the free space available in a
> relation.
>
> I've added a TODO item in the UPDATE section with a backlink to this
> thread so the discussion isn't lost.
Attached is a POC to enable userland table compaction: A top-level COMPACT
command that performs the relocation directly in the server, with a
stripped-down heap_relocate primitive instead of full UPDATE, and a
built-in prune-and-truncate pass so it runs to a useful end state in one
command. Rough comparison:
Approach Driver Disk overhead One shot
-------------------------------------------------------------
VACUUM FULL server ~live size yes
REPACK server ~live size yes
local_update_limit + UPDATE loop user ~1 page no
COMPACT (this POC) server ~1 page yes
Design summary:
- heap_relocate: low-level "move this tuple to that page" primitive,
sibling of heap_update. Skips HOT, modified-attr analysis, toast,
and replica-identity extraction (data is byte-identical).
Concurrency: handles key-share lockers and multixacts via the
same logic as heap_update (no-wait variant). Skips tuples held
under stronger locks or being updated by a live transaction; the
caller revisits them on a future run.
- XLH_UPDATE_RELOCATED: new flag on xl_heap_update. Replay treats
these as ordinary updates; logical decoding's DecodeUpdate
filters them out so subscribers see no phantom UPDATE events.
- lazy_compact_heap: new internal vacuum phase, walks pages
high-to-low, snapshots live tuples, calls heap_relocate with
FSM-chosen low-numbered targets, and inserts matching index
entries via index_insert(UNIQUE_CHECK_NO).
- COMPACT command: runs three vacuum() invocations per relation
(compact, then prune+truncate in a fresh xact, then optional
analyze). No AccessExclusiveLock except briefly during truncate.
How it engages with the concerns previously raised on this thread:
- Index bloat: confirmed and unavoidable -- every
relocation creates a new index entry, the old one stays until
the next ordinary vacuum reaps it. For an index-heavy table,
REINDEX CONCURRENTLY after COMPACT is the recommended remedy.
Documented in compact.sgml.
- Tuning difficulty: no parameter to tune. The user
runs COMPACT and the server figures out what to move where. At
the cost of being less flexible than local_update_limit (you
can't, e.g., compact only the top 50 GB of a 100 GB table, you
compact what the FSM and the high-water heuristic decide to
compact).
- Page selection: the compaction loop only places
tuples on pages strictly lower than the one it's draining, so
progress is monotonic and tail pages will be empty if compaction
succeeds (the truncation pass then reclaims them).
- Low-fillfactor interaction: not directly relevant
here. COMPACT doesn't change UPDATE behaviour generally, only
runs an explicit relocation loop. fillfactor is honoured when
picking targets.
This POC is intended as an end-to-end demonstration that the use case can
be served by a server-side automated command rather than a userland UPDATE
loop. All five patches build cleanly; make check-world
passes; isolation spec covers concurrent FOR UPDATE / UPDATE / FOR KEY
SHARE / multixact / REPEATABLE READ readers.
- Isn't: production-ready. Notable rough edges:
* The three-pass vacuum() structure inside ExecCompact is
correct but ugly. Folding the prune-and-truncate pass back
into a single vacuum() invocation that retries with a fresh
snapshot would be better. This requires letting vacuum_rel
commit and start a new transaction mid-flight; doable but
invasive.
* Index updates inside lazy_compact_heap call index_insert
directly with UNIQUE_CHECK_NO. An earlier attempt to use
ExecInsertIndexTuples crashed because the executor
scaffolding (es_snapshot, ECxt, range table) isn't fully
constructible from outside the executor. Worth a closer
look from someone who knows that area.
* contrib/pg_compact_test (in 0005) exposes the lower
primitives to SQL for development testing. Not really
appropriate for in-tree. Useful for review.
The two approaches are not exclusive. local_update_limit serves users
who want a knob and don't want a new command; COMPACT serves users
who want one-shot behaviour. Posting this so the current thread can
consider whether server-side automation is on the table at all,
before settling on the reloption design.
Patch series structure:
0001 -- New hio primitive, RelationGetSpecificBufferForTuple
0002 -- New heap primitive, heap_relocate (+ XLH_UPDATE_RELOCATED)
0003 -- New internal vacuum phase, lazy_compact_heap
0004 -- New COMPACT command (grammar, executor, docs)
0005 -- Tests (regression, isolation, contrib test)
Benchmark on a 24 MB / 100K-row workload (90% deleted at the head,
then VACUUMed):
Strategy Mean WAL Mean time
COMPACT 2.97 MB 51 ms
VACUUM FULL 2.39 MB 50 ms
REPACK 2.43 MB 43 ms
COMPACT writes ~24% more WAL than the rewrite-based strategies (one
cross-page heap update + index inserts per relocated tuple, vs. a
single bulk relation rewrite). Final size is identical for all
three. Peak extra disk: ~1 page for COMPACT, ~live-data-size for
the others.
James
| Attachment | Content-Type | Size |
|---|---|---|
| 0005-Add-tests-for-COMPACT.patch | text/x-patch | 33.6 KB |
| 0002-heap-Add-heap_relocate-primitive.patch | text/x-patch | 20.6 KB |
| 0003-vacuum-Add-lazy_compact_heap-an-internal-heap-compac.patch | text/x-patch | 14.1 KB |
| 0004-Add-COMPACT-command.patch | text/x-patch | 27.0 KB |
| 0001-heap-Add-RelationGetSpecificBufferForTuple.patch | text/x-patch | 6.3 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Ayush Tiwari | 2026-05-08 12:41:08 | Disallow whole-row index references with virtual generated columns? |
| Previous Message | Amit Kapila | 2026-05-08 12:25:12 | Re: Adding REPACK [concurrently] |