Re: Global temporary tables

From: Nico Williams <nico(at)cryptonector(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Global temporary tables
Date: 2026-07-01 23:20:23
Message-ID: akWgt4tuP7e5Bc+B@ubby
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 26, 2026 at 08:37:07PM +0100, Dean Rasheed wrote:
> Thanks. Any reviews will be very helpful.

I drove Claude through a review. You can too, so there's nothing
special to it, but I'll go over it here, with a bit of my content added,
and hopefully y'all don't hate me for sharing LLM outputs.

1. You have an uninitialized have_gtrs in
src/backend/commands/vacuum.c:504.

2. TRUNCATE on a GTT can cause corruption.

Claude's description (I don't know how to evaluate this one myself):

RelationSetNewRelfilenumber writes session-local freeze xids into
shared pg_class for GTTs (relcache.c:4057-4076). Only relfilenode is
split via SetEffective_relfilenode. relpages=0, reltuples=-1,
relallvisible=0, relallfrozen=0, relfrozenxid=freezeXid,
relminmxid=minmulti, relpersistence are written directly into
classform and committed via CatalogTupleUpdate(pg_class, ...). Called
for GTT TRUNCATE (tablecmds.c:2301) and REINDEX (index.c:3908). For a
GTT, the new freezeXid reflects this session's freshly created
storage; clobbering shared pg_class.relfrozenxid with it can advance
the global horizon past what other sessions' local storage actually
has.

Exploring further Claude added:

The failure chain:

1. t=0: Session B opens gtt_foo, MyProc->tempfrozenxid = X0 (~old
RecentXmin). B never inserts anything; its local file has no rows
but is still bound to freezeXid X0.
2. t=1: Session A opens gtt_foo, inserts rows in xids X0…X1, commits.
A's local storage holds rows with xmin in that range. B's local
file is untouched.
3. t=2: Session A: TRUNCATE gtt_foo;. RelationSetNewRelfilenumber
runs. table_relation_set_new_filelocator returns freezeXid = X2
(~current). Code writes pg_class.relfrozenxid = X2 — no cap.
4. t=3: Any session VACUUMs any tables. vac_update_datfrozenxid walks
pg_class, sees gtt_foo.relfrozenxid = X2. If other pg_class rows
all have similar or newer values, newFrozenXid ≈ X2.
5. t=4: dbform->datfrozenxid = X2. vac_truncate_clog(X2, ...) at
:2019 truncates CLOG for everything older than X2.
6. t=5: Session B, whose local gtt_foo file was actually created back
at t=0 with freezeXid X0, has no problem itself (empty file). But
suppose C is any other backend also holding local storage with
xids older than X2 (any GTT, or a running long snapshot's stale
hint-bit-less pages) — reading such rows now does
TransactionIdDidCommit(x) → CLOG lookup → "ERROR: could not access
status of transaction x".
7. Worse: because the whole cluster's CLOG is truncated, this is not
just B's GTT. Any tuple anywhere the cluster still holds with an
xid older than X2 that lacks the HEAP_XMIN_COMMITTED hint bit
becomes unreadable. TOAST chains, index-only scan hints — all
vulnerable to torn interpretation.

Fix: the post-SetEffective_relfilenode block must update
temp_classform instead of classform when temp_tuple is valid (or skip
stats/freeze fields on classform entirely for GTT).

3. vac_get_min_tempfrozenxids() scans ProcGlobal->allProcCount without
ProcArrayLock (vacuum.c:1769-1799). Every comparable consumer
(vac_update_datfrozenxid, GetSnapshotData) acquires it.

Without this lock, a backend can exit mid-scan and a new backend
reuse the slot; you'll read databaseId/tempfrozenxid from
inconsistent slot generations. This can lead to corruption.

4. eoxact_usage_list_len / eoxact_usage_list_overflowed never reset at
end-of-xact (global_temp.c:1176-1179). Only the storage list counters
reset. Once a long-lived session overflows (>32 GTTs touched
cumulatively), every subsequent xact falls back to full hash_seq
scans. Correctness preserved, performance silently degrades.

5. MyProc->tempfrozenxid never decays. Set on first GTT access; only
moves forward on a session's VACUUM of its own GTT. A backend that
touches a GTT once and never vacuums holds back pg_class.relfrozenxid
advancement for all sessions for the connection's lifetime. No path
clears it on GlobalTempRelationDropped or when local usage_count
drops to zero.

I explored this one a bit because I was surprised that the frozenxid
in the catalog should be relevant to GTTs. If I understand correctly
this tempfrozenxid exists because while the GTT storage is
per-session, there is some coupling via the CLOG, but I don't
understand why _that_ has to be, and this is what Claude told me:

An alternative that would have avoided the coordination

You could have designed it so that GTT pg_class rows are skipped
from datfrozenxid computation (just like local temp), and each
backend's tempfrozenxid fed directly into vac_update_datfrozenxid
as an additional constraint. That would let the shared
pg_class.relfrozenxid be advisory/stale without a safety issue,
removing one leg of the coordination.

The current design chose to keep the shared row accurate — probably
because it's user-visible (SELECT relfrozenxid FROM pg_class) and
drives per-relation autovacuum urgency signals. That's the
tradeoff: an accurate shared row, in exchange for needing
cross-session horizon exchange. Same fundamental need as local
temp; different plumbing because the pg_class row is shared.

And, ah, well, I guess you could say that `relfrozenxid` in
`pg_class` is not meaningful for GTTs, or that you'll get your
session's GTT's relfrozenxid.

6. InsertPgTempIndexTuple mutates rel->rd_index->indisvalid in-place
(global_temp.c:885-892). Relcache fields are normally immutable
outside relcache.c. Works today but fragile.

7. AbortTransaction ordering — gtrs_dropped is list_freed
unconditionally in AtEOXact (:1178) even though the in-source comment
(:1077-1080) claims it'll be "repeated next commit". Mostly benign
(storage is reaped at exit) but the comment is wrong.

8. DeleteRelationTuple calls GlobalTempRelationDropped after
CatalogTupleDelete; GlobalTempRelationDropped then calls
get_rel_relkind(relid) (global_temp.c:1038). After a CCI, the
syscache will return '\0' and skip DeletePgTempIndexTuple. Pass
relkind explicitly instead of re-deriving it.

Nico
--

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2026-07-01 23:22:32 Re: Support EXCEPT for ALL SEQUENCES publications
Previous Message Etsuro Fujita 2026-07-01 22:57:51 Re: postgres_fdw: fix cumulative stats after imported foreign-table stats