| From: | Nico Williams <nico(at)cryptonector(dot)com> |
|---|---|
| To: | Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Global temporary tables |
| Date: | 2026-07-01 23:20:23 |
| Message-ID: | akWgt4tuP7e5Bc+B@ubby |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Fri, Jun 26, 2026 at 08:37:07PM +0100, Dean Rasheed wrote:
> Thanks. Any reviews will be very helpful.
I drove Claude through a review. You can too, so there's nothing
special to it, but I'll go over it here, with a bit of my content added,
and hopefully y'all don't hate me for sharing LLM outputs.
1. You have an uninitialized have_gtrs in
src/backend/commands/vacuum.c:504.
2. TRUNCATE on a GTT can cause corruption.
Claude's description (I don't know how to evaluate this one myself):
RelationSetNewRelfilenumber writes session-local freeze xids into
shared pg_class for GTTs (relcache.c:4057-4076). Only relfilenode is
split via SetEffective_relfilenode. relpages=0, reltuples=-1,
relallvisible=0, relallfrozen=0, relfrozenxid=freezeXid,
relminmxid=minmulti, relpersistence are written directly into
classform and committed via CatalogTupleUpdate(pg_class, ...). Called
for GTT TRUNCATE (tablecmds.c:2301) and REINDEX (index.c:3908). For a
GTT, the new freezeXid reflects this session's freshly created
storage; clobbering shared pg_class.relfrozenxid with it can advance
the global horizon past what other sessions' local storage actually
has.
Exploring further Claude added:
The failure chain:
1. t=0: Session B opens gtt_foo, MyProc->tempfrozenxid = X0 (~old
RecentXmin). B never inserts anything; its local file has no rows
but is still bound to freezeXid X0.
2. t=1: Session A opens gtt_foo, inserts rows in xids X0…X1, commits.
A's local storage holds rows with xmin in that range. B's local
file is untouched.
3. t=2: Session A: TRUNCATE gtt_foo;. RelationSetNewRelfilenumber
runs. table_relation_set_new_filelocator returns freezeXid = X2
(~current). Code writes pg_class.relfrozenxid = X2 — no cap.
4. t=3: Any session VACUUMs any tables. vac_update_datfrozenxid walks
pg_class, sees gtt_foo.relfrozenxid = X2. If other pg_class rows
all have similar or newer values, newFrozenXid ≈ X2.
5. t=4: dbform->datfrozenxid = X2. vac_truncate_clog(X2, ...) at
:2019 truncates CLOG for everything older than X2.
6. t=5: Session B, whose local gtt_foo file was actually created back
at t=0 with freezeXid X0, has no problem itself (empty file). But
suppose C is any other backend also holding local storage with
xids older than X2 (any GTT, or a running long snapshot's stale
hint-bit-less pages) — reading such rows now does
TransactionIdDidCommit(x) → CLOG lookup → "ERROR: could not access
status of transaction x".
7. Worse: because the whole cluster's CLOG is truncated, this is not
just B's GTT. Any tuple anywhere the cluster still holds with an
xid older than X2 that lacks the HEAP_XMIN_COMMITTED hint bit
becomes unreadable. TOAST chains, index-only scan hints — all
vulnerable to torn interpretation.
Fix: the post-SetEffective_relfilenode block must update
temp_classform instead of classform when temp_tuple is valid (or skip
stats/freeze fields on classform entirely for GTT).
3. vac_get_min_tempfrozenxids() scans ProcGlobal->allProcCount without
ProcArrayLock (vacuum.c:1769-1799). Every comparable consumer
(vac_update_datfrozenxid, GetSnapshotData) acquires it.
Without this lock, a backend can exit mid-scan and a new backend
reuse the slot; you'll read databaseId/tempfrozenxid from
inconsistent slot generations. This can lead to corruption.
4. eoxact_usage_list_len / eoxact_usage_list_overflowed never reset at
end-of-xact (global_temp.c:1176-1179). Only the storage list counters
reset. Once a long-lived session overflows (>32 GTTs touched
cumulatively), every subsequent xact falls back to full hash_seq
scans. Correctness preserved, performance silently degrades.
5. MyProc->tempfrozenxid never decays. Set on first GTT access; only
moves forward on a session's VACUUM of its own GTT. A backend that
touches a GTT once and never vacuums holds back pg_class.relfrozenxid
advancement for all sessions for the connection's lifetime. No path
clears it on GlobalTempRelationDropped or when local usage_count
drops to zero.
I explored this one a bit because I was surprised that the frozenxid
in the catalog should be relevant to GTTs. If I understand correctly
this tempfrozenxid exists because while the GTT storage is
per-session, there is some coupling via the CLOG, but I don't
understand why _that_ has to be, and this is what Claude told me:
An alternative that would have avoided the coordination
You could have designed it so that GTT pg_class rows are skipped
from datfrozenxid computation (just like local temp), and each
backend's tempfrozenxid fed directly into vac_update_datfrozenxid
as an additional constraint. That would let the shared
pg_class.relfrozenxid be advisory/stale without a safety issue,
removing one leg of the coordination.
The current design chose to keep the shared row accurate — probably
because it's user-visible (SELECT relfrozenxid FROM pg_class) and
drives per-relation autovacuum urgency signals. That's the
tradeoff: an accurate shared row, in exchange for needing
cross-session horizon exchange. Same fundamental need as local
temp; different plumbing because the pg_class row is shared.
And, ah, well, I guess you could say that `relfrozenxid` in
`pg_class` is not meaningful for GTTs, or that you'll get your
session's GTT's relfrozenxid.
6. InsertPgTempIndexTuple mutates rel->rd_index->indisvalid in-place
(global_temp.c:885-892). Relcache fields are normally immutable
outside relcache.c. Works today but fragile.
7. AbortTransaction ordering — gtrs_dropped is list_freed
unconditionally in AtEOXact (:1178) even though the in-source comment
(:1077-1080) claims it'll be "repeated next commit". Mostly benign
(storage is reaped at exit) but the comment is wrong.
8. DeleteRelationTuple calls GlobalTempRelationDropped after
CatalogTupleDelete; GlobalTempRelationDropped then calls
get_rel_relkind(relid) (global_temp.c:1038). After a CCI, the
syscache will return '\0' and skip DeletePgTempIndexTuple. Pass
relkind explicitly instead of re-deriving it.
Nico
--
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Peter Smith | 2026-07-01 23:22:32 | Re: Support EXCEPT for ALL SEQUENCES publications |
| Previous Message | Etsuro Fujita | 2026-07-01 22:57:51 | Re: postgres_fdw: fix cumulative stats after imported foreign-table stats |