A key barrier to migrations from trigger-based replication to WAL-based
replication is the lack of temporary tables under hot standby. I'd like to
close that gap; the changes needed will also reduce the master-side cost of
temporary table usage. Here is a high-level design for your advice and
comments. Much of this builds on ideas from past postings noted below.
Our temporary tables are cataloged and filled like permanent tables. This has
the major advantage of making most code operate on tables with minimal regard
for their relpersistence. It also brings disadvantages:
1. High catalog turnover in rapid create/drop workloads. Heavy temporary
table users often need to REINDEX relation-oriented catalogs. Hot standby
cannot assign OIDs or modify system catalogs at all.
2. Consumption of permanent XIDs for DML on the table. This increases COMMIT
cost on the master and is a non-starter under hot standby.
3. autovacuum cannot VACUUM temporary tables, but their relfrozenxid values
delay pg_clog truncation and can trigger a wraparound-prevention shutdown.
4. sinval traffic from every CREATE TEMP TABLE et al.
5. CREATE TEMP TABLE records AccessExclusiveLock in the global lock table.
6. We don't automatically drop temporary tables that existed at the point of a
crash, because they look much like permanent tables.
To resolve points 2 and 3, let's change the XID values stored in temporary
tables from regular TransactionId to LocalTransactionId. This entails an lxid
counterpart for clog.c. Functions like GetCurrentTransactionId() and
HeapTupleSatisfiesVisibility() will take a Relation argument to identify the
XID type. One open question is whether to add conditional logic to functions
like HeapTupleSatisfiesMVCC() or to have parallel implementations like
HeapTupleSatisfiesMVCCLocal(). I lean toward the latter, perhaps with the
help of some code generation. I don't think a counterpart for pg_subtrans
will be necessary; the backend knows its own XID tree, and the
TRANSACTION_STATUS_SUB_COMMITTED interlock is superfluous with only one
backend as reader and writer. I'm also thinking the local clog can live
strictly in memory; a session that retains a temporary table across 2B local
transactions can afford 512 MiB of RAM. With this change, VACUUM can ignore
relfrozenxid of temporary tables when calculating a new datfrozenxid. This
change can form an independent patch.
I do not see a clean behind-the-scenes fix for points 1, 4 and 5. We can
resolve those by adding a new variety of temporary table, one coincidentally
matching the SQL standard's notion of a temporary table. The developer will
declare it once, after which all sessions observe it as an initially-empty
table whose contents remain local to the session. Most relation catalog
entries, including all OIDs, are readily sharable among sessions. The
exceptions are relpages, reltuples, relallvisible, relfrozenxid, and
pg_statistic rows. I will handle the pg_class columns by introducing new
backend APIs abstracting them. Those APIs will consult the relcache for
permanent tables and a local-memory hash for temporary tables. For
statistics, add a new catalog pg_temp_statistic, an inheritance child of
pg_statistic and itself one of these new-variety temporary tables.
Past discussions have raised the issue of interaction between commands like
ALTER TABLE and sessions using the new-variety temporary table. As a first
cut, let's keep this simple and have ongoing use of the table block operations
requiring AccessExclusiveLock. Note that you can always just make a new
temporary table with a different name to deploy a change quickly. Implement
this with a heavyweight lock having a novel life cycle. When a session first
takes an ordinary relation lock on the table, acquire the longer-term lock and
schedule it for release on transaction abort. On TRUNCATE, schedule a release
on transaction commit. Of course, also release the lock at session end.
For new-variety temporary tables, change file names from "relfilenode[_fork]"
to "refilenode[_fork].pid.localnode". During crash recovery, delete all files
conforming to that pattern for refilenodes of known temporary tables. This
also lets backends share pg_class.relfilenode. The "localnode" field would
distinguish multiple generations of a table across VACUUM FULL, CLUSTER, and
TRUNCATE. We could build on this strategy to safely resolve point 6 for the
existing kind of temporary table, but that entails enough other details to
probably not mix it into the same patch.
A third patch will permit the following commands in read-only transactions,
where they will throw an error if the subject is not a temporary table:
COPY ... FROM
VACUUM (including VACUUM FULL)
CLUSTER (without USING clause)
I considered whether to instead separate the set of commands allowed in a
read-only transaction from the set allowed under hot standby. This proposal
is closer to the SQL standard, which explicitly allows INSERT, UPDATE and
DELETE on temporary tables during read-only transactions. Only the first five
commands are essential; support for the rest could wait for follow-on patches.
Concerning everyone's favorite topic, how to name the new type of table, I
liked Tom's proposal to make CREATE TEMP TABLE retain current behavior and
have CREATE GLOBAL TEMP TABLE and/or CREATE LOCAL TEMP TABLE request the new
SQL-standard variety. (I'd vote for using CREATE GLOBAL and retaining CREATE
LOCAL for future expansion.) As he mentions, to get there, we'd ideally start
by producing a warning instead of silently accepting GLOBAL as a noise word.
Should we put such a warning into 9.2?
How important is support for VACUUM on these tables under hot standby? The
alternative is to fail when a session retains a temporary table across 2B
local transactions. I do not currently see any challenges sufficient to
motivate not supporting VACUUM, but it might be a useful simplification to
keep in mind. What about ANALYZE support; how important is the ability to
collect statistics on temporary tables? Again, I tentatively expect to
support it regardless of the answer.
Key past threads:
pgsql-hackers by date
|Next:||From: Tom Lane||Date: 2012-04-25 04:06:59|
|Subject: Re: remove dead ports? |
|Previous:||From: Josh Berkus||Date: 2012-04-25 01:31:39|
|Subject: Re: 9.3: summary of corruption detection / checksums /