On Tue, Apr 24, 2012 at 11:55 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> A key barrier to migrations from trigger-based replication to WAL-based
> replication is the lack of temporary tables under hot standby. I'd like to
> close that gap; the changes needed will also reduce the master-side cost of
> temporary table usage. Here is a high-level design for your advice and
> comments. Much of this builds on ideas from past postings noted below.
> Our temporary tables are cataloged and filled like permanent tables. This has
> the major advantage of making most code operate on tables with minimal regard
> for their relpersistence. It also brings disadvantages:
> 1. High catalog turnover in rapid create/drop workloads. Heavy temporary
> table users often need to REINDEX relation-oriented catalogs. Hot standby
> cannot assign OIDs or modify system catalogs at all.
> 2. Consumption of permanent XIDs for DML on the table. This increases COMMIT
> cost on the master and is a non-starter under hot standby.
> 3. autovacuum cannot VACUUM temporary tables, but their relfrozenxid values
> delay pg_clog truncation and can trigger a wraparound-prevention shutdown.
> 4. sinval traffic from every CREATE TEMP TABLE et al.
> 5. CREATE TEMP TABLE records AccessExclusiveLock in the global lock table.
> 6. We don't automatically drop temporary tables that existed at the point of a
> crash, because they look much like permanent tables.
#6 is already fixed in 9.1. Temporary tables now have names like
tBACKENDID_RELFILENODE, and are cleaned up during start-up in exactly
the same way you're propose to do it further down.
> To resolve points 2 and 3, let's change the XID values stored in temporary
> tables from regular TransactionId to LocalTransactionId. This entails an lxid
> counterpart for clog.c. Functions like GetCurrentTransactionId() and
> HeapTupleSatisfiesVisibility() will take a Relation argument to identify the
> XID type. One open question is whether to add conditional logic to functions
> like HeapTupleSatisfiesMVCC() or to have parallel implementations like
> HeapTupleSatisfiesMVCCLocal(). I lean toward the latter, perhaps with the
> help of some code generation. I don't think a counterpart for pg_subtrans
> will be necessary; the backend knows its own XID tree, and the
> TRANSACTION_STATUS_SUB_COMMITTED interlock is superfluous with only one
> backend as reader and writer. I'm also thinking the local clog can live
> strictly in memory; a session that retains a temporary table across 2B local
> transactions can afford 512 MiB of RAM. With this change, VACUUM can ignore
> relfrozenxid of temporary tables when calculating a new datfrozenxid. This
> change can form an independent patch.
Agreed. If you can pull it off, this will be a nice improvement
regardless of what happens with the rest of this, and it makes sense
to do it as a separate patch. I don't yet have a strong opinion on
what to do with the HeapTupleSatisfies* functions, but I suspect
you're correct in thinking that separate functions are better. For
one thing, those functions are very much performance-critical, so
introducing extra branches is something to avoid.
Maybe this is a silly idea, but if you're thinking about creating a
local XID space and a global XID space, it might be a good idea to
also make allowance for an "unlogged" XID space - that is, an XID
space that is global to all backends but need not survive crashes.
This would potentially allow unlogged tables to be used in HS mode. I
would expect that you probably don't want to go as far as actually
trying to make this work as part of your current project, but maybe as
you're structuring the code it would be worth considering the
possibility that we'll eventually want >2 XID spaces, either for this
or other reasons.
> I do not see a clean behind-the-scenes fix for points 1, 4 and 5. We can
> resolve those by adding a new variety of temporary table, one coincidentally
> matching the SQL standard's notion of a temporary table. The developer will
> declare it once, after which all sessions observe it as an initially-empty
> table whose contents remain local to the session. Most relation catalog
> entries, including all OIDs, are readily sharable among sessions. The
> exceptions are relpages, reltuples, relallvisible, relfrozenxid, and
> pg_statistic rows. I will handle the pg_class columns by introducing new
> backend APIs abstracting them. Those APIs will consult the relcache for
> permanent tables and a local-memory hash for temporary tables. For
> statistics, add a new catalog pg_temp_statistic, an inheritance child of
> pg_statistic and itself one of these new-variety temporary tables.
With respect to problem #5, I've been wondering if we couldn't just
forget about taking AccessExclusiveLock when first creating a table
(temporary or permanent). Unless and until the transaction commits,
nobody can see the catalog entry anyway, so nobody else is going to
attempt to take a conflicting lock. So why bother? Maybe there's
some reason here that's not obvious to me. If not, it might be worth
doing on general principle independent of this project.
Problem #4 is a little stickier, but I think also solvable. Basically
all cross-backend access to temp tables ought to be prohibited anyway,
but it currently isn't, because there's at least one cross-backend
operation that we categorically need to support: DROP. Separating the
global and local XID spaces would help with that, but not really all
that much, since if a session manages to exit without cleaning up
after itself, we'd like someone to be able to do that later (and maybe
more aggressively than what we do now, which is to only clean things
up after 2B txns or so). Nevertheless it's currently possible for
backend A to build a relcache entry for backend B's temporary
relation, and it might be that if we prohibited that and maybe a few
other things we could see our way clear to removing most or all of the
sinval traffic here. Still, it may not really help that much without
a solution to problem #1.
On that note, I had a thought in the pat that it might be possible to
do solve problem #1 by using global temp tables as system catalogs -
that is, for each type of system catalog that relates to table
creation, you'd have a permanent catalog and a global temp catalog.
So if someone wants to create a temporary table of the existing
variety on the standby, you can make all the entries in the
global-temp version of pg_class, pg_attribute, etc. However, this
seems extremely difficult to manage in general - there's a lot of code
churn involved, and also possible temporary -> permanent dependencies;
for example, the temporary table might have a pg_attrdef entry that
needs to depend on a non-temporary pg_proc entry. That's tricky to
solve on the master and even trickier to solve in HS operation. So
I'm inclined to agree with you that it makes more sense to just aim to
support global temp tables in HS mode, and if we want to beat our head
against the brick wall of making regular temp tables work there
eventually, that can be a later project.
I would suggest just not worrying about the statistical stuff for the
first version of the patch. Assume that it's adequate to have one set
of statistics for all copies of the table, both pg_statistic entries
and the stats-related stuff in pg_class (relpages, reltuples,
relallvisible). It's not unreasonable to think that the table will be
used in broadly similiar ways across all backends, so in some cases
this might actually give better performance than what you're proposing
to do. If not, you can solve that problem in a future patch. This
project is hard enough without throwing that problem onto the pile,
and I really think that if you do throw it on the pile you're going to
be adding a lot of complexity and code churn that isn't really
necessary for a v1.
> Past discussions have raised the issue of interaction between commands like
> ALTER TABLE and sessions using the new-variety temporary table. As a first
> cut, let's keep this simple and have ongoing use of the table block operations
> requiring AccessExclusiveLock. Note that you can always just make a new
> temporary table with a different name to deploy a change quickly. Implement
> this with a heavyweight lock having a novel life cycle. When a session first
> takes an ordinary relation lock on the table, acquire the longer-term lock and
> schedule it for release on transaction abort. On TRUNCATE, schedule a release
> on transaction commit. Of course, also release the lock at session end.
I'm not sure I believe this will work, but maybe I'm just not understanding it.
> For new-variety temporary tables, change file names from "relfilenode[_fork]"
> to "refilenode[_fork].pid.localnode". During crash recovery, delete all files
> conforming to that pattern for refilenodes of known temporary tables. This
> also lets backends share pg_class.relfilenode. The "localnode" field would
> distinguish multiple generations of a table across VACUUM FULL, CLUSTER, and
> TRUNCATE. We could build on this strategy to safely resolve point 6 for the
> existing kind of temporary table, but that entails enough other details to
> probably not mix it into the same patch.
In lieu of including localnode in the filename, I would suggest that
for a global temp table, we set relfilenode = 0, and let each backend
assign one (and reassign new ones) whenever it feels like it. The
mapping from reloid -> relfilenode can be kept in backend-local
memory, or you can keep a <backend id, rel oid> -> relfilenode mapping
in a separate relation fork. The latter would have the additional
advantage of simplifying cleanup and might also be helpful in
detecting when the table is or is not in use by multiple backends.
> How important is support for VACUUM on these tables under hot standby? The
> alternative is to fail when a session retains a temporary table across 2B
> local transactions. I do not currently see any challenges sufficient to
> motivate not supporting VACUUM, but it might be a useful simplification to
> keep in mind. What about ANALYZE support; how important is the ability to
> collect statistics on temporary tables? Again, I tentatively expect to
> support it regardless of the answer.
I think it's probably pretty important to support VACUUM, because even
ignoring wraparound considerations, not vacuuming tends to cause
performance to suck. I think ANALYZE is less important for the
reasons stated above.
The Enterprise PostgreSQL Company
In response to
pgsql-hackers by date
|Next:||From: Robert Haas||Date: 2012-04-25 16:06:47|
|Subject: Re: 9.3: summary of corruption detection / checksums / CRCs discussion|
|Previous:||From: Merlin Moncure||Date: 2012-04-25 14:28:05|
|Subject: Re: 9.3: summary of corruption detection / checksums / CRCs discussion|