Somebody asked about this on Depesz's blog today, and I think it's
come up here before too, so I thought it might be worth my writing up
a few comments on this. I don't think I'm going to have time to work
on this any time soon, but if someone else wants to work up a patch,
I'm game to review. I think it'd clearly be a good feature.
Generally, to do this, it would be necessary to do the following
things (plus anything I'm forgetting):
1. Take an AccessExclusiveLock on the target table. You might think
that concurrent selects could be allowed, but I believe that's not the
case. Read on.
2. Verify that there are no foreign keys referencing other unlogged
tables, because if that were the case then after the change we'd have
a permanent table referencing an unlogged table, which would violate
referential integrity. (Note that unlogged referencing permanent is
OK, but permanent referencing unlogged is a no-no, so what matters
when upgrading is "outbound" foreign keys.)
3. Write out all shared buffers for the target table, and drop them.
This ensures that there are no buffers floating around for the target
relation that are marked BM_UNLOGGED, which would be a bad thing. Or
maybe it's possible to just clear the BM_UNLOGGED flag, instead of
dropping them. This is the step that makes me think we need an access
exclusive lock - otherwise, somebody else might read in a buffer and,
seeing that the relation is unlogged (which is true, since we haven't
committed yet), mark it BM_UNLOGGED.
4. fsync() any segments of the target relation - of any fork except
that init fork - that might have dirty pages not on disk.
5. Arrange for the appropriate file deletions at commit or abort, by
updating pendingDeletes. On commit, we want to delete the init fork
for the table and all its indexes. On abort, we want to delete
everything else, but only for pretend; that is, the abort record
should reflect the deletions since they'll need to happen on any
standbys, but we shouldn't actually perform them on the master since
we don't want to obliterate the contents of the table for no reason.
There's a subtle problem here I'm not quite sure how to deal with:
what happens if we *crash* without writing an abort record? It seems
like that could leave a stray file around on a standby, because the
current code only cleans things up on the standby at the start of
recovery; to make this bullet-proof, I suppose it'd need to repeat
that every time a crash happens on the master, but I don't know how to
do that. Note also that if wal_level is minimal, then we need only
worry about the commit case; the abort case can be a no-op.
6. If wal_level != minimal, XLOG every page of every fork except the
init fork, for both the table and the associated indexes. (Note that
this step also requires an AccessExclusiveLock rather than some weaker
lock, because of the arbitrary rule that only AccessExclusiveLocks are
sent to standbys. If we held only ShareRowExclusiveLock on the
master, for example, a Hot Standby backend might read the table while
it's only been half-copied.)
7. Update pg_class.relpersistence from 'u' to 'p', for both the table
and the associated indexes.
Going the other direction ought to be possible too, although it seems
somewhat less useful. For that, you'd need to flip around the check
in step #2 (i.e. check for a reference FROM a permanent table),
perform step #3, skip step #4, do step #5 backwards (create and log
init forks, arranging for them to be removed on abort - this too has
an issue with crashes that don't leave abort records behind); and
arrange for the rest of the forks to be removed on commit on any
standby without doing it on the master), skip step #6, and do step #7
The Enterprise PostgreSQL Company
pgsql-hackers by date
|Next:||From: Robert Haas||Date: 2011-01-05 03:16:26|
|Subject: Re: SQL/MED - core functionality|
|Previous:||From: Greg Smith||Date: 2011-01-05 02:27:10|
|Subject: Re: Re: new patch of MERGE (merge_204) & a question about