FYI, getting rid of VACUUM FULL in a .0 release does make more sense
than doing it in a .1 release.
Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > I'll do a little work towards step (1) just so we can take a more
> > informed view once you've had a better look at just what (2) involves.
> I spent a couple of hours reading code and believe that I've identified
> all the pain points for allowing relocation of system catalogs (ie,
> assigning them new relfilenodes during cluster-style VACUUM FULL,
> REINDEX, etc). It's definitely not a trivial project but it's not out
> of reach either --- I estimate I could get it done in a couple or three
> days of full-time effort. Given the potential benefits I think it's
> worth doing. Rough notes are attached --- comments?
> regards, tom lane
> Change pg_class.relfilenode to be 0 for all rels for which a map file should
> be used instead. Define RelationGetFilenode() to look to the physaddr info
> instead, and make all internal refs go to that instead of reading
> rd_rel->relfilenode directly. Define pg_relation_filenode(regclass) and fix
> external refs (oid2name, pg_dump) to use that instead.
> In zeroth cut, just have relcache.c substitute the OID in place of reading
> a map file. Possibly it should always do that during bootstrap.
> How should bootstrap decide which rels to insert zero for, anyway?
> Shared definitely, pg_class, ... maybe that's enough? Or do we need
> it for all nailed local catalogs?
> local state contains
> * shared map list
> * this database's map list
> * list of local overrides to each on-disk map list
> At commit: if modified, write out new version; do this as late as possible
> before xact commit
> At abort: just discard local-override list
> "Write" must include generating a WAL entry as well as sending a shared-inval
> * write temp file, fsync it
> * emit WAL record containing copy of new file contents, fsync it
> * atomically rename temp file into place
> * send sinval
> During relation open, check for sinval before relying on current cached value
> of map contents
> Hm, what about concurrent updates of map? Probably instantiate a new
> LWLock or maybe better a HW lock. So write involves taking lock, check
> for updates, finally write --- which means that local modifications
> have to be stored in a way that allows re-reading somebody else's mods.
> Hence above data structure with separate storage of local modifications.
> We assume rel-level locking protects us from concurrent update attempts
> on the same map entry, but we don't want to forbid concurrent relocations
> of different catalogs.
> LWLock would work if we do an unconditional read of the file contents after
> getting lock rather than relying on sinval signaling, which seems a small
> price considering map updates should be infrequent. Without that, writers
> have to hold the lock till commit which rules out using LWLock.
> Hm ... do we want an LWLock per map file, or is one lock to rule them all
> sufficient? LWLock per database seems problematic. With an HW lock there
> wouldn't be a problem with that. HW lock would allow concurrent updates of
> the map files of different DBs, but is that worth the extra cycles?
> In a case where a transaction relocates pg_class (or another mapped catalog)
> and then makes updates in that catalog, all is well in the sense that the
> updates will be preserved iff the xact commits. HOWEVER we would have
> problems during WAL replay? No, because the WAL entries will command updates
> to the catalog's new relfilenode, which will be interesting to anyone else if
> and only if the xact gets to commit. We would need to be careful about the
> case of relocating pg_class itself though --- make sure local relcache
> references the new pg_class before any such updates happen. Probably a CCI
> is sufficient.
> Another issue for clustering a catalog is that sinval catcache signalling
> could get confused, since that depends on catalog entry TIDs which would
> change --- we'd likely need to have relocation send an sinval signal saying
> "flush *everything* from that catalog". (relcache inval on the catalog itself
> doesn't do that.) This action could/should be transactional so no
> additional code is needed to propagate the notice to HS standbys.
> Once the updated map file is moved into place, the relocation is effectively
> committed even if we subsequently abort the transaction. We can make that
> window pretty narrow but not remove it completely. Risk factors:
> * abort would try to remove relation files created by xact, in particular
> the new version of the catalog. Ooops. Can fix this by telling smgr to
> forget about removing those files before installing the new map file ---
> better to leak some disk space than destroy catalogs.
> * on abort, we'd not send sinval notices, leaving other backends at risk
> of using stale info (maybe our own too). We could fix this by sending
> the sinval notice BEFORE renaming into place (if we send it and then fail
> to rename, no real harm done, just useless cache flushes). This requires
> keeping a nontransactional-inval path in inval.c, but at least it's much
> more localized than before. No problem for HS since we have the WAL record
> for map update to drive issuing sinvals on the slave side. Note this
> means that readers need to take the mapfile lock in shared mode, since they
> could conceivably get the sinval notice before we complete the rename.
> For our own backend we need cache flushes at CCI and again at xact
> commit/abort. This could be handled by regular transactional inval
> path but we end up with a lot of redundant flushing. Maybe not worth
> worrying about though.
> Disallow catalog relocation inside subtransactions, to avoid having
> to handle subxact abort effects on the local-map-changes state.
> This could be implemented if desired, but doesn't seem worth it
> at least in first pass.
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
+ If your life is a hard drive, Christ can be your backup. +
In response to
pgsql-hackers by date
|Next:||From: Pavel Stehule||Date: 2010-02-01 05:07:52|
|Subject: Re: Review: listagg aggregate|
|Previous:||From: Tom Lane||Date: 2010-02-01 03:49:56|
|Subject: Re: Hot Standby and VACUUM FULL |