Re: heap metapages

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: heap metapages
Date: 2012-05-21 19:15:43
Message-ID: CA+U5nMKmCtgyn9WqNVadMohwBLNGMRiSRLQ1cQqu9-bDrU31Kw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 21 May 2012 13:56, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> At dinner on Friday night at PGCon, the end of the table that included
> Tom Lane, Stephen Frost, and myself got to talking about the idea of
> including some kind of metapage in every relation, including heap
> relations.  At least some index relations already have something like
> this (cf _bt_initmetapage, _hash_metapinit).  I believe that adding
> this for all relations, including heaps, would allow us to make
> improvements in several areas.

The only thing against these ideas is that you're putting the design
before the requirements, which always makes me nervous.

I very much like the idea of a common framework to support multiple
requirements. If we can view a couple of other designs as well it may
quickly become clear this is the right way. In any case, the topics
discussed here are important ones, so thanks for covering them.

What springs immediately to mind is why this would not be just another fork.

> 1. Tom was interested in the idea of trying to make the system catalog
> entries which describe the system catalogs themselves completely
> immutable, so that they can potentially be shared between databases.
> For example, we might have shared catalogs pg_class_shared and
> pg_attribute_shared, describing the structure of all the system
> catalogs; and then we might also have pg_class and pg_attribute within
> each database, describing the structure of tables which exist only
> within that database.  Right now, this is not possible, because values
> like relpages, reltuples, and relfrozenxid can vary from database to
> database.  However, if those values were stored in a metapage
> associated with the heap relation rather than in the system catalogs,
> then potentially we could make this work.  The most obvious benefit of
> this is that it would reduce the on-disk footprint of a new database,
> but there are other possible benefits as well.  For example, a process
> not bound to a database could read a shared catalog even if it weren't
> nailed, and if we ever implement a prefork system for backends, they'd
> be able to do more of their initialization steps before learning which
> database they were to target.

This is important. I like the idea of breaking down the barriers
between databases to allow it to be an option for one backend to
access tables in multiple databases. The current mechanism doesn't
actually prevent looking at data from other databases using internal
APIs, so full security doesn't exist. It's a very common user
requirement to wish to join tables stored in different databases,
which ought to be possible more cleanly with correct privileges.

> 2. I'm interested in having a cleaner way to associate
> non-transactional state with a relation.  This has come up a few
> times.  We currently handle this by having lazy VACUUM do in-place
> heap updates to replace values like relpages, reltuples, and
> relfrozenxid, but this feels like a kludge.  It's particularly scary
> to think about relying on this for anything critical given that
> non-inplace heap updates can be happening simultaneously, and the
> consequences of losing an update to relfrozenxid in particular are
> disastrous.  Plus, it requires hackery in pg_upgrade to preserve the
> value between the old and new clusters; we've already had to fix two
> data-destroying bugs in that logic.  There are several other things
> that we might want to do that have similar requirements.  For example,
> Pavan's idea of folding VACUUM's second heap pass into the next vacuum
> cycle requires a relation-wide piece of state which can probably be
> represented as a single bit, but putting that bit in pg_class would
> require the same sorts of hacks there that we already have for
> relfrozenxid, with similar consequences if it's not properly
> preserved.  Making unlogged tables logged or the other way around
> appears to require some piece of relation-level state *that can be
> accessed during recovery*, and pg_class is not going to work for that.
>  Page checksums have a similar requirement if the granularity for
> turning them on and off is anything less than the entire cluster.
> Whenever we decide to roll out a new page version, we'll want a place
> to record the oldest page version that might be present in a
> particular relation, so that we can easily check whether a cluster can
> be upgraded to a new release that has dropped support for an old page
> version.  Having a common framework for all of these things seems like
> it will probably be easier than solving each problem individually, and
> a metapage is a good place to store non-transactional state.

I thought there was a patch that put that info in a separate table 1:1
with pg_class.

Not very sure why a metapage is better than a catalog table. We would
still want a view that allows us to access that data as if it were a
catalog table.

> 3. Right now, a new table uses up a minimum of 3 inodes, even if it
> has no indexes: one for the main fork, one for the visibility map, and
> one for the free space map.  For people who have lots and lots of
> little tiny tables, this is quite inefficient.  The amount of
> information we'd have to store in a heap metapage would presumably not
> be very big, so we could potentially move the first, say, 1K of the
> visibility map into the heap metapage, meaning that tables less than
> 64MB would no longer require a separate visibility map fork.
> Something similar could possibly be done with the free-space map,
> though I am unsure of the details.  Right now, a relation containing
> just one tuple consumes 5 8k blocks on disk (1 for the main fork, 3
> for the FSM, and 1 for the VM) and 3 inodes; getting that down to 8kB
> and 1 inode would be very nice.  The case of a completely-empty
> relation is a bit annoying; that right now takes 1 inode and 0 blocks
> and I suspect we'd end up with 1 inode and 1 block, but I think it
> might still be a win overall.

Again, there are other ways to optimise the FSM for small tables.

> 4. Every once in a while, somebody's database ends up in pieces in
> lost+found.  We could make this a bit easier to recover from by
> including the database OID, relfilenode, and table OID in the
> metapage.  This wouldn't be perfect, since a relation over one GB
> would still only have one metapage, so additional relation segments
> would still be a problem.  But it would be still be a huge improvement
> over the status quo: some very large percentage of the work of putting
> everything back where it goes could probably be done by a Perl script
> that read all the metapages, and if you needed to know, say, which
> file contained pg_class, that would be a whole lot easier, too.

That sounds like the requirement that is driving this idea.

> Now, there are a couple of obvious problems here, the biggest of which
> is probably that we want to avoid breaking pg_upgrade.  I don't have a
> great solution to that problem.  The most realistic option I can think
> of at the moment is to fudge things so that existing features can
> continue to work even if the metapage isn't present.  Any table
> rewrite would add a metapage; if you want to use a new feature that
> requires a metapage, you have to rewrite the table first to get one.
> However, that's pretty unfortunate in terms of goal #1 and some parts
> of goal #2, because if you can't be certain of having the metapage
> present then you can't really store data there in lieu of pg_class;
> the best you'll be able to do is have it both places, at least until
> you're ready to deprecate upgrades from releases that don't contain
> metapage support.  Hopefully someone has a better idea...

You don't have to rewrite the table, you just need to update the rows
so they migrate to another block.

That seems easy enough, but still not sure why you wouldn't just use
another fork. Or another idea would be to have the first page have a
non-zero pd_special.

I know you were recording what was discussed as an initial starting
point. Looks like a good set of problems to solve.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2012-05-21 19:22:52 Re: Why is indexonlyscan so darned slow?
Previous Message Stephen Frost 2012-05-21 19:15:25 Re: heap metapages