Re: Avoiding adjacent checkpoint records

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Avoiding adjacent checkpoint records
Date: 2012-06-08 15:24:22
Message-ID: CA+TgmoZNqSbuJwYB8ZGtSf0qQFcDeXU+LKvLqxLczcM-OnZoFQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 7, 2012 at 9:25 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> The only risk of data loss is in the case where someone deletes their
> pg_xlog and who didn't take a backup in all that time, which is hardly
> recommended behaviour. We're at exactly the same risk of data loss if
> someone deletes their pg_clog. Too frequent checkpoints actually makes
> the data loss risk from deleted pg_clog greater, so the balance of
> data loss risk doesn't seem to have altered.

This doesn't match my experience. pg_xlog is often located on a
separate disk, which significantly increases the chances of something
bad happening to it, either through user error or because, uh, disks
sometimes fail. Now, granted, you can also lose your data directory
(including pg_clog) this way, but just because we lose data in that
situation doesn't mean we should be happy about also losing data when
pg_xlog goes does the toilet, especially when we can easily prevent it
by going back to the behavior we've had in every previous release.

Now, I have had customers lose pg_clog data, and it does suck, but
it's usually a safe bet that most of the missing transactions
committed, so you can pad out the missing files with 0x55, and
probably get your data back. On the other hand, it's impossible to
guess what any missing pg_xlog data might have been. Perhaps if the
data pages are on disk and only CLOG didn't get written you could
somehow figure out which bits you need to flip in CLOG to get your
data back, but that's pretty heavy brain surgery, and if autovacuum or
even just a HOT prune runs before you realize that you need to do it
then you're toast. OTOH, if the database has checkpointed,
pg_resetxlog is remarkably successful in letting you pick up the
pieces and go on with your life.

All that having been said, it wouldn't be a stupid idea to have a
little more redundancy in our CLOG mechanism than we do right now.
Hint bits help, as does the predictability of the data, but it's still
an awfully scary to have that much critical data packed into that
small a space. I'd love to see us checksum those pages, or store the
data in some redundant location that makes it unlikely we'll lose both
copies, or ship a utility that will scan all your heap pages and try
to find hint bits that reveal which transactions committed and which
ones aborted, or all of the above. But until then, I'd like to make
sure that we at least have the data on the disk instead of sitting
dirty in memory forever.

As a general thought about disaster recovery, my experience is that if
you can tell a customer to run a command (like pg_resetxlog), or - not
quite as good - if you can tell them to run some script that you email
them (like my pad-out-the-CLOG-with-0x55 script), then they're willing
to do that, and it usually works, and they're as happy as they're
going to be. But if you tell them that they have to send you all
their data files or let you log into the machine and poke around for
$X/hour * many hours, then they typically don't want to do that.
Sometimes it's legally or procedurally impossible for them; even if
not, it's cheaper to find some other way to cope with the situation,
so they do, but now - the way they view it - the database lost their
data. Even if the problem was entirely self-inflicted, like an
intentional deletion of pg_xlog, and even if they therefore understand
that it was entirely their own stupid fault that the data got eaten,
it's a bad experience. For that reason, I think we should be looking
for opportunities to increase the recoverability of the database in
every area. I'm sure that everyone on this list who works with
customers on a regular basis has had customers who lost pg_xlog, who
lost pg_clog (or portions theref), who dropped their main table, who
lost the backing files for pg_class and/or pg_attribute, whose
database ended up in lost+found, who had a break in WAL, who had
individual blocks corrupted or unreadable within some important table,
who were missing TOAST chunks, who took a pg_basebackup and failed to
create recovery.conf, who had a corrupted index on a critical system
table, who had inconsistent system catalog contents. Some of these
problems are caused by bad hardware or bugs, but the most common cause
is user error. Regardless of the cause, the user wants to get as much
of their data back as possible as quickly and as easily and as
reliably as possible. To the extent that we can transform a
situations that would have required consulting hours into situations
from which a semi-automated recovery is possible, or situations that
would have required many consulting hours into ones that require only
a few, that's a huge win. Of course, we shouldn't place that goal
above all else; and of course, this is only one small piece of that.
But it is a piece, and it has a tangible benefit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2012-06-08 16:24:56 Re: Avoiding adjacent checkpoint records
Previous Message Simon Riggs 2012-06-08 14:39:31 Re: Checkpointer on hot standby runs without looking checkpoint_segments