Re: emergency outage requiring database restart

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: emergency outage requiring database restart
Date: 2016-10-25 20:08:39
Message-ID: CAHyXU0xr+PcufmcbJk5hvz9w+H5R2Sc65NJ8-B+MFGqqT98EkQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 25, 2016 at 2:31 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Merlin Moncure <mmoncure(at)gmail(dot)com> writes:
>> What if the subsequent dataloss was in fact a symptom of the first
>> outage? Is in theory possible for data to appear visible but then be
>> eaten up as the transactions making the data visible get voided out by
>> some other mechanic? I had to pull a quick restart the first time and
>> everything looked ok -- or so I thought. What I think was actually
>> happening is that data started to slip into the void. It's like
>> randomly sys catalogs were dropping off. I bet other data was, too. I
>> can pull older backups and verify that. It's as if some creeping xmin
>> was snuffing everything out.
>
> Might be interesting to look at age(xmin) in a few different system
> catalogs. I think you can ignore entries with age = 2147483647;
> those should be frozen rows. But if you see entries with very large
> ages that are not that, it'd be suspicious.

nothing really stands out.

The damage did re-occur after a dump/restore -- not sure about a
cluster level rebuild. No problems previous to that. This suggests
that if this theory holds the damage would have had to have been under
the database level -- perhaps in clog. Maybe hint bits and clog did
not agree as to commit or delete status for example. clog has plenty
of history leading past the problem barrier:
-rwx------ 1 postgres postgres 256K Jul 10 16:21 0000
-rwx------ 1 postgres postgres 256K Jul 21 12:39 0001
-rwx------ 1 postgres postgres 256K Jul 21 13:19 0002
-rwx------ 1 postgres postgres 256K Jul 21 13:59 0003
<snip>

Confirmation of problem re-occurrence will come in a few days. I'm
much more likely to believe 6+sigma occurrence (storage, freak bug,
etc) should it prove the problem goes away post rebuild.

merlin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2016-10-25 21:03:58 Re: pg_dump: Simplify internal archive version handling
Previous Message Tom Lane 2016-10-25 19:31:25 Re: emergency outage requiring database restart