Re: Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "devrim(at)gunduz(dot)org" <devrim(at)gunduz(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Date: 2012-09-17 06:44:42
Message-ID: 201209170844.43077.andres@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

On Monday, September 17, 2012 07:35:06 AM Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > On Sep 15, 2012, at 11:32 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> Right, but we do a shutdown checkpoint at the end of crash recovery.
(as noted somewhere else and tackled by Simon, a END_OF_RECOVERY didn't sync
those before)

> > Yes, but that only writes the buffers that are dirty. It doesn't fix the
> > lack of a BM_PERMANENT flag on a buffer that ought to have had one. So
> > that page can now get modified AGAIN, after recovery, and not
> > checkpointed.
>
> Ugh. Yeah, we need to fix this ASAP. I've notified pgsql-packagers to
> expect a release this week.
Btw, I played with this some more on Saturday and I think, while definitely a
bad bug, the actual consequences aren't as bad as at least I initially feared.

Fake relcache entries are currently set in 3 scenarios during recovery:
1. removal of ALL_VISIBLE in heapam.c
2. incomplete splits and incomplete deletions in nbtxlog.c
3. incomplete splits in ginxlog.c

As Jeff nicely showed its easy to corrupt the visibilitymap with this.
Fortunately in < 9.2 that doesn't have too bad consequences and will be fixed
by the next vacuum. In 9.2 that obviously can result in wrong query results but
will still be fixed by a later (probably full table) vacuum.

To hit 2) and 3) the server needs to have crashed (or a strange
recovery_target_* set) while doing a multilevel operation. I have only
cursorily looked at gin but it looks to me like in both, nbtree and gin, the
window during logging the multiple steps in a split/deletion is fairly short
and its not likely that we crashed exactly during that. Unless we haven't read
the complete multistep operation during recovery we won't ever create fake
relcache entries for those relations/buffers! And even if that edge case is hit
it seems somewhat likely that the few pages that are read with the fake entry
are still in the cache (the incomplete operation has to have been soon before)
and thus won't get the bad relpersistence flag set.

So I think while that bug had the possibility of being really bad we were
pretty lucky...

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Simon Riggs 2012-09-17 08:44:39 Re: Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Previous Message Tom Lane 2012-09-17 05:35:06 Re: Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2012-09-17 07:03:28 Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown
Previous Message Tom Lane 2012-09-17 05:35:06 Re: Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.