Re: [HACKERS] pgsql: Fix freezing of a dead HOT-updated tuple

From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, "Wood, Dan" <hexpert(at)amazon(dot)com>, pgsql-committers(at)postgresql(dot)org, "Wong, Yi Wen" <yiwong(at)amazon(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] pgsql: Fix freezing of a dead HOT-updated tuple
Date: 2017-11-04 13:15:00
Message-ID: 20171104131500.w7enyo7olxk7ygca@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

On 2017-11-03 12:36:59 -0700, Peter Geoghegan wrote:
> Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Here's that patch. I've stared at this some, and Robert did too. Robert
> > mentioned that the commit message might need some polish and I'm not
> > 100% sure about the error message texts yet.
>
> The commit message should probably say that the bug involves the
> resurrection of previously dead tuples, which is different to there
> being duplicates because a constraint is not enforced because HOT chains
> are broken (that's a separate, arguably less serious problem).

The reason for that is that I hadn't yet quite figured out how the bug I
described in the commit message (and the previously committed testcase)
would cause that. I was planning to diagnose / experiment more with this
and write an email if I couldn't come up with an explanation. The
committed test does *not* actually trigger that.

The reason I couldn't quite figure out how the problem triggers is that
the xmax removing branch in FreezeMultiXactId() can only be reached if
the multi is from before the cutoff - which it can't have been for a
single vacuum execution to trigger the bug, because that'd require the
multi to be running to falsely return RECENTLY_DEAD rather than DEAD (by
definition a multi can't be below the cutoff if running).

For the problem to occur I think vacuum has to be executed *twice*: The
first time through HTSV mistakenly returns RECENTLY_DEAD preventing the
tuple from being pruned. That triggers FreezeMultiXactId() to create a
*new* multi with dead members. At this point the database already is in
a bad state. Then in a second vacuum HTSV returns DEAD, but
* Ordinarily, DEAD tuples would have been removed by
* heap_page_prune(), but it's possible that the tuple
* state changed since heap_page_prune() looked. In
* particular an INSERT_IN_PROGRESS tuple could have
* changed to DEAD if the inserter aborted. So this
* cannot be considered an error condition.
*
..
if (HeapTupleIsHotUpdated(&tuple) ||
HeapTupleIsHeapOnly(&tuple))
{
nkeep += 1;

prevents the tuple from being removed. If now the multi xmax is below
the xmin horizon it triggers
/*
* If the xid is older than the cutoff, it has to have aborted,
* otherwise the tuple would have gotten pruned away.
*/
if (TransactionIdPrecedes(xid, cutoff_xid))
{
if (TransactionIdDidCommit(xid))
elog(ERROR, "can't freeze committed xmax");
*flags |= FRM_INVALIDATE_XMAX;
in FreezeMultiXact. Without the new elog, this then causes xmax to be
removed, reviving the tuple.

The current testcase, and I think the descriptions in the relevant
threads, all actually fail to point out the precise way the bug is
triggered. As e.g. evidenced that the freeze-the-dead case appears to
not cause any failures in !assertion builds even if the bug is present.

The good news is that the error checks I added in the patch upthread
prevent all of this from happening, even though I'd not yet understood
the mechanics fully - it's imnsho pretty clear that we need to be more
paranoid in production builds around this. A bunch of users that
triggered largely "invisible" corruption (the first vacuum described
above) will possibly run into one of these elog()s, but that seems far
preferrable to making the corruption a lot worse.

I think unfortunately the check + elog() in the
HeapTupleIsHeapOnly(&tuple))
{
nkeep += 1;

/*
* If this were to happen for a tuple that actually
* needed to be frozen, we'd be in trouble, because
* it'd leave a tuple below the relation's xmin
* horizon alive.
*/
if (heap_tuple_needs_freeze(tuple.t_data, FreezeLimit,
MultiXactCutoff, buf))
{
elog(ERROR, "encountered tuple from before xid cutoff, sleeping");
case needs to go, because it's too coarse - there very well could be
lockers or such that need to be removed and where it's fine to do
so. The checks the actual freezing code ought to be sufficient however.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Peter Eisentraut 2017-11-04 16:20:22 pgsql: Fix incorrect use of bool
Previous Message Noah Misch 2017-11-04 07:49:46 Re: [COMMITTERS] pgsql: Account for catalog snapshot in PGXACT->xmin updates.

Browse pgsql-hackers by date

  From Date Subject
Next Message Tels 2017-11-04 13:27:25 Re: Parallel Plans and Cost of non-filter functions
Previous Message Peter Eisentraut 2017-11-04 13:14:33 Re: Add some const decorations to prototypes