Re: Failed recovery with new faster 2PC code

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com>, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Subject: Re: Failed recovery with new faster 2PC code
Date: 2017-04-18 12:12:43
Message-ID: CAB7nPqQ_RnV8QTYxtm7=hudY56jjtG1tbZgrFOuYF8AuDdFiZA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 18, 2017 at 7:54 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> Yeh, this is better. Pushed.

I have been outraced on this one, the error is obvious once you see it ;)

Thanks for the investigation and the fix! I have spent a couple of
hours reviewing the interactions between the shmem entries of 2PC
state data created at the beginning of recovery and all the lookups in
procarray.c and varsup.c, noticing nothing by the way.

> The bug was that the loop set gxact to be the last entry in the array,
> causing the exit condition to fail and us then to remove the last
> gxact from memory even when it didn't match the xid, removing a valid
> entry too early. That then allowed xmin to move forwards, which causes
> autovac to remove pg_xact entries earlier than needed.
>
> Well done for finding that one, thanks for the patch.

Running Jeff's test suite, I can confirm that there are no problems now.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2017-04-18 12:13:08 Re: Continuous buildfarm failures on hamster with bin-check
Previous Message Craig Ringer 2017-04-18 12:05:55 Re: Logical replication and synchronous replication