Failed recovery with new faster 2PC code

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Failed recovery with new faster 2PC code
Date: 2017-04-15 22:37:15
Message-ID: CAMkU=1xBP8cqdS5eK8APHL=X6RHMMM2vG5g+QamduuTsyCwv9g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

After this commit, I get crash recovery failures when using prepared
transactions.

commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71
Author: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Date: Tue Apr 4 15:56:56 2017 -0400

Speedup 2PC recovery by skipping two phase state files in normal path

After the induced crash, I get this failure in recovery:

FATAL: could not access status of transaction 334419347
DETAIL: Could not open file "pg_xact/013E": No such file or directory.
LOG: startup process (PID 60106) exited with exit code 1
LOG: aborting startup due to startup process failure
LOG: database system is shut down

The earliest file which exists in pg_xact is 0176

Other examples:

FATAL: could not access status of transaction 121729737
DETAIL: Could not open file "pg_xact/0074": No such file or directory.
LOG: startup process (PID 23720) exited with exit code 1

FATAL: could not access status of transaction 181325554
DETAIL: Could not open file "pg_xact/00AC": No such file or directory.
LOG: startup process (PID 8375) exited with exit code 1

I experience this in about 1 out of 15 crash-recovery cycles on 8 CPUs.

The patch Pavan posted here did not make any difference:

https://www.postgresql.org/message-id/CABOikdMdhS4nYX7xHaF+m=P=q_zAJBCYsZ++VN26AZzDRf_xFA@mail.gmail.com

I've attached the test harness, which I think will look familiar to y'all.
It is the usual injection of torn-page-write crashes with consistency
checks after recovery (which makes no difference, as the issue is that
recovery does not happen), modified to include a very crude transaction
manager to make use of 2PC.

Cheers,

Jeff

Attachment Content-Type Size
count.pl application/octet-stream 11.2 KB
do.sh application/x-sh 5.4 KB
crash_REL10.patch application/octet-stream 12.8 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-04-15 23:13:04 Re: OpenSSL support in our back branches
Previous Message Andres Freund 2017-04-15 22:06:49 Re: OpenSSL support in our back branches