PANIC in pg_commit_ts slru after crashes

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: PANIC in pg_commit_ts slru after crashes
Date: 2017-04-14 19:23:10
Message-ID: CAMkU=1zMLnH_i1-PVQ-biZzvNx7VcuatriquEnh7HNk6K8Ss3Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

In the first statement executed after crash recovery, I sometimes get this
error:

PANIC: XX000: could not access status of transaction 207580505
DETAIL: Could not read from file "pg_commit_ts/1EF0" at offset 131072:
Success.
LOCATION: SlruReportIOError, slru.c:918
STATEMENT: create temporary table aldjf (x serial)

Other examples:

PANIC: XX000: could not access status of transaction 3483853232
DETAIL: Could not read from file "pg_commit_ts/20742" at offset 237568:
Success.
LOCATION: SlruReportIOError, slru.c:918
STATEMENT: create temporary table aldjf (x serial)

PANIC: XX000: could not access status of transaction 802552883
DETAIL: Could not read from file "pg_commit_ts/779E" at offset 114688:
Success.
LOCATION: SlruReportIOError, slru.c:918
STATEMENT: create temporary table aldjf (x serial)

Based on the errno, I'm assuming the read was successful but returned the
wrong number of bytes (which was zero in the case I saw after changing the
code to log short reads).

It then goes through recovery again and the problem does not immediately
re-occur if you attempt to connect again. I don't know why the file size
would have changed between attempts.

The problem bisects to the commit:

commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71
Author: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Date: Tue Apr 4 15:56:56 2017 -0400

Speedup 2PC recovery by skipping two phase state files in normal path

It is not obvious to me how that is relevant. My test doesn't use prepared
transactions (and leaves the guc at zero), and this commit doesn't touch
the slru.c.

I'm attaching the test harness. There is a patch which injects the
crash-faults and also allows xid fast-forward, a perl script that runs
until crash and assesses the consistency of the post-crash results, and a
shell script which sets up the database and then calls the perl script in a
loop. On 8 CPU machine, it takes about an hour for the PANIC to occur.

The attached script bails out once it sees the PANIC (count.pl line 158) if
it didn't do that then it will proceed to connect again and retry, and
works fine. No apparent permanent data corruption.

Any clues on how to investigate this further?

Cheers,

Jeff

Attachment Content-Type Size
count.pl application/octet-stream 8.8 KB
crash_REL10.patch application/octet-stream 12.8 KB
do.sh application/x-sh 4.9 KB
slru_log_read_size.patch application/octet-stream 1.2 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2017-04-14 19:27:06 Wincrypt.h vs wincrypt.h
Previous Message Peter Eisentraut 2017-04-14 19:20:33 Re: Some thoughts about SCRAM implementation