Notify system doesn't recover from "No space" error

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Notify system doesn't recover from "No space" error
Date: 2012-02-09 16:18:21
Message-ID: 4F339D6D02000025000450F5@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On a development system which wasn't being monitored very closely
I think we've uncovered a bug in the listen/notify behavior. The
system was struggling with lack of disk space for about a week
before this surfaced, so it's probably not the most critical sort
of bug we could have, but it probably merits a fix.

We were getting all sorts of logging about the space problem, and
PostgreSQL soldiered bravely on, doing the best it could under such
adverse conditions. (That's pretty impressive, really.) A lot of
messages like these:

[2012-02-05 01:24:36.816 CST] 14950 <cc cc 127.0.0.1(35321)> ERROR:
could not extend file "base/16401/46507734": wrote only 4096 of 8192
bytes at block 0
[2012-02-05 01:24:36.816 CST] 14950 <cc cc 127.0.0.1(35321)> HINT:
Check free disk space.

[2012-02-05 01:25:40.643 CST] 5323 LOG: could not write temporary
statistics file "pg_stat_tmp/pgstat.tmp": No space left on device

[2012-02-05 01:26:38.971 CST] 15065 <cc cc 127.0.0.1(53258)> ERROR:
could not write to hash-join temporary file: No space left on device

I don't know how much of a clue this is, but less than a minute
before our first message about pg_notify problems, the statistics
messages went from complaining about not being able to *write* the
file to also complaining about not being able to *close* the file --
like this:

[2012-02-05 01:26:41.159 CST] 5323 LOG: could not close temporary
statistics file "pg_stat_tmp/pgstat.tmp": No space left on device
[2012-02-05 01:26:44.663 CST] 5323 LOG: could not write temporary
statistics file "pg_stat_tmp/pgstat.tmp": No space left on device
[2012-02-05 01:26:45.705 CST] 16252 WARNING: pgstat wait timeout

We got three errors in the log with exactly this DETAIL (file,
offset and message):

[2012-02-05 01:27:36.878 CST] 14892 <cc cc 127.0.0.1(35320)> ERROR:
could not access status of transaction 0
[2012-02-05 01:27:36.878 CST] 14892 <cc cc 127.0.0.1(35320)> DETAIL:
Could not write to file "pg_notify/03A5" at offset 188416: No space
left on device.
[2012-02-05 01:27:36.878 CST] 14892 <cc cc 127.0.0.1(35320)>
STATEMENT: COMMIT
[2012-02-05 01:27:36.882 CST] 14892 <cc cc 127.0.0.1(35320)> LOG:
disconnection: session time: 0:07:00.600 user=cc database=cc
host=127.0.0.1 port=35320

After that the message changed to a new offset and a message of
"Success." for the failure:

[2012-02-05 01:30:36.952 CST] 16383 <cc cc 127.0.0.1(38931)> ERROR:
could not access status of transaction 0
[2012-02-05 01:30:36.952 CST] 16383 <cc cc 127.0.0.1(38931)> DETAIL:
Could not read from file "pg_notify/03A5" at offset 253952:
Success.
[2012-02-05 01:30:36.952 CST] 16383 <cc cc 127.0.0.1(38931)>
STATEMENT: COMMIT
[2012-02-05 01:30:36.956 CST] 16383 <cc cc 127.0.0.1(38931)> LOG:
disconnection: session time: 0:01:59.537 user=cc database=cc
host=127.0.0.1 port=38931

There were 4591 messages over the course of five days with exactly
that filename and offset and a message of "Success.", all for
transaction 0. About midway through that time disk space was freed
and all other symptoms cleared. It really is pretty impressive that
PostgreSQL survived that and was able to resume normal operations in
all other respects once disk space was freed for it. Only this one
feature stayed in a bad state.

-Kevin

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-02-09 17:02:29 index-only quals vs. security_barrier views
Previous Message Peter Geoghegan 2012-02-09 15:36:23 Re: Progress on fast path sorting, btree index creation time