Re: Notify system doesn't recover from "No space" error

From: Christoph Berg <cb(at)df7cb(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Notify system doesn't recover from "No space" error
Date: 2012-06-29 08:24:30
Message-ID: 20120629082430.GA905@msgid.df7cb.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

[Resending as the original post didn't get through to the list]

Warming up an old thread here - we ran into the same problem.

Database is 9.1.4/x86_64 from Debian/testing. The client application
is bucardo hammering the database with NOTIFYs (including some
master-master replication conflicts, that might add to the parallel
NOTIFY load).

The problem is reproducible with the attached instructions (several
ENOSPC cycles might be requried). When the filesystem is filled using
dd, the bucardo and psql processes will die with this error:

FEHLER: 53100: konnte auf den Status von Transaktion 0 nicht zugreifen
DETAIL: Konnte nicht in Datei »pg_notify/0000« bei Position 180224 schreiben: Auf dem Gerät ist kein Speicherplatz mehr verfügbar.
ORT: SlruReportIOError, slru.c:861

The line number might be different, sometimes its ENOENT, sometimes even
"Success".

Even after disk space is available again, subsequent "NOTIFY foobar"
calls will die, without any other clients connected:

ERROR: XX000: could not access status of transaction 0
DETAIL: Could not read from file "pg_notify/0000" at offset 245760: Success.
ORT: SlruReportIOError, slru.c:854

Here's a backtrace, caught at slru.c:430:

430 SlruReportIOError(ctl, pageno, xid);
(gdb) bt
#0 SimpleLruReadPage (ctl=ctl(at)entry=0xb192a0, pageno=30, write_ok=write_ok(at)entry=1 '\001', xid=xid(at)entry=0)
at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/access/transam/slru.c:430
#1 0x0000000000520d2f in asyncQueueAddEntries (nextNotify=nextNotify(at)entry=0x29b60c8)
at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/commands/async.c:1318
#2 0x000000000052187f in PreCommit_Notify ()
at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/commands/async.c:869
#3 0x00000000004973d3 in CommitTransaction ()
at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/access/transam/xact.c:1827
#4 0x0000000000497a8d in CommitTransactionCommand ()
at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/access/transam/xact.c:2562
#5 0x0000000000649497 in finish_xact_command ()
at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/tcop/postgres.c:2452
#6 finish_xact_command ()
at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/tcop/postgres.c:2441
#7 0x000000000064c875 in exec_simple_query (query_string=0x2a99d70 "notify foobar;")
at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/tcop/postgres.c:1037
#8 PostgresMain (argc=<optimized out>, argv=argv(at)entry=0x29b1df8, username=<optimized out>)
at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/tcop/postgres.c:3968
#9 0x000000000060e731 in BackendRun (port=0x2a14800)
at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/postmaster/postmaster.c:3611
#10 BackendStartup (port=0x2a14800)
at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/postmaster/postmaster.c:3296
#11 ServerLoop ()
at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/postmaster/postmaster.c:1460
#12 0x000000000060f451 in PostmasterMain (argc=argc(at)entry=5, argv=argv(at)entry=0x29b1170)
at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/postmaster/postmaster.c:1121
#13 0x0000000000464bc9 in main (argc=5, argv=0x29b1170)
at /home/martin/debian/psql/9.1/build-area/postgresql-9.1-9.1.4/build/../src/backend/main/main.c:199

Restarting the cluster seems to fix the condition in some cases, but
I've seen the error persist over restarts, or reappear after some time
even without disk full. (That's also what the customer on the live
system is seeing.)

Christoph
--
cb(at)df7cb(dot)de | http://www.df7cb.de/

Attachment Content-Type Size
pg_notify_error.sh application/x-sh 3.1 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Eric McKeeth 2012-06-29 08:34:23 Re: Covering Indexes
Previous Message Cédric Villemain 2012-06-29 07:11:56 Re: We probably need autovacuum_max_wraparound_workers