Re: Notify system doesn't recover from "No space" error

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Christoph Berg <christoph(dot)berg(at)credativ(dot)de>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org, Joachim Wieland <joe(at)mcknight(dot)de>
Subject: Re: Notify system doesn't recover from "No space" error
Date: 2012-06-29 01:35:51
Message-ID: 8308.1340933751@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Christoph Berg <christoph(dot)berg(at)credativ(dot)de> writes:
> Warming up an old thread here - we ran into the same problem.

Thanks for the example proving it is repeatable.

I dug into this a bit and found the problem. When a page of pg_notify
is filled, asyncQueueAddEntries() advances the global QUEUE_HEAD to
point to the next page before it calls SimpleLruZeroPage() to make that
page exist in slru.c. But SimpleLruZeroPage could fail --- even though
it doesn't attempt to write out the new page, it might need to dump
some existing dirty page from shared memory before it can allocate a
slot to the new page. If it does fail, then QUEUE_HEAD is pointing
at a page number that doesn't exist either on disk or in slru.c buffers.
Subsequent attempts to execute asyncQueueAddEntries() will fail because
SimpleLruReadPage() says "page? what page?". And there is noplace else
in async.c that would try to create the missing page. This error will
persist until the server is restarted (thus resetting the state of
pg_notify), even if the underlying disk-full condition is cleared.

The solution is to make sure we don't advance QUEUE_HEAD into the new
page until SimpleLruZeroPage has created it. There are a couple of ways
the code could be fixed, but what I chose to do in the attached proposed
patch is to make asyncQueueAddEntries work with a local queue_head
pointer, which is only copied back to shared memory on successful exit.

Comments, objections?

regards, tom lane

Attachment Content-Type Size
notify-disk-full-fix.patch text/x-patch 3.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2012-06-29 01:57:24 Re: We probably need autovacuum_max_wraparound_workers
Previous Message Tom Lane 2012-06-28 23:48:55 Re: We probably need autovacuum_max_wraparound_workers