Re: emergency outage requiring database restart

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Oskari Saarenmaa <os(at)ohmu(dot)fi>, Ants Aasma <ants(dot)aasma(at)eesti(dot)ee>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: emergency outage requiring database restart
Date: 2016-11-01 13:56:45
Message-ID: 17649.1478008605@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Merlin Moncure <mmoncure(at)gmail(dot)com> writes:
> On Mon, Oct 31, 2016 at 10:32 AM, Oskari Saarenmaa <os(at)ohmu(dot)fi> wrote:
>> Your production system's postgres backends probably have a lot more open
>> files associated with them than the simple test case does. Since Postgres
>> likes to keep files open as long as possible and only closes them when you
>> need to free up fds to open new files, it's possible that your production
>> backends have almost all allowed fds used when you execute your pl/sh
>> function.
>>
>> If that's the case, the sqsh process that's executed may not have enough fds
>> to do what it wanted to do and because of busted error handling could end up
>> writing to fds that were opened by Postgres and point to $PGDATA files.

> Does that apply? the mechanics are a sqsh function that basically does:
> cat foo.sql | sqsh <args>
> pipe redirection opens a new process, right?

Yeah, but I doubt that either level of the shell would attempt to close
inherited file handles.

The real problem with Oskari's theory is that it requires not merely
busted, but positively brain-dead error handling in the shell and/or
sqsh, ie ignoring open() failures altogether. That seems kind of
unlikely. Still, I suspect he might be onto something --- there must
be some reason you can reproduce the issue in production and not in
your test bed, and number-of-open-files is as good a theory as I've
heard.

Maybe the issue is not with open() failures, but with the resulting
FD numbers being much larger than sqsh is expecting. It would be
weird to try to store an FD in something narrower than int, but
I could see a use of select() being unprepared for large FDs.
Still, it's hard to translate that idea into scribbling on the
wrong file...

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2016-11-01 14:05:45 Re: emergency outage requiring database restart
Previous Message Tom Lane 2016-11-01 13:46:40 Re: Improve output of BitmapAnd EXPLAIN ANALYZE