NFS vs. PostgreSQL on Solaris

From: "Thomas F(dot) O'Connell" <tf(at)o(dot)ptimized(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: NFS vs. PostgreSQL on Solaris
Date: 2007-04-26 21:21:30
Message-ID: FE428C19-6868-41F7-A018-FE1C991EF1A7@o.ptimized.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I've got an interesting case study that I don't fully understand from
a postgres perspective, and I'm hoping that someone in the community
might help me understand what led to the outcome and whether it's
easily prevented or not.

The setup: PostgreSQL 8.2.3 on Solaris 10 x86

postgresql.conf:
shared_buffers = 200000

This had been a box that had been completely dedicated to postgres. A
new project required that the box start to also share a directory
over NFS. In preparation for this, nfsd had been running for quite
some time, although today was the first day that production mount
points on other boxes were added. The first external mount point was
created this morning, and all seemed quiet. The second external mount
point was then added this afternoon. That second mount point caused
postgres great consternation.

Within several few seconds of enabling the second external NFS mount
point, postgres began reporting out of memory errors. An INSERT
generated the following:

TopMemoryContext: 118832 total in 10 blocks; 17512 free (24 chunks);
101320 used
TopTransactionContext: 8192 total in 1 blocks; 7856 free (0 chunks);
336 used
Type information cache: 8192 total in 1 blocks; 1800 free (0 chunks);
6392 used
Operator class cache: 8192 total in 1 blocks; 4872 free (0 chunks);
3320 used
MessageContext: 262144 total in 6 blocks; 75208 free (5 chunks);
186936 used
smgr relation table: 8192 total in 1 blocks; 1776 free (0 chunks);
6416 used
TransactionAbortContext: 32768 total in 1 blocks; 32752 free (0
chunks); 16 used
Portal hash: 8192 total in 1 blocks; 3912 free (0 chunks); 4280 used
PortalMemory: 8192 total in 1 blocks; 8040 free (0 chunks); 152 used
PortalHeapMemory: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
ExecutorState: 147648 total in 7 blocks; 59816 free (11 chunks);
87832 used
HashTableContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
HashBatchContext: 25174100 total in 13 blocks; 7754600 free (16
chunks); 17419500 used
TupleSort: 4448280 total in 10 blocks; 1888 free (25 chunks); 4446392
used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
Unique: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
AggContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
Relcache by OID: 8192 total in 1 blocks; 2856 free (0 chunks); 5336 used
CacheMemoryContext: 659000 total in 19 blocks; 140248 free (1
chunks); 518752 us
ed

This was followed by a number of statements about individual
relations and, eventually, this:

MdSmgr: 8192 total in 1 blocks; 7120 free (0 chunks); 1072 used
LOCALLOCK hash: 8192 total in 1 blocks; 3912 free (0 chunks); 4280 used
2007-04-26 14:47:42 CDT 15175 :LOG: could not fork new process for
connection: Not enough space
2007-04-26 14:47:42 CDT 15175 :LOG: could not fork new process for
connection: Not enough space
2007-04-26 14:47:42 CDT 15175 :LOG: could not fork new process for
connection: Not enough space
2007-04-26 14:47:42 CDT 15175 :LOG: could not fork new process for
connection: Not enough space
2007-04-26 14:47:42 CDT 15175 :LOG: could not fork new process for
connection: Not enough space
2007-04-26 14:47:42 CDT 15175 :LOG: could not fork new process for
connection: Not enough space
Timezones: 48616 total in 2 blocks; 5968 free (0 chunks); 42648 used
ErrorContext: 8192 total in 1 blocks; 8176 free (4 chunks); 16 used
2007-04-26 14:47:42 CDT 17018 10.0.4.168(46723):ERROR: out of memory
2007-04-26 14:47:42 CDT 17018 10.0.4.168(46723):DETAIL: Failed on
request of size 257.

This happened intermittently for the next several minutes, eventually
building to FATAL:

2007-04-26 14:50:12 CDT 18421 10.0.4.168(44414):FATAL: out of memory
2007-04-26 14:50:12 CDT 18421 10.0.4.168(44414):DETAIL: Failed on
request of size 1864.

The FATAL error was preceded by a similar report to the original
ERRORs. I don't know what distinguishes the ERROR "out of memory"
from the FATAL version.

Then, eventually, without any evidence of other things on the box
behaving badly, we saw this:

2007-04-26 14:53:05 CDT 15175 :LOG: server process (PID 18543) was
terminated by signal 11
2007-04-26 14:53:05 CDT 15175 :LOG: terminating any other active
server processes

So my questions are:

1. What aspect of postgres' memory usage would create an "out of
memory" condition?
2. What is the difference between an ERROR and FATAL "out of memory"
message?
3. What would cause postgres to die from a signal 11?

I've also got a core file if that's necessary for further forensics.

--
Thomas F. O'Connell

optimizing modern web applications
: for search engines, for usability, and for performance :

http://o.ptimized.com/
615-260-0005

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Rich Shepard 2007-04-26 21:47:29 Re: Upgrade Process
Previous Message Greg Sabino Mullane 2007-04-26 21:09:44 Re: Business days