Re: 9.1.3 backends getting stuck in 'startup'

From: Jeff Frost <jeff(at)pgexperts(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-05-24 20:40:59
Message-ID: 4FBE9CDB.3030009@pgexperts.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 05/24/12 12:21, Tom Lane wrote:
> Jeff Frost <jeff(at)pgexperts(dot)com> writes:
>> A few times today, we've seen postgresql 9.1.3 backends on Ubuntu 11.10 x86_64
>> get stuck in 'startup' mode. By that I mean the set_ps_output mode. Postgres
>> is installed via Martin Pitt's packages.
> I took another look at your original report here, and noticed something
> that had escaped me the first time: the backtrace shows that what that
> backend is doing is loading critical relcache entries the hard way, that
> is, via seqscans on the system catalogs. This happens when the relcache
> cache file (pg_internal.init) is missing or stale. A backend that has
> to go through that will write a new pg_internal.init file when it's
> done; but if many sessions are connecting at the same time, they'll each
> have to do it until someone succeeds in creating a new cache file.
>
> Another piece of evidence is here:
>
>> #5 0x00007f7dd3d91ca4 in ReadBufferExtended (reln=0x7f7dd3b17540,
>> forkNum=MAIN_FORKNUM, blockNum=201362, mode=RBM_NORMAL, strategy=<optimized
>> out>) at bufmgr.c:246
> It's possible that gdb is lying to us, but it sure looks like this read
> request is for block 201362 of pg_attribute (I know it's pg_attribute
> from the location of the heap_getnext call). That's 1.5Gb into the
> relation, and who knows if it's anywhere near the end yet.
>
> So in other words, we've got a boatload of backends all seqscanning a
> pretty darn large relation, and doing so multiple times. That's
> probably enough to explain why it took a long time to finish
> initializing; the locking business is likely a red herring.
>
> How big is pg_attribute normally in this database? I'm suspicious that
> what triggered this is some major bloating of pg_attribute (and maybe
> some of the other catalogs too). Once you had that, relcache cache file
> invalidations are a fairly routine occurrence --- I think an autovacuum
> on one of the system catalogs would trigger that, for example.
>
> regards, tom lane

Interesting! Fortunately, that data directory is still lying around even
though we initdb'd to get a working system.

So, the current working system's pg_attribute looks like:

select pg_size_pretty(pg_relation_size('pg_catalog.pg_attribute'));
pg_size_pretty
----------------
1671 MB
(1 row)

select pg_size_pretty(pg_total_relation_size('pg_catalog.pg_attribute'));
pg_size_pretty
----------------
2439 MB

and on the old broken system:

select pg_size_pretty(pg_relation_size('pg_catalog.pg_attribute'));
pg_size_pretty
----------------
1858 MB
(1 row)

select pg_size_pretty(pg_total_relation_size('pg_catalog.pg_attribute'));
pg_size_pretty
----------------
1907 MB
(1 row)

So, interestingly, they're both quite large, but the old broken system is
quite a bit larger. Any other data points be helpful?

--
Jeff Frost <jeff(at)pgexperts(dot)com>
CTO, PostgreSQL Experts, Inc.
Phone: 1-888-PG-EXPRT x506
FAX: 415-762-5122
http://www.pgexperts.com/

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2012-05-24 22:13:06 Re: 9.1.3 backends getting stuck in 'startup'
Previous Message Tom Lane 2012-05-24 19:36:39 Re: BUG #6664: Postgres server process does not come up in foreground and parent process is set to init ( PID 1 )