Re: Backends stalled in 'startup' state: index corruption

From: Jeff Frost <jeff(at)pgexperts(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Sabino Mullane <greg(at)endpoint(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Backends stalled in 'startup' state: index corruption
Date: 2012-05-26 00:32:05
Message-ID: 9CD57B4F-B1AB-4F1B-BC7A-8C6A86B0C509@pgexperts.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On May 25, 2012, at 4:02 PM, Tom Lane wrote:

> Greg Sabino Mullane <greg(at)endpoint(dot)com> writes:
>>> Yeah, this is proof that what it was doing is the same as what we saw in
>>> Jeff's backtrace, ie loading up the system catalog relcache entries the
>>> hard way via seqscans on the core catalogs. So the question to be
>>> answered is why that's suddenly a big performance bottleneck. It's not
>>> a cheap operation of course (that's why we cache the results ;-)) but
>>> it shouldn't take minutes either. And, because they are seqscans, it
>>> doesn't seem like messed-up indexes should matter.
>
>> FWIW, this appeared to be an all-or-nothing event: either every new backend
>> was suffering through this, or none were. They all seemed to clear up
>> at the same time as well.
>
> Mostly not surprising. They'd definitely all hit the missing init file
> at the same time, so the stalling would start consistently for all. And
> once any one process successfully created a new file, subsequent incoming
> sessions wouldn't stall. However, the remaining processes trying to
> compute new init files would still have to complete the process, so I'd
> expect there to be a diminishing effect --- the ones that were stalling
> shouldn't all release exactly together. Unless there is some additional
> effect that's syncing them all. (I wonder for instance if the syncscan
> logic is kicking in here.)

In our customer's case, the size of pg_attribute was a little less than 1/4 of shared_buffers, so might not be the syncscan?

BTW, In our case, I thought to take the system down to single user mode and reindex these. When the indexes were disabled, I immediately experienced the slow startup, so it certainly seems like an issue with seq scanning these.

I'll see if i can reproduce that behavior by starting up with system indexes disabled. This probably won't happen until tuesday when we get that data directory moved to a test server.

In our customer's case, it would happen for a while,then stop happening for some time...presumably this was after the caching, then it would start up again..presumably after something invalidated the cache.

Switching from the master to the streaming replica made the situation better, but not go away.

Then a full initdb solved the problem. I bet a vacuum full of pg_attribute would've done the trick though.

>

>
> 1. Somebody decides to update one of those rows, and it gets dropped in
> some remote region of the table. The only really plausible reason for
> this is deciding to fool with the column-specific stats target
> (attstattarget) of a system catalog. Does that sound like something
> either of you might have done? You could check it by looking at the
> ctid columns of the pg_attribute rows for system catalogs, and seeing
> if any have large block numbers.
>

Definitely wasn't done by me and I'm pretty sure the customer wouldn't have done that either.

---
Jeff Frost <jeff(at)pgexperts(dot)com>
CTO, PostgreSQL Experts, Inc.
Phone: 1-888-PG-EXPRT x506
FAX: 415-762-5122
http://www.pgexperts.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2012-05-26 01:18:40 Re: pg_dump and thousands of schemas
Previous Message Tom Lane 2012-05-26 00:30:12 No, pg_size_pretty(numeric) was not such a hot idea