On Mon, Mar 30, 2009 at 1:50 PM, <dan(at)sidhe(dot)org> wrote:
> I'm running a 64-bit build of Postgres 8.3.5 on AIX 5.3, and have a really
> strange, annoying transient problem with one particular query stalling.
> The symptom here is that when this query is made with X or more records in
> a temp table involved in the join (where X is constant when the problem
> manifests, but is different between manifestations) the query takes about
> 20 minutes. When the query is made with X-1 records it takes less than a
> second. It's just this one query -- for everything else the system's nice
> and snappy. The machine load's never above 9 (it's a 32 CPU box) and
> hasn't had less than 60G of free system memory on it.
> An EXPLAIN ANALYZE of the two queries (with X-1 and X records) is even
> more bizarre. Not only are the two query plans identical (save trivial
> differences because of the record count differences) but the explain
> EXPLAIN ANALYZE total runtimes are near-identical -- the fast case showed
> 259ms, the slow case 265ms.
> When the slow query was actually run, though, it took 20 minutes. There
> were no blocked back ends shown in pg_stat_activity, and the back end
> itself was definitely moving. I trussed the back end stuck running the
> slow query and it spent nearly all its time doing kread() and kwrite()
> calls. The DB log didn't show anything interesting in it. I checked to
> make sure the SQL statement that was EXPLAIN ANALYZEd was the one actually
> executed, and I pulled the client code into the debugger and
> single-stepped through just to make sure it was getting stuck on that one
> SQL statement and it wasn't the client doing something unexpected.
> Just to be even more annoying, the problem goes away. Right now I can't
> trigger the problem. Last friday it happened reliably feeding 58 records
> into this query. The week before it was 38 records. Today? Nothing, the
> system's full of snappy.
> At the moment I'm at a loss as to what's going wrong, and the fact that I
> can't duplicate it right now when I actually have time to look into the
> problem's damned annoying. What I'm looking for are some hints as to where
> to look the next time it does happen, or things I can flip on to catch
> more in the log. (The logging parameters are all set at their defaults)
> I'm going to try attaching with dbx to get a stack trace (I assume this is
> relatively safe?) but past that I'm kind of stumped.
log_min_duration_statement is a good place to start, but it sounds
like the query plan you're getting when you test it by hand isn't the
same one that's actually executing, so I'm not sure how far that's
going to get you. contrib/auto_explain sounds like it would be just
the thing, but unfortunately that's an 8.4 feature which hasn't been
released yet. I'm not sure whether it could be built and run against
In response to
pgsql-performance by date
|Next:||From: dan||Date: 2009-03-30 18:42:13|
|Subject: Re: Trying to track down weird query stalls|
|Previous:||From: Віталій Тимчишин||Date: 2009-03-30 18:13:23|
|Subject: Re: Very specialised query|