Re: hung backends stuck in spinlock heavy endless loop

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Geoghegan <pg(at)heroku(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: hung backends stuck in spinlock heavy endless loop
Date: 2015-01-14 15:30:23
Message-ID: 20150114153023.GS5245@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2015-01-14 09:22:45 -0600, Merlin Moncure wrote:
> On Wed, Jan 14, 2015 at 9:11 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > On 2015-01-14 10:05:01 -0500, Tom Lane wrote:
> >> Merlin Moncure <mmoncure(at)gmail(dot)com> writes:
> >> > On Wed, Jan 14, 2015 at 8:41 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> >> What are the autovac processes doing (according to pg_stat_activity)?
> >>
> >> > pid,running,waiting,query
> >> > 7105,00:28:40.789221,f,autovacuum: VACUUM ANALYZE pg_catalog.pg_class
> >
> > It'd be interesting to know whether that vacuum gets very frequent
> > semaphore wakeups. Could you strace it for a second or three?
>
> for 30 seconds+ it just looks like this:
> mmoncure(at)mernix2 ~ $ sudo strace -p 7105
> Process 7105 attached
> semop(5701638, {{4, -1, 0}}, 1

Ok. So that explains why it's not interruptible.

> all of other processes are yielding out of the spinlock, for example:
> select(0, NULL, NULL, NULL, {0, 1408}) = 0 (Timeout)

Note the above isn't the spinlock, it's the process's semaphore. It'll
only get set if the refcount ever indicates that nobody but autovac is
holding the lock.

> > How did this perform < 9.4?
> this is a new project. However, I can run it vs earlier version.
>
> Can you guess how many times these dynamic
> > statements are planned? How many different relations are accessed in the
> > dynamically planned queries?
>
> only once or twice, and only a couple of tables.

Hm. Odd. The first -g profile seemed to indicate a hell of a lot time
was spent in LWLockRelease() - indicating that there's actually
progress. Later profiles/backtraces were less clear.

If you gdb in, and type 'fin' a couple times, to wait till the function
finishes, is there actually any progress? I'm wondering whether it's
just many catalog accesses + contention, or some other
problem. Alternatively set a breakpoint on ScanPgRelation() or so and
see how often it's hit.

> I can send the code off-list if you guys think it'd help.

Might be interesting.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2015-01-14 15:30:55 Re: [PATCH] explain sortorder
Previous Message Timmer, Marius 2015-01-14 15:26:02 Re: [PATCH] explain sortorder