Re: postgresql hanging/stuck

From: Andrzej Pilacik <cypisek77(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: postgresql hanging/stuck
Date: 2017-03-15 18:11:09
Message-ID: CAJw8uJR1YpPnOA6AS1=OUm9=hNP7me=9XqrB4EtvG0oi1zc5Aw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

On 2017-03-14 16:47:40 -0400, Andrzej Pilacik wrote:
> Since then we had 2 incidents where PostgreSQL connections went to about
> 300 and PostgreSQL became unresponsive and restarted itself.

Could you give a bit more details about how that "retarted itself"
happened? Any chance you saw something about a "stuck spinlock" in the
logs?

Sorry, misworded. In order to get that instance back online we have to
kill -9 postgresql and let it go through recovery.
Tried killing other connection and removing the connections with
"authenticatin" state without any luck.

> We have monitors setup to page us when connections are over 250.
> PostgreSQL should still have about 250 connections available.
>
> When this occurred, we saw a lot of connections in the authentication
state
> (stuck)
>
> *They all looked like: postgres: username databasename ipaddress (pid)
> authentication*

The symptoms suggest you're hitting quite massive lock contention. But
that's just a guess, not yet enough information to nail that down. If
that's the case, a update to 9.6 might resolve the issue, we've done two
significant scalability improvements that could affect your case.

Will go to 9.6 but not in the next month or so.

> During the incident (about 10-15min timeline)
>
> - CPUs are running high but the box is still very responsible. 60-70%
>
> - Memory allocations are ok, no paging. Nothing looks out of norm.

Have you checked that you're not running into issues with transparent
hugepages? The symptoms don't quite match, but it's still useful to make
sure. I don't know which kernel / distribution you're running, but a
search for "disable transparent hugepages $distro" should help.

We have these on and will look into removing these.

> - We have our normal scenario engine running with about 30-40 active
> connections + some 5-10 active reporting and processing connections at a
> time (lots of quick/short queries)

Is there a chance you could provide a profile using perf? See
https://wiki.postgresql.org/wiki/Profiling_with_perf

We are in the process of creating a couple of scripts to collect baseline
Linux data so we can run them for a couple of days and then run them again
during the incident.
We don't really have time to troubleshoot if the incident occurs during the
day.

Greetings,

Andres Freund

On Tue, Mar 14, 2017 at 4:55 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
>
> On 2017-03-14 16:47:40 -0400, Andrzej Pilacik wrote:
> > Since then we had 2 incidents where PostgreSQL connections went to about
> > 300 and PostgreSQL became unresponsive and restarted itself.
>
> Could you give a bit more details about how that "retarted itself"
> happened? Any chance you saw something about a "stuck spinlock" in the
> logs?
>
>
> > We have monitors setup to page us when connections are over 250.
> > PostgreSQL should still have about 250 connections available.
> >
> > When this occurred, we saw a lot of connections in the authentication
> state
> > (stuck)
> >
> > *They all looked like: postgres: username databasename ipaddress (pid)
> > authentication*
>
> The symptoms suggest you're hitting quite massive lock contention. But
> that's just a guess, not yet enough information to nail that down. If
> that's the case, a update to 9.6 might resolve the issue, we've done two
> significant scalability improvements that could affect your case.
>
>
> > During the incident (about 10-15min timeline)
> >
> > - CPUs are running high but the box is still very responsible. 60-70%
> >
> > - Memory allocations are ok, no paging. Nothing looks out of norm.
>
> Have you checked that you're not running into issues with transparent
> hugepages? The symptoms don't quite match, but it's still useful to make
> sure. I don't know which kernel / distribution you're running, but a
> search for "disable transparent hugepages $distro" should help.
>
>
> > - We have our normal scenario engine running with about 30-40 active
> > connections + some 5-10 active reporting and processing connections at a
> > time (lots of quick/short queries)
>
> Is there a chance you could provide a profile using perf? See
> https://wiki.postgresql.org/wiki/Profiling_with_perf
>
>
> Greetings,
>
> Andres Freund
>

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2017-03-16 01:52:57 Re: Backend crash on non-exclusive backup cancel
Previous Message Wiler Coelho Jr. 2017-03-15 17:17:43 Re: Error floating-point exception on postgresql installer