Re: Connections hang indefinitely while taking a LWLockTranche buffer_content lock.

From: Chris Hanks <christopher(dot)m(dot)hanks(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL mailing lists <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Connections hang indefinitely while taking a LWLockTranche buffer_content lock.
Date: 2017-10-23 17:39:26
Message-ID: CAK7KUdAzTD53BDL89CesrCYc2hj5+mKmRLyB4yrxxhRPUNu0QQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

I'm not sure if you guys had noticed this already, but just in case,
those two hung connections weren't making the inserts at exactly the
same time - if you look at pg_stat_activity they executed about a day
apart (since Heroku cycles the clients every 24 hours or so). And
before I restarted the first DB that experienced the problem, there
were ~30 backends built up over the course of a month. It seems like
when one INSERT sticks, every following INSERT just stacks up on top
of it, trying to take out the same lock.

This isn't to say that a race condition didn't cause the issue in the
first place (it may have, given how difficult it is to reproduce). But
it might have been a read using the index or something like that. From
my knowledge of the usage patterns for the app, I believe that only a
single INSERT should have been taking place at that time - the INSERT
only happens when a worker process starts up, there was only one
worker process for the app, and the worker processes were only
restarted every 24 hours. Also I'd expect this table to be empty when
the INSERT took place (it's an unlogged table that tracks which
connections belong to clients that are ready to pull from a queue).

Also, the only time the index would be used, to my knowledge, is in a
trigger on another table when a row is inserted. It tries to
pseudorandomly pick a client that's ready for work:

```
SELECT pid
INTO locker_pid
FROM (
SELECT *, last_value(row_number) OVER () + 1 AS count
FROM (
SELECT *, row_number() OVER () - 1 AS row_number
FROM (
SELECT *
FROM public.que_lockers ql, generate_series(1, ql.worker_count) AS id
WHERE listening AND queues @> ARRAY[NEW.queue]
ORDER BY md5(pid::text || id::text)
) t1
) t2
) t3
WHERE NEW.id % count = row_number;
```

Chris

On Mon, Oct 23, 2017 at 12:55 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Mon, Oct 23, 2017 at 9:49 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> This is probably enough info for us to identify the exact cause. But
>> I don't have time to look right this minute, and am not the hacker most
>> familiar with the GIN infrastructure anyway. Anyone want to look for
>> the bug?
>
> I don't know very much about GIN, but this does look interesting. I'll
> try to get to it later in the week; I have meetings over the next
> couple of days.
>
> --
> Peter Geoghegan

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2017-10-23 22:21:34 Re: BUG #14866: The generated constraint in the typed table causes the server to crash
Previous Message Peter Geoghegan 2017-10-23 16:55:30 Re: Connections hang indefinitely while taking a LWLockTranche buffer_content lock.