Re: [BUGS] Connections hang indefinitely while taking a LWLockTranche buffer_content lock.

From: Chris Hanks <christopher(dot)m(dot)hanks(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL mailing lists <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: [BUGS] Connections hang indefinitely while taking a LWLockTranche buffer_content lock.
Date: 2018-03-26 18:43:09
Message-ID: CAK7KUdD8ibDSnZY-65TEZMgvU6A79Znkq990hAUbREJgheqReg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hey, I realize it's a bit late, but did anybody ever find a root cause
for this? I removed the relevant index and haven't had an issue since.

Thanks,
Chris

On Thu, Oct 26, 2017 at 7:42 PM, Chris Hanks
<christopher(dot)m(dot)hanks(at)gmail(dot)com> wrote:
> I wound up working around the issue by forking the database and
> removing that GIN index, and things have been fine with the new
> instance for the past two days.
>
> I previously had two Postgres instances with hung processes, one 9.6.1
> and one 9.6.5. For work reasons I destroyed the 9.6.5 (Heroku support
> was having issues with its container anyway) but I've kept the 9.6.1.
> I'm happy to give permission for you to access it, just let me know
> who to talk to, or have them reach out to me. I have an open issue in
> their tracker under my work email, chris(at)pathgather(dot)com(dot)
>
> Thanks again!
>
> On Thu, Oct 26, 2017 at 7:14 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>> I managed to get a couple of hours to look at this this afternoon.
>>
>> On Mon, Oct 23, 2017 at 10:39 AM, Chris Hanks
>> <christopher(dot)m(dot)hanks(at)gmail(dot)com> wrote:
>>> I'm not sure if you guys had noticed this already, but just in case,
>>> those two hung connections weren't making the inserts at exactly the
>>> same time - if you look at pg_stat_activity they executed about a day
>>> apart (since Heroku cycles the clients every 24 hours or so). And
>>> before I restarted the first DB that experienced the problem, there
>>> were ~30 backends built up over the course of a month. It seems like
>>> when one INSERT sticks, every following INSERT just stacks up on top
>>> of it, trying to take out the same lock.
>>
>> Right. In both backtraces, we see that we're an inserter stuck on
>> getting an exclusive buffer lock on the buffer containing block 0, the
>> meta page block (GIN_METAPAGE_BLKNO). There is probably some
>> session/backend that has acquired two buffer locks in an order that is
>> inconsistent with these inserters, meaning that you get an
>> undetectable deadlock. (The only alternative to that theory is that
>> some backend sits on a meta page buffer lock for some other reason,
>> but that seems much less likely.)
>>
>> The interesting question to my mind is: What backend is the other
>> backend that acquires buffer locks in an incompatible order, resulting
>> in this undetectable deadlock? What's it doing? (I worked for Heroku
>> until quite recently; I may be able to open a back channel, with
>> Chris' permission.)
>>
>> I remember expressing befuddlement about commit e95680832854c over a
>> year ago, and never satisfying myself that it was correct [1]. I'm no
>> expert on GIN, so I dropped it. It feels like that might be relevant
>> here, since you seem to be using this GIN index with a queue table.
>> That usage pattern is one where entire leaf pages in indexes tend to
>> be routinely deleted and later recycled by VACUUM, at least with
>> B-Trees [2]. Whereas, in general I think B-Tree (and presumably GIN)
>> page deletion is fairly rare, since the entire page must be empty for
>> it to happen.
>>
>> The follow up bugfix commit, e2c79e14, added a ConditionalLockPage()
>> to the insert ginInsertCleanup() path, while also adding a LockPage()
>> to the VACUUM path. In case you missed it, those are *heavyweight*
>> page lock acquisitions, not buffer lock acquisitions, which is pretty
>> unconventional (I though only hash still did that). Frankly, the
>> e2c79e14 fix seems kind of bolted on (though I don't want to focus on
>> that aspect right now).
>>
>> [1] https://postgr.es/m/CAM3SWZSDxqDBvUGOoNm0veVOwgJV3GDvoncYr6f5L16qo8MYRg@mail.gmail.com
>> [2] https://brandur.org/postgres-queues
>> --
>> Peter Geoghegan

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Петър Славов 2018-03-26 20:59:12 Re: BUG #15114: logical decoding Segmentation fault
Previous Message Andres Freund 2018-03-26 16:19:26 Re: BUG #15114: logical decoding Segmentation fault