Quick Links

Re: [BUGS] Connections hang indefinitely while taking a LWLockTranche buffer_content lock.

From:	Chris Hanks <christopher(dot)m(dot)hanks(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL mailing lists <pgsql-bugs(at)postgresql(dot)org>
Subject:	Re: [BUGS] Connections hang indefinitely while taking a LWLockTranche buffer_content lock.
Date:	2018-03-26 18:43:09
Message-ID:	CAK7KUdD8ibDSnZY-65TEZMgvU6A79Znkq990hAUbREJgheqReg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

Hey, I realize it's a bit late, but did anybody ever find a root cause
for this? I removed the relevant index and haven't had an issue since.

Thanks,
Chris

On Thu, Oct 26, 2017 at 7:42 PM, Chris Hanks
<christopher(dot)m(dot)hanks(at)gmail(dot)com> wrote:
> I wound up working around the issue by forking the database and
> removing that GIN index, and things have been fine with the new
> instance for the past two days.
>
> I previously had two Postgres instances with hung processes, one 9.6.1
> and one 9.6.5. For work reasons I destroyed the 9.6.5 (Heroku support
> was having issues with its container anyway) but I've kept the 9.6.1.
> I'm happy to give permission for you to access it, just let me know
> who to talk to, or have them reach out to me. I have an open issue in
> their tracker under my work email, chris(at)pathgather(dot)com(dot)
>
> Thanks again!
>
> On Thu, Oct 26, 2017 at 7:14 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>> I managed to get a couple of hours to look at this this afternoon.
>>
>> On Mon, Oct 23, 2017 at 10:39 AM, Chris Hanks
>> <christopher(dot)m(dot)hanks(at)gmail(dot)com> wrote:
>>> I'm not sure if you guys had noticed this already, but just in case,
>>> those two hung connections weren't making the inserts at exactly the
>>> same time - if you look at pg_stat_activity they executed about a day
>>> apart (since Heroku cycles the clients every 24 hours or so). And
>>> before I restarted the first DB that experienced the problem, there
>>> were ~30 backends built up over the course of a month. It seems like
>>> when one INSERT sticks, every following INSERT just stacks up on top
>>> of it, trying to take out the same lock.
>>
>> Right. In both backtraces, we see that we're an inserter stuck on
>> getting an exclusive buffer lock on the buffer containing block 0, the
>> meta page block (GIN_METAPAGE_BLKNO). There is probably some
>> session/backend that has acquired two buffer locks in an order that is
>> inconsistent with these inserters, meaning that you get an
>> undetectable deadlock. (The only alternative to that theory is that
>> some backend sits on a meta page buffer lock for some other reason,
>> but that seems much less likely.)
>>
>> The interesting question to my mind is: What backend is the other
>> backend that acquires buffer locks in an incompatible order, resulting
>> in this undetectable deadlock? What's it doing? (I worked for Heroku
>> until quite recently; I may be able to open a back channel, with
>> Chris' permission.)
>>
>> I remember expressing befuddlement about commit e95680832854c over a
>> year ago, and never satisfying myself that it was correct [1]. I'm no
>> expert on GIN, so I dropped it. It feels like that might be relevant
>> here, since you seem to be using this GIN index with a queue table.
>> That usage pattern is one where entire leaf pages in indexes tend to
>> be routinely deleted and later recycled by VACUUM, at least with
>> B-Trees [2]. Whereas, in general I think B-Tree (and presumably GIN)
>> page deletion is fairly rare, since the entire page must be empty for
>> it to happen.
>>
>> The follow up bugfix commit, e2c79e14, added a ConditionalLockPage()
>> to the insert ginInsertCleanup() path, while also adding a LockPage()
>> to the VACUUM path. In case you missed it, those are *heavyweight*
>> page lock acquisitions, not buffer lock acquisitions, which is pretty
>> unconventional (I though only hash still did that). Frankly, the
>> e2c79e14 fix seems kind of bolted on (though I don't want to focus on
>> that aspect right now).
>>
>> [1] https://postgr.es/m/CAM3SWZSDxqDBvUGOoNm0veVOwgJV3GDvoncYr6f5L16qo8MYRg@mail.gmail.com
>> [2] https://brandur.org/postgres-queues
>> --
>> Peter Geoghegan

In response to

Re: Connections hang indefinitely while taking a LWLockTranche buffer_content lock. at 2017-10-26 23:42:23 from Chris Hanks

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Петър Славов	2018-03-26 20:59:12	Re: BUG #15114: logical decoding Segmentation fault
Previous Message	Andres Freund	2018-03-26 16:19:26	Re: BUG #15114: logical decoding Segmentation fault