Re: GIN data corruption bug(s) in 9.6devel

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GIN data corruption bug(s) in 9.6devel
Date: 2016-04-08 00:53:54
Message-ID: CAMkU=1wZXyXrVg9BhJwwhZZ-fZr_i94fvLDvBepn0ceKLFGMmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 7, 2016 at 4:33 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
>> To summarize the behavior change:
>
>> In the released code, an inserting backend that violates the pending
>> list limit will try to clean the list, even if it is already being
>> cleaned. It won't accomplish anything useful, but will go through the
>> motions until eventually it runs into a page the lead cleaner has
>> deleted, at which point it realizes there is another cleaner and it
>> stops. This acts as a natural throttle to how fast insertions can
>> take place into an over-sized pending list.
>
> Right.
>
>> The proposed change removes that throttle, so that inserters will
>> immediately see there is already a cleaner and just go back about
>> their business. Due to that, unthrottled backends could add to the
>> pending list faster than the cleaner can clean it, leading to
>> unbounded growth in the pending list and could cause a user backend to
>> becoming apparently unresponsive to the user, indefinitely. That is
>> scary to backpatch.
>
> It's scary to put into HEAD, either. What if we simply don't take
> that specific behavioral change? It doesn't seem like this is an
> essential part of fixing the bug as you described it. (Though I've
> not read the patch, so maybe I'm just missing the connection.)

There are only 3 fundamental options I see, the cleaner can wait,
"help", or move on.

"Helping" is what it does now and is dangerous.

Moving on gives the above-discussed unthrottling problem.

Waiting has two problems. The act of waiting will cause autovacuums
to be canceled, unless ugly hacks are deployed to prevent that. If
we deploy those ugly hacks, then we have the problem that a user
backend will end up waiting on an autovacuum to finish the cleaning,
and the autovacuum is taking its sweet time due to
autovacuum_vacuum_cost_delay. (The "helping" model avoids this
problem because the user backend can just catch up with and pass the
io-throttled autovac process)

For completeness sake, a fourth option would to move on, but only
after inserting the tuple directly into the main index structure
(rather then the pending list) like would be done with fastupdate off,
once the pending list is already oversized. This is my favorite, but
there is no chance of it going into 9.6, much less being backpatched.

Alvaro's recommendation, to let the cleaner off the hook once it
passes the page which was the tail page at the time it started, would
prevent any process from getting pinned down indefinitely, but would
not prevent the size of the list from increasing without bound. I
think that would probably be good enough, because the current
throttling behavior is purely accidentally and doesn't *guarantee* a
limit on the size of the pending list.

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2016-04-08 01:08:56 Re: Fix for OpenSSL error queue bug
Previous Message Andrew Dunstan 2016-04-08 00:29:50 Re: VS 2015 support in src/tools/msvc