Re: [HACKERS] ginInsertCleanup called from vacuum could still miss tuples to be deleted

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] ginInsertCleanup called from vacuum could still miss tuples to be deleted
Date: 2017-11-25 05:14:00
Message-ID: CAH2-Wz=GTnAPzEEZqYELOv3h1Fxpo5xhMrBP6aMGEKLKv95csQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 24, 2017 at 6:09 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>> We see this for the entry tree (no deletion is possible in the first
>> place), and we also see it for the posting tree (the dance with
>> inserters having a pin on the root, and so on). Not mentioning why
>> pending list recycling is safe in terms of locking protocols seems
>> like an omission that should be addressed. I'm no GIN expert, but I
>> would expect this because the pending list seems like a kind of
>> extension of the posting tree.
>
> The pending list follows the same rule as the right-links on the posting
> trees:
>
> "To prevent a backend from reaching a deleted page via a right-link, when
> following a right-link the lock on the previous page is not released until
> the lock on next page has been acquired."
>
> This lock-chaining rule is also followed when stepping from the meta page to
> the first page in the pending list.
>
> (Yes, neither of these are mentioned in the README).

But, as I said, we lock-chain with shared buffer locks in
ginInsertCleanup(). Not exclusive buffer locks. (We also take the HW
lock on the metapage so that ginInsertCleanup() callers block each
other out).

With B-Tree index structures that don't use Lehman & Yao's technique,
in general, we must do lock coupling. There is, in general, no point
in lock-chaining at all if there is no exclusive buffer locker
(writer) involved at some point. Writers must lock with exclusive
buffer locks from the beginning of the chain (e.g. root) for
traditional coupling to be correct. However, the only exclusive buffer
lock in ginInsertCleanup() is taken on the metapage at quite a late
point in cleanup, just before insertion of pending items is performed.
Crucially, this comes after we've decided what the head and tail of
the pending list are.

The metapage HW lock is sufficient for writers to block writers, but
AFAICT it is not sufficient for writers to block readers.

> Since pending lists don't have downlinks, the dance with the super-exclusive
> locks is not needed.

Isn't the pending list's head block number in the metapage very much
like a downlink? I think I would be convinced by what you say here if
we held an exclusive buffer lock on the metapage from the beginning
within ginInsertCleanup() (we'd then remove the LockPage() HW lock
calls entirely). We don't, though. I'm not proposing that we actually
change the code in that way. My point is that that looks like it would
make the code correct from the angle I'm looking at it, which may make
my thinking clearer to you.

> I think what is true for deleted pages should also be sufficient for
> deleted-and-freed-and-recycled pages. Since you are only allowed to follow
> a link from a locked buffer and not from private memory, you can't follow a
> stale link as long as the links are updated before the page is deleted and
> then recycled.
>
> Is this convincing?

In general, yes, but the devil is in the details. I very much like the
"pending list is analogous to posting list" framing. i just doubt that
we get all the details right in practice.

> Also missing is an argument for the deadlock-freeness. I think what is
> needed for that is that you never lock a page earlier in the pending list,
> nor the metapage, while holding a lock on a pending list page. (Which as
> far as I can tell is true, but I don't know how to exhaustively verify it)

Defensively checking the page type (pending, posting, etc) within
scanPendingInsert() would be a good start. That's already something
that we do for posting trees. If we're following the same rules as
posting trees (which sounds like a good idea), then we should have the
same defenses. Same applies to using elogs within ginInsertCleanup()
-- we should promote those Assert()s to elog()s.

I also suggest that you add a random, artificial sleep within
ginInsertCleanup(), like this:

/*
* Read and lock head of pending list
*/
blkno = metadata->head;
// Random sleep goes here
buffer = ReadBuffer(index, blkno);
LockBuffer(buffer, GIN_SHARE);
page = BufferGetPage(buffer);

If there is a race here, then this should make it more likely to hit.

I should remind you that deadlocks might be caused by recycling races.
While we should explain why pending list cleaning is deadlock-free, as
you suggest, reports of deadlocks might not be due to a flaw in
locking as such.

>> As I've said, it feels slightly off to me that a user backend that is
>> performing opportunistic cleanup during insertion can call
>> RecordFreeIndexPage() in GIN.
>
>
> I don't know how to address that, because I don't know why it seems off. No
> one else does this because no one else generates large number of free pages
> during insertions.

My point is that it seems like having little distinction between
deletion and recycling is problematic. If you don't have very strong
locking to block out all possible index scans (locking at some central
choke point), then you risk concurrent recycling races. I am wearing
my nbtree hat in assessing the GIN code, which admittedly might not
always be the best way of looking at it. I'm explaining my perspective
because I think it might be useful for you to understand my thought
process. I hope I don't come off as peevish.

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2017-11-25 05:40:00 Re: [HACKERS] Custom compression methods
Previous Message Jeff Janes 2017-11-25 02:09:42 Re: [HACKERS] ginInsertCleanup called from vacuum could still miss tuples to be deleted