Re: Reduce pinning in btree indexes

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reduce pinning in btree indexes
Date: 2015-02-23 19:48:48
Message-ID: 54EB8420.7090907@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 02/15/2015 02:19 AM, Kevin Grittner wrote:
> Interestingly, the btree README points out that using the old TID
> with a new tuple poses no hazard for a scan using an MVCC snapshot,
> because the new tuple would not be visible to a snapshot created
> that long ago.

The first question is: Do we really need that interlock for the non-MVCC
snapshots either?

If we do: For non-MVCC snapshots, we need to ensure that all index scans
that started before the VACUUM did complete before the VACUUM does. I
wonder if we could find a different mechanism to enforce that. Using the
pin-interlock for that has always seemed a bit silly to me. Perhaps grab
a new heavy-weight lock on the table whenever a non-MVCC index scan on
the table begins, and have VACUUM wait on it.

> I found that the LP_DEAD hinting
> would be a problem with an old TID, but I figured we could work
> around that by storing the page LSN into the scan position structure
> when the page contents were read, and only doing hinting if that
> matched when we were ready to do the hinting. That wouldn't work
> for an index which was not WAL-logged, so the patch still holds
> pins for those.

Or you could use GetFakeLSNForUnloggedRel().

> Robert pointed out that the visibility information
> for an index-only scan wasn't checked while the index page READ
> lock was held, so those scans also still hold the pins.

Why does an index-only scan need to hold the pin?

> Finally, there was an "optimization" for marking buffer position
> for possible restore that was incompatible with releasing the pin.
> I use quotes because the optimization adds overhead to every move
> to the next page in order set some variables in a structure when a
> mark is requested instead of running two fairly small memcpy()
> statements. The two-day benchmark of the customer showed no
> performance hit, and looking at the code I would be amazed if the
> optimization yielded a measurable benefit. In general, optimization
> by adding overhead to moving through a scan to save time in a mark
> operation seems dubious.

Hmm. Did your test case actually exercise mark/restore? The memcpy()s
are not that small. Especially if it's an index-only scan, you're
copying a large portion of the page. Some scans call markpos on every
tuple, so that could add up.

> At some point we could consider building on this patch to recheck
> index conditions for heap access when a non-MVCC snapshot is used,
> check the visibility map for referenced heap pages when the TIDs
> are read for an index-only scan, and/or skip LP_DEAD hinting for
> non-WAL-logged indexes. But all those are speculative future work;
> this is a conservative implementation that just didn't modify
> pinning where there were any confounding factors.

Understood. Still, I'd like to see if we can easily get rid of the
pinning altogether.
- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2015-02-23 20:08:19 Re: Precedence of NOT LIKE, NOT BETWEEN, etc
Previous Message Tom Lane 2015-02-23 19:43:17 Precedence of NOT LIKE, NOT BETWEEN, etc