Re: Reduce pinning in btree indexes

From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reduce pinning in btree indexes
Date: 2015-02-26 17:49:23
Message-ID: 440831854.629116.1424972963735.JavaMail.yahoo@mail.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:> On 02/15/2015 02:19 AM, Kevin Grittner wrote:
>> Interestingly, the btree README points out that using the old TID
>> with a new tuple poses no hazard for a scan using an MVCC snapshot,
>> because the new tuple would not be visible to a snapshot created
>> that long ago.
>
> The first question is: Do we really need that interlock for the non-MVCC

> snapshots either?

We don't need the interlock for non-MVCC snapshots if the conditions we
limit or filter on at the index level are rechecked when we get to the heap
tuple; otherwise we do.

> If we do: For non-MVCC snapshots, we need to ensure that all index scans

> that started before the VACUUM did complete before the VACUUM does.

Isn't it enough to be sure that if we're not going to recheck any index
tests against the heap tuple that any process-local copies of index entries
which exist at the start of the index scan phase of the vacuum are not used
after the vacuum enters the phase of freeing line pointers -- that is, any
scan which has read a page when the vacuum starts to scan the index moves
on to another page before vacuum finishes scanning that index?

> I wonder if we could find a different mechanism to enforce that. Using the

> pin-interlock for that has always seemed a bit silly to me.

It certainly is abusing the semantics of the pin to treat it as a type of
lock on the contents.

> Perhaps grab a new heavy-weight lock on the table whenever a non-MVCC index
> scan on the table begins, and have VACUUM wait on it.

-1

The problem I'm most interested in fixing based on user feedback is a vacuum
blocking when a cursor which is using an index scan is sitting idle. Not
only does the vacuum of that table stall, but if it is an autovacuum worker,
that worker is prevented from cleaning up any other tables. I have seen all
autovacuum workers blocked in exactly this way, leading to massive bloat.
What you suggest is just using a less awkward way to keep the problem.

>> I found that the LP_DEAD hinting
>> would be a problem with an old TID, but I figured we could work
>> around that by storing the page LSN into the scan position structure
>> when the page contents were read, and only doing hinting if that
>> matched when we were ready to do the hinting. That wouldn't work
>> for an index which was not WAL-logged, so the patch still holds
>> pins for those.
>

> Or you could use GetFakeLSNForUnloggedRel().

I'm unfamiliar with that, but will take a look. (I will be back at
my usual development machine late tomorrow.)

>> Robert pointed out that the visibility information
>> for an index-only scan wasn't checked while the index page READ
>> lock was held, so those scans also still hold the pins.
>

> Why does an index-only scan need to hold the pin?

Robert already answered this one, but there is a possible solution
-- provide some way to check the heap's visibility map for the TIDs
read from an index page before releasing the read lock on that page.

>> Finally, there was an "optimization" for marking buffer position
>> for possible restore that was incompatible with releasing the pin.
>> I use quotes because the optimization adds overhead to every move
>> to the next page in order set some variables in a structure when a
>> mark is requested instead of running two fairly small memcpy()
>> statements. The two-day benchmark of the customer showed no
>> performance hit, and looking at the code I would be amazed if the
>> optimization yielded a measurable benefit. In general, optimization
>> by adding overhead to moving through a scan to save time in a mark
>> operation seems dubious.
>
> Hmm. Did your test case actually exercise mark/restore? The memcpy()s
> are not that small. Especially if it's an index-only scan, you're
> copying a large portion of the page. Some scans call markpos on every
> tuple, so that could add up.

I have results from the `make world` regression tests and a 48-hour
customer test. Unfortunately I don't know how heavily either of those
exercise this code. Do you have a suggestion for a way to test whether
there is a serious regression for something approaching a "worst case"?

>> At some point we could consider building on this patch to recheck
>> index conditions for heap access when a non-MVCC snapshot is used,
>> check the visibility map for referenced heap pages when the TIDs
>> are read for an index-only scan, and/or skip LP_DEAD hinting for
>> non-WAL-logged indexes. But all those are speculative future work;
>> this is a conservative implementation that just didn't modify
>> pinning where there were any confounding factors.
>
> Understood. Still, I'd like to see if we can easily get rid of the
> pinning altogether.

That would be great, but I felt that taking care of the easy cases in
on patch and following with patches to handle the trickier cases as
separate follow-on patches made more sense than trying to do everything
at once. Assuming we sort out the mark/restore issues for the initial
patch, it provides infrastructure to keep the other cases simpler.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2015-02-26 18:01:36 Re: Partitioning WIP patch
Previous Message Tomas Vondra 2015-02-26 17:48:28 Re: Performance improvement for joins where outer side is unique