Reduce pinning in btree indexes

From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Reduce pinning in btree indexes
Date: 2015-02-15 00:19:45
Message-ID: 721615179.3351449.1423959585771.JavaMail.yahoo@mail.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

We have a customer who was unable to migrate from a well-known
commercial database product to pg because they have a very large
software base that holds cursors open for very long periods of
time, preventing GlobalXmin values from advancing, leading to
bloat. They have a standard test that exercises hundreds of client
connections for days at a time, and saw disk space and CPU usage
climbing to unacceptable levels[1]. The other database product had
no such problems. We suggested the obvious solutions that we
always suggest on the community lists, and they had perfectly good
reasons why those were not viable for them -- we either needed to
prevent bloat under their current software or they could not
migrate.

What they wanted was what happened in the other database product --
if a snapshot got sufficiently old that cleaning up the MVCC data
was a problem *and* the snapshot was used again *and* it read a
page which had been modified far enough back that it was not
possible to return correct data, then they wanted to receive a
"snapshot too old" error. I wrote a patch to do that, but they
didn't seem much improvement, because all (auto)vacuum processes
blocked indefinitely on the btree index pages where a cursor was
parked. So they still got massive bloat and consequent problems.

It seemed odd to me that a btree implementation based on the
Lehman & Yao techniques would block anything, because the concept
is that it reads everything it needs off the index page and pauses
"between" pages. We sorta do that, but the "interlock" between
vacuum processes and other index usages basically involves holding
a pin on the page we just read until we are done using the values
we read off that page, and treating the pin as a lighter lock than
a shared (or READ) lock -- which only conflicts with a "super
exclusive" lock, which consists of getting an exclusive lock only
once there are no other processes holding a pin on the page. This
ensures that at the end of a vacuum pass over a btree index there
are no TIDs in process-local memory from before the start of the
pass, and consequently no scan can read a new tuple assigned to the
same TID value after the rest of the vacuum phases run. So a pin
on a btree page blocks a vacuum process indefinitely.

Interestingly, the btree README points out that using the old TID
with a new tuple poses no hazard for a scan using an MVCC snapshot,
because the new tuple would not be visible to a snapshot created
that long ago. The customer's cursors which were causing the
problems all use MVCC snapshots, so I went looking to see whether
we couldn't take advantage of this fact. For the most part it
seemed we were OK if we dropped pins with the READ lock for a btree
scan which used an MVCC snapshot. I found that the LP_DEAD hinting
would be a problem with an old TID, but I figured we could work
around that by storing the page LSN into the scan position structure
when the page contents were read, and only doing hinting if that
matched when we were ready to do the hinting. That wouldn't work
for an index which was not WAL-logged, so the patch still holds
pins for those. Robert pointed out that the visibility information
for an index-only scan wasn't checked while the index page READ
lock was held, so those scans also still hold the pins.

There was also a problem with the fact that the code conflated the
notion of having a valid scan position with holding a pin on a
block in the index. Those two concepts needed to be teased apart,
which I did using some new macros and figuring out which one to use
where. Without a pin on the block, we also needed to remember what
block we had been processing to be able to do the LP_DEAD hinting;
for that the block number was added to the structure which tracks
scan position.

Finally, there was an "optimization" for marking buffer position
for possible restore that was incompatible with releasing the pin.
I use quotes because the optimization adds overhead to every move
to the next page in order set some variables in a structure when a
mark is requested instead of running two fairly small memcpy()
statements. The two-day benchmark of the customer showed no
performance hit, and looking at the code I would be amazed if the
optimization yielded a measurable benefit. In general, optimization
by adding overhead to moving through a scan to save time in a mark
operation seems dubious.

At some point we could consider building on this patch to recheck
index conditions for heap access when a non-MVCC snapshot is used,
check the visibility map for referenced heap pages when the TIDs
are read for an index-only scan, and/or skip LP_DEAD hinting for
non-WAL-logged indexes. But all those are speculative future work;
this is a conservative implementation that just didn't modify
pinning where there were any confounding factors.

In combination with the snapshot-too-old patch the 2-day customer
benchmark ran without problem levels of bloat, controlling disk
space growth and keeping CPU usage level. Without the other patch
this patch would at least allow autovacuum to maintain statistics,
which probably makes it worthwhile even without the other patch,
but will probably not have a very big impact on bloat without both.

At least two README files and a lot of comments need to be modified
before commit to reflect the change in approach if this is accepted
by the community. Since this work has been very recent I haven't
had time to do that yet.

git diff --stat tells me this patch has:

4 files changed, 208 insertions(+), 128 deletions(-)

... and the related "snapshot too old" patch has:

16 files changed, 145 insertions(+), 19 deletions(-)

Given that these are going into the last CF and the related patch
is really still at "proof of concept" phase, it may be too late to
consider them for PostgreSQL 9.5, but I would still like a serious
review now in case people feel strongly about controlling bloat in
the face of old snapshots, and to know whether to work on this as
just for the Postgres Plus Advanced Server fork or for the
community.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] This test maintains a certain transaction rate, with user think
times on the connections, so performance isn't measured by a tps
rate but by how much CPU is consumed to maintain that rate.

Attachment Content-Type Size
bt-nopin-v1.patch text/x-diff 28.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2015-02-15 00:24:54 Re: restrict global access to be readonly
Previous Message Jim Nasby 2015-02-14 23:28:38 Re: restrict global access to be readonly