Re: Avoiding pin scan during btree vacuum

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Avoiding pin scan during btree vacuum
Date: 2015-12-21 21:28:16
Message-ID: 20151221212816.GY2618@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Simon Riggs wrote:
> During VACUUM of btrees, we need to pin all pages, even those where tuples
> are not removed, which I am calling here the "pin scan". This is especially
> a problem during redo, where removing one tuple from a 100GB btree can take
> a minute or more. That causes replication lags. Bad Thing.

Agreed on there being a problem here.

As a reminder, this "pin scan" is implemented in the
HotStandbyActiveInReplay() block in btree_xlog_vacuum; this routine is
in charge of replaying an action recorded by _bt_delitems_vacuum. That
replay works by acquiring cleanup lock on all btree blocks from
lastBlockVacuumed to "this one" (i.e. the one in which elements are
being removed). In essence, this means pinning *all* the blocks in the
index, which is bad.
The new code sets lastBlockVacuumed to Invalid; a new test in the replay
code makes that value not pin anything. This is supposed to optimize
the case in question.

One thing that this patch should update is the comment above struct
xl_btree_vacuum, which currently state that EnsureBlockUnpinned() is one
of the options to apply to each block, but that routine doesn't exist.

One possible naive optimization is to avoid locking pages that aren't
leaf pages, but that would still require fetching the block from disk,
so it wouldn't actually optimize anything other than the cleanup lock
acquisition. (And probably not too useful, since leaf pages are said to
be 95% - 99% of index pages anyway.)

Also of interest is the comment above the call to _bt_delitems_vacuum in
btvacuumpage(); that needs an update too.

It appears to me that your patch changes the call in btvacuumscan() but
not the one in btvacuumpage(). I assume you should be changing both.

I think the new comment that talks about Toast Index should explain
*why* we can skip the pinning in all cases except that one, instead of
just saying we can do it.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2015-12-21 21:36:50 Re: Avoiding pin scan during btree vacuum
Previous Message Tom Lane 2015-12-21 20:47:56 Coding note: truncating with strlcpy() is not such a hot idea