Re: Deleting older versions in unique indexes to avoid page splits

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Deleting older versions in unique indexes to avoid page splits
Date: 2020-10-21 19:36:29
Message-ID: CAH2-Wz=zOxjVdz0ZDch=Hyh07SWkorRvD+HGUcT5xMj-sSFt5w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 21, 2020 at 8:25 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> That certainly isn't great. I mean, it might be not be too terrible,
> because it's a leaf index page isn't nearly as potentially hot as a VM
> page or a clog page, but it hurts interruptibility and risks hurting
> concurrency, but if it were possible to arrange to hold only a pin on
> the page during all this rather than a lock, it would be better. I'm
> not sure how realistic that is, though.

I don't think that it's realistic. Well, technically you could do
something like that, but you'd end up with some logically equivalent
mechanism which would probably be slower. As you know, in nbtree pins
are generally much less helpful than within heapam (you cannot read a
page without a shared buffer lock, no matter what). Holding a pin only
provides a very weak guarantee about VACUUM and TID recycling that
usually doesn't come up.

Bear in mind that we actually do practically the same thing all the
time with the current LP_DEAD setting stuff, where we need to call
compute_xid_horizon_for_tuples/heap_compute_xid_horizon_for_tuples
with a leaf buffer lock held in almost the same way. That's actually
potentially far worse if you look at it in isolation, because you
could potentially have hundreds of heap pages, whereas this is just 1
- 3. (BTW, next version will also do that work in passing, so you're
practically guaranteed to do less with a buffer lock held compared to
the typical case of nbtree LP_DEAD setting, even without counting how
the LP_DEAD bits get set in the first place.)

I could also point out that something very similar happens in
_bt_check_unique().

Also bear in mind that the alternative is pretty much a page split, which means:

* Locking the leaf page

* Then obtaining relation extension lock

* Locking to create new right sibling

* Releasing relation extension lock

* Locking original right sibling page

* Release original right sibling page

* Release new right sibling page

* Lock parent page

* Release original now-split page

* Release parent page

(I will refrain from going into all of the absurd and near-permanent
secondary costs that just giving up and splitting the page imposes for
now. I didn't even include all of the information about locking --
there is one thing that didn't seem worth mentioning.)

The key concept here is of course asymmetry. The asymmetry here is not
only favorable; it's just outrageous. The other key concept is it's
fundamentally impossible to pay more than a very small fixed cost
without getting a benefit.

That said, I accept that there is still some uncertainty that all
workloads that get a benefit will be happy with the trade-off. I am
still fine tuning how this works in cases with high contention. I
welcome any help with that part. But note that this doesn't
necessarily have much to do with the heap page accesses. It's not
always strictly better to never have any bloat at all (it's pretty
close to that, but not quite). We saw this with the Postgres 12 work,
where small TPC-C test cases had some queries go slower simply because
a small highly contended index did not get bloated due to a smarter
split algorithm. There is no reason to believe that it had anything to
do with the cost of making better decisions. It was the decisions
themselves.

I don't want to completely prevent "version driven page splits"
(though a person could reasonably imagine that that is in fact my
precise goal); rather, I want to make non-hot updates work to prove
that it's almost certainly necessary to split the page due to version
churn - then and only then should it be accepted. Currently we meekly
roll over and let non-hot updaters impose negative externalities on
the system as a whole. The patch usually clearly benefits even
workloads that consist entirely of non-hot updaters. Negative
externalities are only good for the individual trying to impose costs
on the collective when they can be a true freeloader. It's always bad
for the collective, but it's even bad for the bad actors once they're
more than a small minority.

Currently non-hot updaters are not merely selfish to the extent that
they impose a downside on the collective or the system as a whole that
is roughly proportionate to the upside benefit they get. Not cleaning
up their mess as they go creates a downside that is a huge multiple of
any possible upside for them. To me this seems incontrovertible.
Worrying about the precise extent to which this is true in each
situation doesn't seem particularly productive to me. Whatever the
actual extent of the imbalance is, the solution is that you don't let
them do that.

This patch is not really about overall throughput. It could be
justified on that basis, but that's not how I like to think of it.
Rather, it's about providing a stabilizing backstop mechanism, which
tends to bound the amount of index bloat and the number of versions in
each index for each *logical row* -- that's the most important benefit
of the patch. There are workloads that will greatly benefit despite
only invoking the new mechanism very occasionally, as a backstop. And
even cases with a fair amount of contention don't really use it that
often (which is why the heap page access cost is pretty much a
question about specific high contention patterns only). The proposed
new cleanup mechanism may only be used in certain parts of the key
space for certain indexes at certain times, in a bottom-up fashion. We
don't have to be eager about cleaning up bloat most of the time, but
it's also true that there are cases where we ought to work very hard
at it in a localized way.

This explanation may sound unlikely, but the existing behaviors taken
together present us with outrageous cost/benefit asymmetry, arguably
in multiple dimensions.

I think that having this backstop cleanup mechanism (and likely others
in other areas) will help to make the assumptions underlying
autovacuum scheduling much more reasonable in realistic settings. Now
it really is okay that autovacuum doesn't really care about the needs
of queries, and is largely concerned with macro level things like free
space management. It's top down approach isn't so bad once it has true
bottom up complementary mechanisms. The LP_DEAD microvacuum stuff is
nice because it marks things as dead in passing, pretty much for free.
That's not enough on its own -- it's no backstop. The current LP_DEAD
stuff appears to work rather well, until one day it suddenly doesn't
and you curse Postgres for it. I could go on about the non-linear
nature of the system as a whole, hidden tipping points, and other
stuff like that. But I won't right now.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2020-10-21 19:46:35 Re: new heapcheck contrib module
Previous Message Daniel Verite 2020-10-21 17:49:19 Re: Add header support to text format and matching feature