Re: Deleting older versions in unique indexes to avoid page splits

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Victor Yegorov <vyegorov(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Deleting older versions in unique indexes to avoid page splits
Date: 2020-12-01 22:18:36
Message-ID: CAH2-WzkJ1sO6inwECttDDGdts9YSsQQM_8h3_+8DO1Bmykj_ug@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 1, 2020 at 1:50 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> This is a wholly new concept with a lot of heuristics. It's a lot of
> swallow.

Thanks for taking a look! Yes, it's a little unorthodox.

Ideally, you'd find time to grok the patch and help me codify the
design in some general kind of way. If there are general lessons to be
learned here (and I suspect that there are), then this should not be
left to chance. The same principles can probably be applied in heapam,
for example. Even if I'm wrong about the techniques being highly
generalizable, it should still be interesting! (Something to think
about a little later.)

Some of the intuitions behind the design might be vaguely familiar to
you as the reviewer of my earlier B-Tree work. In particular, the
whole way that we reason about how successive related operations (in
this case bottom-up deletion passes) affect individual leaf pages over
time might give you a feeling of déjà vu. It's a little like the
nbtsplitloc.c stuff that we worked on together during the Postgres 12
cycle. We can make what might seem like rather bold assumptions about
what's going on, and adapt to the workload. This is okay because we're
sure that the downside of our being wrong is a fixed, low performance
penalty. (To a lesser degree it's okay because the empirical evidence
shows that we're almost always right, because we apply the
optimization again and again precisely because it worked out last
time.)

I like to compare it to the familiar "rightmost leaf page applies leaf
fillfactor" heuristic, which applies an assumption that is wrong in
the general case, but nevertheless obviously helps enormously as a
practical matter. Of course it's still true that anybody reviewing
this patch ought to start with a principled skepticism of this claim
-- that's how you review any patch. I can say for sure that that's the
idea behind the patch, though. I want to be clear about that up front,
to save you time -- if this claim is wrong, then I'm wrong about
everything.

Generational garbage collection influenced this work, too. It also
applies pragmatic assumptions about where garbage is likely to appear.
Assumptions that are based on nothing more than empirical observations
about what is likely to happen in the real world, that are validated
by experience and by benchmarking.

> On 30/11/2020 21:50, Peter Geoghegan wrote:
> > +} TM_IndexDelete;

> > +} TM_IndexStatus;

> Is it really significantly faster to have two arrays? If you had just
> one array, each element would be only 12 bytes long. That's not much
> smaller than TM_IndexDeletes, which is 8 bytes.

Yeah, but the swap operations really matter here. At one point I found
that to be the case, anyway. That might no longer be true, though. It
might be that the code became less performance critical because other
parts of the design improved, resulting in the code getting called
less frequently. But if that is true then it has a lot to do with the
power-of-two bucketing that you go on to ask about -- that helped
performance a lot in certain specific cases (as I go into below).

I will add a TODO item for myself, to look into this again. You may
well be right.

> > + /* First sort caller's array by TID */
> > + heap_tid_shellsort(delstate->deltids, delstate->ndeltids);

> Instead of sorting the array by TID, wouldn't it be enough to sort by
> just the block numbers?

I don't understand. Yeah, I guess that we could do our initial sort of
the deltids array (the heap_tid_shellsort() call) just using
BlockNumber (not TID). But OTOH there might be some small locality
benefit to doing a proper TID sort at the level of each heap page. And
even if there isn't any such benefit, does it really matter?

If you are asking about the later sort of the block counts array
(which helps us sort the deltids array a second time, leaving it in
its final order for bottom-up deletion heapam.c processing), then the
answer is no. This block counts metadata array sort is useful because
it allows us to leave the deltids array in what I believe to be the
most useful order for processing. We'll access heap blocks primarily
based on the number of promising tuples (though as I go into below,
sometimes the number of promising tuples isn't a decisive influence on
processing order).

> > * While in general the presence of promising tuples (the hint that index
> > + * AMs provide) is the best information that we have to go on, it is based
> > + * on simple heuristics about duplicates in indexes that are understood to
> > + * have specific flaws. We should not let the most promising heap pages
> > + * win or lose on the basis of _relatively_ small differences in the total
> > + * number of promising tuples. Small differences between the most
> > + * promising few heap pages are effectively ignored by applying this
> > + * power-of-two bucketing scheme.
> > + *
>
> What are the "specific flaws"?

I just meant the obvious: the index AM doesn't actually know for sure
that there are any old versions on the leaf page that it will
ultimately be able to delete. This uncertainty needs to be managed,
including inside heapam.c. Feel free to suggest a better way of
expressing that sentiment.

> I understand that this is all based on rough heuristics, but is there
> any advantage to rounding/bucketing the numbers like this? Even if small
> differences in the total number of promising tuple is essentially noise
> that can be safely ignored, is there any harm in letting those small
> differences guide the choice? Wouldn't it be the essentially the same as
> picking at random, or are those small differences somehow biased?

Excellent question! It actually helps enormously, especially with low
cardinality data that makes good use of the deduplication optimization
(where it is especially important to keep the costs and the benefits
in balance). This has been validated by benchmarking.

This design naturally allows the heapam.c code to take advantage of
both temporal and spatial locality. For example, let's say that you
have several indexes all on the same table, which get lots of non-HOT
UPDATEs, which are skewed. Naturally, the heap TIDs will match across
each index -- these are index entries that are needed to represent
successor versions (which are logically unchanged/version duplicate
index tuples from the point of view of nbtree/nbtdedup.c). Introducing
a degree of determinism to the order in which heap blocks are
processed naturally takes advantage of the correlated nature of the
index bloat. It naturally makes it much more likely that the
also-correlated bottom-up index deletion passes (that occur across
indexes on the same table) each process the same heap blocks close
together in time -- with obvious performance benefits.

In the extreme (but not particularly uncommon) case of non-HOT UPDATEs
with many low cardinality indexes, each heapam.c call will end up
doing *almost the same thing* across indexes. So we're making the
correlated nature of the bloat (which is currently a big problem) work
in our favor -- turning it on its head, you could say. Highly
correlated bloat is not the exception -- it's actually the norm in the
cases we're targeting here. If it wasn't highly correlated then it
would already be okay to rely on VACUUM to get around to cleaning it
later.

This power-of-two bucketing design probably also helps when there is
only one index. I could go into more detail on that, plus other
variations, but perhaps the multiple index example suffices for now. I
believe that there are a few interesting kinds of correlations here --
I bet you can think of some yourself. (Of course it's also possible
and even likely that heap block correlation won't be important at all,
but my response is "what specifically is the harm in being open to the
possibility?".)

BTW, I tried to make the tableam interface changes more or less
compatible with Zedstore, which is notable for not using TIDs in the
same way as heapam (or zheap). Let me know what you think about that.
I can go into detail about it if it isn't obvious to you.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Patrick Handja 2020-12-01 22:19:02 Re: Setof RangeType returns
Previous Message Zidenberg, Tsahi 2020-12-01 21:43:53 Re: Improving spin-lock implementation on ARM.