Re: Deleting older versions in unique indexes to avoid page splits

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Victor Yegorov <vyegorov(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Deleting older versions in unique indexes to avoid page splits
Date: 2021-01-04 12:08:18
Message-ID: 42d3b795-d4c1-7abd-2280-6b0f2df30478@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 02/12/2020 00:18, Peter Geoghegan wrote:
> On Tue, Dec 1, 2020 at 1:50 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> On 30/11/2020 21:50, Peter Geoghegan wrote:
>>> +} TM_IndexDelete;
>
>>> +} TM_IndexStatus;
>
>> Is it really significantly faster to have two arrays? If you had just
>> one array, each element would be only 12 bytes long. That's not much
>> smaller than TM_IndexDeletes, which is 8 bytes.
>
> Yeah, but the swap operations really matter here. At one point I found
> that to be the case, anyway. That might no longer be true, though. It
> might be that the code became less performance critical because other
> parts of the design improved, resulting in the code getting called
> less frequently. But if that is true then it has a lot to do with the
> power-of-two bucketing that you go on to ask about -- that helped
> performance a lot in certain specific cases (as I go into below).
>
> I will add a TODO item for myself, to look into this again. You may
> well be right.
>
>>> + /* First sort caller's array by TID */
>>> + heap_tid_shellsort(delstate->deltids, delstate->ndeltids);
>
>> Instead of sorting the array by TID, wouldn't it be enough to sort by
>> just the block numbers?
>
> I don't understand. Yeah, I guess that we could do our initial sort of
> the deltids array (the heap_tid_shellsort() call) just using
> BlockNumber (not TID). But OTOH there might be some small locality
> benefit to doing a proper TID sort at the level of each heap page. And
> even if there isn't any such benefit, does it really matter?

You said above that heap_tid_shellsort() is very performance critical,
and that's why you use the two arrays approach. If it's so performance
critical that swapping 8 bytes vs 12 byte array elements makes a
difference, I would guess that comparing TID vs just the block numbers
would also make a difference.

If you really want to shave cycles, you could also store BlockNumber and
OffsetNumber in the TM_IndexDelete array, instead of ItemPointerData.
What's the difference, you might ask? ItemPointerData stores the block
number as two 16 bit integers that need to be reassembled into a 32 bit
integer in the ItemPointerGetBlockNumber() macro.

My argument here is two-pronged: If this is performance critical, you
could do these additional optimizations. If it's not, then you don't
need the two-arrays trick; one array would be simpler.

I'm not sure how performance critical this really is, or how many
instructions each of these optimizations might shave off, so of course I
might be wrong and the tradeoff you have in the patch now really is the
best. However, my intuition would be to use a single array, because
that's simpler, and do all the other optimizations instead: store the
tid in the struct as separate BlockNumber and OffsetNumber fields, and
sort only by the block number. Those optimizations are very deep in the
low level functions and won't add much to the mental effort of
understanding the algorithm at a high level.

>>> * While in general the presence of promising tuples (the hint that index
>>> + * AMs provide) is the best information that we have to go on, it is based
>>> + * on simple heuristics about duplicates in indexes that are understood to
>>> + * have specific flaws. We should not let the most promising heap pages
>>> + * win or lose on the basis of _relatively_ small differences in the total
>>> + * number of promising tuples. Small differences between the most
>>> + * promising few heap pages are effectively ignored by applying this
>>> + * power-of-two bucketing scheme.
>>> + *
>>
>> What are the "specific flaws"?
>
> I just meant the obvious: the index AM doesn't actually know for sure
> that there are any old versions on the leaf page that it will
> ultimately be able to delete. This uncertainty needs to be managed,
> including inside heapam.c. Feel free to suggest a better way of
> expressing that sentiment.

Hmm, maybe: "... is the best information that we have to go on, it is
just a guess based on simple heuristics about duplicates in indexes".

>> I understand that this is all based on rough heuristics, but is there
>> any advantage to rounding/bucketing the numbers like this? Even if small
>> differences in the total number of promising tuple is essentially noise
>> that can be safely ignored, is there any harm in letting those small
>> differences guide the choice? Wouldn't it be the essentially the same as
>> picking at random, or are those small differences somehow biased?
>
> Excellent question! It actually helps enormously, especially with low
> cardinality data that makes good use of the deduplication optimization
> (where it is especially important to keep the costs and the benefits
> in balance). This has been validated by benchmarking.
>
> This design naturally allows the heapam.c code to take advantage of
> both temporal and spatial locality. For example, let's say that you
> have several indexes all on the same table, which get lots of non-HOT
> UPDATEs, which are skewed. Naturally, the heap TIDs will match across
> each index -- these are index entries that are needed to represent
> successor versions (which are logically unchanged/version duplicate
> index tuples from the point of view of nbtree/nbtdedup.c). Introducing
> a degree of determinism to the order in which heap blocks are
> processed naturally takes advantage of the correlated nature of the
> index bloat. It naturally makes it much more likely that the
> also-correlated bottom-up index deletion passes (that occur across
> indexes on the same table) each process the same heap blocks close
> together in time -- with obvious performance benefits.
>
> In the extreme (but not particularly uncommon) case of non-HOT UPDATEs
> with many low cardinality indexes, each heapam.c call will end up
> doing *almost the same thing* across indexes. So we're making the
> correlated nature of the bloat (which is currently a big problem) work
> in our favor -- turning it on its head, you could say. Highly
> correlated bloat is not the exception -- it's actually the norm in the
> cases we're targeting here. If it wasn't highly correlated then it
> would already be okay to rely on VACUUM to get around to cleaning it
> later.
>
> This power-of-two bucketing design probably also helps when there is
> only one index. I could go into more detail on that, plus other
> variations, but perhaps the multiple index example suffices for now. I
> believe that there are a few interesting kinds of correlations here --
> I bet you can think of some yourself. (Of course it's also possible
> and even likely that heap block correlation won't be important at all,
> but my response is "what specifically is the harm in being open to the
> possibility?".)

I see. Would be good to explain that pattern with multiple indexes in
the comment.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Luc Vlaming 2021-01-04 12:14:41 Re: Parallel Inserts in CREATE TABLE AS
Previous Message Masahiko Sawada 2021-01-04 11:56:12 Re: [PATCH] Feature improvement for CLOSE, FETCH, MOVE tab completion