Re: Deleting older versions in unique indexes to avoid page splits

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Victor Yegorov <vyegorov(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Deleting older versions in unique indexes to avoid page splits
Date: 2020-10-20 02:37:16
Message-ID: CAH2-WzmUAf=MSWxh28xV_veWiOmn-fmUPLbfhqX=t+RG6X9kbg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 16, 2020 at 1:58 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Attached is v3, which is rebased against the master branch as of
> today. No real changes, though.

And now here's v4.

This version adds handling of posting list tuples, which I was
skipping over before. Highly contended leaf pages with posting list
tuples could still sometimes get "logically unnecessary" page splits
in v3, which this seems to fix (barring the most extreme cases of
version churn, where the patch cannot and should not prevent page
splits). Actually, posting list tuple handling was something that we
had in v1 but lost in v2, because v2 changed our general strategy to
focus on what is convenient to the heap/tableam, which is the most
important thing by far (the cost of failing must be a low, fixed, well
understood and well targeted cost). The fix was to include TIDs from
posting lists, while marking them "non-promising". Only plain tuples
that are duplicates are considered promising. Only promising tuples
influence where we look for tuples to kill most of the time. The
exception is when there is an even number of promising tuples on two
heap pages, where we tiebreak on the total number of TIDs that point
to the heap page from the leaf page.

I seem to be able to cap the costs paid when the optimization doesn't
work out to the extent that we can get away with visiting only *one*
heap page before giving up. And, we now never visit more than 3 total
(2 is the common case when the optimization is really effective). This
may sound conservative -- because it is -- but it seems to work in
practice. I may change my mind about that and decide to be less
conservative, but so far all of the evidence I've seen suggests that
it doesn't matter -- the heuristics seem to really work. Might as well
be completely conservative. I'm *sure* that we can afford one heap
access here -- we currently *always* visit at least one heap tuple in
roughly the same way during each and every unique index tuple insert
(not just when the page fills).

Posting list TIDs are not the only kind of TIDs that are marked
non-promising. We now also include TIDs from non-duplicate tuples. So
we're prepared to kill any of the TIDs on the page, though we only
really expect it to happen with the "promising" tuples (duplicates).
But why not be open to the possibility of killing some extra TIDs in
passing? We don't have to visit any extra heap pages to do so, so it's
practically free. Empirical evidence shows that this happens quite
often.

Here's why this posting list tuple strategy is a good one: we consider
posting list tuple TIDs non-promising to represent that we think that
there are probably actually multiple logical rows involved, or at
least to represent that they didn't work -- simple trial and error
suggests that they aren't very "promising", whatever the true reason
happens to be. But why not "keep an open mind" about the TIDs not each
being for distinct logical rows? If it just so happens that the
posting list TIDs really were multiple versions of the same logical
row all along, then there is a reasonable chance that there'll be even
more versions on that same heap page later on. When this happens and
when we end up back on the same B-Tree leaf page to think about dedup
deletion once again, it's pretty likely that we'll also naturally end
up looking into the later additional versions on this same heap page
from earlier. At which point we won't miss the opportunity to check
the posting lists TIDs in passing. So we get to delete the posting
list after all!

(If you're thinking "but we probably won't get lucky like that", then
consider that it doesn't have to happen on the next occasion when
delete deduplication happens on the same leaf page. Just at some point
in the future. This is due to the way we visit the heap pages that
look most promising first. It might take a couple of rounds of dedup
deletion to get back to the same heap page, but that's okay. The
simple heuristics proposed in the patch seem to have some really
interesting and useful consequences in practice. It's hard to quantify
how important these kinds of accidents of locality will be. I haven't
targeted this or that effect -- my heuristics are dead simple, and
based almost entirely on keeping costs down. You can think of it as
"increase the number of TIDs to increase our chances of success" if
you prefer.)

The life cycle of logical rows/heap tuples seems to naturally lead to
these kinds of opportunities. Obviously heapam is naturally very keen
on storing related versions on the same heap block already, without
any input from this patch. The patch is still evolving, and the
overall structure and interface certainly still needs work -- I've
focussed on the algorithm so far. I could really use some feedback on
high level direction, though. It's a relatively short patch, even with
all of my README commentary. But...it's not an easy concept to digest.

Note: I'm not really sure if it's necessary to provide specialized
qsort routines for the sorting we do to lists of TIDs, etc. I did do
some experimentation with that, and it's an open question. So for now
I rely on the patch that Thomas Munro posted to do that a little while
back, which is why that's included here. The question of whether this
is truly needed is unsettled.

--
Peter Geoghegan

Attachment Content-Type Size
v4-0001-Add-sort_template.h-for-making-fast-sort-function.patch application/x-patch 13.7 KB
v4-0002-Add-delete-deduplication-to-nbtree.patch application/x-patch 63.1 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message tsunakawa.takay@fujitsu.com 2020-10-20 02:44:09 RE: Transactions involving multiple postgres foreign servers, take 2
Previous Message Masahiro Ikeda 2020-10-20 02:31:11 Add statistics to pg_stat_wal view for wal related parameter tuning