Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Юрий Соколов <funny(dot)falcon(at)gmail(dot)com>
Subject: Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Date: 2020-02-07 02:18:20
Message-ID: CAH2-Wzk42XV2Dg4tYGsLV2X0hf8PSLZ=9gOTOSnp+d6QB_nTyA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 28, 2020 at 5:29 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> In my opinion, the patch is now pretty close to being committable.

Attached is v32, which is even closer to being committable.

> I do have two outstanding open items for the patch, though. These items
> are:
>
> * We still need infrastructure that marks B-Tree opclasses as safe for
> deduplication, to avoid things like the numeric display scale problem,
> collations that are unsafe for deduplication because they're
> nondeterministic, etc.

No progress on this item for v32, though. It's now my only open item
for this entire project. Getting very close.

> * Make VACUUM's WAL record more space efficient when it contains one
> or more "updates" to an existing posting list tuple.

* I've focussed on this item in v32 -- it has been closed out. v32
doesn't explicitly WAL-log post-update index tuples during vacuuming
of posting list tuples, making the WAL records a lot smaller in some
cases.

v32 represents the posting list TIDs that must be deleted instead. It
does this in the most WAL-space-efficient manner possible: by storing
an array of uint16 offsets for each "updated" posting list within
xl_btree_vacuum records -- each entry in each array is an offset to
remove (i.e. a TID that should not appear in the updated version of
the tuple). We use a new nbtdedup.c utility function for this,
_bt_update_posting(). The new function is similar to its neighbor
function, _bt_swap_posting(), which is the nbtdedup.c utility function
used during posting list splits. Just like _bt_swap_posting(), we call
_bt_update_posting() both during the initial action, and from the REDO
routine that replays that action.

Performing vacuuming of posting list tuples this way seems to matter
with larger databases that depend on deduplication to control bloat,
though I haven't taken the time to figure out exactly how much it
matters. I'm satisfied that this is worth having based on
microbenchmarks that measure WAL volume using pg_waldump. One
microbenchmark showed something like a 10x decrease in the size of all
xl_btree_vacuum records taken together compared to v31.

I'm pretty sure that v32 makes it all but impossible for deduplication
to write out more WAL than an equivalent case with deduplication
disabled (I'm excluding FPIs here, of course -- full_page_writes=on
cases will see significant benefits from reduced FPIs, simply by
having fewer index pages). The per-leaf-page WAL record header
accounts for a lot of the space overhead of xl_btree_vacuum records,
and we naturally reduce that overhead when deduplicating, so we can
now noticeably come out ahead when it comes to overall WAL volume. I
wouldn't say that reducing WAL volume (other than FPIs) is actually a
goal of this project, but it might end up happening anyway. Apparently
Microsoft Azure PostgreSQL uses full_page_writes=off, so not everyone
cares about the number of FPIs (everyone cares about raw record size,
though).

* Removed the GUC that controls the use of deduplication in this new
version, per discussion with Robert over on the "Enabling B-Tree
deduplication by default" thread.

Perhaps we can get by with only an index storage parameter. Let's
defer this until after the Postgres 13 beta period is over, and we get
feedback from testers.

* Turned the documentation on deduplication in the B-Tree internals
chapter into a more general discussion of the on-disk format that
covers deduplication.

Deduplication enhances this on-disk representation, and discussing it
outside that wider context always felt awkward to me. Having this kind
of discussion in the docs seems like a good idea anyway.

--
Peter Geoghegan

Attachment Content-Type Size
v32-0002-Teach-pageinspect-about-nbtree-posting-lists.patch application/x-patch 18.5 KB
v32-0003-DEBUG-Show-index-values-in-pageinspect.patch application/x-patch 4.3 KB
v32-0001-Add-deduplication-to-nbtree.patch application/x-patch 214.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2020-02-07 02:18:29 Re: Internal key management system
Previous Message Andres Freund 2020-02-07 02:08:08 Re: Prevent pg_basebackup running as root