Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Date: 2019-09-24 00:13:32
Message-ID: CAH2-Wz=SQ_vTESu5cE8cH6QVGDWp22hsOTA_sJ2rpmS9BT+ymQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 18, 2019 at 7:25 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> I attach version 16. This revision merges your recent work on WAL
> logging with my recent work on simplifying _bt_dedup_one_page(). See
> my e-mail from earlier today for details.

I attach version 17. This version has changes that are focussed on
further polishing certain things, including fixing some minor bugs. It
seemed worth creating a new version for that. (I didn't get very far
with the space utilization stuff I talked about, so no changes there.)

Changes in v17:

* nbtsort.c now has a loop structure that closely matches
_bt_dedup_one_page() (I put this off in v16).

We now reuse most of the nbtinsert.c deduplication routines.

* Further simplification of btree_xlog_dedup() loop.

Recovery no longer relies on local variables to track the progress of
deduplication -- it uses dedup state (the state managed by
nbtinsert.c's dedup routines) instead. This is easier to follow.

* Reworked _bt_split() comments on posting list splits that coincide
with page splits.

* Fixed memory leaks in recovery code by creating a dedicated memory
context that gets reset regularly. The context is create in a new rmgr
"startup" callback I created for the B-Tree rmgr. We already do this
for both GIN and GiST.

More specifically, the REDO code calls MemoryContextReset() against
its dedicated memory context after every record is processed by REDO,
no matter what. The MemoryContextReset() call usually won't have to
actually free anything, but that's okay because the no-free case does
almost no work. I think that it makes sense to keep things as simple
as possible for memory management during recovery -- it's too easy for
a new memory leak to get introduced when a small change is made to the
nbtinsert.c routines later on.

* Optimize VACUUMing of posting lists: we now only allocate memory for
an array of still-live posting list items when the array will actually
be needed. It is only needed when there are tuples to remove from the
posting list, because only then do we need to create a replacement
posting list that lacks the heap TIDs that VACUUM needs to delete.

It seemed like a really good idea to not allocate any memory in the
common case where VACUUM doesn't need to change a posting list tuple
at all. ginVacuumItemPointers() has exactly the same optimization.

* Fixed an accounting bug in the output of VACCUM VERBOSE by changing
some code in nbtree.c.

The tuples_removed and num_index_tuples fields in
IndexBulkDeleteResult are reported as "index row versions" by VACUUM
VERBOSE. Everything but the index pages stat works at the level of
"index row versions", which should not be affected by the
deduplication patch. Of course, deduplication only changes the
physical representation of items in the index -- never the logical
contents of the index. This is what GIN does already.

Another infrastructure thing that the patch needs to handle to be committable:

We still haven't added an "off" switch to deduplication, which seems
necessary. I suppose that this should look like GIN's "fastupdate"
storage parameter. It's not obvious how to do this in a way that's
easy to work with, though. Maybe we could do something like copy GIN's
GinGetUseFastUpdate() macro, but the situation with nbtree is actually
quite different. There are two questions for nbtree when it comes to
deduplication within an inde: 1) Does the user want to use
deduplication, because that will help performance?, and 2) Is it
safe/possible to use deduplication at all?

I think that we should probably stash this information (deduplication
is both possible and safe) in the metapage. Maybe we can copy it over
to our insertion scankey, just like the "heapkeyspace" field -- that
information also comes from the metapage (it's based on the nbtree
version). The "heapkeyspace" field is a bit ugly, so maybe we
shouldn't go further by adding something similar, but I don't see any
great alternative right now.

--
Peter Geoghegan

Attachment Content-Type Size
v17-0001-Add-deduplication-to-nbtree.patch application/octet-stream 141.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Finnerty, Jim 2019-09-24 01:19:41 Re: Unwanted expression simplification in PG12b2
Previous Message Tsunakawa, Takayuki 2019-09-23 23:59:07 RE: [bug fix??] Fishy code in tts_cirtual_copyslot()