Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2021-07-07 22:50:48
Message-ID: CAH2-WzncZxY_VP30A=YQFy--X3K39hrCc2a4Uks1Sk7rb4XXJg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 7, 2021 at 1:24 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> I wonder how much it would help to break up that loop into two loops.
> Make the callback into a batch operation that generates state that
> describes what to do with each and every index tuple on the leaf page.
> The first loop would build a list of TIDs, then you'd call into
> vacuumlazy.c and get it to process the TIDs, and finally the second
> loop would physically delete the TIDs that need to be deleted. This
> would mean that there would be only one call per leaf page per
> btbulkdelete(). This would reduce the number of calls to the callback
> by at least 100x, and maybe more than 1000x.

Maybe for something like rtbm.c (which is inspired by Roaring
bitmaps), you would really want to use an "intersection" operation for
this. The TIDs that we need to physically delete from the leaf page
inside btvacuumpage() are the intersection of two bitmaps: our bitmap
of all TIDs on the leaf page, and our bitmap of all TIDs that need to
be deleting by the ongoing btbulkdelete() call.

Obviously the typical case is that most TIDs in the index do *not* get
deleted -- needing to delete more than ~20% of all TIDs in the index
will be rare. Ideally it would be very cheap to figure out that a TID
does not need to be deleted at all. Something a little like a negative
cache (but not a true negative cache). This is a little bit like how
hash joins can be made faster by adding a Bloom filter -- most hash
probes don't need to join a tuple in the real world, and we can make
these hash probes even faster by using a Bloom filter as a negative
cache.

If you had the list of TIDs from a leaf page sorted for batch
processing, and if you had roaring bitmap style "chunks" with
"container" metadata stored in the data structure, you could then use
merging/intersection -- that has some of the same advantages. I think
that this would be a lot more efficient than having one binary search
per TID. Most TIDs from the leaf page can be skipped over very
quickly, in large groups. It's very rare for VACUUM to need to delete
TIDs from completely random heap table blocks in the real world (some
kind of pattern is much more common).

When this merging process finds 1 TID that might really be deletable
then it's probably going to find much more than 1 -- better to make
that cache miss take care of all of the TIDs together. Also seems like
the CPU could do some clever prefetching with this approach -- it
could prefetch TIDs where the initial chunk metadata is insufficient
to eliminate them early -- these are the groups of TIDs that will have
many TIDs that we actually need to delete. ISTM that improving
temporal locality through batching could matter a lot here.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Yugo NAGATA 2021-07-07 23:35:06 Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors
Previous Message Victor Spirin 2021-07-07 22:32:04 Re: Atomic rename feature for Windows.