[PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2021-07-07 11:46:38
Message-ID: CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all,

Index vacuuming is one of the most time-consuming processes in lazy
vacuuming. lazy_tid_reaped() is a large part among them. The attached
the flame graph shows a profile of a vacuum on a table that has one index
and 80 million live rows and 20 million dead rows, where
lazy_tid_reaped() accounts for about 47% of the total vacuum execution
time.

lazy_tid_reaped() is essentially an existence check; for every index
tuple, it checks if the TID of the heap it points to exists in the set
of TIDs of dead tuples. The maximum size of dead tuples is limited by
maintenance_work_mem, and if the upper limit is reached, the heap scan
is suspended, index vacuum and heap vacuum are performed, and then
heap scan is resumed again. Therefore, in terms of the performance of
index vacuuming, there are two important factors: the performance of
lookup TIDs from the set of dead tuples and its memory usage. The
former is obvious whereas the latter affects the number of Index
vacuuming. In many index AMs, index vacuuming (i.e., ambulkdelete)
performs a full scan of the index, so it is important in terms of
performance to avoid index vacuuming from being executed more than
once during lazy vacuum.

Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:

1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1].

2. Allocate the whole memory space at once.

3. Slow lookup performance (O(logN)).

I’ve done some experiments in this area and would like to share the
results and discuss ideas.

Problems Solutions
===============

Firstly, I've considered using existing data structures:
IntegerSet(src/backend/lib/integerset.c) and
TIDBitmap(src/backend/nodes/tidbitmap.c). Those address point 1 but
only either point 2 or 3. IntegerSet uses lower memory thanks to
simple-8b encoding but is slow at lookup, still O(logN), since it’s a
tree structure. On the other hand, TIDBitmap has a good lookup
performance, O(1), but could unnecessarily use larger memory in some
cases since it always allocates the space for bitmap enough to store
all possible offsets. With 8kB blocks, the maximum number of line
pointers in a heap page is 291 (c.f., MaxHeapTuplesPerPage) so the
bitmap is 40 bytes long and we always need 46 bytes in total per block
including other meta information.

So I prototyped a new data structure dedicated to storing dead tuples
during lazy vacuum while borrowing the idea from Roaring Bitmap[2].
The authors provide an implementation of Roaring Bitmap[3] (Apache
2.0 license). But I've implemented this idea from scratch because we
need to integrate it with Dynamic Shared Memory/Area to support
parallel vacuum and need to support ItemPointerData, 6-bytes integer
in total, whereas the implementation supports only 4-bytes integers.
Also, when it comes to vacuum, we neither need to compute the
intersection, the union, nor the difference between sets, but need
only an existence check.

The data structure is somewhat similar to TIDBitmap. It consists of
the hash table and the container area; the hash table has entries per
block and each block entry allocates its memory space, called a
container, in the container area to store its offset numbers. The
container area is actually an array of bytes and can be enlarged as
needed. In the container area, the data representation of offset
numbers varies depending on their cardinality. It has three container
types: array, bitmap, and run.

For example, if there are two dead tuples at offset 1 and 150, it uses
the array container that has an array of two 2-byte integers
representing 1 and 150, using 4 bytes in total. If we used the bitmap
container in this case, we would need 20 bytes instead. On the other
hand, if there are consecutive 20 dead tuples from offset 1 to 20, it
uses the run container that has an array of 2-byte integers. The first
value in each pair represents a starting offset number, whereas the
second value represents its length. Therefore, in this case, the run
container uses only 4 bytes in total. Finally, if there are dead
tuples at every other offset from 1 to 100, it uses the bitmap
container that has an uncompressed bitmap, using 13 bytes. We need
another 16 bytes per block entry for hash table entry.

The lookup complexity of a bitmap container is O(1) whereas the one of
an array and a run container is O(N) or O(logN) but the number of
elements in those two containers should not be large it would not be a
problem.

Evaluation
========

Before implementing this idea and integrating it with lazy vacuum
code, I've implemented a benchmark tool dedicated to evaluating
lazy_tid_reaped() performance[4]. It has some functions: generating
TIDs for both index tuples and dead tuples, loading dead tuples to the
data structure, simulating lazy_tid_reaped() using those virtual heap
tuples and heap dead tuples. So the code lacks many features such as
iteration and DSM/DSA support but it makes testing of data structure
easier.

FYI I've confirmed the validity of this tool. When I ran a vacuum on
the table with 3GB size, index vacuuming took 12.3 sec and
lazy_tid_reaped() took approximately 8.5 sec. Simulating a similar
situation with the tool, the lookup benchmark with the array data
structure took approximately 8.0 sec. Given that the tool doesn't
simulate the cost of function calls, it seems to reasonably simulate
it.

I've evaluated the lookup performance and memory foot point against
the four types of data structure: array, integerset (intset),
tidbitmap (tbm), roaring tidbitmap (rtbm) while changing the
distribution of dead tuples in blocks. Since tbm doesn't have a
function for existence check I've added it and allocate enough memory
to make sure that tbm never be lossy during the evaluation. In all
test cases, I simulated that the table has 1,000,000 blocks and every
block has at least one dead tuple. The benchmark scenario is that for
each virtual heap tuple we check if there is its TID in the dead
tuple storage. Here are the results of execution time in milliseconds
and memory usage in bytes:

* Test-case 1 (10 dead tuples in 20 offsets interval)

An array container is selected in this test case, using 20 bytes for each block.

Execution Time Memory Usage
array 14,140.91 60,008,248
intset 9,350.08 50,339,840
tbm 1,299.62 100,671,544
rtbm 1,892.52 58,744,944

* Test-case 2 (10 consecutive dead tuples from offset 1)

A bitmap container is selected in this test case, using 2 bytes for each block.

Execution Time Memory Usage
array 1,056.60 60,008,248
intset 650.85 50,339,840
tbm 194.61 100,671,544
rtbm 154.57 27,287,664

* Test-case 3 (2 dead tuples at 1 and 100 offsets)

An array container is selected in this test case, using 4 bytes for
each block. Since 'array' data structure (not array container of rtbm)
uses only 12 bytes for each block, given that the size of hash table
entry size in 'rtbm', 'array' data structure uses less memory.

Execution Time Memory Usage
array 6,054.22 12,008,248
intset 4,203.41 16,785,408
tbm 759.17 100,671,544
rtbm 750.08 29,384,816

* Test-case 4 (100 consecutive dead tuples from 1)

A run container is selected in this test case, using 4 bytes for each block.

Execution Time Memory Usage
array 8,883.03 600,008,248
intset 7,358.23 100,671,488
tbm 758.81 100,671,544
rtbm 764.33 29,384,816

Overall, 'rtbm' has a much better lookup performance and good memory
usage especially if there are relatively many dead tuples. However, in
some cases, 'intset' and 'array' have a better memory usage.

Feedback is very welcome. Thank you for reading the email through to the end.

Regards,

[1] https://www.postgresql.org/message-id/CAGTBQpbDCaR6vv9%3DscXzuT8fSbckf%3Da3NgZdWFWZbdVugVht6Q%40mail.gmail.com
[2] http://roaringbitmap.org/
[3] https://github.com/RoaringBitmap/CRoaring
[4] https://github.com/MasahikoSawada/pgtools/tree/master/bdbench

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachment Content-Type Size
vacuum.svg image/svg+xml 47.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gilles Darold 2021-07-07 12:02:33 Re: Case expression pushdown
Previous Message David Rowley 2021-07-07 11:44:16 Re: Update maintenance_work_mem/autovacuum_work_mem to reflect the 1GB limitation with VACUUM