Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2021-07-25 16:07:18
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


I've dreamed to write more compact structure for vacuum for three
years, but life didn't give me a time to.

Let me join to friendly competition.

I've bet on HATM approach: popcount-ing bitmaps for non-empty elements.

- 32 consecutive pages are stored together in a single sparse array
(called "chunks").
Chunk contains:
- its number,
- 4 byte bitmap of non-empty pages,
- array of non-empty page headers 2 byte each.
Page header contains offset of page's bitmap in bitmaps container.
(Except if there is just one dead tuple in a page. Then it is
written into header itself).
- container of concatenated bitmaps.

Ie, page metadata overhead varies from 2.4byte (32pages in single
to 18byte (1 page in single chunk) per page.

- If page's bitmap is sparse ie contains a lot of "all-zero" bytes,
it is compressed by removing zero byte and indexing with two-level
bitmap index.
Two-level index - zero bytes in first level are removed using
second level. It is mostly done for 32kb pages, but let it stay since
it is almost free.

- If page's bitmaps contains a lot of "all-one" bytes, it is inverted
and then encoded as sparse.

- Chunks are allocated with custom "allocator" that has no
per-allocation overhead. It is possible because there is no need
to perform "free": allocator is freed as whole at once.

- Array of pointers to chunks is also bitmap indexed. It saves cpu time
when not every 32 consecutive pages has at least one dead tuple.
But consumes time otherwise. Therefore additional optimization is
to quick skip lookup for first non-empty run of chunks.
(Ahhh, I believe this explanation is awful).

Andres Freund wrote 2021-07-20 02:49:
> Hi,
> On 2021-07-19 15:20:54 +0900, Masahiko Sawada wrote:
>> BTW is the implementation of the radix tree approach available
>> somewhere? If so I'd like to experiment with that too.
>> >
>> > I have toyed with implementing adaptively large radix nodes like
>> > proposed in - but haven't
>> > gotten it quite working.
>> That seems promising approach.
> I've since implemented some, but not all of the ideas of that paper
> (adaptive node sizes, but not the tree compression pieces).
> E.g. for
> select prepare(
> 1000000, -- max block
> 20, -- # of dead tuples per page
> 10, -- dead tuples interval within a page
> 1 -- page inteval
> );
> attach size shuffled ordered
> array 69 ms 120 MB 84.87 s 8.66 s
> intset 173 ms 65 MB 68.82 s 11.75 s
> rtbm 201 ms 67 MB 11.54 s 1.35 s
> tbm 232 ms 100 MB 8.33 s 1.26 s
> vtbm 162 ms 58 MB 10.01 s 1.22 s
> radix 88 ms 42 MB 11.49 s 1.67 s
> and for
> select prepare(
> 1000000, -- max block
> 10, -- # of dead tuples per page
> 1, -- dead tuples interval within a page
> 1 -- page inteval
> );
> attach size shuffled ordered
> array 24 ms 60MB 3.74s 1.02 s
> intset 97 ms 49MB 3.14s 0.75 s
> rtbm 138 ms 36MB 0.41s 0.14 s
> tbm 198 ms 101MB 0.41s 0.14 s
> vtbm 118 ms 27MB 0.39s 0.12 s
> radix 33 ms 10MB 0.28s 0.10 s
> (this is an almost unfairly good case for radix)
> Running out of time to format the results of the other testcases before
> I have to run, unfortunately. radix uses 42MB both in test case 3 and
> 4.

My results (Ubuntu 20.04 Intel Core i7-1165G7):


select prepare(1000000, 10, 20, 1); -- original

attach size shuffled
array 29ms 60MB 93.99s
intset 93ms 49MB 80.94s
rtbm 171ms 67MB 14.05s
tbm 238ms 100MB 8.36s
vtbm 148ms 59MB 9.12s
radix 100ms 42MB 11.81s
svtm 75ms 29MB 8.90s

select prepare(1000000, 20, 10, 1); -- Andres's variant

attach size shuffled
array 61ms 120MB 111.91s
intset 163ms 66MB 85.00s
rtbm 236ms 67MB 10.72s
tbm 290ms 100MB 8.40s
vtbm 190ms 59MB 9.28s
radix 117ms 42MB 12.00s
svtm 98ms 29MB 8.77s


select prepare(1000000, 10, 1, 1);

attach size shuffled
array 31ms 60MB 4.68s
intset 97ms 49MB 4.03s
rtbm 163ms 36MB 0.42s
tbm 240ms 100MB 0.42s
vtbm 136ms 27MB 0.36s
radix 60ms 10MB 0.72s
svtm 39ms 6MB 0.19s

(Bad radix result probably due to smaller cache in notebook's CPU ?)


select prepare(1000000, 2, 100, 1);

attach size shuffled
array 6ms 12MB 53.42s
intset 23ms 16MB 54.99s
rtbm 115ms 38MB 8.19s
tbm 186ms 100MB 8.37s
vtbm 105ms 59MB 9.08s
radix 64ms 42MB 10.41s
svtm 73ms 10MB 7.49s


select prepare(1000000, 100, 1, 1);

attach size shuffled
array 304ms 600MB 75.12s
intset 775ms 98MB 47.49s
rtbm 356ms 38MB 4.11s
tbm 539ms 100MB 4.20s
vtbm 493ms 42MB 4.44s
radix 263ms 42MB 6.05s
svtm 360ms 8MB 3.49s

Therefore Specialized Vaccum Tid Map always consumes least memory amount
and usually faster.

(I've applied Andres's patch for slab allocator before testing)

Attached patch is against 6753911a444e12e4b55 commit of your pgtools
applied Andres's patches for radix method.

I've also pushed it to github:

Yura Sokolov

Attachment Content-Type Size
0001-svtm-specialized-vacuum-tid-map.patch text/x-diff 22.0 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-07-25 16:28:04 Re: Removing "long int"-related limit on hash table sizes
Previous Message Tom Lane 2021-07-25 16:03:25 Re: Planning counters in pg_stat_statements (using pgss_store)