Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2021-07-29 15:29:51
Message-ID: 0deb019c8d292ed335d26153a879b532@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Masahiko Sawada писал 2021-07-29 17:29:
> On Thu, Jul 29, 2021 at 8:03 PM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
> wrote:
>>
>> Masahiko Sawada писал 2021-07-29 12:11:
>> > On Thu, Jul 29, 2021 at 3:53 AM Andres Freund <andres(at)anarazel(dot)de>
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> On 2021-07-27 13:06:56 +0900, Masahiko Sawada wrote:
>> >> > Apart from performance and memory usage points of view, we also need
>> >> > to consider the reusability of the code. When I started this thread, I
>> >> > thought the best data structure would be the one optimized for
>> >> > vacuum's dead tuple storage. However, if we can use a data structure
>> >> > that can also be used in general, we can use it also for other
>> >> > purposes. Moreover, if it's too optimized for the current TID system
>> >> > (32 bits block number, 16 bits offset number, maximum block/offset
>> >> > number, etc.) it may become a blocker for future changes.
>> >>
>> >> Indeed.
>> >>
>> >>
>> >> > In that sense, radix tree also seems good since it can also be used in
>> >> > gist vacuum as a replacement for intset, or a replacement for hash
>> >> > table for shared buffer as discussed before. Are there any other use
>> >> > cases?
>> >>
>> >> Yes, I think there are. Whenever there is some spatial locality it has
>> >> a
>> >> decent chance of winning over a hash table, and it will most of the
>> >> time
>> >> win over ordered datastructures like rbtrees (which perform very
>> >> poorly
>> >> due to the number of branches and pointer dispatches). There's plenty
>> >> hashtables, e.g. for caches, locks, etc, in PG that have a medium-high
>> >> degree of locality, so I'd expect a few potential uses. When adding
>> >> "tree compression" (i.e. skip inner nodes that have a single incoming
>> >> &
>> >> outgoing node) radix trees even can deal quite performantly with
>> >> variable width keys.
>> >
>> > Good point.
>> >
>> >>
>> >> > On the other hand, I’m concerned that radix tree would be an
>> >> > over-engineering in terms of vacuum's dead tuples storage since the
>> >> > dead tuple storage is static data and requires only lookup operation,
>> >> > so if we want to use radix tree as dead tuple storage, I'd like to see
>> >> > further use cases.
>> >>
>> >> I don't think we should rely on the read-only-ness. It seems pretty
>> >> clear that we'd want parallel dead-tuple scans at a point not too far
>> >> into the future?
>> >
>> > Indeed. Given that the radix tree itself has other use cases, I have
>> > no concern about using radix tree for vacuum's dead tuples storage. It
>> > will be better to have one that can be generally used and has some
>> > optimizations that are helpful also for vacuum's use case, rather than
>> > having one that is very optimized only for vacuum's use case.
>>
>> Main portion of svtm that leads to memory saving is compression of
>> many
>> pages at once (CHUNK). It could be combined with radix as a storage
>> for
>> pointers to CHUNKs., bute
>>
>> For a moment I'm benchmarking IntegerSet replacement based on Trie
>> (HATM
>> like)
>> and CHUNK compression, therefore datastructure could be used for gist
>> vacuum as well.
>>
>> Since it is generic (allows to index all 64bit) it lacks of trick used
>> to speedup svtm. Still on 10x test it is faster than radix.

I've attached IntegerSet2 patch for pgtools repo and benchmark results.
Branch https://github.com/funny-falcon/pgtools/tree/integerset2

SVTM is measured with couple of changes from commit
5055ef72d23482dd3e11ce
in that branch: 1) more often compress bitmap, but slower, 2) couple of
popcount tricks.

IntegerSet consists of trie index to CHUNKS. CHUNKS is compressed bitmap
of 2^15 (6+9) bits (almost like in SVTM, but for fixed bit width).

Well, IntegerSet2 is always faster than IntegerSet and always uses
significantly less memory (radix uses more memory than IntegerSet in
couple of tests and uses comparable in others).

IntegerSet2 is not always faster than radix. It is more like radix.
That it because both are generic prefix trees with comparable amount of
memory accesses. SVTM did the trick being not multilevel prefix tree,
but
just 1 level bitmap index to chunks.

I believe, trie part of IntegerSet could be replaced with radix.
Ie use radix as storage for pointers to CHUNKS.

> BTW, how does svtm work when we add two sets of dead tuple TIDs to one
> svtm? Dead tuple TIDs are unique sets but those sets could have TIDs
> of the different offsets on the same block. The case I imagine is the
> idea discussed on this thread[1]. With this idea, we store the
> collected dead tuple TIDs somewhere and skip index vacuuming for some
> reason (index skipping optimization, failsafe mode, or interruptions
> etc.). Then, in the next lazy vacuum timing, we load the dead tuple
> TIDs and start to scan the heap. During the heap scan in the second
> lazy vacuum, it's possible that new dead tuples will be found on the
> pages that we have already stored in svtm during the first lazy
> vacuum. How can we efficiently update the chunk in the svtm?

If we store tidmap to disk, then it will be serialized. Since SVTM/
IntegerSet2 are ordered, they could be loaded in order. Then we
can just merge tuples in per page basis: deserialize page (or CHUNK),
put new tuples, store again. Since both scan (scan of serilized map
and scan of table) are in order, merging will be cheap enough.

SVTM and IntegerSet2 already works in "buffered" way on insertion.
(As well as IntegerSet that also does compression but in small parts).

regards,

Yura Sokolov
y(dot)sokolov(at)postgrespro(dot)ru
funny(dot)falcon(at)gmail(dot)com

Attachment Content-Type Size
0001-integerset2.patch text/x-diff 26.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dave Cramer 2021-07-29 15:36:19 Re: pg_upgrade does not upgrade pg_stat_statements properly
Previous Message David G. Johnston 2021-07-29 15:28:12 Re: pg_upgrade does not upgrade pg_stat_statements properly