Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-05-08 10:23:23
Message-ID: CAFBsxsGuBoa5rrHtKn4ajYN-OXK8bJRN2gpmZk60fkQDerzotw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Apr 7, 2023 at 4:55 PM John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
wrote:

> - Fixed-size PagetableEntry's are pretty large, but the tid compression
scheme used in this thread (in addition to being complex) is not a great
fit for tidbitmap because it makes it more difficult to track per-block
metadata (see also next point). With the "combined pointer-value slots"
technique, if a page's max tid offset is 63 or less, the offsets can be
stored directly in the pointer for the exact case. The lowest bit can tag
to indicate a pointer to a single-value leaf. That would complicate
operations like union/intersection and tracking "needs recheck", but it
would reduce memory use and node-traversal in common cases.

[just getting some thoughts out there before I have something concrete]

Thinking some more, this needn't be complicated at all. We'd just need to
reserve some bits of a bitmapword for the tag, as well as flags for
"ischunk" and "recheck". The other bits can be used for offsets.
Getting/storing the offsets basically amounts to adjusting the shift by a
constant. That way, this "embeddable PTE" could serve as both "PTE embedded
in a node pointer" and also the first member of a full PTE. A full PTE is
now just an array of embedded PTEs, except only the first one has the flags
we need. That reduces the number of places that have to be different.
Storing any set of offsets all less than ~60 would save
allocation/traversal in a large number of real cases. Furthermore, that
would reduce a full PTE to 40 bytes because there would be no padding.

This all assumes the key (block number) is no longer stored in the PTE,
whether embedded or not. That would mean this technique:

> - With lazy expansion and single-value leaves, the root of a radix tree
can point to a single leaf. That might get rid of the need to track
TBMStatus, since setting a single-leaf tree should be cheap.

...is not a good trade off because it requires each leaf to have the key,
and would thus reduce the utility of embedded leaves. We just need to make
sure storing a single value is not costly, and I suspect it's not.
(Currently the overhead avoided is allocating and zeroing a few kilobytes
for a hash table). If it is not, then we don't need a special case in
tidbitmap, which would be a great simplification. If it is, there are other
ways to mitigate.

--
John Naylor
EDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2023-05-08 10:28:54 Re: Support logical replication of DDLs
Previous Message Daniel Gustafsson 2023-05-08 08:39:42 Re: evtcache: EventTriggerCache vs Event Trigger Cache