Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2022-07-08 02:09:44
Message-ID: CAD21AoB+M2S4RWHF0ofizq64nJdD_GKUtRHSwAnSGq5tpZnH_w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 5, 2022 at 5:49 PM John Naylor <john(dot)naylor(at)enterprisedb(dot)com> wrote:
>
> On Mon, Jul 4, 2022 at 12:07 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> > > Looking at the node stats, and then your benchmark code, I think key
> > > construction is a major influence, maybe more than node type. The
> > > key/value scheme tested now makes sense:
> > >
> > > blockhi || blocklo || 9 bits of item offset
> > >
> > > (with the leaf nodes containing a bit map of the lowest few bits of
> > > this whole thing)
> > >
> > > We want the lower fanout nodes at the top of the tree and higher
> > > fanout ones at the bottom.
> >
> > So more inner nodes can fit in CPU cache, right?
>
> My thinking is, on average, there will be more dense space utilization
> in the leaf bitmaps, and fewer inner nodes. I'm not quite sure about
> cache, since with my idea a search might have to visit more nodes to
> get the common negative result (indexed tid not found in vacuum's
> list).
>
> > > Note some consequences: If the table has enough columns such that much
> > > fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
> > > dense case the nodes above the leaves will have lower fanout (maybe
> > > they will fit in a node32). Also, the bitmap values in the leaves will
> > > be more empty. In other words, many tables in the wild *resemble* the
> > > sparse case a bit, even if truly all tuples on the page are dead.
> > >
> > > Note also that the dense case in the benchmark above has ~4500 times
> > > more keys than the sparse case, and uses about ~1000 times more
> > > memory. But the runtime is only 2-3 times longer. That's interesting
> > > to me.
> > >
> > > To optimize for the sparse case, it seems to me that the key/value would be
> > >
> > > blockhi || 9 bits of item offset || blocklo
> > >
> > > I believe that would make the leaf nodes more dense, with fewer inner
> > > nodes, and could drastically speed up the sparse case, and maybe many
> > > realistic dense cases.
> >
> > Does it have an effect on the number of inner nodes?
> >
> > > I'm curious to hear your thoughts.
> >
> > Thank you for your analysis. It's worth trying. We use 9 bits for item
> > offset but most pages don't use all bits in practice. So probably it
> > might be better to move the most significant bit of item offset to the
> > left of blockhi. Or more simply:
> >
> > 9 bits of item offset || blockhi || blocklo
>
> A concern here is most tids won't use many bits in blockhi either,
> most often far fewer, so this would make the tree higher, I think.
> Each value of blockhi represents 0.5GB of heap (32TB max). Even with
> very large tables I'm guessing most pages of interest to vacuum are
> concentrated in a few of these 0.5GB "segments".

Right.

I guess that the tree height is affected by where garbages are, right?
For example, even if all garbage in the table is concentrated in
0.5GB, if they exist between 2^17 and 2^18 block, we use the first
byte of blockhi. If the table is larger than 128GB, the second byte of
the blockhi could be used depending on where the garbage exists.

Another variation of how to store TID would be that we use the block
number as a key and store a bitmap of the offset as a value. We can
use Bitmapset for example, or an approach like Roaring bitmap.

I think that at this stage it's better to define the design first. For
example, key size and value size, and these sizes are fixed or can be
set the arbitary size? Given the use case of buffer mapping, we would
need a wider key to store RelFileNode, ForkNumber, and BlockNumber. On
the other hand, limiting the key size is 64 bit integer makes the
logic simple, and possibly it could still be used in buffer mapping
cases by using a tree of a tree. For value size, if we support
different value sizes specified by the user, we can either embed
multiple values in the leaf node (called Multi-value leaves in ART
paper) or introduce a leaf node that stores one value (called
Single-value leaves).

> And it's possible path compression would change the tradeoffs here.

Agreed.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ian Barwick 2022-07-08 02:12:28 Re: Fast COPY FROM based on batch insert
Previous Message Yugo NAGATA 2022-07-08 02:07:44 Re: Add a test for "cannot truncate foreign table"