Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2022-07-12 01:16:21
Message-ID: CAD21AoATLuGOk7mEXXfXXqr7cq+1vWG4bh+YKrrFgukpbyjGeQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jul 8, 2022 at 3:43 PM John Naylor <john(dot)naylor(at)enterprisedb(dot)com> wrote:
>
> On Fri, Jul 8, 2022 at 9:10 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> > I guess that the tree height is affected by where garbages are, right?
> > For example, even if all garbage in the table is concentrated in
> > 0.5GB, if they exist between 2^17 and 2^18 block, we use the first
> > byte of blockhi. If the table is larger than 128GB, the second byte of
> > the blockhi could be used depending on where the garbage exists.
>
> Right.
>
> > Another variation of how to store TID would be that we use the block
> > number as a key and store a bitmap of the offset as a value. We can
> > use Bitmapset for example,
>
> I like the idea of using existing code to set/check a bitmap if it's
> convenient. But (in case that was implied here) I'd really like to
> stay away from variable-length values, which would require
> "Single-value leaves" (slow). I also think it's fine to treat the
> key/value as just bits, and not care where exactly they came from, as
> we've been talking about.
>
> > or an approach like Roaring bitmap.
>
> This would require two new data structures instead of one. That
> doesn't seem like a path to success.

Agreed.

>
> > I think that at this stage it's better to define the design first. For
> > example, key size and value size, and these sizes are fixed or can be
> > set the arbitary size?
>
> I don't think we need to start over. Andres' prototype had certain
> design decisions built in for the intended use case (although maybe
> not clearly documented as such). Subsequent patches in this thread
> substantially changed many design aspects. If there were any changes
> that made things wonderful for vacuum, it wasn't explained, but Andres
> did explain how some of these changes were not good for other uses.
> Going to fixed 64-bit keys and values should still allow many future
> applications, so let's do that if there's no reason not to.

I thought Andres pointed out that given that we store BufferTag (or
part of that) into the key, the fixed 64-bit keys might not be enough
for buffer mapping use cases. If we want to use wider keys more than
64-bit, we would need to consider it.

>
> > For value size, if we support
> > different value sizes specified by the user, we can either embed
> > multiple values in the leaf node (called Multi-value leaves in ART
> > paper)
>
> I don't think "Multi-value leaves" allow for variable-length values,
> FWIW. And now I see I also used this term wrong in my earlier review
> comment -- v3/4 don't actually use "multi-value leaves", but Andres'
> does (going by the multiple leaf types). From the paper: "Multi-value
> leaves: The values are stored in one of four different leaf node
> types, which mirror the structure of inner nodes, but contain values
> instead of pointers."

Right, but sorry I meant the user specifies the arbitrary fixed-size
value length on creation like we do in dynahash.c.

>
> (It seems v3/v4 could be called a variation of "Combined pointer/value
> slots: If values fit into pointers, no separate node types are
> necessary. Instead, each pointer storage location in an inner node can
> either store a pointer or a value." But without the advantage of
> variable length keys).

Agreed.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2022-07-12 01:28:17 Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
Previous Message Thomas Munro 2022-07-12 01:10:29 Re: Cleaning up historical portability baggage