Re: Make relfile tombstone files conditional on WAL level

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Make relfile tombstone files conditional on WAL level
Date: 2021-08-03 15:22:31
Message-ID: CA+TgmoZAa8fEWpdMJtw87gXbGQ_zYVd=5Rx=ys7nr4n8OAOwRQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Aug 2, 2021 at 6:38 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> What I proposed in the past was to have a new shared table that tracks
> relfilenodes. I still think that's a decent solution for just the problem at
> hand.

It's not really clear to me what problem is at hand. The problems that
the tombstone system created for the async I/O stuff weren't really
explained properly, IMHO. And I don't think the current system is all
that ugly. it's not the most beautiful thing in the world but we have
lots of way worse hacks. And, it's easy to understand, requires very
little code, and has few moving parts that can fail. As hacks go it's
a quality hack, I would say.

> But it'd also potentially be the way to redesign relation forks and even
> slim down buffer tags:
>
> Right now a buffer tag is:
> - 4 byte tablespace oid
> - 4 byte database oid
> - 4 byte "relfilenode oid" (don't think we have a good name for this)
> - 4 byte fork number
> - 4 byte block number
>
> If we had such a shared table we could put at least tablespace, fork number
> into that table mapping them to an 8 byte "new relfilenode". That'd only make
> the "new relfilenode" unique within a database, but that'd be sufficient for
> our purposes. It'd give use a buffertag consisting out of the following:
> - 4 byte database oid
> - 8 byte "relfilenode"
> - 4 byte block number

Yep. I think this is a good direction.

> Of course, it'd add some complexity too, because a buffertag alone wouldn't be
> sufficient to read data (as you'd need the tablespace oid from elsewhere). But
> that's probably ok, I think all relevant places would have that information.

I think the thing to look at would be the places that call
relpathperm() or relpathbackend(). I imagine this can be worked out,
but it might require some adjustment.

> It's probably possible to remove the database oid from the tag as well, but
> it'd make CREATE DATABASE tricker - we'd need to change the filenames of
> tables as we copy, to adjust them to the differing oid.

Yeah, I'm not really sure that works out to a win. I tend to think
that we should be trying to make databases within the same cluster
more rather than less independent of each other. If we switch to using
a radix tree for the buffer mapping table as you have previously
proposed, then presumably each backend can cache a pointer to the
second level, after the database OID has been resolved. Then you have
no need to compare database OIDs for every lookup. That might turn out
to be better for performance than shoving everything into the buffer
tag anyway, because then backends in different databases would be
accessing distinct parts of the buffer mapping data structure instead
of contending with one another.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2021-08-03 15:25:29 Re: make MaxBackends available in _PG_init
Previous Message vignesh C 2021-08-03 15:08:28 Re: Added schema level support for publication.