Re: Make relfile tombstone files conditional on WAL level

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Make relfile tombstone files conditional on WAL level
Date: 2021-08-02 22:38:19
Message-ID: 20210802223819.lpo4z5kigxyniytd@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2021-08-02 16:03:31 -0400, Robert Haas wrote:
> The two most principled solutions to this problem that I can see are
> (1) remove wal_level=minimal and

I'm personally not opposed to this. It's not practically relevant and makes a
lot of stuff more complicated. We imo should rather focus on optimizing the
things wal_level=minimal accelerates a lot than adding complications for
wal_level=minimal. Such optimizations would have practical relevance, and
there's plenty low hanging fruits.

> (2) use 64-bit relfilenodes. I have
> been reluctant to support #1 because it's hard for me to believe that
> there aren't cases where being able to skip a whole lot of WAL-logging
> doesn't work out to a nice performance win, but I realize opinions on
> that topic vary. And I'm pretty sure that Andres, at least, will hate
> #2 because he's unhappy with the width of buffer tags already.

Yep :/

I guess there's a somewhat hacky way to get somewhere without actually
increasing the size. We could take 3 bytes from the fork number and use that
to get to a 7 byte relfilenode portion. 7 bytes are probably enough for
everyone.

It's not like we can use those bytes in a useful way, due to alignment
requirements. Declaring that the high 7 bytes are for the relNode portion and
the low byte for the fork would still allow efficient comparisons and doesn't
seem too ugly.

> So I don't really have a good idea. I agree this tombstone system is a
> bit of a wart, but I'm not sure that this patch really makes anything
> any better, and I'm not really seeing another idea that seems better
> either.

> Maybe I am missing something...

What I proposed in the past was to have a new shared table that tracks
relfilenodes. I still think that's a decent solution for just the problem at
hand. But it'd also potentially be the way to redesign relation forks and even
slim down buffer tags:

Right now a buffer tag is:
- 4 byte tablespace oid
- 4 byte database oid
- 4 byte "relfilenode oid" (don't think we have a good name for this)
- 4 byte fork number
- 4 byte block number

If we had such a shared table we could put at least tablespace, fork number
into that table mapping them to an 8 byte "new relfilenode". That'd only make
the "new relfilenode" unique within a database, but that'd be sufficient for
our purposes. It'd give use a buffertag consisting out of the following:
- 4 byte database oid
- 8 byte "relfilenode"
- 4 byte block number

Of course, it'd add some complexity too, because a buffertag alone wouldn't be
sufficient to read data (as you'd need the tablespace oid from elsewhere). But
that's probably ok, I think all relevant places would have that information.

It's probably possible to remove the database oid from the tag as well, but
it'd make CREATE DATABASE tricker - we'd need to change the filenames of
tables as we copy, to adjust them to the differing oid.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2021-08-02 22:42:04 Re: make MaxBackends available in _PG_init
Previous Message Bossart, Nathan 2021-08-02 22:35:13 Re: make MaxBackends available in _PG_init