Re: problems with making relfilenodes 56-bits

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, David Rowley <dgrowleyml(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: problems with making relfilenodes 56-bits
Date: 2022-10-13 22:53:48
Message-ID: CAEze2WihN2NCCFpUYtFRS0Bs5eV0KVYmbkxiH7oBdPJ32McPbA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 12 Oct 2022 at 23:13, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2022-10-12 22:05:30 +0200, Matthias van de Meent wrote:
> > On Wed, 5 Oct 2022 at 01:50, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > I light dusted off my old varint implementation from [1] and converted the
> > > RelFileLocator and BlockNumber from fixed width integers to varint ones. This
> > > isn't meant as a serious patch, but an experiment to see if this is a path
> > > worth pursuing.
> > >
> > > A run of installcheck in a cluster with autovacuum=off, full_page_writes=off
> > > (for increased reproducability) shows a decent saving:
> > >
> > > master: 241106544 - 230 MB
> > > varint: 227858640 - 217 MB
> >
> > I think a signficant part of this improvement comes from the premise
> > of starting with a fresh database. tablespace OID will indeed most
> > likely be low, but database OID may very well be linearly distributed
> > if concurrent workloads in the cluster include updating (potentially
> > unlogged) TOASTed columns and the databases are not created in one
> > "big bang" but over the lifetime of the cluster. In that case DBOID
> > will consume 5B for a significant fraction of databases (anything with
> > OID >=2^28).
> >
> > My point being: I don't think that we should have different WAL
> > performance in databases which is dependent on which OID was assigned
> > to that database.
>
> To me this is raising the bar to an absurd level. Some minor space usage
> increase after oid wraparound and for very large block numbers isn't a huge
> issue - if you're in that situation you already have a huge amount of wal.

I didn't want to block all varlen encoding, I just want to make clear
that I don't think it's great for performance testing and consistency
across installations if WAL size (and thus part of your performance)
is dependent on which actual database/relation/tablespace combination
you're running your workload in.

With the 56-bit relfilenode, the size of a block reference would
realistically differ between 7 bytes and 23 bytes:

- tblspc=0=1B
db=16386=3B
rel=797=2B (787 = 4 * default # of data relations in a fresh DB in PG14, + 1)
block=0=1B

vs

- tsp>=2^28 = 5B
dat>=2^28 =5B
rel>=2^49 =8B
block>=2^28 =5B

That's a difference of 16 bytes, of which only the block number can
realistically be directly influenced by the user ("just don't have
relations larger than X blocks").

If applied to Dilip's pgbench transaction data, that would imply a
minimum per transaction wal usage of 509 bytes, and a maximum per
transaction wal usage of 609 bytes. That is nearly a 20% difference in
WAL size based only on the location of your data, and I'm just not
comfortable with that. Users have little or zero control over the
internal IDs we assign to these fields, while it would affect
performance fairly significantly.

(difference % between min/max wal size is unchanged (within 0.1%)
after accounting for record alignment)

> > 0002 - Rework XLogRecord
> > This makes many fields in the xlog header optional, reducing the size
> > of many xlog records by several bytes. This implements the design I
> > shared in my earlier message [1].
> >
> > 0003 - Rework XLogRecordBlockHeader.
> > This patch could be applied on current head, and saves some bytes in
> > per-block data. It potentially saves some bytes per registered
> > block/buffer in the WAL record (max 2 bytes for the first block, after
> > that up to 3). See the patch's commit message in the patch for
> > detailed information.
>
> The amount of complexity these two introduce seems quite substantial to
> me. Both from an maintenance and a runtime perspective. I think we'd be better
> off using building blocks like variable lengths encoded values than open
> coding it in many places.

I guess that's true for length fields, but I don't think dynamic
header field presence (the 0002 rewrite, and the omission of
data_length in 0003) is that bad. We already have dynamic data
inclusion through block ids 25x; I'm not sure why we couldn't do that
more compactly with bitfields as indicators instead (hence the dynamic
header size).

As for complexity, I think my current patchset is mostly complex due
to a lack of tooling. Note that decoding makes common use of
COPY_HEADER_FIELD, which we don't really have an equivalent for in
XLogRecordAssemble. I think the code for 0002 would improve
significantly in readability if such construct would be available.

To reduce complexity in 0003, I could drop the 'repeat id'
optimization, as that reduces the complexity significantly, at the
cost of not saving that 1 byte per registered block after the first.

Kind regards,

Matthias van de Meent

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2022-10-14 00:30:30 Re: archive modules
Previous Message Mark Dilger 2022-10-13 22:38:40 Re: Incorrect comment regarding command completion tags