Re: problems with making relfilenodes 56-bits

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: problems with making relfilenodes 56-bits
Date: 2022-10-03 17:01:25
Message-ID: 20221003170125.7xd2vdadjak74tnm@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-10-03 08:12:39 -0400, Robert Haas wrote:
> On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I think it'd be interesting to look at per-record-type stats between two
> > equivalent workload, to see where practical workloads suffer the most
> > (possibly with fpw=off, to make things more repeatable).
>
> I would expect, and Dilip's results seem to confirm, the effect to be
> pretty uniform: basically, nearly every record gets bigger by 4 bytes.
> That's because most records contain at least one block reference, and
> if they contain multiple block references, likely all but one will be
> marked BKPBLOCK_SAME_REL, so we pay the cost just once.

But it doesn't really matter that much if an already large record gets a bit
bigger. Whereas it does matter if it's a small record. Focussing on optimizing
the record types where the increase is large seems like a potential way
forward to me, even if we can't find something generic.

> I thought about trying to buy back some space elsewhere, and I think
> that would be a reasonable approach to getting this committed if we
> could find a way to do it. However, I don't see a terribly obvious way
> of making it happen.

I think there's plenty potential...

> Trying to do it by optimizing specific WAL record
> types seems like a real pain in the neck, because there's tons of
> different WAL records that all have the same problem.

I am not so sure about that. Improving a bunch of the most frequent small
records might buy you back enough on just about every workload to be OK.

I put the top record sizes for an installcheck run with full_page_writes off
at the bottom. Certainly our regression tests aren't generally
representative. But I think it still decently highlights how just improving a
few records could buy you back more than enough.

> Trying to do it in a generic way makes more sense, and the fact that we have
> 2 padding bytes available in XLogRecord seems like a place to start looking,
> but the way forward from there is not clear to me.

Random idea: xl_prev is large. Store a full xl_prev in the page header, but
only store a 2 byte offset from the page header xl_prev within each record.

Greetings,

Andres Freund

by total size:

Type N (%) Record size (%) FPI size (%) Combined size (%)
---- - --- ----------- --- -------- --- ------------- ---
Heap/INSERT 1041666 ( 50.48) 106565255 ( 50.54) 0 ( 0.00) 106565255 ( 43.92)
Btree/INSERT_LEAF 352196 ( 17.07) 24067672 ( 11.41) 0 ( 0.00) 24067672 ( 9.92)
Heap/DELETE 250852 ( 12.16) 13546008 ( 6.42) 0 ( 0.00) 13546008 ( 5.58)
Hash/INSERT 108499 ( 5.26) 7811928 ( 3.70) 0 ( 0.00) 7811928 ( 3.22)
Transaction/COMMIT 16053 ( 0.78) 6402657 ( 3.04) 0 ( 0.00) 6402657 ( 2.64)
Gist/PAGE_UPDATE 57225 ( 2.77) 5217100 ( 2.47) 0 ( 0.00) 5217100 ( 2.15)
Gin/UPDATE_META_PAGE 23943 ( 1.16) 4539970 ( 2.15) 0 ( 0.00) 4539970 ( 1.87)
Gin/INSERT 27004 ( 1.31) 3623998 ( 1.72) 0 ( 0.00) 3623998 ( 1.49)
Gist/PAGE_SPLIT 448 ( 0.02) 3391244 ( 1.61) 0 ( 0.00) 3391244 ( 1.40)
SPGist/ADD_LEAF 38968 ( 1.89) 3341696 ( 1.58) 0 ( 0.00) 3341696 ( 1.38)
...
XLOG/FPI 7228 ( 0.35) 378924 ( 0.18) 29788166 ( 93.67) 30167090 ( 12.43)
...
Gin/SPLIT 141 ( 0.01) 13011 ( 0.01) 1187588 ( 3.73) 1200599 ( 0.49)
...
-------- -------- -------- --------
Total 2063609 210848282 [86.89%] 31802766 [13.11%] 242651048 [100%]

(Included XLOG/FPI and Gin/SPLIT to explain why there's FPIs despite running with fpw=off)

sorted by number of records:
Heap/INSERT 1041666 ( 50.48) 106565255 ( 50.54) 0 ( 0.00) 106565255 ( 43.92)
Btree/INSERT_LEAF 352196 ( 17.07) 24067672 ( 11.41) 0 ( 0.00) 24067672 ( 9.92)
Heap/DELETE 250852 ( 12.16) 13546008 ( 6.42) 0 ( 0.00) 13546008 ( 5.58)
Hash/INSERT 108499 ( 5.26) 7811928 ( 3.70) 0 ( 0.00) 7811928 ( 3.22)
Gist/PAGE_UPDATE 57225 ( 2.77) 5217100 ( 2.47) 0 ( 0.00) 5217100 ( 2.15)
SPGist/ADD_LEAF 38968 ( 1.89) 3341696 ( 1.58) 0 ( 0.00) 3341696 ( 1.38)
Gin/INSERT 27004 ( 1.31) 3623998 ( 1.72) 0 ( 0.00) 3623998 ( 1.49)
Gin/UPDATE_META_PAGE 23943 ( 1.16) 4539970 ( 2.15) 0 ( 0.00) 4539970 ( 1.87)
Standby/LOCK 18451 ( 0.89) 775026 ( 0.37) 0 ( 0.00) 775026 ( 0.32)
Transaction/COMMIT 16053 ( 0.78) 6402657 ( 3.04) 0 ( 0.00) 6402657 ( 2.64)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2022-10-03 17:21:55 Re: Question: test "aggregates" failed in 32-bit machine
Previous Message Tom Lane 2022-10-03 17:00:46 Re: [patch] \g with multiple result sets and \watch with copy queries