| From: | Andres Freund <andres(at)anarazel(dot)de> | 
|---|---|
| To: | Robert Haas <robertmhaas(at)gmail(dot)com> | 
| Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> | 
| Subject: | Re: problems with making relfilenodes 56-bits | 
| Date: | 2022-10-03 17:01:25 | 
| Message-ID: | 20221003170125.7xd2vdadjak74tnm@awork3.anarazel.de | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Hi,
On 2022-10-03 08:12:39 -0400, Robert Haas wrote:
> On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I think it'd be interesting to look at per-record-type stats between two
> > equivalent workload, to see where practical workloads suffer the most
> > (possibly with fpw=off, to make things more repeatable).
>
> I would expect, and Dilip's results seem to confirm, the effect to be
> pretty uniform: basically, nearly every record gets bigger by 4 bytes.
> That's because most records contain at least one block reference, and
> if they contain multiple block references, likely all but one will be
> marked BKPBLOCK_SAME_REL, so we pay the cost just once.
But it doesn't really matter that much if an already large record gets a bit
bigger. Whereas it does matter if it's a small record. Focussing on optimizing
the record types where the increase is large seems like a potential way
forward to me, even if we can't find something generic.
> I thought about trying to buy back some space elsewhere, and I think
> that would be a reasonable approach to getting this committed if we
> could find a way to do it. However, I don't see a terribly obvious way
> of making it happen.
I think there's plenty potential...
> Trying to do it by optimizing specific WAL record
> types seems like a real pain in the neck, because there's tons of
> different WAL records that all have the same problem.
I am not so sure about that. Improving a bunch of the most frequent small
records might buy you back enough on just about every workload to be OK.
I put the top record sizes for an installcheck run with full_page_writes off
at the bottom. Certainly our regression tests aren't generally
representative. But I think it still decently highlights how just improving a
few records could buy you back more than enough.
> Trying to do it in a generic way makes more sense, and the fact that we have
> 2 padding bytes available in XLogRecord seems like a place to start looking,
> but the way forward from there is not clear to me.
Random idea: xl_prev is large. Store a full xl_prev in the page header, but
only store a 2 byte offset from the page header xl_prev within each record.
Greetings,
Andres Freund
by total size:
Type                                           N      (%)          Record size      (%)             FPI size      (%)        Combined size      (%)
----                                           -      ---          -----------      ---             --------      ---        -------------      ---
Heap/INSERT                              1041666 ( 50.48)            106565255 ( 50.54)                    0 (  0.00)            106565255 ( 43.92)
Btree/INSERT_LEAF                         352196 ( 17.07)             24067672 ( 11.41)                    0 (  0.00)             24067672 (  9.92)
Heap/DELETE                               250852 ( 12.16)             13546008 (  6.42)                    0 (  0.00)             13546008 (  5.58)
Hash/INSERT                               108499 (  5.26)              7811928 (  3.70)                    0 (  0.00)              7811928 (  3.22)
Transaction/COMMIT                         16053 (  0.78)              6402657 (  3.04)                    0 (  0.00)              6402657 (  2.64)
Gist/PAGE_UPDATE                           57225 (  2.77)              5217100 (  2.47)                    0 (  0.00)              5217100 (  2.15)
Gin/UPDATE_META_PAGE                       23943 (  1.16)              4539970 (  2.15)                    0 (  0.00)              4539970 (  1.87)
Gin/INSERT                                 27004 (  1.31)              3623998 (  1.72)                    0 (  0.00)              3623998 (  1.49)
Gist/PAGE_SPLIT                              448 (  0.02)              3391244 (  1.61)                    0 (  0.00)              3391244 (  1.40)
SPGist/ADD_LEAF                            38968 (  1.89)              3341696 (  1.58)                    0 (  0.00)              3341696 (  1.38)
...
XLOG/FPI                                    7228 (  0.35)               378924 (  0.18)             29788166 ( 93.67)             30167090 ( 12.43)
...
Gin/SPLIT                                    141 (  0.01)                13011 (  0.01)              1187588 (  3.73)              1200599 (  0.49)
...
                                        --------                      --------                      --------                      --------
Total                                    2063609                     210848282 [86.89%]             31802766 [13.11%]            242651048 [100%]
(Included XLOG/FPI and Gin/SPLIT to explain why there's FPIs despite running with fpw=off)
sorted by number of records:
Heap/INSERT                              1041666 ( 50.48)            106565255 ( 50.54)                    0 (  0.00)            106565255 ( 43.92)
Btree/INSERT_LEAF                         352196 ( 17.07)             24067672 ( 11.41)                    0 (  0.00)             24067672 (  9.92)
Heap/DELETE                               250852 ( 12.16)             13546008 (  6.42)                    0 (  0.00)             13546008 (  5.58)
Hash/INSERT                               108499 (  5.26)              7811928 (  3.70)                    0 (  0.00)              7811928 (  3.22)
Gist/PAGE_UPDATE                           57225 (  2.77)              5217100 (  2.47)                    0 (  0.00)              5217100 (  2.15)
SPGist/ADD_LEAF                            38968 (  1.89)              3341696 (  1.58)                    0 (  0.00)              3341696 (  1.38)
Gin/INSERT                                 27004 (  1.31)              3623998 (  1.72)                    0 (  0.00)              3623998 (  1.49)
Gin/UPDATE_META_PAGE                       23943 (  1.16)              4539970 (  2.15)                    0 (  0.00)              4539970 (  1.87)
Standby/LOCK                               18451 (  0.89)               775026 (  0.37)                    0 (  0.00)               775026 (  0.32)
Transaction/COMMIT                         16053 (  0.78)              6402657 (  3.04)                    0 (  0.00)              6402657 (  2.64)
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tomas Vondra | 2022-10-03 17:21:55 | Re: Question: test "aggregates" failed in 32-bit machine | 
| Previous Message | Tom Lane | 2022-10-03 17:00:46 | Re: [patch] \g with multiple result sets and \watch with copy queries |