Re: Improvements in pg_dump/pg_restore toc format and performances

From: Pierre Ducroquet <p(dot)psql(at)pinaraf(dot)info>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Improvements in pg_dump/pg_restore toc format and performances
Date: 2023-09-18 22:41:24
Message-ID: 8312205.NyiUUSuA9g@peanuts2
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Monday, September 18, 2023 11:52:47 PM CEST Nathan Bossart wrote:
> On Thu, Jul 27, 2023 at 10:51:11AM +0200, Pierre Ducroquet wrote:
> > I ended up writing several patches that shaved some time for pg_restore
> > -l,
> > and reduced the toc.dat size.
>
> I've only just started taking a look at these patches, and I intend to do a
> more thorough review in the hopefully-not-too-distant future.

Thank you very much.

> Since cfbot is failing on some pg_upgrade and pg_dump tests, I've set this
> to waiting-on-author.

I did not notice anything running meson test -v, I'll look further into it in
the next days.

> > First patch is "finishing" the job of removing has oids support. When this
> > support was removed, instead of dropping the field from the dumps and
> > increasing the dump versions, the field was kept as is. This field stores
> > a
> > boolean as a string, "true" or "false". This is not free, and requires 10
> > bytes per toc entry.
>
> This sounds reasonable to me. I wonder why this wasn't done when WITH OIDS
> was removed in v12.

I suppose it is an oversight, or not wanting to increase the dump version for
no reason.

> > The second patch removes calls to sscanf and replaces them with strtoul.
> > This was the biggest speedup for pg_restore -l.
>
> Nice.
>
> > The third patch changes the dump format further to remove these strtoul
> > calls and store the integers as is instead.
>
> Do we need to worry about endianness here?

I used the ReadInt/WriteInt functions already defined in pg_dump that take care
of this issue, so there should be no need to worry.

> > The fourth patch is dirtier and does more changes to the dump format.
> > Instead of storing the owner, tablespace, table access method and schema
> > of each object as a string, pg_dump builds an array of these, stores them
> > at the beginning of the file and replaces the strings with integer fields
> > in the dump. This reduces the file size further, and removes a lot of
> > calls to ReadStr, thus saving quite some time.
>
> This sounds promising.
>
> > Patch Toc size Dump -s duration pg_restore -l duration
> > HEAD 214M 23.1s 1.27s
> > #1 (has oid) 210M 22.9s 1.26s
> > #2 (scanf) 210M 22.9s 1.07s
> > #3 (no strtoul) 202M 22.8s 0.94s
> > #4 (string list) 181M 23.1s 0.87s
>
> At a glance, the size improvements in 0004 look the most interesting to me.

Yes it is, and the speed benefits are interesting too (at least for my usecase)

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2023-09-18 22:48:14 Re: Index range search optimization
Previous Message Nathan Bossart 2023-09-18 21:54:42 Re: Improvements in pg_dump/pg_restore toc format and performances