Improvements in pg_dump/pg_restore toc format and performances

From: Pierre Ducroquet <p(dot)psql(at)pinaraf(dot)info>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Improvements in pg_dump/pg_restore toc format and performances
Date: 2023-07-27 08:51:11
Message-ID: 2656000.KRxA6XjA2N@peanuts2
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Following the thread "Inefficiency in parallel pg_restore with many tables", I
started digging into why the toc.dat files are that big and where time is spent
when parsing them.

I ended up writing several patches that shaved some time for pg_restore -l,
and reduced the toc.dat size.

First patch is "finishing" the job of removing has oids support. When this
support was removed, instead of dropping the field from the dumps and
increasing the dump versions, the field was kept as is. This field stores a
boolean as a string, "true" or "false". This is not free, and requires 10
bytes per toc entry.

The second patch removes calls to sscanf and replaces them with strtoul. This
was the biggest speedup for pg_restore -l.

The third patch changes the dump format further to remove these strtoul calls
and store the integers as is instead.

The fourth patch is dirtier and does more changes to the dump format. Instead
of storing the owner, tablespace, table access method and schema of each
object as a string, pg_dump builds an array of these, stores them at the
beginning of the file and replaces the strings with integer fields in the dump.
This reduces the file size further, and removes a lot of calls to ReadStr, thus
saving quite some time.

Toc has 453999 entries.

Patch Toc size Dump -s duration pg_restore -l duration
HEAD 214M 23.1s 1.27s
#1 (has oid) 210M 22.9s 1.26s
#2 (scanf) 210M 22.9s 1.07s
#3 (no strtoul) 202M 22.8s 0.94s
#4 (string list) 181M 23.1s 0.87s

Patch four is likely to require more changes. I don't know PostgreSQL code
enough to do better than calling pgmalloc/pgrealloc and maintaining a char**
manually, I guess there are structs and functions that do that in a better
way. And the location of string tables in the file and in the structures is
probably not acceptable, I suppose these should go to the toc header instead.

I still submit these for comments and first review.

Best regards

Pierre Ducroquet

Attachment Content-Type Size
0001-drop-has-oids-field-instead-of-having-static-values.patch text/x-patch 1.7 KB
0002-convert-sscanf-to-strtoul.patch text/x-patch 1.2 KB
0003-store-oids-as-integer-instead-of-string.patch text/x-patch 3.3 KB
0004-move-static-strings-to-arrays-at-beginning.patch text/x-patch 10.0 KB


Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2023-07-27 08:53:16 Re: Performance degradation on concurrent COPY into a single relation in PG16.
Previous Message David Steele 2023-07-27 08:18:47 Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"