Re: Experimenting with hash tables inside pg_dump

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Experimenting with hash tables inside pg_dump
Date: 2021-10-21 23:37:57
Message-ID: 20211021233757.s7i55dpubw7rjf4a@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2021-10-21 18:27:25 -0400, Tom Lane wrote:
> Today, pg_dump does a lot of internal lookups via binary search
> in presorted arrays. I thought it might improve matters
> to replace those binary searches with hash tables, theoretically
> converting O(log N) searches into O(1) searches. So I tried making
> a hash table indexed by CatalogId (tableoid+oid) with simplehash.h,
> and replacing as many data structures as I could with that.

That does sound like a good idea in theory...

> This makes the code shorter and (IMO anyway) cleaner, but
>
> (a) the executable size increases by a few KB --- apparently, even
> the minimum subset of simplehash.h's functionality is code-wasteful.

Hm. Surprised a bit by that. In an optimized build the difference is a
smaller, at least.

optimized:
text data bss dec hex filename
448066 7048 1368 456482 6f722 src/bin/pg_dump/pg_dump
447530 7048 1496 456074 6f58a src/bin/pg_dump/pg_dump.orig

debug:
text data bss dec hex filename
516883 7024 1352 525259 803cb src/bin/pg_dump/pg_dump
509819 7024 1480 518323 7e8b3 src/bin/pg_dump/pg_dump.orig

The fact that optimization plays such a role makes me wonder if a good chunk
of the difference is the slightly more complicated find{Type,Func,...}ByOid()
functions.

> (b) I couldn't measure any change in performance at all. I tried
> it on the regression database and on a toy DB with 10000 simple
> tables. Maybe on a really large DB you'd notice some difference,
> but I'm not very optimistic now.

Did you measure runtime of pg_dump, or how much CPU it used? I think a lot of
the time the backend is a bigger bottleneck than pg_dump...

For the regression test DB the majority of the time seems to be spent below
two things:
1) libpq
2) sortDumpableObjects().

I don't think 2) hits the binary search / hashtable path?

It does seem interesting that a substantial part of the time is spent in/below
PQexec() and PQfnumber(). Especially the latter shouldn't be too hard to
optimize away...

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bossart, Nathan 2021-10-21 23:59:57 Re: Experimenting with hash tables inside pg_dump
Previous Message Mark Dilger 2021-10-21 23:21:09 Re: CREATEROLE and role ownership hierarchies