Re: Experimenting with hash tables inside pg_dump

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Experimenting with hash tables inside pg_dump
Date: 2021-10-22 05:59:39
Message-ID: 20211022055939.z6fihsm7hdzbjttf@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2021-10-21 22:13:22 -0400, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > I wonder though if for some of them we should instead replace the per-object
> > queries with one query returning the information for all objects of a type. It
> > doesn't make all that much sense that we build and send one query for each
> > table and index.
>
> The trick is the problem I alluded to in another thread: it's not safe to
> do stuff like pg_get_expr() on tables we don't have lock on.

I was looking at getTableAttrs() - sending one query instead of #tables
queries yields a quite substantial speedup in a quick prototype. And I don't
think it changes anything around locking semantics.

> I've thought about doing something like
>
> SELECT unsafe-functions FROM pg_class WHERE oid IN (someoid, someoid, ...)
>
> but in cases with tens of thousands of tables, it seems unlikely that
> that's going to behave all that nicely.

That's kinda what I'm doing in the quick hack. But instead of using IN(...) I
made it unnest('{oid, oid, ...}'), that scales much better.

A pg_dump --schema-only of the regression database goes from

real 0m0.675s
user 0m0.039s
sys 0m0.029s

to

real 0m0.477s
user 0m0.037s
sys 0m0.020s

which isn't half-bad.

There's a few more cases like this I think. But most are harder because the
dumping happens one-by-one from dumpDumpableObject(). The relatively easy but
substantial cases I could find quickly were getIndexes(), getConstraints(),
getTriggers()

To see where it's worth putting in time it'd be useful if getSchemaData() in
verbose mode printed timing information...

> The *real* fix, I suppose, would be to fix all those catalog-inspection
> functions so that they operate with respect to the query's snapshot.
> But that's not a job I'm volunteering for. Besides which, pg_dump
> still has to cope with back-rev servers where it wouldn't be safe.

Yea, that's not a small change :(. I suspect that we'd need a bunch of new
caching infrastructure to make that reasonably performant, since this
presumably couldn't use syscache etc.

Greetings,

Andres Freund

Attachment Content-Type Size
pg_dump-bulk-gettableattrs.diff text/x-diff 25.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Pyhalov 2021-10-22 06:26:50 Re: Partial aggregates pushdown
Previous Message Masahiko Sawada 2021-10-22 05:38:02 Re: Parallel vacuum workers prevent the oldest xmin from advancing