Re: pg_dump additional options for performance

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Tom Dunstan <pgsql(at)tomd(dot)cc>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: pg_dump additional options for performance
Date: 2008-02-26 14:15:33
Message-ID: 1204035333.4252.285.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2008-02-26 at 08:28 -0500, Andrew Dunstan wrote:
>
> Simon Riggs wrote:
> > Separate files seems much simpler...
> >
> >
>
> Yes, We need to stick to the KISS principle.
>
> ISTM that we could simply invent a new archive format of "d" for directory.

...and then dump each table to a separate file?

If we dumped all the tables to separate files we would then be able to
have a COPY statement *without* inline data - we just give a relative
address to the data file (I know we can't do that currently). So, yes, I
like it.

> BTW, parallel dumping might be important, but is really much less so
> than parallel restoring in my book.

Yeh, but unload time creates a long running transaction that is a PITA,
so the unload time matters for two reasons.

Splitting up the dump is the enabler for splitting up the load.

If we had concurrent psql *and* out of line data then we would be able
to load many tables at once. But the single psql process would then be
the bottleneck for reading data, so we really need multiple client
processes, each with its own session to the database. Moving the
bottleneck isn't solving the problem longer term. Sure, psql could spawn
lots of additional processes, but thats a lot more work than just
splitting up the files.

My feeling that trying to do everything automagically will end up in a
long project that has some restrictions in the final result.

I would very much like to KISS and just split the files up at dump time
and then allow them to be executed concurrently on load. It's not really
a problem to put the various files in a directory manually.

Most use cases of this are when we have a few really big tables and lots
of small ones and manually coding an unload/reload is not a big issue in
the scheme of things. It will be easy to use, performant, reliable and
we can work on other backend software in the time saved by doing this.

There's no real problem with allowing split-file unloads *and* later
having a automagical solution.

So I'm saying:

* lets go for the 3 split-file options (called whatever...)

* we might find a really cool automagical solution *later*. That would
probably look like this
- parallel dumping from spawned sessions on pg_dump
- dumps to multiple files in same directory
- parallel reload using pg_restore

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2008-02-26 14:16:23 Re: pgAgent job limit
Previous Message Magnus Hagander 2008-02-26 14:12:26 Re: pg_dump additional options for performance