Re: pg_dump additional options for performance

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Tom Dunstan <pgsql(at)tomd(dot)cc>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: pg_dump additional options for performance
Date: 2008-02-26 18:13:21
Message-ID: 1204049601.4252.399.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2008-02-26 at 12:27 -0500, Greg Smith wrote:
> On Tue, 26 Feb 2008, Simon Riggs wrote:
>
> > Splitting up the dump is the enabler for splitting up the load.
>
> While the pg_dump split train seems to be leaving the station, I feel
> compelled to point out that focus does nothing to help people who are
> bulk-loading data that came from somewhere else. If my data is already in
> PostgreSQL, and I'm doing a dump/load, I can usually split the data easily
> enough with existing tools to handle that right now via COPY (SELECT...)
> TO. Some tools within pg_dump would be nice, but I don't need them that
> much. It's gigantic files that came from some other DB I don't even have
> access to that I struggle with loading efficiently.

Don't write them as gigantic files... :-)

> The work Dimitri is doing is wandering in that direction and that may be
> enough. I note that something that addresses loading big files regardless
> of source could also work on PostgreSQL dumps, while a pg_dump focused
> effort helps nothing but that specific workflow. I wonder if doing too
> much work on the pg_dump path is the best use of someone's time when the
> more general case will need to be addressed one day anyway.

I take your points. I've spent as much time discussing this as coding
it, which is OK cos coding the wrong thing is also a waste of time.

If we have individual load files then we will be able to use a parallel
load utility on them, whether they come from pg_dump or otherwise.
Neither of the current pg_dump output formats would allow that.
--data-only does allow that but there's no way to dump a sensible schema
to go with that, which is what this thread is about.

Dimitri is working on a parallel loader that allows error handling. I
don't think its worth duplicating that effort by attempting to make COPY
work in parallel. Even if we did, the single client could easily become
a bottleneck. So an external utility that spawns multiple processes and
runs multiple concurrent COPY statements sounds quite good to me.

He's asked for help, if anybody can assist.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2008-02-26 18:18:04 Re: pg_dump additional options for performance
Previous Message Tom Lane 2008-02-26 18:02:17 Re: pg_dump additional options for performance