Quick Links

Re: pg_dump additional options for performance

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Tom Dunstan <pgsql(at)tomd(dot)cc>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: pg_dump additional options for performance
Date:	2008-02-26 18:13:21
Message-ID:	1204049601.4252.399.camel@ebony.site
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, 2008-02-26 at 12:27 -0500, Greg Smith wrote:
> On Tue, 26 Feb 2008, Simon Riggs wrote:
>
> > Splitting up the dump is the enabler for splitting up the load.
>
> While the pg_dump split train seems to be leaving the station, I feel
> compelled to point out that focus does nothing to help people who are
> bulk-loading data that came from somewhere else. If my data is already in
> PostgreSQL, and I'm doing a dump/load, I can usually split the data easily
> enough with existing tools to handle that right now via COPY (SELECT...)
> TO. Some tools within pg_dump would be nice, but I don't need them that
> much. It's gigantic files that came from some other DB I don't even have
> access to that I struggle with loading efficiently.

Don't write them as gigantic files... :-)

> The work Dimitri is doing is wandering in that direction and that may be
> enough. I note that something that addresses loading big files regardless
> of source could also work on PostgreSQL dumps, while a pg_dump focused
> effort helps nothing but that specific workflow. I wonder if doing too
> much work on the pg_dump path is the best use of someone's time when the
> more general case will need to be addressed one day anyway.

I take your points. I've spent as much time discussing this as coding
it, which is OK cos coding the wrong thing is also a waste of time.

If we have individual load files then we will be able to use a parallel
load utility on them, whether they come from pg_dump or otherwise.
Neither of the current pg_dump output formats would allow that.
--data-only does allow that but there's no way to dump a sensible schema
to go with that, which is what this thread is about.

Dimitri is working on a parallel loader that allows error handling. I
don't think its worth duplicating that effort by attempting to make COPY
work in parallel. Even if we did, the single client could easily become
a bottleneck. So an external utility that spawns multiple processes and
runs multiple concurrent COPY statements sounds quite good to me.

He's asked for help, if anybody can assist.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

In response to

Re: pg_dump additional options for performance at 2008-02-26 17:27:13 from Greg Smith

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2008-02-26 18:18:04	Re: pg_dump additional options for performance
Previous Message	Tom Lane	2008-02-26 18:02:17	Re: pg_dump additional options for performance