Re: pg_dump additional options for performance

From: chris <cbbrowne(at)ca(dot)afilias(dot)info>
To: pgsql-patches(at)postgresql(dot)org
Subject: Re: pg_dump additional options for performance
Date: 2008-08-01 18:02:30
Message-ID: 87ej58obpl.fsf@dba2.int.libertyrms.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

tgl(at)sss(dot)pgh(dot)pa(dot)us (Tom Lane) writes:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
>> I want to dump tables separately for performance reasons. There are
>> documented tests showing 100% gains using this method. There is no gain
>> adding this to pg_restore. There is a gain to be had - parallelising
>> index creation, but this patch doesn't provide parallelisation.
>
> Right, but the parallelization is going to happen sometime, and it is
> going to happen in the context of pg_restore. So I think it's pretty
> silly to argue that no one will ever want this feature to work in
> pg_restore.

"Never" is a long time, agreed.

> To extend the example I just gave to Stephen, I think a fairly probable
> scenario is where you only need to tweak some "before" object
> definitions, and then you could do
>
> pg_restore --schema-before-data whole.dump >before.sql
> edit before.sql
> psql -f before.sql target_db
> pg_restore --data-only --schema-after-data -d target_db whole.dump
>
> which (given a parallelizing pg_restore) would do all the time-consuming
> steps in a fully parallelized fashion.

Do we need to wait until a fully-parallelizing pg_restore is
implemented before adding this functionality to pg_dump?

The particular extension I'm interested in from pg_dump, here, is the
ability to dump multiple tables concurrently. I've got disk arrays
with enough I/O bandwidth that this form of parallelization does
provide a performance benefit.

The result of that will be that *many* files are generated, and I
don't imagine we want to change pg_restore to try to make it read from
multiple files concurrently.

Further, it's actually not obvious that we *necessarily* care about
parallelizing loading data. The thing that happens every day is
backups. I care rather a lot about optimizing that; we do backups
each and every day, and optimizations to that process will accrue
benefits each and every day.

In contrast, restoring databases does not take place every day. When
it happens, yes, there's considerable value to making *that* go as
quickly as possible, but I'm quite willing to consider optimizing that
to be separate from optimizing backups.

I daresay I haven't used pg_restore any time recently, either. Any
time we have thought about using it, we've concluded that the
perceivable benefits were actually more of a mirage.
--
select 'cbbrowne' || '@' || 'linuxfinances.info';
http://cbbrowne.com/info/lsf.html
Rules of the Evil Overlord #145. "My dungeon cell decor will not
feature exposed pipes. While they add to the gloomy atmosphere, they
are good conductors of vibrations and a lot of prisoners know Morse
code." <http://www.eviloverlord.com/>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2008-08-01 18:56:14 Re: SSL configure patch: review
Previous Message Tom Lane 2008-08-01 16:26:05 Re: Fixing the representation of ORDER BY/GROUP BY/DISTINCT

Browse pgsql-patches by date

  From Date Subject
Next Message Alvaro Herrera 2008-08-01 23:10:14 Re: [HACKERS] Hint Bits and Write I/O
Previous Message Heikki Linnakangas 2008-08-01 11:26:53 Re: Relation forks & FSM rewrite patches