Re: pg_dump additional options for performance

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: pg_dump additional options for performance
Date: 2008-02-26 11:22:56
Message-ID: 1204024976.4252.241.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2008-02-19 at 14:18 -0800, Jeff Davis wrote:
> On Wed, 2008-02-06 at 15:13 +0000, Simon Riggs wrote:
> > The -s option creates the table, as well as creating constraints and
> > indexes. These objects need to be dropped prior to loading, if we are to
> > follow the performance recommendations in the docs. But the only way to
> > do that is to manually edit the script to produce a cut down script.
> >
> > So it would be good if we could dump objects in 3 groups
> > 1. all commands required to re-create table
> > 2. data
> > 3. all commands required to complete table after data load
> >
> > My proposal is to provide two additional modes:
> > --schema-pre-load corresponding to (1) above
> > --schema-post-load corresponding to (3) above
>
> Another thought:
>
> We could also break step #3 into two steps: those objects required for
> correctness (e.g. unique indexes, other constraints, etc); and those
> objects that are merely for performance (i.e. non-constraining indexes).
> Let's call these steps #3A and #3B.
>
> After reading the thread here:
>
> http://archives.postgresql.org/pgsql-performance/2008-02/msg00211.php
>
> it's clear that building indexes can take a long time, and they aren't
> strictly required for correct database operation. Often, it's perfectly
> reasonable to operate the database without a few of the indexes, so long
> as they don't imply a constraint.
>
> Step #3B could be done with "CONCURRENTLY" to allow uninterrupted
> operation.
>
> We could even allow pg_restore to run step #3B in multiple sessions
> building different indexes to use multiple processor cores.
>
> This is just a thought. I can see that it's getting fairly complex, and
> it may be better to just leave these things up to the DBA.

My thinking is to do either:

* keep it as simple as possible to allow DBA to manually improve
performance

* express dependency information in the pg_dump output to allow some
level of parallelism to use that information to advantage automatically

Doing the second one only seems a lot of work and would prevent a DBA
from being able to code things a particular way for their install. It
seems possible to get good benefit from the first one and yet still do
the second part of that later.

If we can work out some notation or grammar to put into the script
output then we might imagine having a concurrent psql that spawns
multiple sessions and executes. With the correct grammar all we would
need is an additional psql option --parallel=N and then psql would work
its way through the script keeping N sessions busy when we reload.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Martijn van Oosterhout 2008-02-26 11:29:53 Re: Producer/Consumer Issues in the COPY across network
Previous Message Simon Riggs 2008-02-26 11:06:02 Re: pg_dump additional options for performance