Re: pg_dump additional options for performance

From: daveg <daveg(at)sonic(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, pgsql-patches(at)postgresql(dot)org
Subject: Re: pg_dump additional options for performance
Date: 2008-07-27 01:33:51
Message-ID: 20080727013351.GB30335@sonic.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

On Sat, Jul 26, 2008 at 01:56:14PM -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > I want to dump tables separately for performance reasons. There are
> > documented tests showing 100% gains using this method. There is no gain
> > adding this to pg_restore. There is a gain to be had - parallelising
> > index creation, but this patch doesn't provide parallelisation.
>
> Right, but the parallelization is going to happen sometime, and it is
> going to happen in the context of pg_restore. So I think it's pretty
> silly to argue that no one will ever want this feature to work in
> pg_restore.
>
> To extend the example I just gave to Stephen, I think a fairly probable
> scenario is where you only need to tweak some "before" object
> definitions, and then you could do
>
> pg_restore --schema-before-data whole.dump >before.sql
> edit before.sql
> psql -f before.sql target_db
> pg_restore --data-only --schema-after-data -d target_db whole.dump
>
> which (given a parallelizing pg_restore) would do all the time-consuming
> steps in a fully parallelized fashion.

A few thoughts about pg_restore performance:

To take advantage of non-logged copy, the create and load should be in
the same transaction.

To take advantage of file and buffer cache, it would be be good to do
indexes immediately after table data. Many tables will be small enough
to fit in cache and this will avoid re-reading them for index builds. This
effect becomes stronger with more indexes on one table. There may also be
some filesytem placement benefit to building the indexes for a table
immediately after loading the data.

The buffer fan file cache advantage also applies to constraint creation,
but this is complicated by the need for indexes and data in the referenced
tables.

It seems that a high performance restore will want to proced in a different
order than the current sort order or that proposed by the before/data/after
patch.

- The simplest unit of work for parallelism may be the table and its
"decorations", eg indexes and relational constraints.

- Sort tables by foreign key dependency so that referenced tables are
loaded before referencing tables.

- Do table creation and data load together in one transaction to use
non-logged copy. Index builds, and constraint creation should follow
immediately, either as part of the same transaction, or possibly
parallelized themselves.

Table creation, data load, index builds, and constraint creation could
be packaged up as the unit of work to be done in a subprocess which either
completes or fails as a unit. The worker process would be called with
connection info, a file pointer to the data, and the DDL for the table.
pg_restore would keep a work queue of tables to be restored in FK dependency
order and also do the other schema operations such as functions and types.

-dg

--
David Gould daveg(at)sonic(dot)net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2008-07-27 03:03:51 Re: Search for version-numbered tclsh?
Previous Message Tom Lane 2008-07-27 00:28:47 Search for version-numbered tclsh?

Browse pgsql-patches by date

  From Date Subject
Next Message Tatsuo Ishii 2008-07-27 08:47:11 Re: WITH RECUSIVE patches 0723
Previous Message Andrew Gierth 2008-07-26 20:05:36 Re: WITH RECUSIVE patches 0723