Re: pg_dump additional options for performance

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, pgsql-patches(at)postgresql(dot)org
Subject: Re: pg_dump additional options for performance
Date: 2008-07-27 09:31:48
Message-ID: 1217151108.3894.1218.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches


On Sat, 2008-07-26 at 13:56 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > I want to dump tables separately for performance reasons. There are
> > documented tests showing 100% gains using this method. There is no gain
> > adding this to pg_restore. There is a gain to be had - parallelising
> > index creation, but this patch doesn't provide parallelisation.
>
> Right, but the parallelization is going to happen sometime, and it is
> going to happen in the context of pg_restore.

I honestly think there is less benefit that way than if we consider
things more as a whole:

To do data dump quickly we need to dump different tables to different
disks simultaneously. By its very nature, that cannot end with just a
single file. So the starting point for any restore must be potentially
more than one file.

There are two ways of dumping: either multi-thread pg_dump, or allow
multiple pg_dumps to work together. Second option much less work, same
result. (Either way we also need a way for multiple concurrent sessions
to share a snapshot.)

When restoring, we can then just use multiple pg_restore sessions to
restore the individual data files. Or again we can write a
multi-threaded pg_restore to do the same thing - why would I bother
doing that when I already can? It gains us nothing.

Parallelising the index creation seems best done using concurrent psql.
We've agreed some mods to psql to put multi-sessions in there. If we do
that right, then we can make pg_restore generate a psql script with
multi-session commands scattered appropriately throughout.

Parallel pg_restore is a lot of work for a narrow use case. Concurrent
psql provides a much wider set of use cases.

So fully parallelising dump/restore can be achieved by

* splitting dump into pieces (this patch)
* allowing sessions to share a common snapshot
* concurrent psql
* changes to pg_restore/psql/pg_dump to allow commands to be inserted
which will use concurrent psql features

If we do things this way then we have some useful tools that can be used
in a range of use cases, not just restore.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2008-07-27 09:37:34 Re: pg_dump additional options for performance
Previous Message Tatsuo Ishii 2008-07-27 08:47:11 Re: WITH RECUSIVE patches 0723

Browse pgsql-patches by date

  From Date Subject
Next Message Simon Riggs 2008-07-27 09:37:34 Re: pg_dump additional options for performance
Previous Message Tatsuo Ishii 2008-07-27 08:47:11 Re: WITH RECUSIVE patches 0723