Re: parallel pg_restore

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Joshua Drake <jd(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: parallel pg_restore
Date: 2008-09-24 15:48:35
Message-ID: 48DA6153.7040003@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dimitri Fontaine wrote:
> Hi,
>
> Le mardi 23 septembre 2008, Andrew Dunstan a écrit :
>
>> In any case, my agenda goes something like this:
>>
>> * get it working with a basic selection algorithm on Unix (nearly
>> done - keep your eyes open for a patch soon)
>> * start testing
>> * get it working on Windows
>> * improve the selection algorithm
>> * harden code
>>
>
> I'm not sure whether your work will feature single table restore splitting,
> but if it's the case, you could consider having a look at what I've done in
> pgloader. The parallel loading work there was asked for by Simon Riggs and
> Greg Smith and you could test two different parallel algorithms.
> The aim was to have a "simple" testbed allowing PostgreSQL hackers to choose
> what to implement in pg_restore, so I still hope it'll get usefull someday :)
>
>
>

No. The proposal will perform exactly the same set of steps as
single-threaded pg_restore, but in parallel. The individual steps won't
be broken up.

Quite apart from anything else, parallel data loading of individual
tables will defeat clustering, as well as making it impossible to avoid
WAL logging of the load (which I have made provision for).

The fact that custom archives are compressed by default would in fact
make parallel loading of individual tables' data difficult with the
present format. We'd have to do something like expanding it on the
client (which might not even have enough disk space) and then split it
before loading it to the server. That's pretty yucky. Alternatively,
each loader thread would need to start decompressing the data from the
start and thow away data until it got to the point it wanted to start
restoring from. Also pretty yucky.

Far better would be to provide for multiple data members in the archive
and teach pg_dump to split large tables as it writes the archive. Then
pg_restore would need comparatively little adjustment.

Also, of course, you can split tables yourself by partitioning them.
That would buy you parallel data load with what I am doing now, with no
extra work.

In any case, data loading is very far from being the only problem. One
of my clients has long running restores where the data load takes about
20% or so of the time - the rest is in index creation and the like. No
amount of table splitting will make a huge difference to them, but
parallel processing will. As against that, if your problem is in loading
one huge table, this won't help you much. However, this is not a pattern
I see much - most of my clients seem to have several large tables plus a
boatload of indexes. They will benefit a lot.

cheers

andrew

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2008-09-24 15:58:58 Re: Proposal of SE-PostgreSQL patches (for CommitFest:Sep)
Previous Message Bruce Momjian 2008-09-24 15:45:37 Re: Proposal of SE-PostgreSQL patches (for CommitFest:Sep)