Re: why postgresql over other RDBMS

From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Erik Jones <erik(at)myemma(dot)com>, Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-general(at)postgresql(dot)org
Subject: Re: why postgresql over other RDBMS
Date: 2007-05-26 10:33:26
Message-ID: 46580CF6.1000909@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Tom Lane wrote:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
>> Erik Jones wrote:
>>> And, to finish up, is there any reason that pg_restore couldn't
>>> already work with separate processes working in parallel?
>
>> The problem is that the ordering of objects in the dump is the only
>> thing that makes the dump consistent with regards to the dependencies of
>> objects. So pg_restore cannot make any assumptions of parallelisability
>> of the restoration process of objects in the dump.
>
> That's true at the level of DDL operations, but AFAIK we could
> parallelize table-loading and index-creation steps pretty effectively
> --- and that's where all the time goes.

yes loading the data and creating the index is the most time consuming
part of a large dump and reload cycle.

>
> A more interesting question is what sort of hardware you need for that
> actually to be a win, though. Loading a few tables in parallel sounds
> like an ideal recipe for oversaturating your disk bandwidth...

you don't actually need that much of disk bandwidth both COPY and CREATE
INDEX are CPU bottlenecked on modern boxes and reasonable disk
subsystems - spreading their work over multiple cores/processes can give
big benefits.
For example I have managed to load ~2B rows (5 integer columns - no
indexes) at a rate of about 320000 rows/s on a modern(but already 1,5
years old) 4 core 2.6Ghz Opteron box (with 12 disks and BBWC iirc).
Using 4 concurrent processes to load the data resulted in about 930000
rows/s loaded (hitting the disk-io limit at that rate).
So having the ability to parallelize those operations at both the dump
and the restore level would be a huge win.
A manual experiment I did a while back with doing that by hand (ie.
splitting the dump manually and feeding it in parallel with a
concurrency of 2) on a copy of a production database brought down the
restore time from 3h+ to a bit less than 2 hours.

Stefan

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Gerhard Wiesinger 2007-05-26 10:58:08 Re: Tools for dumping pg_xlog, pg_clog, etc?
Previous Message nuno 2007-05-26 06:03:17 can i recover the database to a certain point...