Re: why postgresql over other RDBMS

From: Chris Browne <cbbrowne(at)acm(dot)org>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: why postgresql over other RDBMS
Date: 2007-05-26 04:02:00
Message-ID: 60d50omdvr.fsf@dba2.int.libertyrms.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

alvherre(at)commandprompt(dot)com (Alvaro Herrera) writes:
> Erik Jones wrote:
>
>> And, to finish up, is there any reason that pg_restore couldn't
>> already work with separate processes working in parallel?
>
> The problem is that the ordering of objects in the dump is the only
> thing that makes the dump consistent with regards to the dependencies of
> objects. So pg_restore cannot make any assumptions of parallelisability
> of the restoration process of objects in the dump.
>
> pg_dump is the only one who has the dependency information.
>
> If that information were to be saved in the dump, then maybe pg_restore
> could work in parallel. But it seems a fairly non-trivial thing to do.
>
> Mind you, while I am idling at this idea, it seems that just having
> multiple processes generating a dump is not such a hot idea by itself,
> because you then have no clue on how to order the restoration of the
> multiple files that are going to result.

I think it's less bad than you think.

The really timeconsuming bits of "pg_restore" are:

1. the loading of table data
2. creating indices on those tables
[distant] 3. setting up R/I constraints

If you look at the present structure of pg_dump output, those are all
pretty visibly separate steps.

pg_dump output [loosely] consists of:
- Type definitions & such
- Table definitions
- loading table data (e.g. - 1)
- stored function definitions
- indices (e.g. - parts of 2)
- primary keys (e.g. - the rest of 2)
- triggers + rules (including 3)

Thus, a "parallel load" would start by doing some things in a serial
fashion, namely creating types and tables. This isn't a
parallelizable step, but so what? It shouldn't take very long.

The parallel load can load as many tables concurrently as you choose;
since there are no indices or R/I triggers, those are immaterial
factors.

Generating indices and primary keys could, again, be parallelized
pretty heavily, and have (potentially) heavy benefit.

Furthermore, an interesting thing to do might be to use the same
approach that Slony-I does, at present, for subscriptions. It
temporarily deactivates triggers and indices while loading the data,
then reactivates them, and requests a re-index. That would permit
loading the *entire* schema, save for data, and then load and index
with fairly much maximum possible efficiency.

That seems like a not-completely-frightening "SMOP" (simple matter of
programming). Not completely trivial, but not frighteningly
non-trivial...
--
(format nil "~S(at)~S" "cbbrowne" "cbbrowne.com")
http://linuxdatabases.info/info/linuxdistributions.html
Developmental Psychology
"Schoolyard behavior resembles adult primate behavior because "Ontogeny
Recapitulates Phylogeny" doesn't stop at birth."
-- Mark Miller

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ron Johnson 2007-05-26 05:59:07 Re: why postgresql over other RDBMS
Previous Message Michael Harris (BR/EPA) 2007-05-26 01:09:20 Re: ERROR: cache lookup failed for type 0