Re: why postgresql over other RDBMS

From: Ron Johnson <ron(dot)l(dot)johnson(at)cox(dot)net>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: why postgresql over other RDBMS
Date: 2007-05-26 05:59:07
Message-ID: 4657CCAB.9020609@cox.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/25/07 23:02, Chris Browne wrote:
> alvherre(at)commandprompt(dot)com (Alvaro Herrera) writes:
>> Erik Jones wrote:
>>
>>> And, to finish up, is there any reason that pg_restore couldn't
>>> already work with separate processes working in parallel?
>> The problem is that the ordering of objects in the dump is the only
>> thing that makes the dump consistent with regards to the dependencies of
>> objects. So pg_restore cannot make any assumptions of parallelisability
>> of the restoration process of objects in the dump.
>>
>> pg_dump is the only one who has the dependency information.
>>
>> If that information were to be saved in the dump, then maybe pg_restore
>> could work in parallel. But it seems a fairly non-trivial thing to do.
>>
>> Mind you, while I am idling at this idea, it seems that just having
>> multiple processes generating a dump is not such a hot idea by itself,
>> because you then have no clue on how to order the restoration of the
>> multiple files that are going to result.
>
> I think it's less bad than you think.
>
> The really timeconsuming bits of "pg_restore" are:
>
> 1. the loading of table data
> 2. creating indices on those tables
> [distant] 3. setting up R/I constraints
>
> If you look at the present structure of pg_dump output, those are all
> pretty visibly separate steps.
>
> pg_dump output [loosely] consists of:
> - Type definitions & such
> - Table definitions
> - loading table data (e.g. - 1)
> - stored function definitions
> - indices (e.g. - parts of 2)
> - primary keys (e.g. - the rest of 2)
> - triggers + rules (including 3)
>
> Thus, a "parallel load" would start by doing some things in a serial
> fashion, namely creating types and tables. This isn't a
> parallelizable step, but so what? It shouldn't take very long.

Which would be sped up by having pg_dump create multiple output files.

Of course, as I see it, this is only of real benefit when you are
using tablespaces spread across multiple RAID devices on a SAN or
multiple SCSI cards. But then, organizations with lots of data
usually have that kind of h/w.

> The parallel load can load as many tables concurrently as you choose;
> since there are no indices or R/I triggers, those are immaterial
> factors.
>
> Generating indices and primary keys could, again, be parallelized
> pretty heavily, and have (potentially) heavy benefit.
>
> Furthermore, an interesting thing to do might be to use the same
> approach that Slony-I does, at present, for subscriptions. It
> temporarily deactivates triggers and indices while loading the data,
> then reactivates them, and requests a re-index. That would permit
> loading the *entire* schema, save for data, and then load and index
> with fairly much maximum possible efficiency.
>
> That seems like a not-completely-frightening "SMOP" (simple matter of
> programming). Not completely trivial, but not frighteningly
> non-trivial...

pg_dump would have to be smart enough to rationally split the data
into N number of output files, and that would get tricky
(impossible?) if most of your data is in one *huge* unpartitioned
table in a single tablespace. Que sera.

- --
Ron Johnson, Jr.
Jefferson LA USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGV8yrS9HxQb37XmcRAhEPAKDl4231rervBQO3pLHO+HwNx9dX+ACfb4Pu
qSWZNGmh/x/04QQT//nlEwI=
=zs2a
-----END PGP SIGNATURE-----

In response to

Browse pgsql-general by date

  From Date Subject
Next Message nuno 2007-05-26 06:03:17 can i recover the database to a certain point...
Previous Message Chris Browne 2007-05-26 04:02:00 Re: why postgresql over other RDBMS