Re: Should pg_dump dump larger tables first?

From: Christopher Browne <cbbrowne(at)gmail(dot)com>
To: Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Should pg_dump dump larger tables first?
Date: 2013-01-31 16:50:26
Message-ID: CAFNqd5X5FseXroOYat7P5jEzVseSY2SeLJM2HGqGP6X5eEawsQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 31, 2013 at 5:06 AM, Dimitri Fontaine
<dimitri(at)2ndquadrant(dot)fr> wrote:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>> Also, it's far from obvious to me that "largest first" is the best rule
>> anyhow; it's likely to be more complicated than that.
>>
>> But anyway, the right place to add this sort of consideration is in
>> pg_restore --parallel, not pg_dump. I don't know how hard it would be
>> for the scheduler algorithm in there to take table size into account,
>> but at least in principle it should be possible to find out the size of
>> the (compressed) table data from examination of the archive file.
>
> From some experiences with pgloader and loading data in migration
> processes, often enough the most gains are to be had when you load the
> biggest table in parallel with loading all the little ones. It often
> makes it so that the big table loading time is not affected, and by the
> time it's done the rest of the database is done too.
>
> Loading several big'o'tables in parallel tend not to give benefits in
> the tests I've done so far, but that might be an artefact of python
> multi threading, I will do some testing with proper tooling later.

We had the idea of doing this with Slony, to try to process subscriptions
faster by, yes, indeed, loading large tables first, and throwing
reindexing off onto secondary threads. The big win seemed to come
specifically from the reindexing aspect; that tends to take a fair bit
more time than the indexless COPY.

It would make the subscription process a bit more fragile, and would
add quite a bit of development work, for something that didn't seem to
be *that* much of a priority, so we never went past the
"Gedankenexperiment" of establishing that it seemed feasible.

A side-effect that we didn't have to worry about with Slony, but that
would be important for more general use, is what happens to the
processing of re-establishing foreign keys. In Slony, we suppress
them on subscriber nodes; you'd need to worry about that for general
purpose use.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Zoltán Böszörményi 2013-01-31 17:27:22 Re: Strange Windows problem, lock_timeout test request
Previous Message Jeff Janes 2013-01-31 16:32:48 Re: Should pg_dump dump larger tables first?