Re: Parallel pg_restore versus old dump files

From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org, Igor Neyman <ineyman(at)perceptron(dot)com>
Subject: Re: Parallel pg_restore versus old dump files
Date: 2010-06-22 22:52:29
Message-ID: AANLkTimu-yJJBB_AhQ9o_wojjfIy4ef_X-LlwE1AERGo@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 22, 2010 at 9:07 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> 3. Perhaps pg_dump ought to emit a warning when it can't seek, instead
> of just silently not writing the data offsets.  That behavior was okay
> before when lack of data offsets didn't really matter that much, but
> lack of data offsets is a serious performance handicap for parallel
> restore even after we fix the outright failure condition (because each
> worker is going to read through a lot of data to find what it needs).
>

I'm not terribly familiar with the pg_dump format, but... the usual
strategy for storing a TOC on a non-seekable output stream is to store
it at the end of the file. So you just accumulate all the offsets in
memory as you generate the file and then write the TOC at the end. Of
course you need a seekable input stream when you load it then but it
would narrow the slow case to when you have a non-seekable output
stream when dumping *and* a non-seekable input stream on restore.

On the other hand if we didn't notice this dependency when there was
only one variable making it depend on two variables would make it that
much more obscure when the slow case hits and users wonder why the
restore is taking so long.

--
greg

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2010-06-23 01:02:28 Re: Parallel pg_restore versus old dump files
Previous Message Robert Haas 2010-06-22 20:32:09 Re: TCP keepalive support for libpq