Re: pg_dump far too slow

From: Dave Crooke <dcrooke(at)gmail(dot)com>
To: David Newall <postgresql(at)davidnewall(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)postgresql(dot)org, robertmhaas(at)gmail(dot)com
Subject: Re: pg_dump far too slow
Date: 2010-03-21 14:33:35
Message-ID: ca24673e1003210733p2a9e5ba5ib99f383d79462483@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

One more from me ....

If you think that the pipe to GZIP may be causing pg_dump to stall, try
putting something like buffer(1) in the pipeline ... it doesn't generally
come with Linux, but you can download source or create your own very easily
... all it needs to do is asynchronously poll stdin and write stdout. I
wrote one in Perl when I used to do a lot of digital video hacking, and it
helped with chaining together tools like mplayer and mpeg.

However, my money says that Tom's point about it being (disk) I/O bound is
correct :-)

Cheers
Dave

On Sun, Mar 21, 2010 at 8:17 AM, David Newall <postgresql(at)davidnewall(dot)com>wrote:

> Thanks for all of the suggestions, guys, which gave me some pointers on new
> directions to look, and I learned some interesting things.
>
> The first interesting thing was that piping (uncompressed) pg_dump into
> gzip, instead of using pg_dump's internal compressor, does bring a lot of
> extra parallelism into play. (Thank you, Matthew Wakeling.) I observed
> gzip using 100% CPU, as expected, and also two, count them, two postgres
> processes collecting data, each consuming a further 80% CPU. It seemed to
> me that Postgres was starting and stopping these to match the capacity of
> the consumer (i.e. pg_dump and gzip.) Very nice. Unfortunately one of
> these processes dropped eventually, and, according to top, the only non-idle
> process running was gzip (100%.) Obviously there were postgress and pg_dump
> processes, too, but they were throttled by gzip's rate of output and
> effectively idle (less than 1% CPU). That is also interesting. The final
> output from gzip was being produced at the rate of about 0.5MB/second, which
> seems almost unbelievably slow.
>
> I next tried Tom Lane's suggestion, COPY WITH BINARY, which produced the
> complete 34GB file in 30 minutes (a good result.) I then compressed that
> with gzip, which took an hour and reduced the file to 32GB (hardly worth the
> effort) for a total run time of 90 minutes. In that instance, gzip produced
> output at the rate of 10MB/second, so I tried pg_dump -Z0 to see how quickly
> that would dump the file. I had the idea that I'd go on to see how quickly
> gzip would compress it, but unfortunately it filled my disk before finishing
> (87GB at that point), so there's something worth knowing: pg_dump's output
> for binary data is very much less compact than COPY WITH BINARY; all those
> backslashes, as Tom pointed out. For the aforementioned reason, I didn't
> get to see how gzip would perform. For the record, pg_dump with no
> compression produced output at the rate of 26MB/second; a rather meaningless
> number given the 200%+ expansion of final output.
>
> I am now confident the performance problem is from gzip, not Postgres and
> wonder if I should read up on gzip to find why it would work so slowly on a
> pure text stream, albeit a representation of PDF which intrinsically is
> fairly compressed. Given the spectacular job that postgres did in adjusting
> it's rate of output to match the consumer process, I did wonder if there
> might have been a tragic interaction between postgres and gzip; perhaps
> postgres limits its rate of output to match gzip; and gzip tries to compress
> what's available, that being only a few bytes; and perhaps that might be so
> inefficient that it hogs the CPU; but it don't think that likely. I had a
> peek at gzip's source (surprisingly readable) and on first blush it does
> seem that unfortunate input could result in only a few bytes being written
> each time through the loop, meaning only a few more bytes could be read in.
>
> Just to complete the report, I created a child table to hold the PDF's,
> which are static, and took a dump of just that table, and adjusted my backup
> command to exclude it. Total size of compressed back sans PDFs circa 7MB
> taking around 30 seconds.
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2010-03-21 15:39:04 Re: pg_dump far too slow
Previous Message Craig Ringer 2010-03-21 13:56:38 Re: pg_dump far too slow