Re: Warn when parallel restoring a custom dump without data offsets

From: David Gilman <dgilman(at)gilslotd(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: David Gilman <davidgilman1(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)lists(dot)postgresql(dot)org, thomas(dot)munro(at)gmail(dot)com
Subject: Re: Warn when parallel restoring a custom dump without data offsets
Date: 2020-05-25 18:54:29
Message-ID: 20200525185429.6pp5vdvfu3vxlwac@dar.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Updated patches are attached, I ditched the gmail web interface so
hopefully this works.

Not mentioned in Justin's feedback: I dropped the extra sort in the test
as it's no longer necessary. I also added a parallel dump -> parallel
restore -> dump test run for the directory format to get some free test
coverage.

On Sat, May 23, 2020 at 05:47:51PM -0500, Justin Pryzby wrote:
> I'm not sure this will be considered a bugfix, since the behavior is known.
> Maybe eligible for backpatch though (?)

I'm not familiar with how your release management works, but I'm
personally fine with whatever version you can get it into. I urge you to
try landing this as soon as possible. The minimum reproducible example
in the test case is very minimal and I imagine all real world databases
are going to trigger this.

> I would have thought to mention the seeks() ; but it's true that the read()s now
> grow quadratically. I did run a test, but I don't know how many objects would
> be unreasonable or how many it'd take to show a problem.

And I misunderstood how bad it was. I thought it was reading little
header structs off the disk but it's actually reading the entire table
(see _skipData). So you're quadratically rereading entire tables and
thrashing your cache. Oops.

> Maybe we should avoid fseeko(0, SEEK_SET) unless we need to wrap around after
> EOF - I'm not sure.

The seek location is already the location of the end of the last good
object so just adding wraparound gives the good algorithmic performance
from the technique in commit 42f70cd9c. I’ve gone ahead and implemented
this.

> Is it possible to dump to stdout (or pipe to cat or dd) to avoid a new option ?

The underlying IPC::Run code seems to support piping in a cross-platform
way. I am not a Perl master though and after spending an evening trying
to get it to work I went with this approach. If you can put me in touch
with anyone to help me out here I'd appreciate it.

--
David Gilman :DG<
https://gilslotd.com

Attachment Content-Type Size
0001-Remove-unused-seek-check-in-tar-dump-format.patch text/x-diff 1.1 KB
0002-Scan-all-TOCs-when-restoring-a-custom-dump-file-with.patch text/x-diff 3.7 KB
0003-Add-integration-test-for-out-of-order-TOC-requests-i.patch text/x-diff 11.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Dilger 2020-05-25 19:09:52 Re: some grammar refactoring
Previous Message Jeff Davis 2020-05-25 18:36:42 Re: Trouble with hashagg spill I/O pattern and costing