Re: Warn when parallel restoring a custom dump without data offsets

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: David Gilman <davidgilman1(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)lists(dot)postgresql(dot)org, thomas(dot)munro(at)gmail(dot)com
Subject: Re: Warn when parallel restoring a custom dump without data offsets
Date: 2020-05-23 22:47:51
Message-ID: 20200523224751.GP4472@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, May 23, 2020 at 03:54:30PM -0400, David Gilman wrote:
> I've rounded this patch out with a test and I've set up the commitfest
> website for this thread. The latest patches are attached and I think
> they are ready for review.

Thanks. https://commitfest.postgresql.org/28/2568/
I'm not sure this will be considered a bugfix, since the behavior is known.
Maybe eligible for backpatch though (?)

Your patch was encoded, so this is failing:
http://cfbot.cputube.org/david-gilman.html

Ideally CFBOT would deal with that (maybe by using git-am - adding Thomas), but
I think you sent using gmail web interface, which also reordered the patches.
(CFBOT *does* sort them, but it's a known annoyance).

> dump file was written with data offsets pg_restore can seek directly to

offsets COMMA

> pg_restore would only find the TOC if it happened to be immediately

"immediately" is wrong, no ? I thought the problem was if we seeked to D and
then looked for C, we wouldn't attempt to go backwards.

> read request only when restoring a custom dump file without data offsets.

remove "only"

> of a bunch of extra tiny reads when pg_restore starts up.

I would have thought to mention the seeks() ; but it's true that the read()s now
grow quadratically. I did run a test, but I don't know how many objects would
be unreasonable or how many it'd take to show a problem.

Maybe we should avoid fseeko(0, SEEK_SET) unless we need to wrap around after
EOF - I'm not sure.

Maybe the cleanest way would be to pre-populate a structure with all the TOC
data and loop around that instead of seeking around the file ? Can we use the
same structure as pg_dump ?

Otherwise, that makes me think of commit 42f70cd9c. Make it's not a good
parallel or example for this case, though.

+ The custom archive format may not work with the <option>-j</option>
+ option if the archive was originally created by writing the archive
+ to an unseekable output file. For the best concurrent restoration

Can I suggest something like: pg_restore with parallel jobs may fail if the
archive dump was written to an unseekable output stream, like stdout.

+ * If the input file can't be seeked we're at the mercy of the

seeked COMMA

>Subject: [PATCH 3/3] Add integration test for out-of-order TOC requests in pg_restore

Well done - thanks for that.

>Also add undocumented --disable-seeking argument to pg_dump to emulate
>writing to an unseekable output file.

Remove "also".

Is it possible to dump to stdout (or pipe to cat or dd) to avoid a new option ?

Maybe that would involve changing the test process to use the shell (system() vs
execve()), or maybe you could write:

/* sh handles output redirection and arg splitting */
'sh', '-c', 'pg_dump -Fc -Z6 --no-sync --disable-seeking postgres > $tempdir/defaults_custom_format_no_seek_parallel_restore.dump',

But I think that would need to then separately handle WIN32, so maybe it's not
worth it.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2020-05-24 02:31:55 Re: Adding missing object access hook invocations
Previous Message David Gilman 2020-05-23 19:54:30 Re: Warn when parallel restoring a custom dump without data offsets