Re: patch for parallel pg_dump

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Joachim Wieland <joe(at)mcknight(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: patch for parallel pg_dump
Date: 2012-01-30 17:20:11
Message-ID: CA+Tgmoa8TQAu4pTL5FcEHETSOA3xL5ER+5vY9sNWCeVKYYL1rA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jan 29, 2012 at 12:00 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Joachim Wieland <joe(at)mcknight(dot)de> writes:
>> I know that you took back some of your comments, but I'm with you
>> here. Archive is allocated as an ArchiveHandle and then casted back to
>> Archive*, so you always know that an Archive is an ArchiveHandle. I'm
>> all for getting rid of Archive and just using ArchiveHandle throughout
>> pg_dump which would get rid of these useless casts.
>
> I'd like to see a more thoroughgoing look at the basic structure of
> pg_dump.  Everybody who's ever looked at that code has found it
> confusing, with the possible exception of the original author who is
> long gone from the project anyway.  I don't know exactly what would make
> it better, but the useless distinction between Archive and ArchiveHandle
> seems like a minor annoyance, not the core disease.
>
> Not that there'd be anything wrong with starting with that.

After some study, I'm reluctant to completely abandon the
Archive/ArchiveHandle distinction because it seems to me that it does
serve some purpose: right now, nothing in pg_dump.c - where all the
code to actually dump stuff lives - knows anything about what's inside
the ArchiveHandle, just the Archive. So having two data structures
really does serve a purpose: if it's part of the Archive, you need it
in order to query the system catalogs and generate SQL. If it isn't,
you don't. Considering how much more crap there is inside an
ArchiveHandle than an Archive, I don't think we should lightly abandon
that distinction.

Now, that having been said, there are probably better ways of making
that distinction than what we have here; Archive and ArchiveHandle
could be better named, perhaps, and we could have pointers from one
structure to the other instead of magically embedding them inside each
other. All the function pointers that are part of the ArchiveHandle
could be separated out into a static, constant structure with a name
like ArchiveMethod, and we could keep a pointer to the appropriate
ArchiveMethod inside each ArchiveHandle instead of copying all the
pointers into it. I think that'd be a lot less confusing.

But the immediate problem is that pg_dump.c is heavily reliant on
global variables, which isn't going to fly if we want this code to use
threads on Windows (or anywhere else). It's also bad style. So I
suggest that we refactor pg_dump.c to get rid of g_conn and g_fout.
Getting rid of g_fout should be fairly easy: in many places, we're
already passing Archive *fout as a parameter. If we pass it as a
parameter to the functions that need it but aren't yet getting it as a
parameter, then it can cease to exist as a global variable and become
local to main().

Getting rid of g_conn is a little harder. Joachim's patch takes the
view that we can safely overload the existing ArchiveHandle.connection
member. Currently, that member is the connection to which we are
doing a direct to database restore; what he's proposing to do is also
use it to store the connection from which are doing the dump. But it
seems possible that we might someday want to dump from one database
and restore into another database at the same time, so maybe we ought
to play it safe and use different variables for those things. So I'm
inclined to think we could just add a "PGconn *remote_connection"
argument to the Archive structure (to go with all of the similarly
named "remote" fields, all of which describe the DB to be dumped) and
then that would be available everywhere that the Archive itself is.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message hubert depesz lubaczewski 2012-01-30 17:23:15 Re: pg_dump -s dumps data?!
Previous Message Adrian Klaver 2012-01-30 17:03:47 Re: pg_dump -s dumps data?!