Re: pg_dump --split patch

From: Joel Jacobson <joel(at)gluefinance(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: David Wilson <david(dot)t(dot)wilson(at)gmail(dot)com>, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: pg_dump --split patch
Date: 2010-12-29 01:18:46
Message-ID: AANLkTim+sFO7N539V5C+yZFx7_fTFQxdHxtCUyhPD-3V@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

2010/12/29 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>

>
> If you've solved the deterministic-ordering problem, then this entire
> patch is quite useless. You can just run a normal dump and diff it.
>
>
No, that's only half true.

Diff will do a good job minimizing the "size" of the diff output, yes, but
such a diff is still quite useless if you want to quickly grasp the context
of the change.

If you have a hundreds of functions, just looking at the changed source code
is not enough to figure out which functions were modified, unless you have
the brain power to memorize every single line of code and are able to figure
out the function name just by looking at the old and new line of codes.

To understand a change to my database functions, I would start by looking at
the top-level, only focusing on the names of the functions
modified/added/removed.
At this stage, you want as little information as possible about each change,
such as only the names of the functions.
To do this, get a list of changes functions, you cannot compare two full
schema plain text dumps using diff, as it would only reveal the lines
changed, not the name of the functions, unless you are lucky to get the name
of the function within the (by default) 3 lines of copied context.

While you could increase the number of copied lines of context to a value
which would ensure you would see the name of the function in the diff, that
is not feasible if you want to quickly "get a picture" of the code areas
modified, since you would then need to read through even more lines of diff
output.

For a less database-centric system where you don't have hundreds of stored
procedures, I would agree it's not an issue to keep track of changes by
diffing entire schema files, but for extremely database-centric systems,
such as the one we have developed at my company, it's not possible to "get
the whole picture" of a change by analyzing diffs of entire schema dumps.

The patch has been updated:

*) Only spit objects with a namespace (schema) not being null
*) Append all objects of same tag (name) of same type (desc) of same
namespace (schema) to the same file (i.e., do not append -2, -3, like
before) (Suggested by David Wilson, thanks.)

I also tested to play around with "ORDER BY pronargs" and "ORDER BY pronargs
DESC" to the queries in getFuncs() in pg_dump.c, but it had no effect to the
order the functions of same name but different number of arguments were
dumped.
Perhaps functions are already sorted?
Anyway, it doesn't matter that much, keeping all functions of the same name
in the same file is a fair trade-off I think. The main advantage is the
ability to quickly get a picture of the names of all changed functions,
secondly to optimize the actual diff output.

--
Best regards,

Joel Jacobson
Glue Finance

E: jj(at)gluefinance(dot)com
T: +46 70 360 38 01

Postal address:
Glue Finance AB
Box 549
114 11 Stockholm
Sweden

Visiting address:
Glue Finance AB
Birger Jarlsgatan 14
114 34 Stockholm
Sweden

Attachment Content-Type Size
pg-dump-split-plain-text-files-9.1devel.patch application/octet-stream 5.1 KB
pg-dump-split-plain-text-files-9.1alpha2.patch application/octet-stream 5.5 KB
pg-dump-split-plain-text-files-8.4.6.patch application/octet-stream 5.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2010-12-29 01:51:38 Re: pg_dump --split patch
Previous Message Tom Lane 2010-12-29 00:29:42 Re: Revised patches to add table function support to PL/Tcl (TODO item)