Re: Schema version management

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Joel Jacobson <joel(at)trustly(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Schema version management
Date: 2012-07-11 18:01:54
Message-ID: CAHyXU0xD82sRH+djGP9FcUxBFody_JDAqZUjdjLZ40nN=ED6Xw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 10, 2012 at 5:24 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Sun, May 20, 2012 at 2:41 PM, Joel Jacobson <joel(at)trustly(dot)com> wrote:
>> Hi,
>>
>> I just read a very interesting post about "schema version management".
>>
>> Quote: "You could set it up so that every developer gets their own
>> test database, sets up the schema there, takes a dump, and checks that
>> in. There are going to be problems with that, including that dumps
>> produced by pg_dump are ugly and optimized for restoring, not for
>> developing with, and they don't have a deterministic output order." (
>> http://petereisentraut.blogspot.com/2012/05/my-anti-take-on-database-schema-version.html
>> )
>>
>> Back in December 2010, I suggested a new option to pg_dump, --split,
>> which would write the schema definition of each object in separate
>> files:
>>
>> http://archives.postgresql.org/pgsql-hackers/2010-12/msg02285.php
>>
>> Instead of a huge plain text schema file, impossible to version
>> control, all tables/sequences/views/functions are written to separate
>> files, allowing the use of a version control software system, such as
>> git, to do proper version controlling.
>>
>> The "deterministic output order" problem mentioned in the post above,
>> is not a problem if each object (table/sequence/view/function/etc) is
>> written to the same filename everytime.
>> No matter the order, the tree of files and their content will be
>> identical, no matter the order in which they are dumped.
>>
>> I remember a lot of hackers were very positive about this option, but
>> we somehow failed to agree on the naming of files in the tree
>> structure. I'm sure we can work that out though.
>>
>> I use this feature in production, I have a cronjob which does a dump
>> of the schema every hour, committing any eventual changes to a
>> separate git branch for each database installation, such as
>> production, development and test.
>> If no changes to the schema have been made, nothing will be committed
>> to git since none of the files have changed.
>>
>> It is then drop-dead simple to diff two different branches of the
>> database schema, such as development or production, or diffing
>> different revisions allowing point-in-time comparison of the schema.
>>
>> This is an example of the otuput of a git log --summary for one of the
>> automatic commits to our production database's git-repo:
>>
>> --
>> commit 18c31f8162d851b0dac3bad7e80529ef2ed18be3
>> Author: Production Database <production(dot)database(at)trustly(dot)com>
>> Date: Fri May 4 15:00:04 2012 +0200
>>
>> Update of database schema Linux DB0 2.6.26-2-amd64 #1 SMP Wed Aug
>> 19 22:33:18 UTC 2009 x86_64 GNU/Linux Fri, 04 May 2012 15:00:04 +0200
>>
>> create mode 100644
>> gluepay-split/public/CONSTRAINT/openingclosingbalances_pkey.sql
>> create mode 100644
>> gluepay-split/public/CONSTRAINT/openingclosingbalances_source_key.sql
>> create mode 100644 gluepay-split/public/SEQUENCE/seqopeningclosingbalance.sql
>> create mode 100644 gluepay-split/public/TABLE/openingclosingbalances.sql
>> --
>>
>> Here we can see we apparently deployed a new table,
>> "openingclosingbalances" around Fri May 4 15:00:04.
>>
>> Without any manual work, I'm able to follow all changes actually
>> _deployed_ in each database.
>>
>> At my company, a highly database-centric stored-procedure intensive
>> business dealing with mission-critical monetary transactions, we've
>> been using this technique to successfully do schema version management
>> without any hassle for the last two years.
>>
>> Hopefully this can add to the list of various possible _useful_ schema
>> version management methods.
>
> What does your patch do that you can't already do with pg_restore?
>
> create function foo(a int, b int, c text) returns int as $$ select 0;
> $$ language sql;
> CREATE FUNCTION
>
> pg_dump -Fc postgres -s > postgres.dump
> pg_restore -l postgres.dump | grep FUNCTION
> 196; 1255 32939 FUNCTION public foo(integer, integer, text) merlin
>
> pg_restore -P "foo(integer, integer, text)" postgres.dump
> <function body follows>
>
> it's fairly easy to wrap pg_restore with a smalls script that extracts
> function bodies and writes them out to file names. this is a great
> and underused feature, so I'd argue that if you wanted to formalize
> per object file extraction you should be looking at expanding
> pg_restore, not pg_dump.

After extensive off-list discussion with Joel it became clear that
per-object dumping ability really belongs in pg_restore. It already
has some machinery for that, and has the nice property that you can
pull objects out of dumps long after the fact, not just when the dump
happens. It then remains to be worked out of pg_restore should be
given the ability to write directly to files as Joel was gunning for
or simply extended to improve upon the current TOC based facilities,
or both.

As noted, choosing a reversible unambiguous filename based on the
database primitive is nastiness of the first order. For example,
besides the mentioned issues, some filesystems (windows) use case
insensitive entries. What do you do about that? Given that all the
reasonable escape characters are reserved or unusable for some reason,
pretty soon you'll arrive to the point of view that you need some sort
of proxy identifier in the filename to give uniqueness so that you can
ditch all the unpleasantness, just like as is done with relfilenode
(maybe using oid, I don't know). I've worked on the exact same
problem as Joel, but in my case I was able to leverage known naming
conventions in the code in-core solution should not do that,
especially when it comes to backup/restore.

So, just thinking out loud here, maybe the way forward is to try and
tackle stuff in controversy order:
1) add more options to pg_restore to dump other stuff besides tables
and functions
2) add smarter object selectors for extraction to pg_restore ('all
functions', 'objects in list <list>', 'all schema objects') -- thereby
eliminating some of the necessary surrounding scripting, especially if
you can direct output to a program which handles the writing out, for
example something along the lines of
pg_restore --objects=ALL --handler="dumper.sh" schema.dump

Where dumper.sh is a user supplied program that consumes the output
and takes object name, class, etc as arguments. Just thinking out
loud here, but maybe that's cleaner than hacking specific filename
conventions directly into pg_restore -- this punts the absolute
minimum to userland (what the filename is and where the file is to be
stored).

3) figure out a way to dump those selected objects in filenames not
supplied by the user. It's tempting to relax #3 so that filenames
are only guaranteed unique within a particular extraction but that
would foil SCM interaction I think.

Maybe #3 is not really solvable and a hybrid solution needs to be
worked out -- for example hashing the object signature for the
filename and dumping a TOC file along with the extraction:

__restore_toc.sql:
function foo(a int, b int) -> ABCDEF.sql
function "Foo"(a int, b int) -> 132456.sql

etc. Not very pleasant, but at least internally consistent and SCM friendly.

merlin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-07-11 18:06:55 Re: Support for array_remove and array_replace functions
Previous Message Robert Haas 2012-07-11 17:40:50 Re: Support for array_remove and array_replace functions