Re: git: uh-oh

From: Michael Haggerty <mhagger(at)alum(dot)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: git: uh-oh
Date: 2010-08-18 06:25:45
Message-ID: 4C6B7CE9.3000701@alum.mit.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:
> I lack git-fu pretty completely, but I do have the CVS logs ;-).
> It looks like some of these commits that are being ascribed to the
> REL8_3_STABLE branch were actually only committed on HEAD. For
> instance my commit in contrib/xml2 on 28 Feb 2010 21:31:57 was
> only in HEAD. It was back-patched a few hours later (1 Mar 3:41),
> and that's also shown here, but the HEAD commit shouldn't be.
>
> I wonder whether the repository is completely OK and the problem
> is that this webpage isn't filtering the commits correctly.

Please don't panic :-)

The problem is that it is *impossible* to faithfully represent a CVS or
Subversion history with its ancestry information in a git repository (or
AFAIK any of the DVCS repositories). The reason is that CVS
fundamentally records the history of single files, and each file can
have a branching history that is incompatible with those of other files.
For example, in CVS, a file can be added to a branch after the branch
already exists, different files can be added to a branch from multiple
parent branches, and even more perverse things are allowed. The CVS
history can record this mish-mash (albeit with much ambiguity).

Git, on the other hand, fundamentally only records a single history that
is considered to apply to the entire source tree. If a commit is
created with more than one parent, git treats it as a merge and
implicitly assumes that all of the contents of all of the ancestor
commits of all of the parents have been merged into the new version of
the source tree.

See [1] for more discussion of the impedance mismatch between the
branching model of CVS/Subversion vs. that of the DVCSs.

So let's take the simplest example: a branch BRANCH1 is created from
trunk commit T1, then some time later another FILE1 from trunk commit T3
is added to BRANCH1 in commit B4. How should this series of events be
represented in a git repository?

The "inclusive" possibility is to say that some content was merged from
trunk to BRANCH1, and therefore to treat B4 as a merge commit:

T0 -- T1 -- T2 -------- T3 -- T4 TRUNK
\ \
B1 -- B2 -- B3 -- B4 BRANCH1

This is wrong because there might be other changes in T2 and T3 (besides
the addition of FILE1) that were *not* merged to BRANCH1.

The "exclusive" possibility is to ignore the fact that some of the
content of B4 came from trunk and to pretend that FILE1 just appeared
out of nowhere in commit B4 independent of the FILE1 in TRUNK:

T0 -- T1 -- T2 -------- T3 -- T4 TRUNK
\
B1 -- B2 -- B3 -- B4 BRANCH1

This is also wrong, because it doesn't reflect the true lineage of FILE1.

Given the choice between two wrong histories, cvs2git uses the
"inclusive" style. The result is that the ancestors of B4 include not
only T0, T1, B1, B2, and B3 (as might be expected), but also T2 and T3.
The display in the website that was quoted [2] seems to mash all of the
ancestors together without showing the topology of the history, making
the result quite confusing. The true history looks more like this:

$ git log --oneline --graph REL8_3_10 master
[...]
| * 2a91f07 tag 8.3.10
| * eb1b49f Preliminary release notes for releases 8.4.3, 8.3
| * dcf9673 Use SvROK(sv) rather than directly checking SvTYP
| * 1194fb9 Update time zone data files to tzdata release 201
| * fdfd1ec Return proper exit code (3) from psql when ON_ERR
| * 77524a1 Backport fix from HEAD that makes ecpglib give th
| * 55391af Add missing space in example.
| * 982aa23 Require hostname to be set when using GSSAPI auth
| * cb58615 Update time zone data files to tzdata release 201
| * ebe1e29 When reading pg_hba.conf and similar files, do no
| * 5a401e6 Fix a couple of places that would loop forever if
| * 5537492 Make contrib/xml2 use core xml.c's error handler,
| * c720f38 Export xml.c's libxml-error-handling support so t
| * 42ac390 Make iconv work like other optional libraries for
| * b03d523 pgindent run on xml.c in 8.3 branch, per request
| * 7efcdaa Add missing library and include dir for XSLT in M
| * 6ab1407 Do not run regression tests for contrib/xml2 on M
| * fff18e6 Backpatch MSVC build fix for XSLT
| * 7ae09ef Fix numericlocale psql option when used with a nu
| * de92a3d Fix contrib/xml2 so regression test still works w
| * 80f81c3 This commit was manufactured by cvs2svn to crea
| |\
| |/
|/|
* | a08b04f Fix contrib/xml2 so regression test still works w
* | 0d69e0f It's clearly now pointless to do backwards compat
* | 4ad348c Buildfarm still unhappy, so I'll bet it's EACCES
* | 6e96e1b Remove xmlCleanupParser calls from contrib/xml2.
* | 5b65b67 add EPERM to the list of return codes to expect f
| * a4067b3 Remove xmlCleanupParser calls from contrib/xml2.
| * 91b76a4 Back-patch today's memory management fixups in co
| * 5e74f21 Back-patch changes of 2009-05-13 in xml.c's memor
| * 043041e This commit was manufactured by cvs2svn to crea
| |\
| |/
|/|
* | 98cc16f Fix up memory management problems in contrib/xml2
* | 17e1420 Second try at fsyncing directories in CREATE DATA
* | a350f70 Assorted code cleanup for contrib/xml2. No chang
* | 3524149 Update complex locale example in the documentatio
[...]

The left branch is master, the right branch is the one leading to
REL8_3_10. You can see that there are multiple merges from master to
the branch, presumably when new files from trunk were ported to the
branch. This is even easier to see using a graphical history browser
like gitk.

There are good arguments for both the "inclusive" and the "exclusive"
representation of history. The ideal would require a lot more
intelligence and better heuristics (and slow down the conversion
dramatically). But even the smartest conversion would still be wrong,
because git is simply incapable of representing an arbitrary CVS
history. The main practical result of the impedance mismatch is that it
will be more difficult to merge between branches that originated in CVS
(but that is no surprise!)

Michael
the cvs2svn/cvs2git maintainer

[1]
http://softwareswirl.blogspot.com/2009/08/git-mercurial-and-bazaarsimplicity.html

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Martijn van Oosterhout 2010-08-18 06:44:26 Re: git: uh-oh
Previous Message Michael Haggerty 2010-08-18 05:34:31 Re: git: uh-oh