Re: git: uh-oh

From: Michael Haggerty <mhagger(at)alum(dot)mit(dot)edu>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: git: uh-oh
Date: 2010-08-18 09:01:29
Message-ID: 4C6BA169.2040005@alum.mit.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Martijn van Oosterhout wrote:
> On Wed, Aug 18, 2010 at 08:25:45AM +0200, Michael Haggerty wrote:
>> So let's take the simplest example: a branch BRANCH1 is created from
>> trunk commit T1, then some time later another FILE1 from trunk commit T3
>> is added to BRANCH1 in commit B4. How should this series of events be
>> represented in a git repository?
>
> <snip>
>
>> The "exclusive" possibility is to ignore the fact that some of the
>> content of B4 came from trunk and to pretend that FILE1 just appeared
>> out of nowhere in commit B4 independent of the FILE1 in TRUNK:
>>
>> T0 -- T1 -- T2 -------- T3 -- T4 TRUNK
>> \
>> B1 -- B2 -- B3 -- B4 BRANCH1
>>
>> This is also wrong, because it doesn't reflect the true lineage of FILE1.
>
> But the "true lineage" is not stored anywhere in CVS so I don't see why
> you need to fabricate it for git. Sure, it would be really nice if you
> could, but if you can't do it reliably, you may as well not do it at
> all. What's the loss?

CVS does record (albeit somewhat ambiguously) the branch from which a
new branch sprouted. The history above might result from commands like

cvs update -A
cvs tag -b BRANCH1
<hack hack> cvs update -r BRANCH1
cvs commit -m T2 <hack hack>
touch FILE1 cvs commit -m B1
cvs add FILE1 <hack hack>
cvs commit -m T3 cvs commit -m B2
<hack hack>
cvs commit -m B3
cvs tag -b BRANCH1 FILE1

or the last step might have been an explicit merge into BRANCH1:

cvs update -j T1 -j T3
cvs commit -m B4

Either way, the CVS history relatively clearly indicates that content
was ported from TRUNK to BRANCH1. There is no way to distinguish
whether it was a cherry-pick (not recordable in git's history) vs. a
full merge without more information or more intelligence.

Magnus Hagander wrote:
> Our requirements are simple: our cvs history is linear, the git
> history should be linear. It is *not* the same commit that's on head
> and the branch. They are two different commits, that happen to have
> the same commit message and mostly the same content.

I don't think this is at all an issue of cvs2svn merging commits that
happen to have the same commit message and/or commit time. The merge
commits are all manufactured by cvs2svn to do two things:

1. Add content that needs to be on the branch, because a file was added
to the branch after the branch's creation. This *needs* to be done to
ensure that the branch has the correct content.

2. Indicate the origin of the new branch content. This goal is debatable.

> Bottom line is, we want zero merge commits in the git repository. We
> may start using that sometime in the future (but for now, we've
> decided we don't want that even in the future), but we most
> *definitely* don't want it in the past. We don't care about
> "representing the proper heritage of FILE1" in git, because we never
> did in cvs.
>
> Is there some way to make cvs2git work this way, and just not bother
> even trying to create merge commits, or is that fundamentally
> impossible and we need to look at another tool?

A merge is just a special case of content being taken from one branch
and added to another. Logically, the same thing happens when a branch
is created, and some of the same problems can occur in that situation.
A branch can be created using content from multiple source branches,
which cvs2git currently also represents as a merge.

Assuming that you don't want to discard all record of where a branch
sprouted from, it is therefore necessary to choose a single parent
branch for each branch creation. To be sure, this choice can be
incorrect the same way as the merge commits discussed above are
incorrect. But one reasonable "mostly-exclusive" approach would be to
choose the most likely parent as the source of the branch and ignore all
others.

cvs2git doesn't currently have this option. I'm not sure how much work
it would be to implement; probably a few days'. Alternatively, you
could write a tool that would rewrite the ancestry information in the
repository *after* the cvs2git conversion using .git/info/grafts (see
git-filter-branch(1)). Such rewriting would have to occur before the
repository is published, because the rewriting will change the hashes of
most commits.

Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2010-08-18 09:13:15 Re: git: uh-oh
Previous Message Magnus Hagander 2010-08-18 07:56:37 Re: git: uh-oh