Report: removing the inconsistencies in our CVS->git conversion

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Cc: Michael Haggerty <mhagger(at)alum(dot)mit(dot)edu>, Max Bowsher <maxb(at)f2s(dot)com>
Subject: Report: removing the inconsistencies in our CVS->git conversion
Date: 2010-09-13 03:03:01
Message-ID: 20377.1284346981@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-www

I've spent much of the weekend examining the discrepancies between our CVS
repository and the tarballs available from our FTP archives, and after
that trying to remove infelicities in the cvs2git output. There are a
couple of remaining oddities that I would classify as probable cvs2git
bugs, but an awful lot of it is inconsistencies in the CVS repository
itself, some of which I can explain and some that I can't. Read on for
many boring details.

One thing that only old-timers will recall is that originally the PG code
base was divided into multiple repositories. There was one for the server
code and one for the client interfaces, and I believe that at the very
beginning much of the documentation was in yet a third place. The oldest
stuff that's now in src/interfaces/ was in the client repository. It
looks to me like when the earliest tarballs were made up, the
subdirectories that were in the client repository were dumped directly
under src/ instead of src/interfaces; that is, the directory layout of
those tarballs does not exactly match the current CVS repository layout.

I also found out that somebody seems to have manually moved the RCS file
for src/backend/commands/version.c into src/backend/commands/_deadcode,
and that a couple of subdirectories apparently were manually renamed
somewhere along the line.

The upshot of all this is that if you want to match the old tarballs to
current CVS contents, you need to make these hacks:

# hacks to make certain old versions diff successfully
if ((-d "postgresql-v$tag/src" and
not -d "postgresql-v$tag/src/interfaces") or
-d "postgres95/src") {
print "moving src/interfaces for $tag\n";
system("mv cvsout/src/interfaces/* cvsout/src") == 0 || die "mv failed: $?";
system("rmdir cvsout/src/interfaces") == 0 || die "rmdir failed: $?";
}
if (-d "postgresql-v$tag/src/pgsql_perl5") {
print "moving perl5 for $tag\n";
system("mv cvsout/src/perl5 cvsout/src/pgsql_perl5") == 0 || die "mv failed: $?";
}
if (-f "postgresql-$tag/src/backend/commands/version.c" or
-f "postgresql-v$tag/src/backend/commands/version.c" or
-f "postgres95/src/backend/commands/version.c") {
print "moving version.c for $tag\n";
system("mv cvsout/src/backend/commands/_deadcode/version.c cvsout/src/backend/commands") == 0 || die "mv failed: $?";
system("rmdir cvsout/src/backend/commands/_deadcode 2>/dev/null");
}
if (-d "postgresql-$tag/src/test/locale/ISO8859-7") {
print "moving ISO8859-7 for $tag\n";
system("mv cvsout/src/test/locale/gr_GR.ISO8859-7 cvsout/src/test/locale/ISO8859-7") == 0 || die "mv failed: $?";
}

Just for the record, these are the versions for which these tests hit:

moving src/interfaces for 1.08
moving version.c for 1.08
moving src/interfaces for 1.09
moving version.c for 1.09
moving src/interfaces for 6.1
moving perl5 for 6.1
moving version.c for 6.1
moving src/interfaces for 6.1.1
moving perl5 for 6.1.1
moving version.c for 6.1.1
moving version.c for 6.2
moving version.c for 6.2.1
moving version.c for 6.3.2
moving ISO8859-7 for 6.5
moving ISO8859-7 for 6.5.1
moving ISO8859-7 for 6.5.2
moving ISO8859-7 for 6.5.3

With those changes, I am able to match all the available archival tarballs
to various places in the CVS history. The exact spots where they match
are detailed in the attached "matches" file. The file also shows the
cvsroot path and CVS module name that was in use at each time; you need
to duplicate that if you want $Header$ lines to match what's in the
tarballs. (I set up symlinks to the base repository on my machine so that
CVS could check out successfully for each of these scenarios.)

There are still a couple of unexplainable discrepancies, though.
In particular, the 1.08 and 1.09 tarballs contain this fix:
http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/interfaces/libpgtcl/Attic/pgtclCmds.c.diff?r1=1.10;r2=1.11
which is odd because it wasn't applied to CVS till months after those
tarballs were made. Even odder, the file timestamp on pgtclCmds.c in
the tarballs agrees with CVS revision 1.2, which is what ought to be in
those tarballs according to CVS. It may be that this fix was made in the
separate client-code repository and not propagated to the core till later;
but that theory doesn't explain the exact timestamp match.

Anyway, the distressing thing about what the "matches" file shows is that
we do not have CVS tags for a lot of the older tarballs. Even worse,
there are a couple of CVS tags that look like they ought to match released
tarballs, but do not: the tags were evidently applied a few commits before
the tarball was actually made. In particular, the tags REL6_5, REL7_1,
and REL7_1_2 don't match the tarballs they ought to. I don't have a whole
lot of faith in some of the other early tags either, because we don't seem
to have an archived tarball to compare them to.

Having completed that comparison, I then moved on to trying to get rid of
the discrepancies in the git conversion; particularly, trying to get rid
of the "manufactured commits". I didn't have much success in that for the
cases where the manufactured commit was caused by a back-branch file
addition. The case I showed before where things cleaned up nicely (for
pg_dump's it.po) depended on the fact that the place where the branch
would naturally sprout off happened to be a "dead" revision on HEAD.
That's not the case anywhere else, so I gave up on the complicated patch
for it.po. The patches I'm using instead just inject a dead ".0" revision
immediately after the branch point, and are pretty small and easy to
verify. I only bothered to do this for the cases where the back-branch
addition happened significantly later than the main-branch addition. If
they were done in a group of related commits with nothing else in between,
I left well enough alone. We still have "manufactured" commits either
way, but they are just cosmetic so I guess we should live with them.

I also found numerous places where we'd been sloppy about placing tags.
That explains some of the weird things cvs2git did. In particular:

* We had the already-known problem that gram.c and some other derived
files had commits made after they should have been dead.

* Bruce had transiently added those files on the WIN32_DEV branch as
well, to general disapproval, and this seemed to also give cvs2git
indigestion. The attached proposed fixup script deals with this by
deleting those revisions altogether. This is a loss of history, but
not one that I care about.

* The HISTORY and INSTALL files have REL7_3_10 tags and should not.
As mentioned earlier, I think this is because they were deleted after the
original placement of that tag, and weren't correctly fixed when the
tag was moved up to branch end a few days later.

* The regression tests files recently added to contrib/xml2 have REL8_0_23
tags. I have no idea how that happened, because they certainly didn't
exist when 8.0.23 was released.

* There are a bunch of files that should have REL7_3_5 tags and lack them.
They are in just a few subdirectories, so probably what happened was that
the "cvs tag" operation was issued in an incomplete checkout tree.

* Similarly, gram.c should have a release-6-3 tag and lacks it.

* There are a bunch of files that have REL7_1 tags when what they should
have are REL7_1_BETA tags. These appear to be exactly the files that were
deleted between the initial placement of the REL7_1 tag and Marc's later
ex-post-facto renaming of the tag to REL7_1_BETA. I'm guessing another
case of "cvs tag" missing files that weren't in the checkout.

* There are a number of files that lack the REL2_0 tag and REL2_0B branch,
though they should have it according to file dates. These appear to be
exactly the files that were in the separate documentation repository at
the time, so that probably tells us the mechanism for missing them.

After fixing all the above items using the attached script, I have what
seems to be a reasonably clean conversion. I still have the three
oddities alluded to over in the "uh-oh" thread, but I'm not sure any of
them should be considered blockers for making the conversion. There are
also some cosmetic issues remaining, like what committer to blame the
various inserted commits on and whether we want to keep partial tags.
But this message is long enough already so I'll get to those issues
separately.

Attached are an updated version of Max's README file about how to perform
the conversion, the repository fixup script needed for that, the Perl
script I used for comparing CVS to tarballs, and the input file for the
Perl script, which shows which CVS tag or checkout date to compare against
each of the available tarballs.

regards, tom lane

Attachment Content-Type Size
unknown_filename text/plain 1.4 KB
unknown_filename text/plain 42.1 KB
unknown_filename text/plain 1.8 KB
unknown_filename text/plain 9.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2010-09-13 03:14:53 Re: Walsender doesn't process options passed in the startup packet
Previous Message Robert Haas 2010-09-13 02:49:32 Re: update on global temporary and unlogged tables

Browse pgsql-www by date

  From Date Subject
Next Message Joshua D. Drake 2010-09-13 05:45:49 Re: How would I find the WWW project?
Previous Message Josh Berkus 2010-09-12 22:52:02 How would I find the WWW project?