Re: PostgreSQL Developer meeting minutes up

From: "Markus Wanner" <markus(at)bluegap(dot)ch>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Aidan Van Dyk" <aidan(at)highrise(dot)ca>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Magnus Hagander" <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL Developer meeting minutes up
Date: 2009-05-29 06:41:09
Message-ID: 20090529084109.14871sskioiu9gud@mail.bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Quoting "Robert Haas" <robertmhaas(at)gmail(dot)com>:
> That's not the best news I've had today...

Sorry :-(

> To me they sound complex and inconvenient. I guess I'm kind of
> mystified by why we can't make this work reliably. Other than the
> "broken tags" issue we've discussed, it seems like the only real issue
> should be how to group changes to different files into a single
> commit. Once you do that, you should be able to construct a
> well-defined, total function f : <cvs-file, cvs-revision> -> <git
> commit> which is surjective on the space of git commits. In fact it
> might be a good idea to explicitly construct this mapping and drop it
> into a database table somewhere so that people can sanity check it as
> much as they wish. Why is this harder than I think it is?

Well, as CVS doesn't guarantee any consistency between files, you end
up with silly situations more often than you think. One of the
simplest possible example is something like:

commit 1: fileA @ 1.1, fileB @ 1.2
commit 2: fileA @ 1.2, fileB @ 1.1

Seen from fileA, it's obvious that commit 1 (@1.1) comes before commit
2 (@1.2), but seen from fileB it's the exact opposite. The most
promising approach to solve these problems seems to be based on Graph
Theory, where you work with a graph of dependencies from fileA @ 1.1
to fileA @ 1.2.

To resolve the above situation, you'd have "split" a blob of
single-file commits into two end-result commits (for monotone / git).
In the above example, you'd have two options to resolve the conflict:

commit 1a: fileA @ 1.1
commit 2: fileA @ 1.2, fileB @ 1.1
commit 1b: fileA @ 1.2

Or:

commit 2a: fileB @ 1.1
commit 1: fileA @ 1.1, fileB @ 1.2
commit 2b: fileB @ 1.2

(Note that often enough, these have actually been separate commits in
CVS as well, there's just no way to represent that. And no, timestamps
are simply not reliable enough).

Now add tags, branches and cyclic dependencies involving many files
and many 100 commits to the example above and you start to get an idea
of the complexity of the problem in general.

See my description and diagrams of the steps used for cvs_import in
monotone at [1] or follow descriptions of how cvs2svn works internally.

A few numbers about a conversion I'm trying for testing my algorithm
and heuristics. It's converting a pretty recent snapshot of the
Postgres repository:

* running at 100% CPU time since: April, 17
* Total number of files involved: 6'847
* total number of blobs (before splitting): 28'010
* blobs split due to cyclic dependencies: 12'801

Admittedly, my algorithm isn't optimized at all. However, I'm focusing
on good results rather than speed of conversion.

Also note, that monotone uses SQLite, so it actually stores the
results of this conversion in an SQL database, as you proposed.
Recently, a git_export command has been added, so that's definitely
worth a try for converting CVS to git. However, I fear cvs2git is more
mature.

Regards

Markus Wanner

[1]: a description of the various steps in conversion from CVS to monotone:
http://www.monotone.ca/wiki/CvsImport/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2009-05-29 06:50:26 Re: Unicode string literals versus the world
Previous Message Markus Wanner 2009-05-29 05:53:20 Re: PostgreSQL Developer meeting minutes up