Re: Using database to find file doublettes in my computer

From: Eus <eus(at)member(dot)fsf(dot)org>
To: Lothar Behrens <lothar(dot)behrens(at)lollisoft(dot)de>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Using database to find file doublettes in my computer
Date: 2008-11-18 03:48:10
Message-ID: 849157.43436.qm@web37603.mail.mud.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Ho!

--- On Tue, 11/18/08, Lothar Behrens <lothar(dot)behrens(at)lollisoft(dot)de> wrote:

> Hi,
>
> I have a problem to find as fast as possible files that are
> double or
> in other words, identical.
> Also identifying those files that are not identical.
>
> My approach was to use dir /s and an awk script to convert
> it to a sql
> script to be imported into a table.
> That done, I could start issuing queries.
>
> But how to query for files to display a 'left / right
> view' for each
> file that is on multible places ?
>
> I mean this:
>
> This File;Also here
> C:\some.txt;C:\backup\some.txt
> C:\some.txt;C:\backup1\some.txt
> C:\some.txt;C:\backup2\some.txt
>
> but have only this list:
>
> C:\some.txt
> C:\backup\some.txt
> C:\backup1\some.txt
> C:\backup2\some.txt
>
>
> The reason for this is because I am faced with the problem
> of ECAD
> projects that are copied around
> many times and I have to identify what files are here
> missing and what
> files are there.
>
> So a manual approach is as follows:
>
> 1) Identify one file (schematic1.sch) and see, where are
> copies of
> it.
> 2) Compare the files of both directories and make a
> desision about
> what files to use further.
> 3) Determine conflicts, thus these files can't be
> copied together
> for a cleanup.
>
> Are there any approaches or help ?

I also have been in this kind of circumstance before, but I work under GNU/Linux as always.

1. At that time, I used `md5sum' to generate the fingerprint of all files in a given directory to be cleaned up.

2. Later, I created a simple Java program to group the names of all files that had the same fingerprint (i.e., MD5 hash).

3. I simply deleted the files with the same MD5 hash but one file with a good filename (in my case, the filename couldn't be relied on to perform a comparison since it differed by small additions like date, author's name, and the like).

4. After that, I used my brain to find related files based on the filenames (e.g., `[2006-05-23] Jeff - x.txt' should be the same as `Jenny - x.txt'). Of course, the Java program also helped me in grouping the files that I thought to be related.

5. Next, I perused the related files to see whether most of the contents were the same. If yes, I took the latest one based on the modified time.

> This is a very time consuming job and I am searching for
> any solution
> that helps me save time :-)

Well, I think I saved a lot of time at that time to be able to eliminate about 7,000 files out of 15,000 files in about two weeks.

> I know that those problems did not arise when the projects
> are well
> structured and in a version management system. But that
> isn't here :-)

I hope you employ such a system ASAP :-)

> Thanks
>
> Lothar

Best regards,

Eus (FSF member #4445)

In this digital era, where computing technology is pervasive,

your freedom depends on the software controlling those computing devices.

Join free software movement today!

It is free as in freedom, not as in free beer!

Join: http://www.fsf.org/jf?referrer=4445

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Webb Sprague 2008-11-18 04:48:46 "INNER JOIN .... USING " in an UPDATE
Previous Message Bruce Momjian 2008-11-18 03:22:29 Re: [GENERAL] db_user_namespace, md5 and changing passwords