Quick Links

Re: data import - duplicates

From:	"Webb Sprague" <webb(dot)sprague(at)gmail(dot)com>
To:	brian <brian(at)zijn-digital(dot)com>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: data import - duplicates
Date:	2008-03-09 06:51:14
Message-ID:	b11ea23c0803082251l607fe7a2y657ff589df32332f@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

I haven't tested but this is what I would do (uses arrays, which are
handy when you need them), with the names changed to protect the
innocent:

begin;
-- create a table with some duplicates in one of the columns (y is ck);
wsprague=# select x, x%4 as y into temp fbar from generate_series(1,10) as x(a);
SELECT

wsprague=# select * from fbar;
x | y
----+---
1 | 1
2 | 2
3 | 3
4 | 0
5 | 1
6 | 2
7 | 3
8 | 0
9 | 1
10 | 2
(10 rows)

-- create a table with the pk, an array of the duplicate keys, and the
length of that array
select y, x_list, array_upper(x_list, 1) as x_len into baz from
(select y, array_accum(x) as x_list from fbar group by y ) a ;

wsprague=# select * from baz;
y | x_list | x_len
---+----------+-------
3 | {3,7} | 2
2 | {2,6,10} | 3
1 | {1,5,9} | 3
0 | {4,8} | 2
(4 rows)

-- delete all rows that don't have ck in the first element of the pk list
wsprague=# delete from fbar where not exists (select 1 from baz where
fbar.x=baz.x_list[1]);DELETE 6
wsprague=# select * from fbar;
x | y
---+---
1 | 1
2 | 2
3 | 3
4 | 0
(4 rows)

commit;

I hope that makes sense. It should be easy to make into a function
(like clean_table(table_name text, pk_name text, dup_key_name text).
I don't know how well it will work with indexes. You could probably
even write a function to do the entire import inside postgres,
starting with a copy and moving to merging tables and ending with some
consistency checks, and thus benefit from transaction isolation of
the whole process....

HTH

On Sat, Mar 8, 2008 at 9:42 PM, brian <brian(at)zijn-digital(dot)com> wrote:
> I have ~350K rows of sample data that has come to me in 64 text files
> (CSV) that I'd like to import into a new database. Each file can be
> considered its own category and is so named. That is, each file name
> will be inserted into a separate categories table. I'd like to relate
> each row to its category. However, while the rows are unique in each
> file, some rows are found in more than one file.
>
> I also must parse some of the fields into separate columns, but that's
> no big deal. But it means that I must do some pre-processing on these
> files, in any case.
>
> After some thought, I thought I might brute-force the problem with Perl
> by reading each line of each file into an assoc. array unless it's
> already in it, in which case I'd append the "key" based on the list of
> categories that line is found in (I'd still need to parse outthe keys
> later but I'd get rid of the dupes). Each array row would be like so:
>
> 'key' => '1,4,etc.', 'text' => 'a line'
>
> Aside from the fact that the array search would become ridiculously
> large after just a couple of files, I realised that this is a
> non-starter for the simple fact that this data comprises less than 25%
> of the eventual total. So refactoring it in this way would be a waste of
> time (there will probably be dupes in the files to come).
>
> So, I'd like to instead parse out my columns properly and write each
> line (along with its category key) to a new, single file to be copied
> into a working table. ONce I've done so, is there an efficient method I
> can use to select all duplicates (save for the category key) into a set
> from which I could then select into the final table (and insert the keys
> into the category join table)?
>
> For example (pk is the PK from the working table and ck is the category
> key), my dupes query on the working table would give the following set:
>
> pk ck
> 1 1 a a a a
> 2 3 a a a a
> 3 3 b b b b
> 4 7 b b b b
> 5 4 a a a a
>
> I would then want to insert just the unique rows into the final table
> yet add all of the the PKs and CKs to the category join table. After
> that was done, I'd delete all of these from the working table and then
> move the unique rows that are left to the final table (and insert the
> keys into the join table).
>
> I hope that makes sense. I'm not looking for anyone to do my homework
> for me; I'm sure I could fix up a tasty function for this (the data is
> destined for MySQL, alas, but I'll be damned if I won't use PG for the
> heavy lifting). What I'm really looking for is a handy way to grab all
> of those dupes.
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

In response to

data import - duplicates at 2008-03-09 05:42:34 from brian

Responses

Re: data import - duplicates at 2008-03-09 19:47:27 from brian

Browse pgsql-general by date

	From	Date	Subject
Next Message	Mitchell D. Russell	2008-03-09 09:45:57	v8.3 + UTF8 errors when restoring DB
Previous Message	brian	2008-03-09 05:42:34	data import - duplicates