Quick Links

Re: data import - duplicates

From:	brian <brian(at)zijn-digital(dot)com>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: data import - duplicates
Date:	2008-03-09 19:47:27
Message-ID:	47D43ECF.4060101@zijn-digital.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Webb Sprague wrote:
> I haven't tested but this is what I would do (uses arrays, which are
> handy when you need them), with the names changed to protect the
> innocent:
>
> begin;
> -- create a table with some duplicates in one of the columns (y is ck);
> wsprague=# select x, x%4 as y into temp fbar from generate_series(1,10) as x(a);
> SELECT
>
> wsprague=# select * from fbar;
> x | y
> ----+---
> 1 | 1
> 2 | 2
> 3 | 3
> 4 | 0
> 5 | 1
> 6 | 2
> 7 | 3
> 8 | 0
> 9 | 1
> 10 | 2
> (10 rows)
>
> -- create a table with the pk, an array of the duplicate keys, and the
> length of that array
> select y, x_list, array_upper(x_list, 1) as x_len into baz from
> (select y, array_accum(x) as x_list from fbar group by y ) a ;
>
> wsprague=# select * from baz;
> y | x_list | x_len
> ---+----------+-------
> 3 | {3,7} | 2
> 2 | {2,6,10} | 3
> 1 | {1,5,9} | 3
> 0 | {4,8} | 2
> (4 rows)
>
> -- delete all rows that don't have ck in the first element of the pk list
> wsprague=# delete from fbar where not exists (select 1 from baz where
> fbar.x=baz.x_list[1]);DELETE 6
> wsprague=# select * from fbar;
> x | y
> ---+---
> 1 | 1
> 2 | 2
> 3 | 3
> 4 | 0
> (4 rows)
>
> commit;
>
> I hope that makes sense. It should be easy to make into a function
> (like clean_table(table_name text, pk_name text, dup_key_name text).
> I don't know how well it will work with indexes. You could probably
> even write a function to do the entire import inside postgres,
> starting with a copy and moving to merging tables and ending with some
> consistency checks, and thus benefit from transaction isolation of
> the whole process....
>

Hey, neat idea. I changed things a little:

test=# SELECT * FROM table1;
id | ck | content
----+----+---------
1 | 1 | string1
2 | 2 | string2
3 | 2 | string3
4 | 4 | string1
5 | 5 | string1
6 | 1 | string3
7 | 2 | string1
(7 rows)

test=# SELECT key_list, array_upper(key_list, 1) AS key_num, content
test-# INTO table2 FROM (
test-# SELECT array_accum(ck) AS key_list, content FROM table1 GROUP BY
content
test=# ) a;
SELECT

test=# SELECT * FROM table2;
key_list | key_num | content
-----------+---------+---------
{2,1} | 2 | string3
{1,4,5,2} | 4 | string1
{2} | 1 | string2
(3 rows)

I had to modify your DELETE suggestion a bit because the table1.id (your
x) is non-unique (although it must be within a given key_list). It'll be
a performance hog though because it's comparing each string.

test-# DELETE FROM table1 WHERE NOT EXISTS (
test-# SELECT 1 FROM table2
test-# WHERE table1.content = table2.content
test-# AND table1.ck = table2.key_list[1]
test-# );
DELETE 4

test=# SELECT * FROM table1;
id | ck | content
----+----+---------
1 | 1 | string1
2 | 2 | string2
3 | 2 | string3
(3 rows)

From here I can drop the table1.ck column. But, in order to place the
table2.key_list ints into a join table, I need the PK from table1. I
have the content column with which to fetch that from table1 but that
hardly seems an efficient method. So, I modified your idea for gathering
the dupe CKs a little:

test-# SELECT key_list, array_upper(key_list, 1) AS key_num, content
test-# INTO table2 FROM (
test-# SELECT array_cat(ARRAY[array_accum(id)], ARRAY[array_accum(ck)])
AS key_list, content
test-# FROM table1 GROUP BY content) a;
SELECT

test=# select * from table2;
key_list | key_num | content
-----------------------+---------+---------
{{3,6},{2,1}} | 2 | string3
{{1,4,5,7},{1,4,5,2}} | 2 | string1
{{2},{2}} | 2 | string2
(3 rows)

Huzzah! IOW, I accumulate both the PKs and ck from table1 for each
string. Then, I can ditch all but the first PK, which should make
transferring these values in to the join table a snap.

# lightbulb comes on ...

Actually, what I'll do is remove the extraneous PKs and use them to
DELETE FROM table1. Much more efficient, that.

Thanks for your ideas. And if anyone else has any suggestion please feel
free to chime in. When I get the rest of the data I'll post back here on
how it went in case google points someone here in the future.

In response to

Re: data import - duplicates at 2008-03-09 06:51:14 from Webb Sprague

Browse pgsql-general by date

	From	Date	Subject
Next Message	Matt Magoffin	2008-03-09 20:59:02	Nested xpath() calls "lose" namespace support
Previous Message	Raymond O'Donnell	2008-03-09 19:41:57	Re: loading a psql input file on win xp