From: | Ron <ronljohnsonjr(at)gmail(dot)com> |
---|---|
To: | pgsql-general(at)lists(dot)postgresql(dot)org |
Subject: | Re: Need efficient way to remove (almost) duplicate rows from a table |
Date: | 2023-10-17 23:06:41 |
Message-ID: | 53c6f645-6b24-4594-b1c9-fa6cf8b514da@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
https://dba.stackexchange.com/a/138327/63913
On 10/17/23 17:48, David Gauthier wrote:
> v11.5 on linux (I know, I know, IT is working on the upgrade to 15.2)
>
> I have a table with millions of records and there are a lot of "almost"
> duplicates that I want to get rid of in an efficient way. Best to
> illustrate with a simple example...
>
> We'll talk about deleting leaves on a tree where each twig can have many
> leaves, but never 2 or more leaves of the exact same surface area. What I
> have how are a lot of twigs with leaves having the same surface area (and
> some different one too) and I want to get rid of the duplicates for that twig.
>
> create table twigs (limb_id int, branch_id int, twig_id int, surfarea float);
> insert into twigs (linb_id,branch_id,twig_id,surfarea) values
> (1,1,1,100.0),
> *(1,1,2,103.7),*
> *(1,1,3,103.7),*
> (1,1,4,110.4),
>
> (1,2,1,120.0),
> (1,2,2,123.6),
> *(1,2,3,128.1),*
> *(1,2,4,128.1),*
> *(1,2,5,128.1),*
> *
> *
> (2,1,1,100.0),
> (2,1,3,104.4),
> (2,1,4,131.9);
>
> You can see the duplicates in red. I want to get rid of all but one of
> the dups. Which "twig_id" that's left behind doesn't matter.
>
> This would do it...
> delete from twigs where limb_id=1 and branch_id=1 and twig_id=23;
> delete from twigs where limb_id=1 and branch_id=2 and twig_id in (4,5);
>
> But there are millions of these duplicates and it'll take forever like this.
>
> I was going to approach this with a perl/DBI script, shoving the duplicate
> record identifiers (limb_id, branch_id, twig_id) into perl arrays and then
> submitting the delete command in a prepared statement that accepts the
> arrays as values ... (a-la... my $cnt = $sth->execute_array({
> ArrayTupleStatus => \my
> @tuple_status},\(at)limb_id_arr,\(at)branch_id_arr,\(at)twig_id_arr) or die "-F-
> Failed to execute '$sqf'\n";) But I'm not sure that'll be faster.
> Depends on how perl/DBI handles it I suppose.
>
> Then I was thinking it might just be better to create a parallel table and
> insert records in there by copying from the table with the dups, taking
> care to avoid inserting duplicates. Once done, delete the original table
> and rename the copy to the original's name. If that's a good approach,
> then how exactly do I articulate the insert statement to avoid the
> duplicates ? This approach might also be good in that I can do it outside
> a transaction. If the target table gets goofed up, the original data is
> still OK, no harm done, can try again.
>
> Any other ideas ?
>
> Again, millions of duplicates and I don't want to overload any PG system
> stuff in the process.
>
> Thanks in Advance !
--
Born in Arizona, moved to Babylonia.
From | Date | Subject | |
---|---|---|---|
Next Message | Paul Jungwirth | 2023-10-17 23:09:17 | Re: Need efficient way to remove (almost) duplicate rows from a table |
Previous Message | David Gauthier | 2023-10-17 22:48:50 | Need efficient way to remove (almost) duplicate rows from a table |