Re: Need efficient way to remove (almost) duplicate rows from a table

From: Ron <ronljohnsonjr(at)gmail(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Need efficient way to remove (almost) duplicate rows from a table
Date: 2023-10-17 23:06:41
Message-ID: 53c6f645-6b24-4594-b1c9-fa6cf8b514da@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

https://dba.stackexchange.com/a/138327/63913

On 10/17/23 17:48, David Gauthier wrote:
> v11.5 on linux (I know, I know, IT is working on the upgrade to 15.2)
>
> I have a table with millions of records and there are a lot of "almost"
> duplicates that I want to get rid of in an efficient way.  Best to
> illustrate with a simple example...
>
> We'll talk about deleting leaves on a tree where each twig can have many
> leaves, but never 2 or more leaves of the exact same surface area.  What I
> have how are a lot of twigs with leaves having the same surface area (and
> some different one too) and I want to get rid of the duplicates for that twig.
>
> create table twigs (limb_id int, branch_id int, twig_id int, surfarea float);
> insert into twigs (linb_id,branch_id,twig_id,surfarea) values
> (1,1,1,100.0),
> *(1,1,2,103.7),*
> *(1,1,3,103.7),*
> (1,1,4,110.4),
>
> (1,2,1,120.0),
> (1,2,2,123.6),
> *(1,2,3,128.1),*
> *(1,2,4,128.1),*
> *(1,2,5,128.1),*
> *
> *
> (2,1,1,100.0),
> (2,1,3,104.4),
> (2,1,4,131.9);
>
> You can see the duplicates in red.  I want to get rid of all but one of
> the dups.  Which "twig_id" that's left behind doesn't matter.
>
> This would do it...
> delete from twigs where limb_id=1 and branch_id=1 and twig_id=23;
> delete from twigs where limb_id=1 and branch_id=2 and twig_id in (4,5);
>
> But there are millions of these duplicates and it'll take forever like this.
>
> I was going to approach this with a perl/DBI script, shoving the duplicate
> record identifiers (limb_id, branch_id, twig_id) into perl arrays and then
> submitting the delete command in a prepared statement that accepts the
> arrays as values ... (a-la... my $cnt = $sth->execute_array({
> ArrayTupleStatus => \my
> @tuple_status},\(at)limb_id_arr,\(at)branch_id_arr,\(at)twig_id_arr) or die "-F-
> Failed to execute '$sqf'\n";)   But I'm not sure that'll be faster. 
> Depends on how perl/DBI handles it I suppose.
>
> Then I was thinking it might just be better to create a parallel table and
> insert records in there by copying from the table with the dups, taking
> care to avoid inserting duplicates.  Once done, delete the original table
> and rename the copy to the original's name.  If that's a good approach,
> then how exactly do I articulate the insert statement to avoid the
> duplicates ?  This approach might also be good in that I can do it outside
> a transaction.  If the target table gets goofed up, the original data is
> still OK, no harm done, can try again.
>
> Any other ideas ?
>
> Again, millions of duplicates and I don't want to overload any PG system
> stuff in the process.
>
> Thanks in Advance !

--
Born in Arizona, moved to Babylonia.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Paul Jungwirth 2023-10-17 23:09:17 Re: Need efficient way to remove (almost) duplicate rows from a table
Previous Message David Gauthier 2023-10-17 22:48:50 Need efficient way to remove (almost) duplicate rows from a table