Quick Links

Re: Need efficient way to remove (almost) duplicate rows from a table

From:	Ron <ronljohnsonjr(at)gmail(dot)com>
To:	pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	Re: Need efficient way to remove (almost) duplicate rows from a table
Date:	2023-10-17 23:06:41
Message-ID:	53c6f645-6b24-4594-b1c9-fa6cf8b514da@gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

https://dba.stackexchange.com/a/138327/63913

On 10/17/23 17:48, David Gauthier wrote:
> v11.5 on linux (I know, I know, IT is working on the upgrade to 15.2)
>
> I have a table with millions of records and there are a lot of "almost"
> duplicates that I want to get rid of in an efficient way. Best to
> illustrate with a simple example...
>
> We'll talk about deleting leaves on a tree where each twig can have many
> leaves, but never 2 or more leaves of the exact same surface area. What I
> have how are a lot of twigs with leaves having the same surface area (and
> some different one too) and I want to get rid of the duplicates for that twig.
>
> create table twigs (limb_id int, branch_id int, twig_id int, surfarea float);
> insert into twigs (linb_id,branch_id,twig_id,surfarea) values
> (1,1,1,100.0),
> *(1,1,2,103.7),*
> *(1,1,3,103.7),*
> (1,1,4,110.4),
>
> (1,2,1,120.0),
> (1,2,2,123.6),
> *(1,2,3,128.1),*
> *(1,2,4,128.1),*
> *(1,2,5,128.1),*
> *
> *
> (2,1,1,100.0),
> (2,1,3,104.4),
> (2,1,4,131.9);
>
> You can see the duplicates in red. I want to get rid of all but one of
> the dups. Which "twig_id" that's left behind doesn't matter.
>
> This would do it...
> delete from twigs where limb_id=1 and branch_id=1 and twig_id=23;
> delete from twigs where limb_id=1 and branch_id=2 and twig_id in (4,5);
>
> But there are millions of these duplicates and it'll take forever like this.
>
> I was going to approach this with a perl/DBI script, shoving the duplicate
> record identifiers (limb_id, branch_id, twig_id) into perl arrays and then
> submitting the delete command in a prepared statement that accepts the
> arrays as values ... (a-la... my $cnt = $sth->execute_array({
> ArrayTupleStatus => \my
> @tuple_status},\(at)limb_id_arr,\(at)branch_id_arr,\(at)twig_id_arr) or die "-F-
> Failed to execute '$sqf'\n";) But I'm not sure that'll be faster.
> Depends on how perl/DBI handles it I suppose.
>
> Then I was thinking it might just be better to create a parallel table and
> insert records in there by copying from the table with the dups, taking
> care to avoid inserting duplicates. Once done, delete the original table
> and rename the copy to the original's name. If that's a good approach,
> then how exactly do I articulate the insert statement to avoid the
> duplicates ? This approach might also be good in that I can do it outside
> a transaction. If the target table gets goofed up, the original data is
> still OK, no harm done, can try again.
>
> Any other ideas ?
>
> Again, millions of duplicates and I don't want to overload any PG system
> stuff in the process.
>
> Thanks in Advance !

--
Born in Arizona, moved to Babylonia.

In response to

Need efficient way to remove (almost) duplicate rows from a table at 2023-10-17 22:48:50 from David Gauthier

Responses

Fwd: Need efficient way to remove (almost) duplicate rows from a table at 2023-10-18 01:52:01 from Michael Nolan

Browse pgsql-general by date

	From	Date	Subject
Next Message	Paul Jungwirth	2023-10-17 23:09:17	Re: Need efficient way to remove (almost) duplicate rows from a table
Previous Message	David Gauthier	2023-10-17 22:48:50	Need efficient way to remove (almost) duplicate rows from a table