Skip site navigation (1) Skip section navigation (2)

Re: identifying duplicates in table with redundancies

From: "Tarlika Elisabeth Schmitz" <postgresql2(at)numerixtechnology(dot)de>
To:
Cc: <pgsql-sql(at)postgresql(dot)org>
Subject: Re: identifying duplicates in table with redundancies
Date: 2010-09-28 20:36:00
Message-ID: 20100928213600.4db60ba4@dick.coachhouse (view raw or flat)
Thread:
Lists: pgsql-sql
On Tue, 28 Sep 2010 11:34:31 +0100
"Oliveiros d'Azevedo Cristina" <oliveiros(dot)cristina(at)marktest(dot)pt> wrote:

>----- Original Message ----- 
>From: "Tarlika Elisabeth Schmitz" <postgresql(at)numerixtechnology(dot)de>
>To: <pgsql-sql(at)postgresql(dot)org>
>Sent: Monday, September 27, 2010 5:54 PM
>Subject: Re: [SQL] identifying duplicates in table with redundancies
>
>
>> On Fri, 24 Sep 2010 18:12:18 +0100
>> Oliver d'Azevedo Christina <oliveiros(dot)cristina(at)gmail(dot)com> wrote:
>>
>>>>> SELECT DISTINCT trainer_id,trainer_name
>>>>> FROM (
>>>>> SELECT trainer_name   -- The field you want to test for duplicates
>>>>> FROM (
>>>>> SELECT DISTINCT "trainer_id","trainer_name"
>>>>> FROM student
>>>>> ) x
>>>>> GROUP BY "trainer_name"    -- the field you want to test for
>>>>> duplicates
>>>>> HAVING (COUNT(*) > 1)
>>>>> ) z
>>>>> NATURAL JOIN student y
>>
>>
>>
>> What indices would you recommend for this operation?
>
>But, on this query in particular I would recomend an indice on
>trainer_name, as this field will be used on the join and on the group
>by. For the other query, the one you get by substituting trainer_name
>by trainer_id, place an index on trainer_id.
>Also, these indexes may help speed up the order by clause, if you use
>one.
>
>If you have a table with lots of data you can try them around and see
>how performance varies (and don't forget there's also EXPLAIN ANALYZE)


Strangely, these indices did not do anything.
Without, the query took about 8500ms. Same with index.

The table has 250000 records. 11000 have trainer_name = null. Only
13000 unique trainer_names.

It is not hugely important as these queries are not time-critical.
This is only a helper table, which I use to analyze the date prior to
populating the destination tables with the data.

Regards,
Tarlika

In response to

Responses

pgsql-sql by date

Next:From: Thomas AndresDate: 2010-09-29 06:42:42
Subject: Shema with template
Previous:From: msi77Date: 2010-09-28 14:00:45
Subject: Re: is there a distinct function for comma lists ?

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group