Re: identifying duplicates in table with redundancies

From: "Tarlika Elisabeth Schmitz" <postgresql2(at)numerixtechnology(dot)de>
To:
Cc: <pgsql-sql(at)postgresql(dot)org>
Subject: Re: identifying duplicates in table with redundancies
Date: 2010-09-28 20:36:00
Message-ID: 20100928213600.4db60ba4@dick.coachhouse
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

On Tue, 28 Sep 2010 11:34:31 +0100
"Oliveiros d'Azevedo Cristina" <oliveiros(dot)cristina(at)marktest(dot)pt> wrote:

>----- Original Message -----
>From: "Tarlika Elisabeth Schmitz" <postgresql(at)numerixtechnology(dot)de>
>To: <pgsql-sql(at)postgresql(dot)org>
>Sent: Monday, September 27, 2010 5:54 PM
>Subject: Re: [SQL] identifying duplicates in table with redundancies
>
>
>> On Fri, 24 Sep 2010 18:12:18 +0100
>> Oliver d'Azevedo Christina <oliveiros(dot)cristina(at)gmail(dot)com> wrote:
>>
>>>>> SELECT DISTINCT trainer_id,trainer_name
>>>>> FROM (
>>>>> SELECT trainer_name -- The field you want to test for duplicates
>>>>> FROM (
>>>>> SELECT DISTINCT "trainer_id","trainer_name"
>>>>> FROM student
>>>>> ) x
>>>>> GROUP BY "trainer_name" -- the field you want to test for
>>>>> duplicates
>>>>> HAVING (COUNT(*) > 1)
>>>>> ) z
>>>>> NATURAL JOIN student y
>>
>>
>>
>> What indices would you recommend for this operation?
>
>But, on this query in particular I would recomend an indice on
>trainer_name, as this field will be used on the join and on the group
>by. For the other query, the one you get by substituting trainer_name
>by trainer_id, place an index on trainer_id.
>Also, these indexes may help speed up the order by clause, if you use
>one.
>
>If you have a table with lots of data you can try them around and see
>how performance varies (and don't forget there's also EXPLAIN ANALYZE)

Strangely, these indices did not do anything.
Without, the query took about 8500ms. Same with index.

The table has 250000 records. 11000 have trainer_name = null. Only
13000 unique trainer_names.

It is not hugely important as these queries are not time-critical.
This is only a helper table, which I use to analyze the date prior to
populating the destination tables with the data.

Regards,
Tarlika

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Thomas Andres 2010-09-29 06:42:42 Shema with template
Previous Message msi77 2010-09-28 14:00:45 Re: is there a distinct function for comma lists ?