Re: identifying duplicates in table with redundancies

From: Andreas Schmitz <mailinglist(at)longimanus(dot)net>
To: Tarlika Elisabeth Schmitz <postgresql2(at)numerixtechnology(dot)de>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: identifying duplicates in table with redundancies
Date: 2010-09-29 08:40:03
Message-ID: 4CA2FB63.2030102@longimanus.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

On 09/28/2010 10:36 PM, Tarlika Elisabeth Schmitz wrote:
> On Tue, 28 Sep 2010 11:34:31 +0100
> "Oliveiros d'Azevedo Cristina"<oliveiros(dot)cristina(at)marktest(dot)pt> wrote:
>
>> ----- Original Message -----
>> From: "Tarlika Elisabeth Schmitz"<postgresql(at)numerixtechnology(dot)de>
>> To:<pgsql-sql(at)postgresql(dot)org>
>> Sent: Monday, September 27, 2010 5:54 PM
>> Subject: Re: [SQL] identifying duplicates in table with redundancies
>>
>>
>>> On Fri, 24 Sep 2010 18:12:18 +0100
>>> Oliver d'Azevedo Christina<oliveiros(dot)cristina(at)gmail(dot)com> wrote:
>>>
>>>>>> SELECT DISTINCT trainer_id,trainer_name
>>>>>> FROM (
>>>>>> SELECT trainer_name -- The field you want to test for duplicates
>>>>>> FROM (
>>>>>> SELECT DISTINCT "trainer_id","trainer_name"
>>>>>> FROM student
>>>>>> ) x
>>>>>> GROUP BY "trainer_name" -- the field you want to test for
>>>>>> duplicates
>>>>>> HAVING (COUNT(*)> 1)
>>>>>> ) z
>>>>>> NATURAL JOIN student y
>>>
>>>
>>> What indices would you recommend for this operation?
>> But, on this query in particular I would recomend an indice on
>> trainer_name, as this field will be used on the join and on the group
>> by. For the other query, the one you get by substituting trainer_name
>> by trainer_id, place an index on trainer_id.
>> Also, these indexes may help speed up the order by clause, if you use
>> one.
>>
>> If you have a table with lots of data you can try them around and see
>> how performance varies (and don't forget there's also EXPLAIN ANALYZE)
>
> Strangely, these indices did not do anything.
> Without, the query took about 8500ms. Same with index.
>
> The table has 250000 records. 11000 have trainer_name = null. Only
> 13000 unique trainer_names.
>
> It is not hugely important as these queries are not time-critical.
> This is only a helper table, which I use to analyze the date prior to
> populating the destination tables with the data.
>
> Regards,
> Tarlika
>

I guess explain analyze shows up a seq scan. try avoiding to use
distinct. use group by instead.

regards

Andreas

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Tarlika Elisabeth Schmitz 2010-09-29 11:41:07 Re: identifying duplicates in table with redundancies
Previous Message Thomas Andres 2010-09-29 06:42:42 Shema with template