Re: identifying duplicates in table with redundancies

From: Tarlika Elisabeth Schmitz <postgresql2(at)numerixtechnology(dot)de>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: identifying duplicates in table with redundancies
Date: 2010-09-29 11:41:07
Message-ID: 20100929124107.245b1314@dick.coachhouse
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

On Wed, 29 Sep 2010 10:40:03 +0200
Andreas Schmitz <mailinglist(at)longimanus(dot)net> wrote:

> On 09/28/2010 10:36 PM, Tarlika Elisabeth Schmitz wrote:
>> On Tue, 28 Sep 2010 11:34:31 +0100
>> "Oliveiros d'Azevedo Cristina"<oliveiros(dot)cristina(at)marktest(dot)pt>
>> wrote:
>>
>>> ----- Original Message -----
>>> From: "Tarlika Elisabeth Schmitz"<postgresql(at)numerixtechnology(dot)de>
>>> To:<pgsql-sql(at)postgresql(dot)org>
>>> Sent: Monday, September 27, 2010 5:54 PM
>>> Subject: Re: [SQL] identifying duplicates in table with redundancies
>>>
>>>
>>>> On Fri, 24 Sep 2010 18:12:18 +0100
>>>> Oliver d'Azevedo Christina<oliveiros(dot)cristina(at)gmail(dot)com> wrote:
>>>>
>>>>>>> SELECT DISTINCT trainer_id,trainer_name
>>>>>>> FROM (
>>>>>>> SELECT trainer_name -- The field you want to test for
>>>>>>> duplicates FROM (
>>>>>>> SELECT DISTINCT "trainer_id","trainer_name"
>>>>>>> FROM student
>>>>>>> ) x
>>>>>>> GROUP BY "trainer_name" -- the field you want to test for
>>>>>>> duplicates
>>>>>>> HAVING (COUNT(*)> 1)
>>>>>>> ) z
>>>>>>> NATURAL JOIN student y
>>>>
>>>>
>>>> What indices would you recommend for this operation?
>>> But, on this query in particular I would recomend an indice on
>>> trainer_name,[...]
>> Strangely, these indices did not do anything.
>> Without, the query took about 8500ms. Same with index.
>>
>> The table has 250000 records. 11000 have trainer_name = null. Only
>> 13000 unique trainer_names.
>
>I guess explain analyze shows up a seq scan. try avoiding to use
>distinct. use group by instead.
>
>regards, Andreas

Hallo Andreas,
I reduced the problem to the innermost query:

1) SELECT DISTINCT trainer_id, trainer_name FROM student
This results in a sequential table scan. Execution time 7500ms.

2) I created an INDEX ON (trainer_id, trainer_name). Then I had an index
scan instead, which still cost 7000ms.

3) Next, I changed from DISTINCT to GROUP BY:
SELECT trainer_id, trainer_name FROM student
GROUP BY trainer_id, trainer_name
This resulted in an index scan @ 6750ms

4) I filtered out NULL trainer_ids
WHERE trainer_id IS NOT NULL
Amazingly, this resulted in a sequential table scan, which only took
1300ms!!

Please, explain (pun not intended)! How can this be. Only 11000/250000
rows have a null trainer_id.

Thanks for the GROUP BY tip!

--

Best Regards,
Tarlika Elisabeth Schmitz

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Oliveiros d'Azevedo Cristina 2010-09-29 13:00:26 Re: identifying duplicates in table with redundancies
Previous Message Andreas Schmitz 2010-09-29 08:40:03 Re: identifying duplicates in table with redundancies