From: | Don Isgitt <djisgitt(at)soundenergy(dot)com> |
---|---|
To: | postgresql general <pgsql-general(at)postgresql(dot)org> |
Subject: | limited field duplicates |
Date: | 2003-01-31 16:17:10 |
Message-ID: | 3E3AA186.3080800@soundenergy.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hello,
I am trying to find rows in a table that are duplicates only in the
sense of certain columns; i.e, not all columns are the same, and then
for each duplicate entry, I want to retrieve one of the columns that is
not part of the duplication criteria. Ok, enough English description,
here is the real information (simplified).
Table "master" --~925000 records
Column | Type | Modifiers
-------------------+--------------+-----------
apinum | text | not null
state | character(2) | not null
county | text | not null
township_n | integer |
township_d | character(1) |
range_n | integer |
range_d | character(1) |
section | integer |
location | text | not null
(Indexes on state,county,township,range,apinum)
Find duplicates based on
state,county,township_n||township_d,range_n||range_d,section
select state,county,township_n||township_d as township,range_n||range_d
as range,section,location,count(*) from master where state='OK' and
county='MAJOR' group by state,county,township,range,section,location
having count(*)>1;
Works great, is efficient: now, how do I get the apinum associated with
each of the limited duplicate rows (i.e., those returned in the above
query). I need the apinum because another system involved only knows
about apinum.
It must be simple, but I have failed miserably in finding the answer.
The following horrible query works, but it takes forever for even the
small sample above.
explain select m.apinum,m.township_n||m.township_d as
township,m.range_n||m.range_d as range,m.section,m.location from master
m where (select count(*) from master m1 where
m.township_n||m.township_d=m1.township_n||m1.township_d and
m.range_n||m.range_d=m1.range_n||m1.range_d and m.section=m1.section and
m.location=m1.location and m1.state='OK' and m1.county='MAJOR') > 1 and
m.state='OK' and m.county='MAJOR' order by township,range,section;
NOTICE: QUERY PLAN:
Sort (cost=7040332.16..7040332.16 rows=450 width=55)
-> Index Scan using mstcty on master m (cost=0.00..7040312.32
rows=450 width=55)
SubPlan
-> Aggregate (cost=5208.86..5208.86 rows=1 width=0)
-> Index Scan using mstcty on master m1
(cost=0.00..5208.85 rows=1 width=0)
EXPLAIN
So, somebody please enlighten me. All help is appreciated.
Don
From | Date | Subject | |
---|---|---|---|
Next Message | Greg Copeland | 2003-01-31 16:24:39 | Re: serialization errors |
Previous Message | Arjen van der Meijden | 2003-01-31 16:13:49 | Re: Weird query plans for my queries, |