Re: Finding uniques across a big join

From: "John D(dot) Burger" <john(at)mitre(dot)org>
To: PostgreSQL general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Finding uniques across a big join
Date: 2005-11-30 18:20:19
Message-ID: fa649e52f7c458b6eca83789d7648ad2@mitre.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


On Nov 30, 2005, at 01:55, Martijn van Oosterhout wrote:

> On Tue, Nov 29, 2005 at 09:58:49PM -0500, John D. Burger wrote:
>> I could use some help with the following:
>>
>> I have a database of geographic entities with attributes spread across
>> several tables. I need to determine which entities are unique with
>> respect to some of those attributes. I'm using the following query:
>
> <snip>
>
> If you put the gazPlaceID as a result of the uniqs subquery, that would
> avoid the second lookup, right? Whether it's much faster is the
> question. So something like:
>
> select p1.gazPlaceID
> from gazPlaces as p1
> join gazNamings as n1 using (gazPlaceID)
> join gazContainers as c1 using (gazPlaceID)
> group by p1.gazPlaceID, p1.featureType, n1.placeNameID,
> c1.containerID
> having count(*) = 1

The problem is that then every row is unique, because gazPlaceID is a
primary key. As far as I can see, I need to group on just the other
three columns - they're the ones for which I'm interested in
uniqueness.

> Secondly, what does the plan look like? Is it materialising or sorting
> at any stage?
> Finally, what version of postgres?

Version is 7.4.8 for Solaris. Below is (a version of) the query again,
as well as the plan. No materialization, I think, but it appears to be
sorting the first three-way join to do the counts, then sorting the
second one to merge. Cost estimates are way off, as the final result
has almost 10M rows, but everything is analyzed, with indexes on every
column, although none of them get used.

Again, any suggestions on tweaking this query, or a completely
different approach to finding the entities with unique combinations,
would be much appreciated.

- John Burger
MITRE

select p2.gazPlaceID, u.*
into table tempCanonical_nameMatchEquiver_3435_1
from
(select p1.featureType, n1.placeNameID, c1.containerID
from gazPlaces as p1
join gazNamings as n1 using (gazPlaceID)
join gazContainers as c1 using (gazPlaceID)
group by p1.featureType, n1.placeNameID, c1.containerID
having count(*) = 1) as u,
gazPlaces as p2
join gazNamings as n2 using (gazPlaceID)
join gazContainers as c2 using (gazPlaceID)
where u.featureType = p2.featureType
and u.placeNameID = n2.placeNameID
and u.containerID = c2.containerID;

Hash Join (cost=4009535.96..4316595.92 rows=306 width=16)
Hash Cond: (("outer".gazplaceid = "inner".gazplaceid) AND
("outer".containerid = "inner".containerid))
-> Seq Scan on gazcontainers c2 (cost=0.00..141636.45 rows=9193945
width=8)
-> Hash (cost=4006472.81..4006472.81 rows=282029 width=20)
-> Merge Join (cost=3777226.54..4006472.81 rows=282029
width=20)
Merge Cond: (("outer".featuretype = "inner".featuretype)
AND ("outer".placenameid = "inner".placenameid))
-> Subquery Scan u (cost=2107001.20..2259698.67
rows=5552635 width=12)
-> GroupAggregate (cost=2107001.20..2204172.32
rows=5552635 width=12)
Filter: (count(*) = 1)
-> Sort (cost=2107001.20..2120882.79
rows=5552635 width=12)
Sort Key: p1.featuretype,
n1.placenameid, c1.containerid
-> Hash Join
(cost=688064.17..1217844.46 rows=5552635 width=12)
Hash Cond: ("outer".gazplaceid =
"inner".gazplaceid)
-> Seq Scan on gaznamings n1
(cost=0.00..156331.05 rows=10147805 width=8)
-> Hash
(cost=642816.39..642816.39 rows=6128713 width=16)
-> Hash Join
(cost=160244.91..642816.39 rows=6128713 width=16)
Hash Cond:
("outer".gazplaceid = "inner".gazplaceid)
-> Seq Scan on
gazcontainers c1 (cost=0.00..141636.45 rows=9193945 width=8)
-> Hash
(cost=120982.13..120982.13 rows=6128713 width=8)
-> Seq Scan
on gazplaces p1 (cost=0.00..120982.13 rows=6128713 width=8)
-> Sort (cost=1670225.33..1685547.11 rows=6128713
width=16)
Sort Key: p2.featuretype, n2.placenameid
-> Hash Join (cost=160244.91..684040.19
rows=6128713 width=16)
Hash Cond: ("outer".gazplaceid =
"inner".gazplaceid)
-> Seq Scan on gaznamings n2
(cost=0.00..156331.05 rows=10147805 width=8)
-> Hash (cost=120982.13..120982.13
rows=6128713 width=8)
-> Seq Scan on gazplaces p2
(cost=0.00..120982.13 rows=6128713 width=8)

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Jim C. Nasby 2005-11-30 18:22:00 Re: [GENERAL] What is the deal with mailing lists?
Previous Message frank church 2005-11-30 18:18:01 How to check options PostgreSQL was started with