Re: find close (duplicate) points + create index

From: <ghaverla(at)freenet(dot)edmonton(dot)ab(dot)ca>
To: pgsql-novice(at)postgresql(dot)org
Subject: Re: find close (duplicate) points + create index
Date: 2004-03-10 12:27:43
Message-ID: Pine.A41.3.95.1040310051803.20242D-100000@fn2.freenet.edmonton.ab.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

On Wed, 10 Mar 2004, Tom Lane wrote:
> Elinor Medezinski <elinor(at)bellatrix(dot)tau(dot)ac(dot)il> writes:

> > I'm trying to find duplicate entries, where two entries are considered
> > duplicates if they're within a radius of 1, meaning something like
> > "select point from pointtable where distance between points <=1".
> > Obviously this is not SQL syntax.
>
> Well, it is if you do a self-join:
>
> select * from pointtable a, pointtable b
> where distance(a.point, b.point) <= 1;
>
> Postgres spells the "distance" operator as "<->", so this becomes
>
> select * from pointtable a, pointtable b
> where (a.point <-> b.point) <= 1;
>
> Making it fast is a more difficult problem :-( ... if you write the
> above query as-is then the system will sit there and compare each row of
> pointtable to each other row, looking for pairs of rows that match the
> where-clause. Okay if you just have some thousands of rows, but on a
> big table this will take longer than you want to wait.

I'm guessing distance is defined to be:
sqrt( (x1-x0)^2 + (y1-y0)^2 )
or
sqrt( (x1-x0)^2 + (y1-y0)^2 + (z1-z0)^2 )

You obviously need to find the difference in the x,y (and possibly
z if used) coordinates, as is multiplying that difference by
itself and summing them. However, you don't need to take the
square root, which is often a computationally expensive thing.
Two points which are close in terms of distance, are also close in
terms of distance squared.

If you are only interested in points that are within a distance of
1 (or a distance squared of 1*1), then you can stop calculating
the distance when any of the differences is bigger than 1. Since
you don't care, other than knowing it is too big, just stop the
calculation and return some number bigger than 1. That should
help in speeding things up a bit. How you do this in PostgreSQL,
I can't help you. :-)

> > Also, I also tried to build an index on that column, but
> > there's no operator class for type point. How can I do that?
>
> A btree index on a point column would be quite useless, since btree
> understands only a one-dimensional continuum with less-than, equal,
> greater-than relationships. But I think you might be able to do
> something with an rtree index. I'd look at making an rtree index on
> the unit box around each point, and then using an "overlaps" test as
> an indexable coarse filter before the exact distance check.

You probably want to look a little at how GIS (Geographical
Information Systems) works. There are lots of different
techniques they use to search and partition things. Quadtree
and oct-tree come to mind.

Gord

In response to

Browse pgsql-novice by date

  From Date Subject
Next Message Mohan 2004-03-10 13:08:40 Re: local usage was: Re: using pgsql on my comp only without
Previous Message ghaverla 2004-03-10 12:17:26 local usage was: Re: using pgsql on my comp only without tcp