Quick Links

Re: find close (duplicate) points + create index

From:	Elinor <elinor(at)wise(dot)tau(dot)ac(dot)il>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-novice(at)postgresql(dot)org, elinor(at)wise(dot)tau(dot)ac(dot)il
Subject:	Re: find close (duplicate) points + create index
Date:	2005-02-13 15:41:18
Message-ID:	1108309277.9368.15.camel@electra.tau.ac.il
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-novice

Hi Tom,
Following your advice quite a long time ago, I built an index on
my very large table called object:

myDB=# select count(*) from object;
count
---------
2797036
(1 row)
Time: 8354.25 ms

using RTREE INDEX on a column of type point:

myDB=# create index object_point_a on object using rtree
(box(point_a,point_a));
CREATE INDEX
Time: 220546.54 ms

Then I tried the query you suggested to find points closer to
each other than 1.5:
SELECT * FROM object a,object b
WHERE box(a.point_a,a.point_a)&&box(circle(b.point_a,sqrt(2)))
AND (a.point_a<->b.point_a)<1;

After many many minutes, it started giving these lines:
server sent data ("D" message) without prior row description
("T" message)
server sent data ("D" message) without prior row description
("T" message)
server sent data ("D" message) without prior row description
("T" message)
.....

Eventually I lost patience and killed it. I am pretty sure I
tried it once, but perhaps the table wasn't as big.
Any suggestions?

Thanks,
Elinor

Elinor Medezinski <elinor ( at ) bellatrix ( dot ) tau ( dot )
ac ( dot ) il> writes:

> And then I found out that in postgres the only operator classes defined for
> rtree indexes are: bigbox_ops, box_ops and poly_ops. Neither of which works
> with points, only with type box and polygon. Therefore I also have to create
> an operator class.

No you don't. What you want is a functional index built on a box or polygon
surrounding the point. For instance, given

regression=# create table p1 (point_a point);
CREATE TABLE
regression=# create index p1i on p1 using rtree (box(point_a, point_a));
CREATE INDEX

you could do searches for points enclosed in a specific box like this:

regression=# explain select * from p1 where box(point_a, point_a) && '(0,1),(0,1)'::box;
QUERY PLAN
----------------------------------------------------------------
Index Scan using p1i on p1 (cost=0.00..17.07 rows=5 width=16)
Index Cond: (box(point_a, point_a) && '(0,1),(0,1)'::box)
(2 rows)

since box-overlap (&&) is one of the rtree-indexable operators.

The most useful way to solve your original problem seems to be

regression=# explain select * from p1 a, p1 b where
regression-# box(a.point_a, a.point_a) && box(circle(b.point_a,sqrt(2)))
regression-# and (a.point_a <-> b.point_a) <= 1;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.01..17220.00 rows=1667 width=32)
Join Filter: (("inner".point_a <-> "outer".point_a) <= 1::double precision)
-> Seq Scan on p1 b (cost=0.00..20.00 rows=1000 width=16)
-> Index Scan using p1i on p1 a (cost=0.01..17.07 rows=5 width=16)
Index Cond: (box(a.point_a, a.point_a) && box(circle("outer".point_a, 1.4142135623731::double precision)))
(5 rows)

The indexable condition finds "a" rows that are within the bounding box
of a circle surrounding the "b" row, and then we only need to apply the
exact distance check to those rows.

(If you're wondering about the sqrt(2), there's an oddity in the
built-in circle-to-box function: it divides the circle radius
by sqrt(2). I think this is a bug and will propose changing it
for 7.5.)

regards, tom lane

On Wed, 2004-03-10 at 07:27, Tom Lane wrote:

> Elinor Medezinski <elinor(at)bellatrix(dot)tau(dot)ac(dot)il> writes:
> > I'm trying to find duplicate entries, where two entries are considered
> > duplicates if they're within a radius of 1, meaning something like
> > "select point from pointtable where distance between points <=1".
> > Obviously this is not SQL syntax.
>
> Well, it is if you do a self-join:
>
> select * from pointtable a, pointtable b
> where distance(a.point, b.point) <= 1;
>
> Postgres spells the "distance" operator as "<->", so this becomes
>
> select * from pointtable a, pointtable b
> where (a.point <-> b.point) <= 1;
>
> Making it fast is a more difficult problem :-( ... if you write the
> above query as-is then the system will sit there and compare each row of
> pointtable to each other row, looking for pairs of rows that match the
> where-clause. Okay if you just have some thousands of rows, but on a
> big table this will take longer than you want to wait.
>
> > Also, I also tried to build an index on that column, but there's no operator
> > class for type point. How can I do that?
>
> A btree index on a point column would be quite useless, since btree
> understands only a one-dimensional continuum with less-than, equal,
> greater-than relationships. But I think you might be able to do
> something with an rtree index. I'd look at making an rtree index on
> the unit box around each point, and then using an "overlaps" test as
> an indexable coarse filter before the exact distance check.
>
> regards, tom lane
>
> +++++++++++++++++++++++++++++++++++++++++++
> This Mail Was Scanned By Mail-seCure System
> at the Tel-Aviv University CC.

In response to

Re: find close (duplicate) points + create index at 2004-03-10 05:27:27 from Tom Lane

Responses

Re: find close (duplicate) points + create index at 2005-02-16 20:08:48 from Tom Lane

Browse pgsql-novice by date

	From	Date	Subject
Next Message	Les Carter	2005-02-14 06:54:51	export/import from one schema to another?
Previous Message	dim	2005-02-12 10:01:27	Re: plpgsql