Re: application of KNN code to US zipcode searches?

From: Mark Stosberg <mark(at)summersault(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: application of KNN code to US zipcode searches?
Date: 2011-02-17 15:20:58
Message-ID: ijjecs$e8c$1@dough.gmane.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


> I thought the benefit of KNN was that you could retrieve the rows in
> distance order, so that a query for the closest 20 locations (for
> example) would be very fast. I wouldn't have expected it to be
> helpful when you're selecting all the rows regardless of distance.

Kevin,

Thanks for the feedback. You are right that my "reduced test case"
wasn't a good approximation. I added a limit, to simulate finding the
100 zipcodes closest to 90210.

Below I compare 4 approaches to the same query:

1. Cube search
2. Earth Distance Search
3. Simple point distance (no index)
4. Simple point distance (KNN)

Now KNN benchmarks to be almost 100x faster! That's very promising.
Then there's only the issue that simple point distance is not expected
to be a good enough approximation of earth-distances. Perhaps that can
be solved by pre-computing coordinates based on the lat/long pairs....
much like the map projections used to present a curved surface on a flat
map? Given that's OK to be be a few miles off, it seems we have some
leeway here.

Recommendations?

Mark

EXPLAIN ANALYZE
SELECT zipcode,
cube_distance( '(-2513120.64361786, -4645511.0460328,
3575538.9507084)', zipcodes.earth_coords)/1609.344 AS radius
FROM zipcodes ORDER BY radius LIMIT 100;

---------------------------------------------------------------
Limit (cost=2946.70..2946.95 rows=100 width=62) (actual
time=167.650..168.064 rows=100 loops=1)
-> Sort (cost=2946.70..3050.40 rows=41483 width=62) (actual
time=167.644..167.829 rows=100 loops=1)
Sort Key: ((cube_distance('(-2513120.64361786,
-4645511.0460328, 3575538.9507084)'::cube, earth_coords) /
1609.344::double precision))
Sort Method: top-N heapsort Memory: 20kB
-> Seq Scan on zipcodes (cost=0.00..1361.24 rows=41483
width=62) (actual time=0.030..90.807 rows=41483 loops=1)
Total runtime: 168.300 ms

############################################################3

-- Using Earthdistance
EXPLAIN ANALYZE SELECT zipcode,
lon_lat <@> '(-118.412426,34.096629)' As radius
FROM zipcodes
ORDER BY lon_lat <@> '(-118.412426,34.096629)'
LIMIT 100;

------------------------------------------------------------
Limit (cost=2842.99..2843.24 rows=100 width=22) (actual
time=187.995..188.451 rows=100 loops=1)
-> Sort (cost=2842.99..2946.70 rows=41483 width=22) (actual
time=187.989..188.149 rows=100 loops=1)
Sort Key: ((lon_lat <@> '(-118.412426,34.096629)'::point))
Sort Method: top-N heapsort Memory: 20kB
-> Seq Scan on zipcodes (cost=0.00..1257.54 rows=41483
width=22) (actual time=0.033..108.203 rows=41483 loops=1)
Total runtime: 188.660 ms

##########################################

Using simple point distance, but with no Gist Index:

EXPLAIN ANALYZE SELECT zipcode,
lon_lat <-> '(-118.412426,34.096629)' As radius
FROM zipcodes
ORDER BY lon_lat <-> '(-118.412426,34.096629)'
LIMIT 100;

--------------------------------------------------------
Limit (cost=2842.99..2843.24 rows=100 width=22) (actual
time=160.574..161.057 rows=100 loops=1)
-> Sort (cost=2842.99..2946.70 rows=41483 width=22) (actual
time=160.568..160.691 rows=100 loops=1)
Sort Key: ((lon_lat <-> '(-118.412426,34.096629)'::point))
Sort Method: top-N heapsort Memory: 20kB
-> Seq Scan on zipcodes (cost=0.00..1257.54 rows=41483
width=22) (actual time=0.027..84.610 rows=41483 loops=1)
Total runtime: 161.226 ms

#########################################

-- Using KNN-GIST index
EXPLAIN ANALYZE SELECT zipcode,
lon_lat <-> '(-118.412426,34.096629)' As radius
FROM zipcodes
ORDER BY lon_lat <-> '(-118.412426,34.096629)'
LIMIT 100;
------------------------------------------------------------------
Limit (cost=0.00..12.94 rows=100 width=22) (actual time=0.447..1.892
rows=100 loops=1)
-> Index Scan using zipcodes_knn on zipcodes (cost=0.00..5365.93
rows=41483 width=22) (actual time=0.440..1.407 rows=100 loops=1)
Order By: (lon_lat <-> '(-118.412426,34.096629)'::point)
Total runtime: 2.198 ms

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Merlin Moncure 2011-02-17 15:22:17 Re: Really really slow select count(*)
Previous Message Kevin Grittner 2011-02-17 14:49:28 Re: application of KNN code to US zipcode searches?