Quick Links

Re: Improve PostGIS performance with 62 million rows?

From:	Paul Ramsey <pramsey(at)cleverelephant(dot)ca>
To:	Israel Brewster <israel(at)ravnalaska(dot)net>
Cc:	pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Improve PostGIS performance with 62 million rows?
Date:	2017-01-05 22:09:03
Message-ID:	CACowWR2m1xp9N7FTsrLjCNsYXEtRX1_bGWh_vTTA7O2v5R7+sw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Varying the segment length upwards might have a salutary effect for a
while, as the efficiency improvement of fewer inner loops battles with the
inefficiency of having more points selected by the index filter. Worth an
experiment.

On Thu, Jan 5, 2017 at 1:00 PM, Israel Brewster <israel(at)ravnalaska(dot)net>
wrote:

>
> On Jan 5, 2017, at 10:38 AM, Paul Ramsey <pramsey(at)cleverelephant(dot)ca>
> wrote:
>
> Yes, you did. You want a query that spits out a tupleset of goemetries
> (one each for each wee segment), and then you can join that set to your
> main table using st_dwithin() as the join clause.
> So start by ditching the main table and just work on a query that
> generates a pile of wee segments.
>
>
> Ahhh, I see you've done this sort of thing before (
> http://blog.cleverelephant.ca/2015/02/breaking-
> linestring-into-segments.html) :-)
>
> So following that advice I came up with the following query:
>
> WITH dump AS (SELECT
> ST_DumpPoints(
> ST_Segmentize(
> ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056
> 61.179167,-156.77 71.285833)'),
> 600
> )::geometry
> ) as pt
> ),
> pts AS (
> SELECT (pt).geom, (pt).path[1] as vert FROM dump
> )
> SELECT elevation
> FROM data
> INNER JOIN (SELECT
> ST_MakeLine(ARRAY[a.geom, b.geom]) as short_line
> FROM pts a
> INNER JOIN pts b
> ON a.vert=b.vert-1 AND b.vert>1) segments
> ON ST_DWithin(location, segments.short_line, 600)
> ORDER BY elevation DESC limit 1;
>
> Which yields the following EXPLAIN ANALYZE (https://explain.depesz.com/s/
> RsTD <https://explain.depesz.com/s/ukwc>):
>
>
> QUERY PLAN
>
>
> ------------------------------------------------------------
> ------------------------------------------------------------
> ------------------------------------------------------------
> --------------------------------------------------------
> Limit (cost=11611706.90..11611706.91 rows=1 width=4) (actual
> time=1171.814..1171.814 rows=1 loops=1)
> CTE dump
> -> Result (cost=0.00..5.25 rows=1000 width=32) (actual
> time=0.024..1.989 rows=1939 loops=1)
> CTE pts
> -> CTE Scan on dump (cost=0.00..20.00 rows=1000 width=36) (actual
> time=0.032..4.071 rows=1939 loops=1)
> -> Sort (cost=11611681.65..11611768.65 rows=34800 width=4) (actual
> time=1171.813..1171.813 rows=1 loops=1)
> Sort Key: data.elevation DESC
> Sort Method: top-N heapsort Memory: 25kB
> -> Nested Loop (cost=0.55..11611507.65 rows=34800 width=4)
> (actual time=0.590..1167.615 rows=28408 loops=1)
> -> Nested Loop (cost=0.00..8357.50 rows=1665 width=64)
> (actual time=0.046..663.475 rows=1938 loops=1)
> Join Filter: (a.vert = (b.vert - 1))
> Rows Removed by Join Filter: 3755844
> -> CTE Scan on pts b (cost=0.00..22.50 rows=333
> width=36) (actual time=0.042..0.433 rows=1938 loops=1)
> Filter: (vert > 1)
> Rows Removed by Filter: 1
> -> CTE Scan on pts a (cost=0.00..20.00 rows=1000
> width=36) (actual time=0.000..0.149 rows=1939 loops=1938)
> -> Index Scan using location_gix on
> data (cost=0.55..6968.85 rows=1 width=36) (actual time=0.085..0.256
> rows=15 loops=1938)
> Index Cond: (location &&
> _st_expand((st_makeline(ARRAY[a.geom, b.geom]))::geography, '600'::double
> precision))
> Filter: (((st_makeline(ARRAY[a.geom,
> b.geom]))::geography && _st_expand(location, '600'::double precision)) AND
> _st_dwithin(location, (st_makeline(ARRAY[a.geom,
> b.geom]))::geography, '600'::double precision, true))
> Rows Removed by Filter: 7
> Planning time: 4.318 ms
> Execution time: 1171.994 ms
> (22 rows)
>
> So not bad. Went from 20+ seconds to a little over 1 second. Still
> noticeable for a end user, but defiantly usable - and like mentioned,
> that's a worst-case scenario query. Thanks!
>
> Of course, if you have any suggestions for further improvement, I'm all
> ears :-)
> -----------------------------------------------
> Israel Brewster
> Systems Analyst II
> Ravn Alaska
> 5245 Airport Industrial Rd
> Fairbanks, AK 99709
> (907) 450-7293
> -----------------------------------------------
>
>
> On Thu, Jan 5, 2017 at 11:36 AM, Israel Brewster <israel(at)ravnalaska(dot)net>
> wrote:
>
>> On Jan 5, 2017, at 8:50 AM, Paul Ramsey <pramsey(at)cleverelephant(dot)ca>
>> wrote:
>>
>>
>> The index filters using bounding boxes. A long, diagonal route will have
>> a large bounding box, relative to the area you actually care about (within
>> a narrow strip of the route). Use ST_Segmentize() to add points to your
>> route, ST_DumpPoints() to dump those out as point and ST_MakeLine to
>> generate new lines from those points, each line very short. The maximum
>> index effectiveness will come when your line length is close to your buffer
>> width.
>>
>> P
>>
>>
>> Ok, I think I understand the concept. So attempting to follow your
>> advice, I modified the query to be:
>>
>> SELECT elevation
>> FROM data
>> WHERE
>> ST_DWithin(
>> location,
>> (SELECT ST_MakeLine(geom)::geography as split_line
>> FROM (SELECT
>> (ST_DumpPoints(
>> ST_Segmentize(
>> ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056
>> 61.179167,-156.77 71.285833)'),
>> 600
>> )::geometry
>> )).geom
>> ) s1),
>> 600
>> )
>> ORDER BY elevation DESC limit 1;
>>
>> It took some fiddling to find a syntax that Postgresql would accept, but
>> eventually that's what I came up with. Unfortunately, far from improving
>> performance, it killed it - in running the query, it went from 22 seconds
>> to several minutes (EXPLAIn ANALYZE has yet to return a result). Looking at
>> the query execution plan shows, at least partially, why:
>>
>> QUERY PLAN
>>
>> ------------------------------------------------------------
>> ------------------
>> Limit (cost=17119748.98..17119748.98 rows=1 width=4)
>> InitPlan 1 (returns $0)
>> -> Aggregate (cost=17.76..17.77 rows=1 width=32)
>> -> Result (cost=0.00..5.25 rows=1000 width=32)
>> -> Sort (cost=17119731.21..17171983.43 rows=20900890 width=4)
>> Sort Key: data.elevation DESC
>> -> Seq Scan on data (cost=0.00..17015226.76 rows=20900890
>> width=4)
>> Filter: st_dwithin(location, $0, '600'::double precision)
>> (8 rows)
>>
>> So apparently it is now doing a sequential scan on data rather than using
>> the index. And, of course, sorting 20 million rows is not trivial either.
>> Did I do something wrong with forming the query?
>>
>> -----------------------------------------------
>> Israel Brewster
>> Systems Analyst II
>> Ravn Alaska
>> 5245 Airport Industrial Rd
>> Fairbanks, AK 99709
>> (907) 450-7293
>> -----------------------------------------------
>>
>>
>> On Thu, Jan 5, 2017 at 9:45 AM, Israel Brewster <israel(at)ravnalaska(dot)net>
>> wrote:
>>
>>> I have a database (PostgreSQL 9.6.1) containing 62,702,675 rows of
>>> latitude (numeric), longitude(numeric), elevation(integer) data, along with
>>> a PostGIS (2.3.0) geometry column (location), running on a CentOS 6.8 box
>>> with 64GB RAM and a RAID10 SSD data drive. I'm trying to get the maximum
>>> elevation along a path, for which purpose I've come up with the following
>>> query (for one particular path example):
>>>
>>> SELECT elevation FROM data
>>>
>>>
>>>
>>>
>>> WHERE ST_DWithin(location, ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056
>>> 61.179167,-156.77 71.285833)'), 600)
>>>
>>>
>>>
>>> ORDER BY elevation LIMIT 1;
>>>
>>> The EXPLAIN ANALYZE output of this particular query (
>>> https://explain.depesz.com/s/heZ) shows:
>>>
>>>
>>>
>>> QUERY PLAN
>>>
>>>
>>> ------------------------------------------------------------
>>> ------------------------------------------------------------
>>> ------------------------------------------------------------
>>> ------------------------------------------------------------
>>> ------------------------------------------------------------
>>> ------------------------------------------
>>> Limit (cost=4.83..4.83 rows=1 width=4) (actual
>>> time=22653.840..22653.842 rows=1 loops=1)
>>> -> Sort (cost=4.83..4.83 rows=1 width=4) (actual
>>> time=22653.837..22653.837 rows=1 loops=1)
>>> Sort Key: elevation DESC
>>> Sort Method: top-N heapsort Memory: 25kB
>>> -> Index Scan using location_gix on data (cost=0.42..4.82
>>> rows=1 width=4) (actual time=15.786..22652.041 rows=11081 loops=1)
>>> Index Cond: (location && '0102000020E6100000020000002C1
>>> 1A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD2514
>>> 0'::geography)
>>> Filter: (('0102000020E6100000020000002
>>> C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography
>>> && _st_expand(location, '600'::double precision)) AND
>>> _st_dwithin(location, '0102000020E6100000020000002C11A8FE41C
>>> 062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography,
>>> '600'::double precision, true))
>>> Rows Removed by Filter: 4934534
>>> Planning time: 0.741 ms
>>> Execution time: 22653.906 ms
>>> (10 rows)
>>>
>>> So it is using the index properly, but still takes a good 22 seconds to
>>> run, most of which appears to be in the Index Scan.
>>>
>>> Is there any way to improve this, or is this going to be about as good
>>> as it gets with the number of rows being dealt with? I was planning to use
>>> this for a real-time display - punch in a couple of points, get some
>>> information about the route between, including maximum elevation - but with
>>> it taking 22 seconds for the longer routes at least, that doesn't make for
>>> the best user experience.
>>>
>>> It's perhaps worth noting that the example above is most likely a worst
>>> case scenario. I would expect the vast majority of routes to be
>>> significantly shorter, and I want to say the shorter routes query much
>>> faster [testing needed]. That said, the faster the better, even for short
>>> routes :-)
>>> -----------------------------------------------
>>> Israel Brewster
>>> Systems Analyst II
>>> Ravn Alaska
>>> 5245 Airport Industrial Rd
>>> Fairbanks, AK 99709
>>> (907) 450-7293
>>> -----------------------------------------------
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>
>

In response to

Re: Improve PostGIS performance with 62 million rows? at 2017-01-05 21:00:32 from Israel Brewster

Responses

Re: Improve PostGIS performance with 62 million rows? at 2017-01-05 22:38:07 from Rémi Cura
Re: Improve PostGIS performance with 62 million rows? at 2017-01-05 22:43:17 from Israel Brewster

Browse pgsql-general by date

	From	Date	Subject
Next Message	Rémi Cura	2017-01-05 22:38:07	Re: Improve PostGIS performance with 62 million rows?
Previous Message	Patrick B	2017-01-05 22:07:37	FATAL: requested WAL segment has already been removed