Re: Fwd: Raster performance

From: David Haynes II <dahaynes(at)umn(dot)edu>
To: PT <wmoran(at)potentialtech(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Fwd: Raster performance
Date: 2015-05-29 18:37:24
Message-ID: CAKNXh=ohRvRSAbJ60aJGjVrvS+Bagm9oDE4yregV5ti9T0q7bw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello,

Let me explain this a bit more clearly. The dataset (table) that we are
using in this analysis is produced from a satellite image (tif) called
MODIS. This image can be placed into PostgreSQL using the raster2pgsql
command (http://postgis.net/docs/using_raster_dataman.html). The
raster2pgsql command reads the tif image and creates a table that can be
used with the PostGIS extensions. There is no difference between the
modis_noout and modis, what has happened is that we have used the
raster2pgsl -R flag in the conversion process. The -R flag creates metadata
for the raster file. For example, the modis dataset as tif is 46 GB, the
same dataset stored as table within the database is 1.74 GB and the using
outdb the metadata table is 52 MB.

We have seen a decrease in performance time when using OutDB, keep in mind
that the schema and indices on the tables are exactly the same.

CREATE TABLE modis (
rid integer NOT NULL,
rast public.raster,
filename text,
CONSTRAINT enforce_height_rast CHECK ((public.st_height(rast) = 250)),
CONSTRAINT enforce_nodata_values_rast CHECK
(((public._raster_constraint_nodata_values(rast))::numeric(16,10)[] =
'{NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL}'::numeric(16,10)[])),
CONSTRAINT enforce_num_bands_rast CHECK ((public.st_numbands(rast) =
12)),
CONSTRAINT enforce_out_db_rast CHECK
((public._raster_constraint_out_db(rast) =
'{t,t,t,t,t,t,t,t,t,t,t,t}'::boolean[])),
CONSTRAINT enforce_pixel_types_rast CHECK
((public._raster_constraint_pixel_types(rast) =
'{8BUI,8BUI,8BUI,8BUI,8BUI,8BUI,8BUI,8BUI,8BUI,8BUI,8BUI,8BUI}'::text[])),
CONSTRAINT enforce_same_alignment_rast CHECK
(public.st_samealignment(rast,
'01000000002EC50BE300F57C4026C50BE300F57CC0E7FBA955801673C13202AA558016634100000000000000000000000000000000BA1A000001000100'::public.raster)
),
CONSTRAINT enforce_scalex_rast CHECK
(((public.st_scalex(rast))::numeric(16,10) =
463.312716527917::numeric(16,10))),
CONSTRAINT enforce_scaley_rast CHECK
(((public.st_scaley(rast))::numeric(16,10) =
(-463.312716527917)::numeric(16,10))),
CONSTRAINT enforce_srid_rast CHECK ((public.st_srid(rast) = 6842)),
CONSTRAINT enforce_width_rast CHECK ((public.st_width(rast) = 250))
);
CREATE SEQUENCE modis_rid_seq
START WITH 1
INCREMENT BY 1
NO MINVALUE
NO MAXVALUE
CACHE 1;
ALTER TABLE ONLY modis ALTER COLUMN rid SET DEFAULT
nextval('modis_rid_seq'::regclass);
ALTER TABLE ONLY modis ADD CONSTRAINT modis_pkey PRIMARY KEY (rid);
CREATE INDEX modis_rast_gist ON modis USING gist
(public.st_convexhull(rast));

Here is the explain analyze

InDB
Sort (cost=69547.29..69562.80 rows=6204 width=254) (actual
time=131042.478..131042.877 rows=7612 loops=1)
Sort Key: r.id, r.name
Sort Method: quicksort Memory: 787kB
CTE poly
-> Seq Scan on us_counties (cost=0.00..112.86 rows=3109 width=62247)
(actual time=3.008..8135.679 rows=3109 loops=1)
CTE rast_select
-> Nested Loop (cost=0.28..51767.41 rows=62033 width=272) (actual
time=170.799..128606.266 rows=7677 loops=1)
-> CTE Scan on poly s (cost=0.00..62.18 rows=3109 width=250)
(actual time=3.058..8212.030 rows=3109 loops=1)
-> Index Scan using modis_noout_rast_gist on modis_noout r_1
(cost=0.28..16.56 rows=2 width=22) (actual time=4.329..11.244 rows=2
loops=3109)
Index Cond: ((rast)::geometry && s.geom)
Filter: _st_intersects(s.geom, rast, NULL::integer)
Rows Removed by Filter: 0
-> HashAggregate (cost=17214.16..17276.20 rows=6204 width=254) (actual
time=131033.902..131036.105 rows=7612 loops=1)
-> CTE Scan on rast_select r (cost=0.00..1240.66 rows=62033
width=254) (actual time=170.814..128659.842 rows=7677 loops=1)
Total runtime: 131414.752 ms

OutDB
Sort (cost=93911.29..93926.80 rows=6204 width=254) (actual
time=866326.762..866327.148 rows=7612 loops=1)
Sort Key: r.id, r.name
Sort Method: quicksort Memory: 787kB
CTE poly
-> Seq Scan on us_counties (cost=0.00..112.86 rows=3109 width=62247)
(actual time=1.327..6954.020 rows=3109 loops=1)
CTE rast_select
-> Nested Loop (cost=0.28..76131.41 rows=62033 width=1086) (actual
time=257.610..863474.778 rows=7677 loops=1)
-> CTE Scan on poly s (cost=0.00..62.18 rows=3109 width=250)
(actual time=1.341..7030.138 rows=3109 loops=1)
-> Index Scan using modis_rast_gist on modis r_1
(cost=0.28..24.40 rows=2 width=836) (actual time=1.481..3.952 rows=2
loops=3109)
Index Cond: ((rast)::geometry && s.geom)
Filter: _st_intersects(s.geom, rast, NULL::integer)
Rows Removed by Filter: 0
-> HashAggregate (cost=17214.16..17276.20 rows=6204 width=254) (actual
time=866317.923..866320.316 rows=7612 loops=1)
-> CTE Scan on rast_select r (cost=0.00..1240.66 rows=62033
width=254) (actual time=257.625..863555.082 rows=7677 loops=1)
Total runtime: 866691.113 ms

On Fri, May 29, 2015 at 6:40 AM, PT <wmoran(at)potentialtech(dot)com> wrote:

> On Thu, 28 May 2015 10:06:24 -0500
> David Haynes II <dahaynes(at)umn(dot)edu> wrote:
> >
> > The query run times are significantly slower on outdb as that using indb
> > here are the run times on 2 queries.
> >
> > ST_Count/ST_Clip(select single band here)/Inner Join/ST_Transform (US
> > Counties)
> > OutDB: 873.564s (14 minutes 33s) InDB: 127.36s (2 minutes 7s)
> >
> > ST_Count(select single band here)/ST_Clip(on all bands)/Inner
> > Join/ST_Transform (US Counties)
> > OutDB: 9537.371s (2 hours 38minutes) InDB: 310s (5 minutes 10 seconds)
> >
> > In the query planner it shows a large change in the number of columns
> > (width) that are picked up in the CTE_rast_select.
> > These extra columns slow down the ability to process the data.
>
> As I mentioned before, that's not how that works. Those are estimates. If
> those rows are actually different between the two tables, then your data
> is different between the two databases and you can't expect the performance
> to be the same.
>
> Additionally, the part you snip out below isn't the part that's different.
> In particular, the difference is coming from the fact that one of the
> plans uses modis and the other uses modis_noout.
>
> Does modis exist in indb? Does modis_noout exist on outdb? What is the
> difference between these two tables? Because _that_ is where the time
> difference is most likely happening (based on the explain output).
>
> Additionally, run EXPLAIN ANALYZE on these queries to get the actual
> times in addition to the estimates. Furthermore, given that a lot of
> the confusion in this question is due to a lack of iformation, it would
> be a good idea to include the table definitions.
>
> > OUT DB
> > CTE rast_select
> > > -> Nested Loop (cost=0.28..76131.41 rows=62033 *width=1086)*
> > > -> CTE Scan on poly s (cost=0.00..62.18 rows=3109
> width=250)
> >
> >
> > In DB
> > Nested Loop (cost=0.28..51767.41 rows=62033 *width=272*)
> > > -> CTE Scan on poly s (cost=0.00..62.18 rows=3109
> width=250)
> >
> > On Wed, May 27, 2015 at 4:31 PM, PT <wmoran(at)potentialtech(dot)com> wrote:
> >
> > > On Tue, 26 May 2015 12:52:24 -0500
> > > David Haynes II <dahaynes(at)umn(dot)edu> wrote:
> > >
> > > > Hello,
> > > >
> > > > I have a question about the query optimizer and its performance on
> > > spatial
> > > > datasets, specifically rasters. My use case is rather unique, the
> > > > application that I am developing allows users to request
> summarizations
> > > of
> > > > various geographic boundaries around the world. Therefore our raster
> > > > datasets are global. We are in the process of conducting some
> benchmarks
> > > > for our system and we noticed something unexpected.
> > > >
> > > > The query is the same except the first is run on a raster (46gigs)
> in out
> > > > of database (outdb) and the second is the same raster (46gigs)
> stored in
> > > > database (indb). The raster is multibanded (13), with each band
> > > > representing one entire MODIS global scene. A single year of MODIS is
> > > > approximately 3.6 gigs.
> > > >
> > > > The outdb is being out performed by indb, because the query optimizer
> > > gets
> > > > smarter. But what is also interesting is all the extra pieces that
> are
> > > > brought in with outdb.
> > > >
> > > > with poly as
> > > > ( SELECT gid, label as name, ST_Transform(geom, 6842) as geom FROM
> > > > us_counties )
> > > > , rast_select as
> > > > ( SELECT r.rid as id, s.name , ST_CLIP(r.rast, s.geom) as rast from
> > > > rasters.glc2000 as r inner join poly as s on ST_Intersects(r.rast,
> > > s.geom) )
> > > > select r.id, r.name, ST_Count(r.rast, 1, True)
> > > >
> > > >
> > > > QUERY PLAN With Outdb
> > > >
> > >
> --------------------------------------------------------------------------------------------------
> > > > Sort (cost=93911.29..93926.80 rows=6204 width=254)
> > > > Sort Key: r.id, r.name
> > > > CTE poly
> > > > -> Seq Scan on us_counties (cost=0.00..112.86 rows=3109
> > > width=62247)
> > > > CTE rast_select
> > > > -> Nested Loop (cost=0.28..76131.41 rows=62033 width=1086)
> > > > -> CTE Scan on poly s (cost=0.00..62.18 rows=3109
> width=250)
> > > > -> Index Scan using modis_rast_gist on modis r_1
> > > > (cost=0.28..24.40 rows=2 width=836)
> > > > Index Cond: ((rast)::geometry && s.geom)
> > > > Filter: _st_intersects(s.geom, rast, NULL::integer)
> > > > -> HashAggregate (cost=17214.16..17276.20 rows=6204 width=254)
> > > > -> CTE Scan on rast_select r (cost=0.00..1240.66 rows=62033
> > > > width=254)
> > > >
> > > > QUERY PLAN With Indb
> > > >
> > > >
> > >
> -------------------------------------------------------------------------------------------------------------
> > > > Sort (cost=69547.29..69562.80 rows=6204 width=254)
> > > > Sort Key: r.id, r.name
> > > > CTE poly
> > > > -> Seq Scan on us_counties (cost=0.00..112.86 rows=3109
> > > width=62247)
> > > > CTE rast_select
> > > > -> Nested Loop (cost=0.28..51767.41 rows=62033 width=272)
> > > > -> CTE Scan on poly s (cost=0.00..62.18 rows=3109
> width=250)
> > > > -> Index Scan using modis_noout_rast_gist on modis_noout
> r_1
> > > > (cost=0.28..16.56 rows=2 width=22)
> > > > Index Cond: ((rast)::geometry && s.geom)
> > > > Filter: _st_intersects(s.geom, rast, NULL::integer)
> > > > -> HashAggregate (cost=17214.16..17276.20 rows=6204 width=254)
> > > > -> CTE Scan on rast_select r (cost=0.00..1240.66 rows=62033
> > > > width=254)
> > >
> > > I could be missing something here, but I don't see how long the queries
> > > actually take to run. Have you actually run the queries and timed them?
> > > Keep in mind that analyze does not actually run the query, it only
> plans
> > > it, so the actual run time is unknown if all you do is analyze.
> > >
> > > The query plans appear to be equal, assuming there are slight variances
> > > in the names of tables from one DB to another (and I assume that your
> > > description of indb and outdb reflects the fact that there are (for
> > > reasons unknown) two copies of the data).
> > >
> > > The only purpose to those estimates is to choose a good plan. If the
> > > plan is bad for one database and both databases have the same data,
> then
> > > the plan will be bad for both.
> > >
> > > Since there have been no other responses, I'm guessing that others are
> > > confused by your question as well. Can you describe the actual problem
> > > that you're seeing?
> > >
> > > --
> > > Bill Moran <wmoran(at)potentialtech(dot)com>
> > >
> > >
> > > --
> > > Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> > > To make changes to your subscription:
> > > http://www.postgresql.org/mailpref/pgsql-general
> > >
> >
> >
> >
> > --
> > David Haynes, Ph.D.
> > Research Associate Terra Populus
> > Minnesota Population Center
> > www.terrapop.org
>
>
> --
> Bill Moran <wmoran(at)potentialtech(dot)com>
>

--
David Haynes, Ph.D.
Research Associate Terra Populus
Minnesota Population Center
www.terrapop.org

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Robert Haas 2015-05-29 19:08:11 Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Previous Message John McKown 2015-05-29 18:10:49 Re: Between with a list of ranges possible ?