Query optimization using order by and limit

From: Michael Viscuso <michael(dot)viscuso(at)getcarbonblack(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Query optimization using order by and limit
Date: 2011-09-21 23:14:09
Message-ID: CA+Z27QQDRQ1h_tD1Tm_+T+-cBdt00hLWYRnvXygxLf_JWNTxZQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

First of all, thank you for taking the time to review my question. After
attending the PostgresOpen conference in Chicago last week, I've been
pouring over explain logs for hours on end and although my system is MUCH
better, I still can't resolve a few issues. Luckily my data is pretty well
structured so solving one issue will likely solve many more so I'll start
with this one.

Version: PostgreSQL 9.1rc1, compiled by Visual C++ build 1500, 64-bit
OS: Windows 7 64-bit
ORM: SQLAlchemy
Postgres table structure: I have daily partitioned tables for each of 4
"core tables" (the tables with the majority of my application's data). Each
daily table inherits from its parent. I do not explicitly define a
REFERENCE between these tables because I cannot guarantee the order in which
the events are inserted into the database, but where there are references,
the referenced row should exist in the other's daily table. The reason I
partitioned the data in this manner is to increase query speed and make it
easy to archive old data. (I'm new to high-end Postgres performance so
there's likely several fundamental flaws in my assumptions. I won't turn
down any recommendation.)

An example of a daily partitioned table follows:

cb=# \d osmoduleloads_2011_09_14;
Table "public.osmoduleloads_2011_09_14"
Column | Type |
Modifiers
-----------------------+-----------------------------+------------------------------------------------------------
guid | numeric(20,0) | not null
osprocess_guid | numeric(20,0) | not null
filepath_guid | numeric(20,0) | not null
firstloadtime | numeric(20,0) | not null
md5hash | bytea | not null
host_guid | numeric(20,0) | default NULL::numeric
process_create_time | numeric(20,0) | default NULL::numeric
process_filepath_guid | numeric(20,0) | default NULL::numeric
event_time | timestamp without time zone | default '2011-09-14
00:00:00'::timestamp without time zone
Indexes:
"osmoduleloads_2011_09_14_pkey" PRIMARY KEY, btree (guid)
"idx_osmoduleloads_2011_09_14_filepath_guid" btree (filepath_guid)
"idx_osmoduleloads_2011_09_14_firstload_time" btree (firstloadtime)
"idx_osmoduleloads_2011_09_14_host_guid" btree (host_guid)
"idx_osmoduleloads_2011_09_14_md5hash" btree (md5hash)
"idx_osmoduleloads_2011_09_14_osprocess_guid" btree (osprocess_guid)
Check constraints:
"osmoduleloads_2011_09_14_event_time_check" CHECK (event_time =
'2011-09-14 00:00:00'::timestamp without time zone)
"osmoduleloads_2011_09_14_firstloadtime_check" CHECK (firstloadtime >=
129604464000000000::bigint::numeric AND firstloadtime <
129605328000000000::bigint::numeric)
Inherits: osmoduleloads

Objective: The firstloadtime check constraint ensures that the record is
applicable to that daily table. (In case you were wondering, the large
numerics correspond to the Windows 100-nanosecond since the Epoch.) I'm
inserting millions of records into each daily table so "query slowness" is
quite easy to spot. Given that there is so much data per daily table, I was
hoping to use the order by and limit clauses to "stop out" a query once it
sufficed the limit clause and not be forced to visit each daily table.
However, I'm spending way too much time in the older tables than I'd like -
which leads me to believe that I;m doing something wrong. For ease of
viewing, my explain analyze can be found at http://explain.depesz.com/s/tot

I'm still very new to this so I'm not sure if explain.depesz.com saves the
original query. It wasn't readily apparent that it did so here is the
original query:

SELECT osm_1.*, storefiles_1.*, filepaths_1.*, filepaths_2.* FROM (SELECT *
FROM osmoduleloads JOIN hosts ON hosts.guid = osmoduleloads.host_guid WHERE
hosts.guid = '2007075705813916178' AND osmoduleloads.firstloadtime >=
129604320000000000 AND osmoduleloads.firstloadtime < 129610367990000000 AND
hosts.enabled = true AND hosts.user_id = 111 ORDER BY
osmoduleloads.firstloadtime DESC LIMIT 251) AS osm_1 LEFT OUTER JOIN
storefiles AS storefiles_1 ON osm_1.md5hash = storefiles_1.md5hash LEFT
OUTER JOIN filepaths AS filepaths_1 ON osm_1.process_filepath_guid =
filepaths_1.guid AND osm_1.event_time = filepaths_1.event_time LEFT OUTER
JOIN filepaths AS filepaths_2 ON osm_1.filepath_guid = filepaths_2.guid AND
osm_1.event_time= filepaths_2.event_time ORDER BY osm_1.firstloadtime DESC;

Hopefully my assumptions about order by and limit are correct and this query
can be optimized.

Again, appreciate any help you can lend. Thanks in advance.

Mike

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Greg Smith 2011-09-22 01:48:41 Re: Query optimization using order by and limit
Previous Message Kevin Grittner 2011-09-21 18:41:17 Re: : Performance Improvement Strategy