Re: A very long running query....

From: Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: A very long running query....
Date: 2012-07-21 19:16:36
Message-ID: 500B0014.5070600@anatec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice pgsql-performance

On 21/07/2012 00:10, Tom Lane wrote:
> Claudio Freire <klaussfreire(at)gmail(dot)com> writes:
>> Looking at this:
>> " -> Index Scan using
>> idx_message_copies_wk2_date_src_pos_partial on message_copies_wk2
>> message_copies (cost=0.00..19057.93 rows=52 width=32) (actual
>> time=62.124..5486270.845 rows=387524 loops=1)"
>> " Index Cond: ((date_trunc('day'::text,
>> msg_date_rec) = '2012-07-17 00:00:00'::timestamp without time zone)
>> AND (src_id = 1))"
>> " Filter: ((date_part('day'::text,
>> msg_date_rec) = 17::double precision) AND (NOT (((((pos_georef1)::text
>> || (pos_georef2)::text) || (pos_georef3)::text) ||
>> (pos_georef4)::text) IS NULL)) AND (((((pos_georef1)::text ||
>> (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text)
>> <> ''::text))"
> I think the real problem is that the planner has no hope of doing
> anything very accurate with such an unwieldy filter condition. I'd look
> at ways of making the filter conditions simpler, perhaps by recasting
> the data representation. In particular, that's a horridly bad way of
> asking whether some columns are empty, which I gather is the intent.
> If you really want to do it just like that, creating an index on the
> concatenation expression would guide ANALYZE to collect some stats about
> it, but it would probably be a lot more efficient to put together an AND
> or OR of tests on the individual columns.
>
> regards, tom lane
OK regarding the index I use... I follow your second advice about
efficiency with individual columns and changed it to:

CREATE INDEX idx_message_copies_wk2_date_src_pos_partial
ON feed_all_y2012m07.message_copies_wk2
USING btree
(date_trunc('day'::text, msg_date_rec), src_id, pos_georef1,
pos_georef2, pos_georef3, pos_georef4)
TABLESPACE "index"
WHERE
pos_georef1 IS NOT NULL
AND NOT pos_georef1::text = ''::text
AND pos_georef2 IS NOT NULL
AND NOT pos_georef2::text = ''::text
AND pos_georef3 IS NOT NULL
AND NOT pos_georef3::text = ''::text
AND pos_georef4 IS NOT NULL
AND NOT pos_georef4::text = ''::text;

The query has been changed as well as follows now:

SELECT
src_id,
date_trunc('day', message_copies.msg_date_rec) as date_count,
message_copies.pos_georef1,
message_copies.pos_georef2,
message_copies.pos_georef3,
message_copies.pos_georef4,
ais_server.array_accum(CASE WHEN msg_type BETWEEN 1 and 3
THEN message_copies.msg_id END) as msgA_array,
ais_server.array_accum(CASE WHEN msg_type = 18 THEN
message_copies.msg_id END) as msgB_std_array,
ais_server.array_accum(CASE WHEN msg_type = 19 THEN
message_copies.msg_id END) as msgB_ext_array,
uniq
(
ais_server.array_accum(CASE WHEN obj_type = 'SHIP_TYPE_A'
THEN obj_mmsi END)
) as mmsi_type_A_array,
uniq
(
ais_server.array_accum(CASE WHEN obj_type = 'SHIP_TYPE_B'
THEN obj_mmsi END)
) as mmsi_type_B_array,
avg(ship_speed) / 10.0 as avg_speed,
avg(ship_heading) as avg_heading,
avg(ship_course) / 10.0 as avg_course,
ST_Multi(ST_Collect(ship_pos_messages.pos_point)) as geom
from
feed_all_y2012m07.message_copies join
(feed_all_y2012m07.ship_pos_messages join
ais_server.ship_objects on (ship_pos_messages.obj_id =
ship_objects.obj_id))
on (message_copies.msg_id = ship_pos_messages.msg_id)
where
extract('day' from message_copies.msg_date_rec) = 17
and date_trunc('day', message_copies.msg_date_rec) = '2012-07-17'
and message_copies.src_id = 5
and not message_copies.pos_georef1 = '' and not
message_copies.pos_georef2 = '' and not message_copies.pos_georef3 = ''
and not message_copies.pos_georef4 = ''
and message_copies.pos_georef1 is not null and
message_copies.pos_georef2 is not null and message_copies.pos_georef3 is
not null and message_copies.pos_georef4 is not null
and extract('day' from ship_pos_messages.msg_date_rec) = 17
group by src_id, date_count, message_copies.pos_georef1,
message_copies.pos_georef2, message_copies.pos_georef3,
message_copies.pos_georef4;

I am not sure that I can see an improvement, at least on src_id that
have lots of msg_id per day the query never returned even 5 hours later
running "exaplain analyze". For smaller src_id
(message wise) there might be some improvement or it was just the
analyse that I run. As I said the stats goes quickly out of scope
because of the big number of updates. So it looks like that
it is not the "funny" "where" concatenation or some kind of index
construction problem. Which brings us back to the issue of the
"statistics_target" on per column. My problem is that given the
query plan I provided you yesterday, I am not sure which columns
statistics_target to touch and what short of number to introduce. Is
there any rule of thumb?

Kind regards
Yiannis

In response to

Responses

Browse pgsql-novice by date

  From Date Subject
Next Message Claudio Freire 2012-07-21 19:19:20 Re: A very long running query....
Previous Message Ioannis Anagnostopoulos 2012-07-21 17:42:31 Re: A very long running query....

Browse pgsql-performance by date

  From Date Subject
Next Message Claudio Freire 2012-07-21 19:19:20 Re: A very long running query....
Previous Message Ioannis Anagnostopoulos 2012-07-21 17:42:31 Re: A very long running query....