Re: Non-deterministic 100% CPU hang on postgres 9.3

From: Fenn Bailey <fenn(dot)bailey(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Non-deterministic 100% CPU hang on postgres 9.3
Date: 2014-04-16 04:25:01
Message-ID: CAKnSMKnUo5mJC4FPgNQokF5SvokZRXnN=JEPCAqPQVnKwY6zzA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Some more information on this:

>
> > I say "hang", where really I've given up after ~12 hours execution. The
> > exact same query can then be terminated and run in <90 seconds, with none
> > of the underlying data changing.
>
> But does the plan change?
>
>
As it turns out it does. Some further examination shows that the plan
changes, at least partially due to the fact that I now see this query
attempts to modify the data that it's querying (ie: if any updates have
occurred, the subsequent query is not the same).

That said, all queries were performed in a transaction, so when they're
aborted, no modification should be made to the underlying data.

This makes it even more odd that on a subsequent run of the same query
(after the rollback) it will complete in ~140 seconds (vs the several hours
it was taking on the first run).

Is it possible that the initial run is changing the statistics available on
the table, hence providing a different query plan (even on the same data)?

Query plan on first/second run follows -

First run (failure):
---------------------------
Update on ad_events e (cost=1426714795.64..1426749795.64 rows=1000000
width=237)
-> Hash Join (cost=1426714795.64..1426749795.64 rows=1000000 width=237)
Hash Cond: (x.ad_event_id = e.ad_event_id)
-> Subquery Scan on x (cost=1426638856.33..1426651356.33
rows=1000000 width=144)
-> Sort (cost=1426638856.33..1426641356.33 rows=1000000
width=108)
Sort Key: s.ad_user_id, s.ad_session_id, t."timestamp"
-> Nested Loop Left Join
(cost=12046.09..1426539198.49 rows=1000000 width=108)
Join Filter: ((t."timestamp" >= s.session_start)
AND (t."timestamp" <= s.session_end) AND ((s.ad_user_id = u.ad_user_id) OR
(s.ad_user_id = u.merged_id)))
-> Hash Join (cost=12046.09..203878.09
rows=1000000 width=92)
Hash Cond: (t.user_id = u.orig_distinct_id)
-> Seq Scan on ad_events_mv t
(cost=0.00..173082.00 rows=1000000 width=112)
Filter: (NOT processed)
-> Hash (cost=7932.15..7932.15
rows=329115 width=67)
-> Seq Scan on ad_users u
(cost=0.00..7932.15 rows=329115 width=67)
-> Materialize (cost=0.00..5478.88 rows=63392
width=32)
-> Seq Scan on ad_sessions s
(cost=0.00..5161.92 rows=63392 width=32)
-> Hash (cost=50938.58..50938.58 rows=2000058 width=101)
-> Seq Scan on ad_events e (cost=0.00..50938.58
rows=2000058 width=101)

Second run (completed in ~148 seconds):
------------------------------
Update on ad_events e (cost=39730372.76..39765372.76 rows=1000000
width=237)
-> Hash Join (cost=39730372.76..39765372.76 rows=1000000 width=237)
Hash Cond: (x.ad_event_id = e.ad_event_id)
-> Subquery Scan on x (cost=39654433.45..39666933.45
rows=1000000 width=144)
-> Sort (cost=39654433.45..39656933.45 rows=1000000
width=108)
Sort Key: s.ad_user_id, s.ad_session_id, t."timestamp"
-> Nested Loop Left Join (cost=12054.20..39554775.61
rows=1000000 width=108)
Join Filter: ((t."timestamp" >= s.session_start)
AND (t."timestamp" <= s.session_end))
-> Hash Join (cost=12046.09..203878.09
rows=1000000 width=92)
Hash Cond: (t.user_id = u.orig_distinct_id)
-> Seq Scan on ad_events_mv t
(cost=0.00..173082.00 rows=1000000 width=112)
Filter: (NOT processed)
-> Hash (cost=7932.15..7932.15
rows=329115 width=67)
-> Seq Scan on ad_users u
(cost=0.00..7932.15 rows=329115 width=67)
-> Bitmap Heap Scan on ad_sessions s
(cost=8.11..39.22 rows=8 width=32)
Recheck Cond: ((ad_user_id = u.ad_user_id)
OR (ad_user_id = u.merged_id))
-> BitmapOr (cost=8.11..8.11 rows=8
width=0)
-> Bitmap Index Scan on
ad_sessions_ad_user_id_idx (cost=0.00..4.05 rows=4 width=0)
Index Cond: (ad_user_id =
u.ad_user_id)
-> Bitmap Index Scan on
ad_sessions_ad_user_id_idx (cost=0.00..4.05 rows=4 width=0)
Index Cond: (ad_user_id =
u.merged_id)
-> Hash (cost=50938.58..50938.58 rows=2000058 width=101)
-> Seq Scan on ad_events e (cost=0.00..50938.58
rows=2000058 width=101)
---------------

From a quick look, the plans diverge at:
-> Seq Scan on ad_sessions s (cost=0.00..5161.92 rows=63392 width=32)
vs
-> Bitmap Heap Scan on ad_sessions s (cost=8.11..39.22 rows=8 width=32)

I will continue to look further into this in an effort to "fix" the queries
but it would be great if someone could give me insight as to what could be
causing the inconsistency in plan.

Cheers!

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Alberto Cabello Sánchez 2014-04-16 06:52:25 Re: Trouble installing Slony 2.0
Previous Message Gavin Flower 2014-04-16 01:27:25 Re: Approach to Data Summary and Analysis