Re: Non-deterministic 100% CPU hang on postgres 9.3

From: Fenn Bailey <fenn(dot)bailey(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Non-deterministic 100% CPU hang on postgres 9.3
Date: 2014-04-21 03:19:44
Message-ID: CAKnSMKmXrbHkSkFhbszf9FbG8MDvpFbbHXfPjjgZLaxxK23CDA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi all,

I was wondering if anyone had any further insight as to why the query plan
would have changed so dramatically between queries, even when isolated by a
transaction?

Is it possible for database statistics to change after a query has been run
(independent of transactions)?

Thanks in advance,

Fenn.

On Wed, Apr 16, 2014 at 2:25 PM, Fenn Bailey <fenn(dot)bailey(at)gmail(dot)com> wrote:

> Some more information on this:
>
>>
>> > I say "hang", where really I've given up after ~12 hours execution. The
>> > exact same query can then be terminated and run in <90 seconds, with
>> none
>> > of the underlying data changing.
>>
>> But does the plan change?
>>
>>
> As it turns out it does. Some further examination shows that the plan
> changes, at least partially due to the fact that I now see this query
> attempts to modify the data that it's querying (ie: if any updates have
> occurred, the subsequent query is not the same).
>
> That said, all queries were performed in a transaction, so when they're
> aborted, no modification should be made to the underlying data.
>
> This makes it even more odd that on a subsequent run of the same query
> (after the rollback) it will complete in ~140 seconds (vs the several hours
> it was taking on the first run).
>
> Is it possible that the initial run is changing the statistics available
> on the table, hence providing a different query plan (even on the same
> data)?
>
> Query plan on first/second run follows -
>
> First run (failure):
> ---------------------------
> Update on ad_events e (cost=1426714795.64..1426749795.64 rows=1000000
> width=237)
> -> Hash Join (cost=1426714795.64..1426749795.64 rows=1000000
> width=237)
> Hash Cond: (x.ad_event_id = e.ad_event_id)
> -> Subquery Scan on x (cost=1426638856.33..1426651356.33
> rows=1000000 width=144)
> -> Sort (cost=1426638856.33..1426641356.33 rows=1000000
> width=108)
> Sort Key: s.ad_user_id, s.ad_session_id,
> t."timestamp"
> -> Nested Loop Left Join
> (cost=12046.09..1426539198.49 rows=1000000 width=108)
> Join Filter: ((t."timestamp" >=
> s.session_start) AND (t."timestamp" <= s.session_end) AND ((s.ad_user_id =
> u.ad_user_id) OR (s.ad_user_id = u.merged_id)))
> -> Hash Join (cost=12046.09..203878.09
> rows=1000000 width=92)
> Hash Cond: (t.user_id =
> u.orig_distinct_id)
> -> Seq Scan on ad_events_mv t
> (cost=0.00..173082.00 rows=1000000 width=112)
> Filter: (NOT processed)
> -> Hash (cost=7932.15..7932.15
> rows=329115 width=67)
> -> Seq Scan on ad_users u
> (cost=0.00..7932.15 rows=329115 width=67)
> -> Materialize (cost=0.00..5478.88 rows=63392
> width=32)
> -> Seq Scan on ad_sessions s
> (cost=0.00..5161.92 rows=63392 width=32)
> -> Hash (cost=50938.58..50938.58 rows=2000058 width=101)
> -> Seq Scan on ad_events e (cost=0.00..50938.58
> rows=2000058 width=101)
>
> Second run (completed in ~148 seconds):
> ------------------------------
> Update on ad_events e (cost=39730372.76..39765372.76 rows=1000000
> width=237)
> -> Hash Join (cost=39730372.76..39765372.76 rows=1000000 width=237)
> Hash Cond: (x.ad_event_id = e.ad_event_id)
> -> Subquery Scan on x (cost=39654433.45..39666933.45
> rows=1000000 width=144)
> -> Sort (cost=39654433.45..39656933.45 rows=1000000
> width=108)
> Sort Key: s.ad_user_id, s.ad_session_id, t."timestamp"
> -> Nested Loop Left Join
> (cost=12054.20..39554775.61 rows=1000000 width=108)
> Join Filter: ((t."timestamp" >=
> s.session_start) AND (t."timestamp" <= s.session_end))
> -> Hash Join (cost=12046.09..203878.09
> rows=1000000 width=92)
> Hash Cond: (t.user_id =
> u.orig_distinct_id)
> -> Seq Scan on ad_events_mv t
> (cost=0.00..173082.00 rows=1000000 width=112)
> Filter: (NOT processed)
> -> Hash (cost=7932.15..7932.15
> rows=329115 width=67)
> -> Seq Scan on ad_users u
> (cost=0.00..7932.15 rows=329115 width=67)
> -> Bitmap Heap Scan on ad_sessions s
> (cost=8.11..39.22 rows=8 width=32)
> Recheck Cond: ((ad_user_id =
> u.ad_user_id) OR (ad_user_id = u.merged_id))
> -> BitmapOr (cost=8.11..8.11 rows=8
> width=0)
> -> Bitmap Index Scan on
> ad_sessions_ad_user_id_idx (cost=0.00..4.05 rows=4 width=0)
> Index Cond: (ad_user_id =
> u.ad_user_id)
> -> Bitmap Index Scan on
> ad_sessions_ad_user_id_idx (cost=0.00..4.05 rows=4 width=0)
> Index Cond: (ad_user_id =
> u.merged_id)
> -> Hash (cost=50938.58..50938.58 rows=2000058 width=101)
> -> Seq Scan on ad_events e (cost=0.00..50938.58
> rows=2000058 width=101)
> ---------------
>
> From a quick look, the plans diverge at:
> -> Seq Scan on ad_sessions s (cost=0.00..5161.92 rows=63392 width=32)
> vs
> -> Bitmap Heap Scan on ad_sessions s (cost=8.11..39.22 rows=8 width=32)
>
> I will continue to look further into this in an effort to "fix" the
> queries but it would be great if someone could give me insight as to what
> could be causing the inconsistency in plan.
>
> Cheers!
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Rajeev rastogi 2014-04-21 03:29:30 Re: Disable an index temporarily
Previous Message Rob Sargentg 2014-04-21 02:02:51 Re: Altering array(composite-types) without breaking code when inserting them and similar questions