Query hanging/not finishing inconsistently

From: "Meetesh Karia" <meetesh(dot)karia(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Cc: "Jean-David Dahan" <jd(dot)dahan(at)gmail(dot)com>
Subject: Query hanging/not finishing inconsistently
Date: 2006-05-22 17:20:09
Message-ID: fc5b04ca0605221020k12c503cfx137980624e78b75@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi all,

We've recently started having a problem where a query that normally executes
in ~15ms starts to take upwards of 20s to complete. When the connection
that ran query is returned to the connection pool, it appears as though a
transaction is still in progress so the connection pool tries to cancel the
transaction and close the connection. This fails and the connection is
removed from the connection pool. At this point, the situation rapidly
degrades and we run out of connections to the postgres server.

An inspection of the pg_stat_activity table shows that practically every
connection is running the above-mentioned query and some of those queries
have been active for many minutes! We've looked at the pg_locks table as
well and the only exclusive locks are on transactions that are open. All
other locks are AccessShareLocks. Also, as far as we can tell (from looking
at the Hibernate stats), every db session that is opened is closed.

When this happens, if I kill one of the running postgres processes (just by
picking the last process returned from "ps -ef | grep postgres"), the other
queries will immediately finish and the system will respond. However,
within 15 minutes, we'll be back in the same state as before. At that
point, I've cycled Apache, Tomcat and Postgres and the system then seems to
come back.

This problem appears to be unrelated to load and in fact, the majority of
the time there is very little load on the site when this occurs. We've run
load tests on our dev boxes but we've been unable to reproduce the problem.
We're currently working on playing back the clicks on the site previous to
the weird state the site gets in and at the same time, we were wondering if
anyone has experienced a problem like this or has any suggestions.

The query in question is:

select distinct s.screening_id, f.film_id, f.title, s.period_start,
f.runtime, c.value, v.short_name, s.parent_id,
stats.avg_rating, coalesce(stats.num_adds, 0) as num_adds, coalesce(
stats.unique_visits, 0) as unique_visits,
f.*, s.*
from lte_screening s
inner join lte_film f on s.film_id = f.film_id
inner join lte_venue v on s.venue_id = v.venue_id
inner join lte_film_classifier c on c.film_id = f.film_id
left join lte_film_stats stats on stats.context = :context and
stats.film_id = s.film_id
where c.name = ? and s.period_start is not null and s.festival_id = ?
and s.period_start between ? + ? and ? + ?
order by s.period_start, f.title;

And the result of explain analyze:

QUERY PLAN
Unique (cost=1117.42..1118.71 rows=11 width=866) (actual time=
18.306..18.386 rows=15 loops=1)
-> Sort (cost=1117.42..1117.44 rows=11 width=866) (actual time=
18.300..18.316 rows=15 loops=1)
Sort Key: s.period_start, f.title, s.screening_id, f.film_id,
f.runtime, c.value, v.short_name, s.parent_id, stats.avg_rating, COALESCE(
stats.num_adds, 0), COALESCE(stats.unique_visits, 0::bigint), f.film_id,
f.sku, f.title, f."template", f.release_date, f.runtime, f."language",
f.country, f.mpaa_rating, f.synopsis, f."owner", f.ext_sales_rank,
f.small_image_url, f.medium_image_url, f.large_image_url, f.detail_page,
f.to_delete, f.coalesce_to, (subplan), (subplan), s.screening_id,
s.period_start, s.period_end, s.ticket_price, s.tickets_avail,
s.tickets_sold, s."type", s.venue_id, s.festival_id, s.film_id, s.parent_id,
s.ext_id, s.purchase_url, s.status, s.status_update_time
-> Nested Loop Left Join (cost=2.62..1117.23 rows=11 width=866)
(actual time=2.656..17.773 rows=15 loops=1)
-> Nested Loop (cost=2.62..976.00 rows=11 width=846) (actual
time=2.347..16.162 rows=15 loops=1)
-> Hash Join (cost=2.62..929.09 rows=10 width=831)
(actual time=2.217..15.480 rows=15 loops=1)
Hash Cond: ("outer".venue_id = "inner".venue_id)
-> Nested Loop (cost=0.00..926.32 rows=10
width=818) (actual time=1.915..14.974 rows=15 loops=1)
-> Seq Scan on lte_screening s (cost=
0.00..886.67 rows=10 width=159) (actual time=1.830..14.314 rows=15 loops=1)
Filter: ((period_start IS NOT NULL)
AND (festival_id = 316372) AND (period_start >= '2006-05-19
05:00:00'::timestamp without time zone) AND (period_start <= '2006-05-20
04:59:59'::timestamp without time zone))
-> Index Scan using lte_film_pkey on
lte_film f (cost=0.00..3.95 rows=1 width=659) (actual
time=0.026..0.028rows=1 loops=15)
Index Cond: ("outer".film_id =
f.film_id)
-> Hash (cost=2.50..2.50 rows=50 width=21)
(actual time=0.215..0.215 rows=0 loops=1)
-> Seq Scan on lte_venue v (cost=
0.00..2.50 rows=50 width=21) (actual time=0.012..0.126 rows=52 loops=1)
-> Index Scan using idx_classifier_film on
lte_film_classifier c (cost=0.00..4.67 rows=2 width=23) (actual time=
0.026..0.028 rows=1 loops=15)
Index Cond: (c.film_id = "outer".film_id)
Filter: ((name)::text = 'FestivalCategory'::text)
-> Index Scan using lte_film_stats_pkey on lte_film_stats
stats (cost=0.00..4.34 rows=1 width=28) (actual time=0.034..0.037 rows=1
loops=15)
Index Cond: ((stats.context = 316372) AND
(stats.film_id= "outer".film_id))
SubPlan
-> Index Scan using idx_collateral_film on
lte_film_collateral c (cost=0.00..4.24 rows=1 width=40) (actual time=
0.009..0.011 rows=1 loops=15)
Index Cond: (film_id = $0)
Filter: ((name)::text = 'TVRating'::text)
-> Index Scan using idx_collateral_film on
lte_film_collateral c (cost=0.00..4.24 rows=1 width=40) (actual time=
0.022..0.025 rows=1 loops=15)
Index Cond: (film_id = $0)
Filter: ((name)::text = 'IMDBId'::text)
Total runtime: 19.077 ms

Here is our setup:

We have 2 machines. The first is the web server and the db server and the
second is just another web server:

Machine A
- 1 GB RAM
- 1 Intel(R) Xeon(TM) CPU 2.80GHz HyperThreaded Processor
- CentOS 4.3
- Linux moe 2.6.9-22.ELsmp #1 SMP Sat Oct 8 19:11:43 CDT 2005 i686 i686 i386
GNU/Linux

Machine B
- 1 GB RAM
- 1 Intel(R) Xeon(TM) CPU 2.80GHz Processor
- CentOS 4.3
- Linux larry 2.6.9-22.0.1.EL #1 Thu Oct 27 12:26:11 CDT 2005 i686 i686 i386
GNU/Linux

We're using the following software:
- Apache 2.0.52
- Tomcat 5.5.17
- Postgres 8.0.6
- JDK 1.5.0-Release 6
- Proxool 0.8.3
- Hibernate 3.1.3

Thanks in advance for any help,
Meetesh

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Craig A. James 2006-05-22 18:49:39 Re: Query hanging/not finishing inconsistently
Previous Message Markus Schaber 2006-05-22 15:30:29 Re: How can I make this query faster (resend)