Slow queries / commits, mis-configuration or hardware issues?

From: Cody Caughlan <toolbag(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Slow queries / commits, mis-configuration or hardware issues?
Date: 2011-11-14 18:16:46
Message-ID: CAPVp=gYs7uYF_hy0MDBkpRCxNbWQHPSyP6sXO-6SJZ_TvaU=XQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi, running Postgres 9.1.1 on an EC2 m1.xlarge instance. Machine is a
dedicated master with 2 streaming replication nodes.

The machine has 16GB of RAM and 4 cores.

We're starting to see some slow queries, especially COMMITs that are
happening more frequently. The slow queries are against seemingly
well-indexed tables.

I have log_min_duration = 150ms

Slow commits like:

2011-11-14 17:47:11 UTC pid:14366 (44/0-0) LOG: duration: 3062.784 ms
statement: COMMIT
2011-11-14 17:47:11 UTC pid:14604 (48/0-0) LOG: duration: 2593.351 ms
statement: COMMIT

These slow COMMITs are against tables that received a large number of
UPDATEs and are growing fairly rapidly.

And slow queries like:

2011-11-14 17:50:20 UTC pid:6519 (16/0-0) LOG: duration: 1694.456 ms
statement: SELECT "facebook_wall_posts".* FROM "facebook_wall_posts"
WHERE "facebook_wall_posts"."token" =
'984c44e75975b224b38197cf8f8fc76a' LIMIT 1

query plan: http://explain.depesz.com/s/wbm
The # of rows in facebook_wall_posts is 5841

Structure of facebook_wall_posts:
Table "public.facebook_wall_posts"
Column | Type |
Modifiers
--------------------+-----------------------------+------------------------------------------------------------------
id | integer | not null default
nextval('facebook_wall_posts_id_seq'::regclass)
album_id | integer | not null
token | character varying(32) | not null
fb_recipient_id | character varying(64) | not null
post_id | character varying(100) | not null
fb_post_created_at | timestamp without time zone |
data | text |
created_at | timestamp without time zone |
updated_at | timestamp without time zone |
fb_post_deleted_at | timestamp without time zone |
Indexes:
"facebook_wall_posts_pkey" PRIMARY KEY, btree (id)
"index_facebook_wall_posts_on_token" UNIQUE, btree (token)
"index_facebook_wall_posts_on_album_id" btree (album_id)

And another slow query:

2011-11-14 17:52:44 UTC pid:14912 (58/0-0) LOG: duration: 979.740 ms
statement: SELECT "facebook_friends"."friend_id" FROM
"facebook_friends" WHERE "facebook_friends"."user_id" = 9134671

Query plan: http://explain.depesz.com/s/x1q
# of rows in facebook_friends is 27075088

Structure of facebook_friends:
Table "public.facebook_friends"
Column | Type |
Modifiers
------------+-----------------------------+---------------------------------------------------------------
id | integer | not null default
nextval('facebook_friends_id_seq'::regclass)
user_id | integer | not null
friend_id | integer | not null
created_at | timestamp without time zone |
Indexes:
"facebook_friends_pkey" PRIMARY KEY, btree (id)
"index_facebook_friends_on_user_id_and_friend_id" UNIQUE, btree
(user_id, friend_id)

We have auto-vacuum enabled and running. But yesterday I manually ran
vacuum on the database. Autovacuum settings:

autovacuum | on
autovacuum_analyze_scale_factor | 0.1
autovacuum_analyze_threshold | 50
autovacuum_freeze_max_age | 200000000
autovacuum_max_workers | 3
autovacuum_naptime | 60
autovacuum_vacuum_cost_delay | 20
autovacuum_vacuum_cost_limit | -1
autovacuum_vacuum_scale_factor | 0.2
autovacuum_vacuum_threshold | 50

other postgresql.conf settings:

shared_buffers = 3584MB
wal_buffers = 16MB
checkpoint_segments = 32
max_wal_senders = 10
checkpoint_completion_target = 0.9
wal_keep_segments = 1024
maintenance_work_mem = 256MB
work_mem = 88MB
shared_buffers = 3584MB
effective_cache_size = 10GB

The PGDATA dir is a RAID10 on 4 local ("ephemeral" in EC2 speak)
drives. I ran some dd tests and received the following output:

--- WRITING ---
root(at)sql03:/data# time sh -c "dd if=/dev/zero of=/data/tmp/bigfile
bs=8k count=4000000 && sync"
4000000+0 records in
4000000+0 records out
32768000000 bytes (33 GB) copied, 670.663 s, 48.9 MB/s

real 11m52.199s
user 0m2.720s
sys 0m45.330s

--- READING ---
root(at)sql03:/data# time dd of=/dev/zero if=/data/tmp/bigfile bs=8k
4000000+0 records in
4000000+0 records out
32768000000 bytes (33 GB) copied, 155.429 s, 211 MB/s

real 2m35.434s
user 0m2.400s
sys 0m33.160s

I have enabled log_checkpoints and here is a recent sample from the log:

2011-11-14 17:38:48 UTC pid:3965 (-0) LOG: checkpoint complete: wrote
15121 buffers (3.3%); 0 transaction log file(s) added, 0 removed, 8
recycled; write=270.101 s, sync=2.989 s, total=273.112 s; sync
files=60, longest=1.484 s, average=0.049 s
2011-11-14 17:39:15 UTC pid:3965 (-0) LOG: checkpoint starting: time
2011-11-14 17:43:49 UTC pid:3965 (-0) LOG: checkpoint complete: wrote
16462 buffers (3.6%); 0 transaction log file(s) added, 0 removed, 9
recycled; write=269.978 s, sync=4.106 s, total=274.117 s; sync
files=82, longest=2.943 s, average=0.050 s
2011-11-14 17:44:15 UTC pid:3965 (-0) LOG: checkpoint starting: time
2011-11-14 17:48:47 UTC pid:3965 (-0) LOG: checkpoint complete: wrote
14159 buffers (3.1%); 0 transaction log file(s) added, 0 removed, 6
recycled; write=269.818 s, sync=2.119 s, total=271.948 s; sync
files=71, longest=1.192 s, average=0.029 s
2011-11-14 17:49:15 UTC pid:3965 (-0) LOG: checkpoint starting: time
2011-11-14 17:53:47 UTC pid:3965 (-0) LOG: checkpoint complete: wrote
11337 buffers (2.5%); 0 transaction log file(s) added, 0 removed, 7
recycled; write=269.901 s, sync=2.508 s, total=272.419 s; sync
files=71, longest=1.867 s, average=0.035 s
2011-11-14 17:54:15 UTC pid:3965 (-0) LOG: checkpoint starting: time
2011-11-14 17:58:48 UTC pid:3965 (-0) LOG: checkpoint complete: wrote
15706 buffers (3.4%); 0 transaction log file(s) added, 0 removed, 7
recycled; write=270.104 s, sync=3.612 s, total=273.727 s; sync
files=67, longest=3.051 s, average=0.053 s
2011-11-14 17:59:15 UTC pid:3965 (-0) LOG: checkpoint starting: time

I've been collecting random samples from pg_stat_bgwriter:
https://gist.github.com/4faec2ca9a79ede281e1

So given all this information (if you need more just let me know), is
there something fundamentally wrong or mis-configured? Do I have an
I/O issue?

Thanks for any insight.

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tomas Vondra 2011-11-14 19:22:52 Re: Slow queries / commits, mis-configuration or hardware issues?
Previous Message Kevin Grittner 2011-11-14 17:50:25 Re: unlogged tables