Odd query behavior [urgent, but long]

From: Steve Lane <slane(at)soliantconsulting(dot)com>
To: <pgsql-admin(at)postgresql(dot)org>
Cc: Jamie Thomas <jthomas(at)soliantconsulting(dot)com>
Subject: Odd query behavior [urgent, but long]
Date: 2004-08-25 17:06:42
Message-ID: BD523152.1A629%slane@soliantconsulting.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hello all:

I'm seeing some very odd query behavior on postgres 7.1.3. I know that's way
out of date, and I do have a plan in place to upgrade, but any immediate
help will be, well, very helpful.

I have a server running two instances of postgres: 7.1.3 and 7.4.3. Each
supports a different web application -- the web apps are served from a
separate middleware server. Formerly both web apps ran on a single
all-in-one server, on postgres 7.1.3.

Since reconfiguring the servers (one web, one database) and deploying on the
two-postgres config, performance of the web app running on the older pg is
terrible, despite the new db-only machine being superficially "better" (in
processor and RAM) than the original.

I have enabled pretty extensive query logging on the 7.1.3 instance. What I
see is that certain queries take hideously long.

What is odd is that these are perfectly well optimized queries: they
represents updates to a table of web application sessions. Each user's
session is written once per web page view, so the query gets run frequently.

Now, this session update is not, I think, the only query that's taking long.
But there are so many of these queries that they are heavily represented in
the slow queries (which I'm determining somewhat laboriously by running the
huge logs through an awk script).

Here's an example EXPLAIN:

datapoint=# explain UPDATE nsse_session SET expiration = 1093541296, value =
'sessIdUser|s:5:\"31991\";sessPriv|s:1:\"u\";se\
datapoint'#
ssCustId|s:5:\"14688\";sessSchoolId|s:5:\"14781\";sessUserName|s:6:\"gcruse\
";sessImgVersion|i:2;sessAccessPrivs|N;sessAccessSchools|a:1:{i:14781;s:\
datapoint'#
5:\"14781\";}sessSort|a:0:{}!sessBack|sessDPVersion|s:4:\"full\";sessDataSou
rce|s:4:\"live\";sessUserMeasures|s:14:\"1948,1913,1703\";sessUserGroups\
datapoint'#
|s:84:\"3074,3113,3144,3179,3223,3263,5968,5997,6011,6027,6035,6046,6063,607
6,6087,6105,6116\";!sessGroupFilter|' WHERE id_session = '955af0272896ba\
datapoint'# a67956494dcb30a5fe' AND expiration > 1093441296;
NOTICE: QUERY PLAN:

Index Scan using nsse_session_pkey on nsse_session (cost=0.00..8.16 rows=3
width=50)

EXPLAIN

So, an index scan, as it should be.

When I check the postgres log for the query stats, I find this:

QUERY STATISTICS
! system usage stats:
! 0.012477 elapsed 0.000000 user 0.000000 system sec
! [0.070000 user 0.000000 sys total]
! 0/0 [0/0] filesystem blocks in/out
! 3/1 [1044/309] page faults/reclaims, 0 [0] swaps
! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
! 0/0 [0/0] voluntary/involuntary context switches
! postgres usage stats:
! Shared blocks: 0 read, 0 written, buffer hit rate
= 100.00%
! Local blocks: 0 read, 0 written, buffer hit rate
= 0.00%
! Direct blocks: 0 read, 0 written
QUERY STATISTICS
! system usage stats:
! 1878.708691 elapsed 194.000000 user 1.180000 system sec
! [194.020000 user 1.190000 sys total]
! 0/0 [0/0] filesystem blocks in/out
! 36441/3233 [37108/3496] page faults/reclaims, 0 [0] swaps
! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
! 0/0 [0/0] voluntary/involuntary context switches
! postgres usage stats:
! Shared blocks: 16863 read, 125 written, buffer hit rate
= 99.99%
! Local blocks: 0 read, 0 written, buffer hit rate
= 0.00%
! Direct blocks: 0 read, 0 written

Two sets of stats, of which the first looks OK, and the second is terrible.
Virtually all of the long queries I've seen so far display this behavior, a
double set of stats, with the second one looking like bad news. The second
set seems almost to show a very elevated level of page faults. (what are
these page faults against? They seem not to be against the pg shared buffers
-- are they against the OS disk cache?)

I did hear a suggestion that it's possible that a few very expensive queries
are blocking others (if they're writing) and that I see the effects mostly
on these session updates, because they're so frequent. I guess that's
possible, but would like to know more about what the above stats might mean.

Some of these session writes are fairly heavy -- we carry a lot of data in
the session, often multiple tens of K of text going into one column.

Is this a stats problem, symptom of insufficient vacuuming?

Relevant params: postgres 7.1.3 running on RedHat ES 2.1. SHMMAX set to
800000000. (Physical ram is 3gig). Shared buffers for this install = 20000,
sort mem is 8 meg, max connections = 50. (I don't think it's pure concurrent
load, it happened during a user training session with only about 30 users).
The other pg install on the box is similarly configured. Just to complete
the data dump, here's the output of ipcs:

------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x0052e6a9 3538944 postgres 600 167649280 2
0x0052e2c1 3473409 postgres 600 170762240 2

------ Semaphore Arrays --------
key semid owner perms nsems status
0x0052e6a9 11567104 postgres 600 17
0x0052e6aa 11599873 postgres 600 17
0x0052e6ab 11632642 postgres 600 17
0x0052e6ac 11665411 postgres 600 17
0x0052e2c1 10878980 postgres 600 17
0x0052e2c2 10911749 postgres 600 17
0x0052e2c3 10944518 postgres 600 17
0x0052e2c4 10977295 postgres 600 17
0x0052e2c5 11010064 postgres 600 17
0x0052e2c6 11042835 postgres 600 17
0x0052e2c7 11075604 postgres 600 17
0x0052e2c8 11108373 postgres 600 17
0x0052e2c9 11141142 postgres 600 17
0x0052e2ca 11173911 postgres 600 17
0x0052e2cb 11206680 postgres 600 17
0x0052e2cc 11239449 postgres 600 17
0x0052e2cd 11272218 postgres 600 17
0x0052e2ce 11304987 postgres 600 17
0x0052e2cf 11337756 postgres 600 17
0x0052e2d0 11370525 postgres 600 17
0x0052e2d1 11403294 postgres 600 17

------ Message Queues --------
key msqid owner perms used-bytes messages

Sorry, I know that's a ton of data but the problem is urgent and I'm very
perplexed.

For comparison, on the original unified server, I had max connections of 256
(as opposed to 50 on the new one -- could this be the problem?), sort_mem of
6000, shared buffers of 20000. (overall, shared buffers have doubled on the
new machine because I've devoted that much to each).

-- sgl

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message kris pal 2004-08-25 18:53:34 regression database
Previous Message Chris Browne 2004-08-25 15:46:44 Re: HELP pleaseeee!!!