Re: postgress 8.1.4 deadlocking??

From: Casey Duncan <casey(at)pandora(dot)com>
To: Rafal Pietrak <rafal(at)zorro(dot)isa-geek(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: postgress 8.1.4 deadlocking??
Date: 2006-09-27 20:09:02
Message-ID: A8F99566-671E-43DD-AA53-145828B7CDA2@pandora.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


On Sep 27, 2006, at 12:35 PM, Rafal Pietrak wrote:
> Thenx Duncan for the analysis.
>
> This happend again, so I'm able to peek at the details you've pointed
> out.
>
> On Wed, 2006-09-27 at 09:33 -0700, Casey Duncan wrote:
>> Sounds like it was blocked (unsure by what). You can use pg_locks to
>> check that.
>
> That view currently contais 240 entries .... source of blocking. Now I
> have to dig the ultimate reason why those locks end up there (and are
> not being cleand). Any ideas?

You need to look at the procpid that's holding the lock and then use
pg_stat_activity to see what it is doing. I'm not sure what locks
that query needs to hold, it may be blocked by any concurrent query
from a user in that group, I dunno.

>
> Whould those locks go away by themselves .... in time?

Yeah, once the transactions holding the locks complete.

>
>> What was consuming all the CPU? I assume from what you wrote
>> previously that it is a postgres backend. If so, what was it doing?
>
> ----------------ps output (just one backend)------------
> postgres 4759 0.0 2.0 19368 5292 ? S 21:02 0:00
> postgres: czyn_738773 czyn 127.0.0.1(55431) UPDATE waiting
> -------------------------------------------------------
>
> "czyn" is the name of the database, "czyn_738773" is database
> username.
> IP address is localhost, since the database purpose is web serbvice
> back-end.

If the update is waiting, it is blocked and should not be consuming
any noticeable cpu. This was using lots of CPU at this moment?

>
> Still, I'm not sure what that UPDATE counld that be.
>
>> You can use ps, the pg_stat_activity view and strace to figure
>> that out.
>
> ------------- exerpt from pg_stat_activity-----------------
> 17660 | czyn | 4759 | 19541 | czyn_738773 | <command string not
> enabled> | | 2006-09-27 21:02:05.914728+02 | 127.0.0.1 |
> 55431
> (18 rows)
> ------------------------------------------
>
> But this is not very meaningfull to me.

You need to add this to your postgresql.conf:

stats_command_string = on

Then send a HUP signal to the postmaster (you can use pgctl reload
for this or 'kill -HUP')

Then you can see what the exact command is.

Another option that may be helpful to diagnose is:

log_min_duration_statement = 500

This will output the sql for all commands taking longer than 500ms.
You can adjust this up and down depending on what counts as "slow" in
your app. You probably only want it on temporarily, but I find it
sometimes helpful to always leave it on with a suitably high value
(like 1000). That way you can always troll the logs for laggards.

>
>>> ...another postgress-process restart, and another VACCUM ANALYSE,
>>> this
>>> one completes in 3-5 secs. Now "ALTER GROUP ..." goes OK. And
>>> everything
>>> seams working.
>>
>> Was something else using the database previously? Was something else
>> different this time than the last?
>
> No. My impression was (naive, I know), that not all things were
> correctly cleand-up on the first restart. Now I think, that on the
> second go I was just a little bit more lucky - www service was calm
> for
> a little longer. That was anough for VACUUM to complete.

Yeah, it sounds like things are fairly busy given the locks above.

>
> Still, throuble started later.
>
>> How big is the data on disk? Is it possible that you have queries
>
> close to nothing: c.a. 200MiB

Sounds like the problem is more about concurrency then data size. How
many concurrent connections do you have (select count(*) from
pg_stat_activity)? How many are in transaction and running queries
concurrently?

If you see multiple instances of a given query (or the same one
appearing frequently), try running EXPLAIN ANALYZE on it. That'll
tell you what makes it slow.

>
>> problem (hardware, os, fs). When you saw this happen did you happen
>> to check if it was user or system cpu usage? I had a corrupt
>
> It's the user CPU time.

Yeah, just sounds like a scaling issue. Hopefully easily solved ;^)

>
> As it currently looks like some unknown UPDATE causes the trouble, how
> can I identify what it is?

see config recommendations above.

> 1. the database is not very sophisticated, apart from what I've
> mentioned earlier, it does have just a few stored procedures in
> pgsql -
> nothing fancy. And there is no triggers just specialized views
> (although
> quite a lot of referencial integrity constrains).

These can be a source of lock contention on updates (though this is
much better on 8.1 compared to 7.4). Once you know the update query
that's getting blocked you can know for sure.

> 2. machine is 'from the old days': 450MHz CPU

What is the iowait like on this box? You can use iostat to determine
this. That will tell you if the disks are also struggling.

> 3. the system is currently under 'available-for-public' condition, so
> there may be 'waves' (peeks) of activity.
> 4. when the system was tested in-lab (small number of users - no
> risk /
> not-tested for havier load behavior), I haven't noticed this
> 'blocking'.

With any luck the above will help point out some suboptimal queries
that are causing the headache.

-Casey

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2006-09-27 20:13:34 Re: [GENERAL] Restart after power outage: createdb
Previous Message Karen Hill 2006-09-27 20:05:40 cyclical redundancy checksum algorithm(s)?