Re: pg_cancel_backend() does not work with buzz queries

From: "Sergey Konoplev" <gray(dot)ru(at)gmail(dot)com>
To: "Richard Huxton" <dev(at)archonet(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: pg_cancel_backend() does not work with buzz queries
Date: 2007-10-23 05:56:26
Message-ID: c3a7de1f0710222256x214fa988p36e8f2a57325d6a8@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-ru-general

2007/10/17, Sergey Konoplev <gray(dot)ru(at)gmail(dot)com>:
> Hello again,
>
> Sorry for the deal with my answer it was realy hectic week so I
> couldn't even check my mail.
>
> 2007/10/3, Richard Huxton <dev(at)archonet(dot)com>:
> > Sergey Konoplev wrote:
> > >> Don't forget to cc: the list.
> > >> Try not to top-post replies, it's easier to read if you reply below the
> > >> text you're replying to.
> > >
> > > Thanx for your advice. I'm just absolutely worned out. Sorry.
> >
> > Know that feeling - let's see if we can't sort this out.
> >
> > >>>> 1. Is it always the same query?
> > >>>> 2. Does the client still think it's connected?
> > >>>> 3. Is that query using up CPU, or just idling?
> > >>>> 4. Anything odd in pg_locks for the problem pid?
> > >>> 1. No it isn't. I have few functions (plpgsql, plpython) that cause
> > >>> such situations more often than another but they are called more often
> > >>> also.
> > >> OK, so there's no real pattern. That would suggest it's not a particular
> > >> query-plan that's got something wrong.
> > >>
> > >> Do you always get this problem inside a function?
> > >
> > > As far as I remember I do.
> >
> > Hmm - check Magnus' thoughts on pl/python. Can't comment on Python
> > myself. Are you sure it's not always the same few function(s) that cause
> > this problem?
>
> Yes I'm shure. I've noticed about 10 queries and procedure calls
> buzzing at least.
>
> >
> > >>> 2. The client just waits for query and buzz.
> > >>> 3. They are using CPU in usual way and their pg_lock activity seems normal.
> > >> So the backend that appears "stuck" is still using CPU?
> > >
> > > Yes but the metter is that this procedures usualy use CPU just a
> > > little so I can't find out if there is some oddity or not.
> >
> > OK, so it's not that it's stuck in a loop wasting a lot of CPU
> >
> > >> So - the symptoms are:
> > [snip]
> > > Exactly.
> >
> > So - we need to solve two mysteries
> > 1. Why are these functions not returning?
> > 2. Why does SIGINT not interrupt them?
> >
> > >> Are you happy that your hardware and drivers are OK? There aren't
> > >> problems with any other servers on this machine?
> > >
> > > Yes I'm quite happy. My hardware is: 2 double-core Xeon, 8Gb RAM,
> > > RAID5. What about other software... it's dedicated PG server so I have
> > > no problem with it.
> >
> > Well, the places I'd look would be:
> > 1. Hardware (you're happy that's fine, and it's not quite the problems
> > I'd expect)
> > 2. Drivers (same as #1)
> > 3. Client connectivity (but you say the client is fine)
> > 4. External interactions (see below)
> > 5. Bug in PG extension (pl/python)
>
> I think it's not only lp/python problem cos I saw pl/pgsql and simple
> queries also. For example:
> select * from (
> select d.*, cto.full_name, cast(st.name || ', ' || r.name as
> varchar) as cityname
> from
> drivers d
> join cars_trailers_owners cto on
> d.cars_trailers_owner_id = cto.id
> join settles st on
> d.settle_id = st.id
> join regions r on
> st.region_id = r.id
> order by sname, name, pname
> ) as sq
>
> > 6. Bug in PG core code
> >
> > Do any of your functions interact with the outside world - fetch
> > webpages or similar? It could be they're waiting for that. If you're
> > using a library that could hang waiting for a response and also block
> > SIGINT at the same time that would explain everything.
>
> No, most of them don't.
>
> --
> Regards,
> Sergey Konoplev
>

I've managed to repeat the situation and found out where the problem is.

I run on of the heaviest queries on a client application (Delphi,
psqlodbc.8.01.0101) then without waiting for it (the query run time is
about 5 minutes) I interrupted the client using task manager. There
was backend process still running on my server:

pgdb:/base/PG-Data # ps -ef |awk '/postgres.*konoplev.*SELECT/'
postgres 8590 8073 2 15:46 ? 00:00:36 postgres: konoplev
transport localhost(35442) SELECT
root 8973 7642 0 16:10 pts/0 00:00:00 awk /postgres.*konoplev.*SELECT/
pgdb:/base/PG-Data #

I took a look at TCP state with netstat:

pgdb:/base/PG-Data # netstat -pna |grep 8590
tcp 1 0 127.0.0.1:5432 127.0.0.1:35442
CLOSE_WAIT 8590/postgres: kono
pgdb:/base/PG-Data #

Remote address is 127.0.0.1 because clients connect to postgres
through SSH-tunnel. Then I used strace to make shure that the process
is active and after few times of repeating it I noticed its hunging:

pgdb:/base/PG-Data # strace -dfirtvx -p 8590
Process 8590 attached - interrupt to quit
[wait(0x137f) = 8590]
pid 8590 stopped, [SIGSTOP]
[wait(0x57f) = 8590]
pid 8590 stopped, [SIGTRAP]
0.000000 [ffffe410] send(11,
"\x30\x30\x37\x2d\x31\x30\x2d\x31\x39\x20\x32\x32\x3a\x31"...,
8192, 0

[Output stoped here and after few minutes I interrupted it by pressing Ctrl-C]

cleanup: looking at pid 8590
<unfinished ...>
Process 8590 detached
pgdb:/base/PG-Data #

Next I run gdb and took a look at callstack of the hung process:

pgdb:/base/PG-Data # gdb /opt/PostgreSQL/bin/postgres 8590
GNU gdb 6.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i586-suse-linux"...Using host libthread_db library
"/lib/tls/libthread_db.so.1".

Attaching to program: /opt/PostgreSQL/bin/postgres, process 8590
Reading symbols from /usr/lib/libssl.so.0.9.7...done.
Loaded symbols for /usr/lib/libssl.so.0.9.7
Reading symbols from /usr/lib/libcrypto.so.0.9.7...done.
Loaded symbols for /usr/lib/libcrypto.so.0.9.7
Reading symbols from /lib/libcrypt.so.1...done.
Loaded symbols for /lib/libcrypt.so.1
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/tls/libm.so.6...done.
Loaded symbols for /lib/tls/libm.so.6
Reading symbols from /lib/tls/libc.so.6...done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /usr/lib/gconv/KOI8-R.so...done.
Loaded symbols for /usr/lib/gconv/KOI8-R.so
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
Reading symbols from /opt/PostgreSQL/lib/postgresql/plpgsql.so...done.
Loaded symbols for /opt/PostgreSQL/lib/postgresql/plpgsql.so
0xffffe410 in ?? ()
(gdb) bt
#0 0xffffe410 in ?? ()
#1 0xbfffd718 in ?? ()
#2 0x082f16c0 in sock_path ()
#3 0xbfffd6f0 in ?? ()
#4 0x40254781 in send () from /lib/tls/libc.so.6
#5 0x0814bdf2 in secure_write ()
#6 0x08151632 in internal_flush ()
#7 0x08151704 in internal_putbytes ()
#8 0x08151782 in pq_putmessage ()
#9 0x08152c6a in pq_endmessage ()
#10 0x0807e5fc in printtup ()
#11 0x08134417 in ExecutorRun ()
#12 0x081b50f4 in PortalRunSelect ()
#13 0x081b6271 in PortalRun ()
#14 0x081b1d96 in exec_simple_query ()
#15 0x081b3552 in PostgresMain ()
#16 0x0818e169 in ServerLoop ()
#17 0x0818ed10 in PostmasterMain ()
#18 0x08153841 in main ()
(gdb) quit
The program is running. Quit anyway (and detach it)? (y or n) y
Detaching from program: /opt/PostgreSQL/bin/postgres, process 8590
pgdb:/base/PG-Data #

I checked the process's TCP stack state again and sent SIGINT to it
(the same as it would be pg_cancel_backend()):

pgdb:/base/PG-Data # netstat -pna |grep 8590
tcp 1 0 127.0.0.1:5432 127.0.0.1:35442
CLOSE_WAIT 8590/postgres: kono
pgdb:/base/PG-Data # kill -INT 8590
pgdb:/base/PG-Data # ps -ef |grep 8590
postgres 8590 8073 0 15:46 ? 00:00:36 postgres: konoplev
transport localhost(35442) SELECT
root 9869 7642 0 17:17 pts/0 00:00:00 grep 8590

So it was alive.

The only solution that allows me to finish it without using SIGQUIT or
SIGKILL was killing an oposit SSH process. I got the process's PID and
killed it using SIGHUP.

pgdb:/base/PG-Data # netstat -pna |grep 35442
tcp 1 131072 127.0.0.1:5432 127.0.0.1:35442
CLOSE_WAIT 8590/postgres: kono
tcp 65536 0 127.0.0.1:35442 127.0.0.1:5432
FIN_WAIT2 7944/sshd: dcsshcli
pgdb:/base/PG-Data # kill -HUP 7944
pgdb:/base/PG-Data # ps -ef |grep 8590
root 9951 7642 0 17:24 pts/0 00:00:00 grep 8590

So there was no hunging postgress backend.

I'm going to write a shell script for periodical checking hunging
processes with stack state CLOSE_WAIT and killing theirs oposit SSH
processes after some time but I would be pleased if somebody could
advise me better solution.

Thank you.

--
Regards,
Sergey Konoplev

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ow Mun Heng 2007-10-23 06:02:14 Determine query run-time from pg_* tables
Previous Message Warren 2007-10-23 03:37:14 pg_dump auto login

Browse pgsql-ru-general by date

  From Date Subject
Next Message Sergey Konoplev 2007-10-23 06:08:49 Re: Fwd: pg_cancel_backend() не снимает зависшие транзакции
Previous Message Sergey Konoplev 2007-10-17 11:53:25 Re: Fwd: pg_cancel_backend() не снимает зависшие транзакции