Re: Help: 8.0.3 Vacuum of an empty table never completes ...

From: James Robinson <jlrobins(at)socialserve(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Hackers Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Help: 8.0.3 Vacuum of an empty table never completes ...
Date: 2005-11-28 21:00:38
Message-ID: A5BE3126-1A1A-47B9-BCDE-F7E39FAAD7E9@socialserve.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On Nov 28, 2005, at 1:46 PM, Tom Lane wrote:

> James Robinson <jlrobins(at)socialserve(dot)com> writes:
>> backtrace of the sshd doesn't look good:
>
> Stripped executable :-( ... you won't get much info there. What of
> the client at the far end of the ssh connection? You should probably
> assume that the blockage is there, rather than in a commonly used bit
> of software like ssh.

Ok: cron fired off a bash running our script which performs the
backup + vacuuming on the backup box side, and that script was at the
point of driving vacuumdb

sscadmin 20612 20610 0 Nov25 ? 00:00:00 bash /usr/local/bin/
db_backup.sh
sscadmin 20622 20612 0 Nov25 ? 00:00:00 ssh -c blowfish [ ssh
identity file + host edited out ] /usr/local/pgsql/bin/vacuumdb -U
postgres --all --analyze --verbose

[ yes, verbose vacuum. Who knew that'd be the camel-breaking straw ??! ]

The lines in the script invoking the ssh'd vacuumdb is:
--
# Vacuum all databases, storing log results.
$SSHCMD $DBHOST /usr/local/pgsql/bin/vacuumdb -U postgres --all --
analyze --verb
ose >& $DATE/vacuum.log
--

Unfortunately the dir holding that date + hour's vacuum.log was swept
away by the next day's activities.

The stuck bash is backtraceable:

(gdb) bt
#0 0xffffe410 in ?? ()
#1 0xbffff928 in ?? ()
#2 0x00000000 in ?? ()
#3 0xbffff918 in ?? ()
#4 0xb7ed1513 in __waitpid_nocancel () from /lib/tls/libc.so.6
#5 0x080935bf in default_tty_job_signals ()
#6 0x080949ca in wait_for ()
#7 0x0808acd7 in execute_command_internal ()
#8 0x0808a4f0 in execute_command ()
#9 0x0808241d in reader_loop ()
#10 0x08081364 in main ()
(gdb)

Nothing unexpected there. Funny though, file reports /bin/bash as
being stripped ( as it does for /usr/bin/ssh and /usr/sbin/sshd ),
but I could get far better debugging support from it. Could the
debugging issue be ssh / sshd's apparent multithreadedness:

gdb /usr/bin/ssh
GNU gdb 6.2.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and
you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for
details.
This GDB was configured as "i586-suse-linux"...(no debugging symbols
found)...Using host libthread_db library "/lib/tls/libthread_db.so.1".

(gdb) run localhost
Starting program: /usr/bin/ssh localhost
(no debugging symbols found)...(no debugging symbols found)...(no
debugging symbols found)...(no debugging symbols found)...(no
debugging symbols found)...(no debugging symbols found)...(no
debugging symbols found)...(no debugging symbols found)...(no
debugging symbols found)...(no debugging symbols found)...(no
debugging symbols found)...(no debugging symbols found)...(no
debugging symbols found)...(no debugging symbols found)...(no
debugging symbols found)...(no debugging symbols found)...(no
debugging symbols found)...(no debugging symbols found)...[Thread
debugging using libthread_db enabled]
[New Thread 1078408704 (LWP 29932)]
(no debugging symbols found)...(no debugging symbols found)...(no
debugging symbols found)...(no debugging symbols found)...(no
debugging symbols found)...(no debugging symbols found)...The
authenticity of host 'localhost (127.0.0.1)' can't be established.
RSA key fingerprint is f4:cd:bc:37:d7:08:bc:4f:04:91:45:9b:44:cf:d5:b9.
Are you sure you want to continue connecting (yes/no)?
Program received signal SIGINT, Interrupt.
[Switching to Thread 1078408704 (LWP 29932)]
0xffffe410 in ?? ()
(gdb) bt
#0 0xffffe410 in ?? ()
#1 0xbfffb9e8 in ?? ()
#2 0x00000001 in ?? ()
#3 0xbfffb3af in ?? ()
#4 0x402f94b3 in __read_nocancel () from /lib/tls/libc.so.6
#5 0x080742e1 in mkstemp64 ()
#6 0x080684c8 in error ()
#7 0x08054e91 in ?? ()
#8 0xbfffcac0 in ?? ()
#9 0x00000001 in ?? ()
#10 0x08096230 in ?? ()
#11 0xbfffcac0 in ?? ()
....

I know you have far better things to do than teach someone how to use
gdb on multithreaded programs, but could a proper backtrace be
salvageable on the ssh client? If you really care, that is, otherwise
I'm off to kill that vacuumdb client.

>
>> At this moment in time, should we kill off the offending processes
>> from Nov 25 -- starting from client-most side all the way to the
>> vacuumdb process on the production server. The other vacuums would
>> probably then complete happily, and we'd be cool again, eh?
>
> If you just want to get out of it, killing the vacuumdb should be the
> least dangerous way to get out of the problem. I'd suggest taking a
> little more time to try to find out what's stuck though.

Given the other culprits in play are bash running a straightforward
shellscript line with redirected output to a simple file on a non-
full filesystem, I'm leaning more towards the odds that something
related to the sshd + tcp/ip + ssh client portion of things went
crazy. Just seems that's where more complexity is -- bash is doing
nothing but waiting for the ssh client to finish, and its stdout /
stderr were redired to the fileystem before the bash child exec'd
ssh. Even if the filesystem became full somehow during that run, one
would expect the ssh client's writes to the file being error'd out as
opposed to blocking. Funky.

>
>> I suppose we're darn lucky the process got ultimately gummed up on a
>> table that sees no traffic at all to it, eh? The lock that vacuum has
>> taken out on it would prevent at least some things happening to the
>> table in question -- possibly even new inserts or updates?
>
> No, since it's a lazy vacuum it's not going to block either reads or
> writes. Just other vacuums and DDL changes on the table.

Good to know.

>
>> Could this potentially be alleviated in the future by a little code
>> reordering in vacuumdb or postmaster by completing working on the
>> current table completely before emitting output,
>
> Wouldn't help. I would imagine that by the time we got to this state,
> the backend side is quite a few tables past the point where the end
> client stopped accepting output. You'd normally expect there to be
> several Kb worth of buffering in the network channel, and the backend
> isn't going to be blocked till that's *all* used up.

But when it does get blocked up, albeit after-the-fact, it wouldn't
be holding and postgres-level locks. But, since this was lazy vacuum,
only other lazy vacuums or DDL would be blocked. But had some other
poor chap been running full vacuum, then it could have been worse.

>
> BTW, I suppose this was a VACUUM VERBOSE? Without the verbose output,
> you'd likely not have seen any problem ...

Yep -- see above.

I do believe we've agreed internally to separate out the vacuuming
bit from the backup bit, and have that be driven by a separate cron'd
script on the primary database box itself to simplify the system for
this case. And with 8.1 we'll move to the autovacuum service, so as
to obliterate the need altogether. So, our vacuuming would be
happening in a globally far simpler system -- no remote host
networking involved at all.

Again, thanks again for all of your time and superb support.

----
James Robinson
Socialserve.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2005-11-28 21:13:35 Re: Help: 8.0.3 Vacuum of an empty table never completes ...
Previous Message Marc G. Fournier 2005-11-28 19:48:54 Re: Anonymous CVS working?