Re: Autovacuum seems to block database: WARNING worker took too long to start

From: Pablo Delgado Díaz-Pache <delgadop(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-admin(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject: Re: Autovacuum seems to block database: WARNING worker took too long to start
Date: 2010-11-15 07:52:53
Message-ID: AANLkTinsGLeRc26RT5Kb4_HEhow5e97p0ZBveg=p9xqS@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi Tom, thanks for your detailed responses. I really appreciate it.
My comments below inline ...

> > I'm having a strange problem with postgres & autovacuum
> > Everything is working fine until I start getting the following errors ...
> > and postgres stops working shortly after (it stops accepting connections)
>
> > 2010-11-13 12:34:08.599 CET|1|||7104||4cde77b0.1bc0|2010-11-13 12:34:08
> > CET|1/44303|0|| LOG: automatic vacuum of table
> > "mrs.pg_catalog.pg_statistic": index scans: 1
> > pages: 0 removed, 189 remain
> > tuples: 132 removed, 4587 remain
> > system usage: CPU 0.00s/0.00u sec elapsed 0.04 sec
> > 2010-11-13 13:24:40.998 CET|2|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
> > CET||0|| WARNING: worker took too long to start; cancelled
> > 2010-11-13 13:25:41.126 CET|3|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
> > CET||0|| WARNING: worker took too long to start; cancelled
> > 2010-11-13 13:26:41.254 CET|4|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
> > CET||0|| WARNING: worker took too long to start; cancelled
>
> Hm. The code comment above that warning says
>
> * The only problems that may cause this code to
> * fire are errors in the earlier sections of AutoVacWorkerMain,
> * before the worker removes the WorkerInfo from the
> * startingWorker pointer.
>
> but it's hard to see what problem there could lead to an issue. (In
> particular, I discount the idea that AutovacuumLock could be stuck,
> because we had to acquire it in order to issue this message.)
>
> But it strikes me that the code comment is wrong in one significant way:
> if the postmaster were failing to heed SIGUSR1 at all, you could reach
> the timeout here, because the fork-failed signal wouldn't get sent.
>
> Given that you say it also stops accepting connections, I'm thinking
> this is a postmaster problem not an autovacuum problem. But you've
> not provided any information about that end of it. Exactly what happens
> when you try to make a connection? Are there any entries at all in the
> postmaster log? What about the kernel log? Are you sure that new
> connections stop working *after* this happens, and not at exactly the
> same time?
>

When I try to make a connection from the command line I get nothing. I mean,
if I execute ...

*# psql -U postgres -l*

I have to Ctr-C or Ctr-Z (followed by kill %1) because I can wait forever.
It's like no more connections were accepted. Obviously our applications stop
working immediately.
Trying to stop postgres with

*# /etc/init.d/postgres stop*
*Stopping postgresql service: *

I have to strop it with Ctr-C as it does not wotk (I waited up to 3 min.
with no luck). As a consequence I had to kill the postmaster process
manually.

Kernel log says nothing :(

Not sure what started failing first: the connections or the autovacuum. I
just went to the log and saw that sequence of events. Searching in google I
found a few references saying autovacuum and that error log "WARNING:
worker took too long to start; cancelled" were releated.

>
> > OS: Centos 5.5
> > Kernel: 2.6.18-194.26.1.el5
> > Postgres version: 8.4.5 (installation out-of-the-box using yum)
>
> Given that this is a Linux system, I think that an OOM kill on the
> postmaster is a not-insignificant possibility. Or at least I would
> think that if there weren't a PostmasterIsAlive check in the autovac
> launcher loop. It's real hard to see how you could get more than
> one of these messages if the postmaster were gone entirely.
>

I just noticed the error log repeats once every minute but the content is
the same ...
First time the log appeared

*2010-11-13 13:24:40.998 CET|2|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
CET||0|| WARNING: worker took too long to start; cancelled*
*
2010-11-13 13:25:41.126 CET|3|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
CET||0|| WARNING: worker took too long to start; cancelled
*

It repeats once every minute until we fix it

*
2010-11-14 09:44:01.889 CET|1208|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
CET||0|| WARNING: worker took too long to start; cancelled
*
*2010-11-14 09:45:03.105 CET|1209|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
CET||0|| WARNING: worker took too long to start; cancelled*

All 1,000+ lines are alike except for the date (every new line adds 1 more
minute)
However the log is the same

*
2010-11-14 09:45:03.105 CET|1209|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
CET||0|| WARNING: worker took too long to start; cancelled
*

According to our postgresql.conf log format ...

*log_line_prefix = '%m|%l|%d|%r|%p|%i|%c|%s|%v|%x|%u| ' *
* # %u = user name*
* # %d = database name*
* # %r = remote host and port*
* # %h = remote host*
* # %p = process ID*
* # %t = timestamp without
milliseconds*
* # %m = timestamp with
milliseconds*
* # %i = command tag*
* # %c = session ID*
* # %l = session line number*
* # %s = session start timestamp*
* # %v = virtual transaction ID*
* # %x = transaction ID (0 if none)
*
* # %q = stop here in non-session*
* # processes*

Splitting the log ...

*
(%m = timestamp with milliseconds) = 2010-11-14 09:45:03.105 CET
(%l = session line number) = 209
(%d = database name) =
(%r = remote host and port) =
(%p = process ID) = 3300
(%i = command tag) =
(%c = session ID) = 4cdc2ae6.ce4
(%s = session start timestamp) = 2010-11-11 18:41:58 CET
(%v = virtual transaction ID) =
(%x = transaction ID (0 if none)) = 0
%u = user name) =
Message = WARNING: worker took too long to start; cancelled
*

I see the following fields are the same in the 1,200+ lines of the log

*
(%d = database name) =
(%r = remote host and port) =
*
*
(%c = session ID)
(%s = session start timestamp)
(%v = virtual transaction ID)
(%x = transaction ID (0 if none))
%u = user name)
*
*
Message
*

Does this info help?
What does "empty database" mean? could be the autovacuum process?
I noticed the "%s" (*
2010-11-11 18:41:58 CET)
*matches the time we started postgres. Here you have the log

*2010-11-11 18:41:58.840 CET|1|||3268||4cdc2ae6.cc4|2010-11-11 18:41:58
CET||0|| LOG: database system was shut down at 2010-11-11 18:38:55 CET*
*2010-11-11 18:41:58.902 CET|1|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
CET||0|| LOG: autovacuum launcher started*
*2010-11-11 18:41:58.902 CET|1|||3234||4cdc2ae4.ca2|2010-11-11 18:41:56
CET||0|| LOG: database system is ready to accept connections*

> Could you try strace'ing the postmaster process to see what it's doing
> when this is happening?
>

I definitely will.

More information in case that helps. Strangely, we had the same problem in a
temporary database we used last week while migrating some data. This time
there was not autovacuum entry in the log ...

*2010-11-10 20:26:06.441
CET|1|cache|10.19.0.48(20840)|11239|SELECT|4cdaf1cc.2be7|2010-11-10 20:26:04
CET|114/0|0|postgres| LOG: duration: 1770.500 ms statement: select
id_type,id,id_language_fk,data from cache where id_type=5 and id =
'60820269' and id_language_fk = 2*
*2010-11-10 20:26:07.100
CET|1|xxx|10.19.0.43(14080)|9423|SELECT|4cdaef43.24cf|2010-11-10 20:15:15
CET|80/0|0|postgres| LOG: duration: 1178.423 ms statement: SELECT
"reviews".id FROM "reviews" WHERE ("reviews"."booking_id" = E'10101690551')
LIMIT 1*
*2010-11-10 20:27:26.905 CET|3|||22218||4cca5977.56ca|2010-10-29 07:19:51
CEST||0|| WARNING: worker took too long to start; cancelled*
*2010-11-10 20:28:27.422 CET|4|||22218||4cca5977.56ca|2010-10-29 07:19:51
CEST||0|| WARNING: worker took too long to start; cancelled*
*2010-11-10 20:28:41.641
CET|1|mrs|10.19.0.51(20106)|11272|SELECT|4cdaf1d0.2c08|2010-11-10 20:26:08
CET|6/701|0|postgres| LOG: duration: 1878.724 ms execute <unnamed>: select
tp.*,t.id_token,uo.* from tokenprovider as tp inner join token as t on
(tp.id_tokenprovider = t.id_tokenprovider_fk) inner join tkuo as tu on
(t.id_token = tu.id_token_fk) inner join useoccupation as uo on
(uo.id_useoccupation = tu.id_useoccupation_fk and uo.id_tokenprovider_fk =
tp.id_tokenprovider) where tp.id_tokenprovider in
(10203746,10369604,10718768,10001514,10170665,10091612,10829359,10797243,50013934,12536008,90529952,10064394,10769848,50013983,10012115,23951091,25688092,60712161,10455933,10766831,62781261,10560488,10808512,80751062,38735328,50013938,10481737,50013710,10916713,10556915,10999695,75417634,22313851,55471672,88737419,46563417,69789399,20418193,37363697,55232743,10328868,10171688,77959221,10380595,28839589,10439708,10165273,10872146,50356324,15842616,10597119,10943311,43894782,61323020)
order by tp.id_tokenprovider,t.id_token,uo.sort*
*2010-11-10 20:29:28.349 CET|5|||22218||4cca5977.56ca|2010-10-29 07:19:51
CEST||0|| WARNING: worker took too long to start; cancelled*
*2010-11-10 20:30:01.940
CET|1|xmas|10.19.0.43(14078)|9422|idle|4cdaef43.24ce|2010-11-10 20:15:15
CET|70/0|0|postgres| LOG: unexpected EOF on client connection*
*2010-11-10 20:30:01.942
CET|1|xxxx|10.19.0.43(2900)|5793|idle|4cdaebb6.16a1|2010-11-10 20:00:06
CET|60/0|0|postgres| LOG: unexpected EOF on client connection*
*2010-11-10 20:30:01.948
CET|1|mrs|10.19.0.43(14081)|9424|idle|4cdaef43.24d0|2010-11-10 20:15:15
CET|89/0|0|postgres| LOG: unexpected EOF on client connection*
*2010-11-10 20:30:01.950
CET|3|xxxx|10.19.0.43(14429)|9840|idle|4cdaefc8.2670|2010-11-10 20:17:28
CET|72/0|0|postgres| LOG: unexpected EOF on client connection*
*2010-11-10 20:30:01.951
CET|1|xxxx|10.19.0.43(60284)|10509|idle|4cdaf0ba.290d|2010-11-10 20:21:30
CET|83/0|0|postgres| LOG: unexpected EOF on client connection*
*2010-11-10 20:30:01.961
CET|1|xxxx|10.19.0.43(49864)|10986|idle|4cdaf19f.2aea|2010-11-10 20:25:19
CET|57/0|0|postgres| LOG: unexpected EOF on client connection*
*2010-11-10 20:30:01.969
CET|1|xxxx|10.19.0.43(50039)|11282|idle|4cdaf1d9.2c12|2010-11-10 20:26:17
CET|65/0|0|postgres| LOG: unexpected EOF on client connection*
*2010-11-10 20:30:28.875 CET|6|||22218||4cca5977.56ca|2010-10-29 07:19:51
CEST||0|| WARNING: worker took too long to start; cancelled*
*2010-11-10 20:31:29.210 CET|7|||22218||4cca5977.56ca|2010-10-29 07:19:51
CEST||0|| WARNING: worker took too long to start; cancelled*
*2010-11-10 20:32:29.642 CET|8|||22218||4cca5977.56ca|2010-10-29 07:19:51
CEST||0|| WARNING: worker took too long to start; cancelled*
*2010-11-10 20:33:30.271 CET|9|||22218||4cca5977.56ca|2010-10-29 07:19:51
CEST||0|| WARNING: worker took too long to start; cancelled*
*2010-11-10 20:34:30.604 CET|10|||22218||4cca5977.56ca|2010-10-29 07:19:51
CEST||0|| WARNING: worker took too long to start; cancelled*
*2010-11-10 20:35:31.106 CET|11|||22218||4cca5977.56ca|2010-10-29 07:19:51
CEST||0|| WARNING: worker took too long to start; cancelled*

Well, thanks for any information. I can provide more logs/information if
necessary.

Pablo

>
> regards, tom lane
>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Pablo Delgado Díaz-Pache 2010-11-15 09:55:38 Postgres error: could not open relation base/xxxxx/yyyyy
Previous Message Tom Lane 2010-11-14 19:15:28 Re: Autovacuum seems to block database: WARNING worker took too long to start