simultaneous autovacuum and postmaster crash

From: lsq(at)nym(dot)hush(dot)com
To: pgsql-bugs(at)postgresql(dot)org
Subject: simultaneous autovacuum and postmaster crash
Date: 2011-12-28 13:53:21
Message-ID: 20111228135321.58B0C6F442@smtp.hushmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

I got a simultaneous autovacuum and postmaster crash, then another
postmaster crash a few seconds later on restart. After a third
restart the system was stable. I haven't found anything obviously
matching this defect on-line.

I'm aware there's not an enormous amount of info to go on here - I
cannot reproduce the problem - but I'm hoping someone will
recognize the defect, and if it's fixed somewhere please tell me
where. I couldn't find a match in the bug lists.

Here are the details I have...

Two crashes happened simultaneously (postmaster + autovacuum
process for the same postmaster instance), the third about 5
seconds later when restarting the postmaster. The second attempt
to restart the postmaster, several hours later by hand, succeeded.
We have three postmasters running, on different ports, serving
different databases. Only one, 5432 seems to have been affected.

This particular system is a deployment test VM (Virtualbox 4.0.8),
which has been stable. It is possible the VM is actually at fault,
not PostgreSQL. However, since I don't have a pattern of crashes
and nothing is obvious there either, I can neither rule it in nor
out.

I have not been able to reproduce the problem. After the two
crashes, the system recovered, and has not failed again in the last
two weeks. I have tried tinkering with the amount of shared
memory, loading the system down, etc., but have not reproduced it.
I have a couple dozen other systems with identical software
configurations but using non-virtualized hardware, all x86-64 of
various sorts with 2 to 8 cores, running Centos 6, PostgreSQL
8.4.7. I also have a number of other virtual machines running the
same loads. None have reproduced this problem.

The virtual machine was lightly loaded. It's running Centos 6,
2.6.32-131, 64 bit, PostgreSQL 8.4.7. libc rpm version is glibc-
2.12-1.7.el6_0.5.x86_64. It has 2 GB of memory. The underlying box
(the one running the VM) is an i5 (e.g. amd64 architecture), 2.8
GHz, and has 4 GB of memory. It’s running Ubuntu 10.04, not Centos.

I am aware that I am running 8.4.7, not 8.4.10 or 9.x.

The postgresql packages came from CentOS:
postgresql-contrib-8.4.7-1.el6_0.1.x86_64
postgresql-server-8.4.7-1.el6_0.1.x86_64
postgresql-libs-8.4.7-1.el6_0.1.x86_64
postgresql-8.4.7-1.el6_0.1.x86_64

I don't know what specific queries were running at the time of the
failure, since nothing useful was logged either by postmaster or by
the OS. Neither logged anything within roughly two hours, other
than occasional pgstat wait timeouts and the crash.

2011-12-03 01:42:00.913 UTC,,,20562,,4ed97e63.5052,1,,2011-12-03
01:41:55 UTC,1/0,0,WARNING,01000,"pgstat wait timeout",,,,,,,,
2011-12-03 05:37:56.506 UTC,,,23937,,4ed94d35.5d81,2,,2011-12-02
22:12:05 UTC,,0,LOG,00000,"server process (PID 12604) was
terminated by signal 7: Bus error",,,,,,,,
2011-12-03 05:37:56.520 UTC,,,23937,,4ed94d35.5d81,3,,2011-12-02
22:12:05 UTC,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,
2011-12-03 05:37:56.575 UTC,,,23937,,4ed94d35.5d81,4,,2011-12-02
22:12:05 UTC,,0,LOG,00000,"all server processes terminated;
reinitializing",,,,,,,,

Note the first attempt to restart the postmaster, which also
crashed, did not log anything.

It appears the system had sufficient memory, normal amounts of
shared memory available, etc, but sar samples are not that frequent
and these crashes were several minutes after the immediately
preceding sample, so if something happened between samples to make
shared memory or some other system resource unavailable I'd have
trouble finding it out posthumously.

Non-default configuration changes:
max_connections=200
listen_addresses='localhost,192.168.99.33'
shared_buffers=4000
wal_buffers=1024
log_destination='csvlog,syslog'
logging_collector='on'
log_directory='pg_log'
log_filename='postgresql-%a.log'
log_truncate_on_rotation = on
log_rotation_age = 1d
log_rotation_size=0
syslog_facility='LOCAL0'
syslog_ident='postgres'
log_min_duration_statement=2147483
log_line_prefix='%m:%h:%u:%d\ '
log_autovacuum_min_duration=1000
autovacuum_naptime='2min'
datestyle = 'iso, mdy'

Here are the two concurrent stack traces, one for the postmaster
and one for the autovacuum worker. I don't know which one of the
two truly crashed first. fstat tells me the creation times are
identical.

This is core 12604, crashed at the same time as core 12603
Core was generated by `postgres: autovacuum worker process '.
Program terminated with signal 7, Bus error.
#0 guc_var_compare (a=0x7fff5fee3658, b=0xc54190) at guc.c:3180
3180 return guc_name_compare(confa->name, confb->name);

(gdb) bt
#0 guc_var_compare (a=0x7fff5fee3658, b=0xc54190) at guc.c:3180
#1 0x00007f62aa8dc300 in bsearch () from /lib64/libc.so.6
#2 0x00000000006ab49e in find_option (name=0x7bb622
"zero_damaged_pages", create_placeholders=1 '\001', elevel=20) at
guc.c:3134
#3 0x00000000006affd5 in set_config_option (name=0x7bb622
"zero_damaged_pages", value=0x6d1879 "false", context=PGC_SUSET,
source=PGC_S_OVERRIDE, action=GUC_ACTION_SET, changeVal=1 '\001')
at guc.c:4537
#4 0x00000000005b4dee in AutoVacWorkerMain (argc=<value optimized
out>, argv=<value optimized out>) at autovacuum.c:1554
#5 0x00000000005b503b in StartAutoVacWorker () at autovacuum.c:1438
#6 0x00000000005be465 in StartAutovacuumWorker
(postgres_signal_arg=<value optimized out>) at postmaster.c:4404
#7 sigusr1_handler (postgres_signal_arg=<value optimized out>) at
postmaster.c:4164
#8 <signal handler called>
#9 0x00007f62aa982423 in __select_nocancel () from /lib64/libc.so.6
#10 0x00000000005bc522 in ServerLoop () at postmaster.c:1347
#11 0x00000000005bf0ec in PostmasterMain (argc=<value optimized
out>,argv=<value optimized out>) at postmaster.c:1040
#12 0x000000000056a4c0 in main (argc=6, argv=0xc4d3d0) at main.c:188
(gdb)

This is 12603
Core was generated by `/usr/bin/postmaster -i -p 5432 -D
/db/postgres'.
Program terminated with signal 7, Bus error.
#0 AllocSetAlloc (context=0xc4ff90, size=293) at aset.c:554
554 if (size > set->allocChunkLimit)
(gdb) bt
#0 AllocSetAlloc (context=0xc4ff90, size=293) at aset.c:554
#1 0x00000000006b4f61 in MemoryContextAllocZero (context=0xc4ff90,
size=293) at mcxt.c:534
#2 0x00000000005bb511 in ProcessStartupPacket (port=0xc71ac0,
SSLdone=0 '\000') at postmaster.c:1543
#3 0x00000000005bbef0 in BackendInitialize (port=0xc71ac0) at
postmaster.c:3329
#4 0x00000000005bc936 in BackendStartup () at postmaster.c:3078
#5 ServerLoop () at postmaster.c:1387
#6 0x00000000005bf0ec in PostmasterMain (argc=<value optimized
out>, argv=<value optimized out>) at postmaster.c:1040
#7 0x000000000056a4c0 in main (argc=6, argv=0xc4d3d0) at main.c:188
(gdb)

Then 5 seconds later, on postmaster restart:
Core was generated by `/usr/bin/postmaster -i -p 5432 -D
/db/postgres'
Program terminated with signal 7, Bus error.
#0 MemoryContextDeleteChildren (context=0xc501b0) at mcxt.c:214
214 while (context->firstchild != NULL)

(gdb) bt
#0 MemoryContextDeleteChildren (context=0xc501b0) at mcxt.c:214
#1 0x00000000006b4c79 in MemoryContextDelete (context=0xc501b0) at
mcxt.c:169
#2 0x00000000005de838 in InitLocks () at lock.c:343
#3 0x00000000005d7995 in CreateSharedMemoryAndSemaphores
(makePrivate=0 '\000', port=5432) at ipci.c:192
#4 0x00000000005bd63b in reset_shared () at postmaster.c:2029
#5 PostmasterStateMachine () at postmaster.c:2933
#6 0x00000000005bda05 in reaper (postgres_signal_arg=<value
optimized out>) at postmaster.c:2475
#7 <signal handler called>
#8 0x00007f62aa982423 in __select_nocancel () from /lib64/libc.so.6
#9 0x00000000005bc522 in ServerLoop () at postmaster.c:1347
#10 0x00000000005bf0ec in PostmasterMain (argc=<value optimized
out>,argv=<value optimized out>) at postmaster.c:1040
#11 0x000000000056a4c0 in main (argc=6, argv=0xc4d3d0) at main.c:188

Thanks

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message tracy.anderson 2011-12-28 15:23:03 BUG #6362: pg_ctl register puts wrong slash in
Previous Message wcting163 2011-12-28 02:44:39 Re: BUG #6360: with hold cursor, cause function executed twice and wrong results