Re: PosgreSQL is crashing with a signal 11 - Bug?

From: Rafael Martinez Guerrero <r(dot)m(dot)guerrero(at)usit(dot)uio(dot)no>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kjetil Torgrim Homme <kjetilho(at)ifi(dot)uio(dot)no>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: PosgreSQL is crashing with a signal 11 - Bug?
Date: 2004-09-13 10:41:09
Message-ID: 1095072069.31640.137.camel@bbking.uio.no
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, 2004-09-10 at 16:24, Tom Lane wrote:
> Kjetil Torgrim Homme <kjetilho(at)ifi(dot)uio(dot)no> writes:
> > how can att[i]->attlen possibly change in the interim? but
> > data_length looks corrupted, too.
>
> Unless you compiled with no optimization at all (-O0), the compiler
> would likely fold the identical memcpy() calls in the different
> if-branches together. So I wouldn't put too much stock in the
> reported line number.
>
> It does seem striking that a 0x2f got dumped into the high byte of the
> length word in both cases. Have you checked to see what the
> page-on-disk looks like? I'd be interested to know if the offset of the
> damaged byte within the page is again 0x0fff.
>

Hei Tom

Kjetil will answer you about this.

In the meant time we got new core dumps when taking a backup of the same
database.

Some more info I got from the departament in charge of this database:
-----------------------------------------------------------
We make a backup of our production server every 15 minutes. Recently,
we've seen behaviour like this:

[12/09/2004-05:46:00] PostgreSQL: starting backup_cluster01.sh: on
cerebellum.uio.no
[12/09/2004-05:48:03] PostgreSQL: backup_cluster01.sh finnished on
cerebellum.uio.no
[12/09/2004-06:01:00] PostgreSQL: starting backup_cluster01.sh: on
cerebellum.uio.no
pg_dump: ERROR: MemoryContextAlloc: invalid request size 1577058307
pg_dump: lost synchronization with server, resetting connection
pg_dump: SQL command to dump the contents of table
"paid_quota_history" failed: PQendcopy() failed.
pg_dump: Error message from server: pg_dump: The command was: COPY
public.paid_quota_history (job_id, transaction_type, person_id, tstamp,
update_by, update
_program, pageunits_free, pageunits_paid, pageunits_total) TO stdout;
pg_dumpall: pg_dump failed on cerebrum_prod, exiting
[12/09/2004-06:02:16] PostgreSQL: backup_cluster01.sh finnished on
cerebellum.uio.no

Every consecutive backup failes with the same message, and then
suddenly:

[12/09/2004-08:46:00] PostgreSQL: starting backup_cluster01.sh: on
cerebellum.uio.no
[12/09/2004-08:48:34] PostgreSQL: backup_cluster01.sh finnished on
cerebellum.uio.no

To me this looks like a cache somewhere that upon read contained some
incorrect data. This cache was somehow flushed two-hours later, and
fresh data was read from disk.

Could this be postgres problem, or is it hardware/kernel related?
Upgrading from 7.3.5 to 7.3.7 to 7.4.5 does not help. We have now
moved the database between 3 different Dell2650 servers, and replaced
memory chips on one system once. Lately one or more postgres
processes received signal11 atleast once a day. The problems started
about a week ago after stable production for about 9 months.

The backup failures above were accompanied by 4 core-dumps. Backtrace
follows:

#0 0xb734d07c in memcpy () from /lib/tls/libc.so.6
#1 0x08174880 in set_var_from_num (num=0xb7021d24, dest=0x87b432fe)
at numeric.c:2673
#2 0x08171927 in numeric_out (fcinfo=0xbfffc2d0) at numeric.c:373
#3 0x081aa81d in FunctionCall3 (flinfo=0x82cc4e8, arg1=3221209808,
arg2=3221209808, arg3=3221209808) at fmgr.c:1016
#4 0x080c78fb in CopyTo (rel=0xb6800bd0, attnumlist=0x82cb4a0,
binary=0 '\0', oids=0 '\0', delim=0x82232a8 "\t", null_print=0x81fc95d
"\\N")
at copy.c:1096
#5 0x080c7021 in DoCopy (stmt=0x2f000004) at copy.c:920
#6 0x081507c5 in PortalRunUtility (portal=0x82bdfd8, query=0x82ba220,
dest=0x82ba1d8, completionTag=0xbfffc650 "") at pquery.c:772
#7 0x08150a3e in PortalRunMulti (portal=0x82bdfd8, dest=0x82ba1d8,
altdest=0x82ba1d8, completionTag=0xbfffc650 "") at pquery.c:836
#8 0x0815033c in PortalRun (portal=0x82bdfd8, count=2147483647,
dest=0x82ba1d8, altdest=0x82ba1d8, completionTag=0xbfffc650 "") at
pquery.c:494
#9 0x0814d5f8 in exec_simple_query (
query_string=0x82b9bc0 "COPY public.change_log (tstamp, change_id,
subject_entity, change_type_id, dest_entity, change_params, change_by,
change_program, description) TO stdout;") at postgres.c:873
#10 0x0814f660 in PostgresMain (argc=4, argv=0x82701b8,
username=0x8270188 "postgres") at postgres.c:2868
#11 0x0812f5ab in BackendFork (port=0x827d0a0) at postmaster.c:2564
#12 0x0812f09e in BackendStartup (port=0x827d0a0) at postmaster.c:2207
#13 0x0812d95f in ServerLoop () at postmaster.c:1119
#14 0x0812d305 in PostmasterMain (argc=3, argv=0x826e1c0) at
postmaster.c:897
#15 0x08104f10 in main (argc=3, argv=0xbfffd6c4) at main.c:214

We are currently in the process of moving the production server to an
IBM box, which should eliminate any Dell2650 spesific causes.
-----------------------------------------------------------

--
Rafael Martinez, <r(dot)m(dot)guerrero(at)usit(dot)uio(dot)no>
Center for Information Technology Services
University of Oslo, Norway

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PostgreSQL Bugs List 2004-09-13 10:43:03 BUG #1251: setTransactionIsolation does not seem to work
Previous Message Kris Jurka 2004-09-13 07:16:05 Re: BUG #1233: JDBC driver: moveToCurrentRow fails