Re: signal 11 on AIX: 7.4.2

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: signal 11 on AIX: 7.4.2
Date: 2004-06-17 17:12:10
Message-ID: 200406171712.i5HHCAU10882@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andrew Sullivan wrote:
> On Mon, May 10, 2004 at 11:59:40AM -0400, Andrew Sullivan wrote:
> >
> > On the weekend, we ran a set of tests on the offending system to see
> > if we could re-create it. We set up the triggering conditions just
> > as they'd been when it happened, and alas, no segfault. So although
> > this was pretty much regularly reproducible when it actually
> > happened, it's now a note to the Journal of Irreproducible Results.
> > I hate when that happens.
>
> I hate it even more when the symptom comes back inexplicably. We had
> it again. For the record, here's what gdb says (there are some
> high-bit characters in here; dunno how they'll come though in mail):
>
> (gdb) bt
> #0 0xd01d7778 in memmove () from /usr/lib/libc.a(shr.o)
> #1 0xd0326e1c in getaddrinfo2 () from /usr/lib/libc.a(shr.o)
> #2 0xd0327b6c in getaddrinfo () from /usr/lib/libc.a(shr.o)
> #3 0x10058668 in WriteControlFile () at xlog.c:2121
> #4 0x101f8f78 in init_execution_state (src=0x202acd8c "",
> argOidVect=0x7308710b, nargs=4, rettype=539520040, haspolyarg=-104 '\230')
> at functions.c:121
> #5 0x101f9304 in init_sql_fcache (finfo=0xdeadbeef) at functions.c:250
> #6 0x101fa57c in set_tz (tz=0x7308710b <Address 0x7308710b out of bounds>)
> at variable.c:261
> #7 0x101fa9a4 in assign_timezone (value=0x202ad398 "", doit=-1 '',
> interactive=-8 '') at variable.c:584
> #8 0x1000466c in PostgresMain (argc=1, argv=0x2002cf38, username=0x1 "")
> at postgres.c:2560
> #9 0x100040b0 in PostgresMain (argc=537240896, argv=0xdeadbeef,
> username=0xdeadbeef <Address 0xdeadbeef out of bounds>) at postgres.c:2307
> #10 0x10002530 in exec_parse_message (query_string=0x20000a24 "",
> stmt_name=0x5 "", paramTypes=0x0, numParams=0) at postgres.c:1216
> #11 0x10001f84 in exec_simple_query (
> query_string=0x2005a540 '' <repeats 40 times>) at postgres.c:980
> #12 0x100005f0 in main (argc=1, argv=0xdeadbeef) at main.c:228

Well, the bad news is that this backtrace isn't very useful. It states
the query you sent was 40 0xff's, and it says you called
assign_timezone, which called set_tz, which then shows it calling
init_sql_fcache() (impossible), which later calls WriteControlFile()
impossible, which calls getaddrinfo() (impossible).

My only guess is that getaddrinfo in your libc has a bug somehow that is
corrupting the stack (hance the improper backtrace), then crashing.

As to the cause, I assume this is not reproducable, right? Is there
something unusual about your DNS setup or something that might have
changed recently that caused getaddrinfo() to do something new?

Of course, the memmove() might be causing the problem and the
getaddrinfo is a corrupt part of the backtrace too.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2004-06-17 18:18:18 Tablespace patch review
Previous Message Bruce Momjian 2004-06-17 17:04:41 Re: Status in 7.5 patches