Re: BUG #6200: standby bad memory allocations on SELECT

From: Michael Brauwerman <michael(dot)brauwerman(at)redfin(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Bridget Frey <bridget(dot)frey(at)redfin(dot)com>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #6200: standby bad memory allocations on SELECT
Date: 2012-01-28 21:34:30
Message-ID: CAHDXJ6jes_Zv1OFo=EZn-HGOrjKoy2uLz3Sg4ShXhb0yMY_-5A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

I work with Bridget at Redfin.

We have a core dump from a once-in-5-days (multi-million queries) hot
standby segfault in pg 9.1.2 . (It might or might be the same root issue as
the "alloc" errors. If I should file a new bug report, let me know.

The postgres executable that crashed did not have debugging symbols
installed, and we were unable to debug (gdb) the core file using a debug
build of postgres. (Symbols didn't match.) Running gdb against a non-debug
postgres executable gave us this stack trace:

[root(at)query-7 core]# gdb -q -c /postgres/core/query-9.core.19678
/usr/pgsql-9.1/bin/postgres-non-debug
Reading symbols from /usr/pgsql-9.1/bin/postgres-non-debug...(no debugging
symbols found)...done.

warning: core file may not match specified executable file.
[New Thread 19678]

warning: no loadable sections found in added symbol-file system-supplied
DSO at 0x7fffdcd58000
Core was generated by `postgres: datamover stingray_prod 10.11.0.134(54140)
SELEC'.
Program terminated with signal 11, Segmentation fault.
#0 0x000000000045694c in nocachegetattr ()

(gdb) bt
#0 0x000000000045694c in nocachegetattr ()
#1 0x00000000006f93c9 in ?? ()
#2 0x00000000006fa231 in tuplesort_puttupleslot ()
#3 0x0000000000573ad1 in ExecSort ()
#4 0x000000000055cdda in ExecProcNode ()
#5 0x000000000055bcd1 in standard_ExecutorRun ()
#6 0x0000000000623594 in ?? ()
#7 0x0000000000624ae0 in PortalRun ()
#8 0x00000000006220f2 in PostgresMain ()
#9 0x00000000005e6ba4 in ?? ()
#10 0x00000000005e791c in PostmasterMain ()
#11 0x000000000058b9ae in main ()

We have the (5GB) core file, and are happy to do any more forensics anyone
can advise.

Please instruct.

I hope this helps point to a root cause and resolution....

Thank you,

Mike Brauwerman
Data Team, Redfin

On Fri, Jan 27, 2012 at 10:53 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Fri, Jan 27, 2012 at 1:31 PM, Bridget Frey <bridget(dot)frey(at)redfin(dot)com>
> wrote:
> > Thanks for the info - that's very helpful. We had also noted that the
> alloc
> > seems to be -3 bytes. We have run pg_check and it found no instances of
> > corruption. We've also replayed queries that have failed, and have never
> > been able to get the same query to fail twice. In the case you
> > investigated, was there permanent page corruption - e.g. you could run
> the
> > same query over and over and get the same result?
>
> Yes. I observed that the infomask bits on several tuples had somehow
> been overwritten by nonsense. I am not sure whether there were other
> kinds of corruption as well - I suspect probably so - but that's the
> only one I saw with my own eyes, courtesy of pg_filedump.
>
> > It really does seem like this is an issue either in Hot Standby or very
> > closely related to that feature, where there is temporary corruption of a
> > btree index that then disappears. Our master is not experiencing any
> malloc
> > issues, while the 3 slaves get about a dozen per day, despite similar
> > workloads. We haven't have a slave segfault since we set it up to
> produce a
> > core dump, but we're expecting to have that within the next few days
> > (assuming we'll continue to get a segfault every 3-4 days). We're also
> > planning to set up one slave that will panic when it gets a malloc
> issue, as
> > you (and other posters on 6400) had suggested.
> >
> > Thanks again for the help, and we'll keep you posted as we learn more...
>
> The case I investigated involved corruption on the master, and I think
> it predated Hot Standby. However, the symptom is generic enough that
> it seems quite possible that there's more than one way for it to
> happen. :-(
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
> --
> Sent via pgsql-bugs mailing list (pgsql-bugs(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs
>

--
Mike Brauwerman
Data Team, Redfin

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andy Grimm 2012-01-28 21:55:20 Re: BUG #6412: psql & fe-connect truncate passwords
Previous Message James Stevenson 2012-01-28 20:30:10 Re: BUG #6413: pg_relation_size wont work on table with upper case chars