Skip site navigation (1) Skip section navigation (2)

Re: BUG #6200: standby bad memory allocations on SELECT

From: Michael Brauwerman <michael(dot)brauwerman(at)redfin(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Bridget Frey <bridget(dot)frey(at)redfin(dot)com>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #6200: standby bad memory allocations on SELECT
Date: 2012-01-28 21:34:30
Message-ID: CAHDXJ6jes_Zv1OFo=EZn-HGOrjKoy2uLz3Sg4ShXhb0yMY_-5A@mail.gmail.com (view raw or flat)
Thread:
Lists: pgsql-bugs
I work with Bridget at Redfin.

We have a core dump from a once-in-5-days (multi-million queries) hot
standby segfault in pg 9.1.2 . (It might or might be the same root issue as
the "alloc" errors. If I should file a new bug report, let me know.

The postgres executable that crashed did not have debugging symbols
installed, and we were unable to debug (gdb) the core file using a debug
build of postgres. (Symbols didn't match.) Running gdb against a non-debug
postgres executable gave us this stack trace:


[root(at)query-7 core]# gdb -q -c  /postgres/core/query-9.core.19678
/usr/pgsql-9.1/bin/postgres-non-debug
Reading symbols from /usr/pgsql-9.1/bin/postgres-non-debug...(no debugging
symbols found)...done.

warning: core file may not match specified executable file.
[New Thread 19678]

warning: no loadable sections found in added symbol-file system-supplied
DSO at 0x7fffdcd58000
Core was generated by `postgres: datamover stingray_prod 10.11.0.134(54140)
SELEC'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000000045694c in nocachegetattr ()



(gdb) bt
#0  0x000000000045694c in nocachegetattr ()
#1  0x00000000006f93c9 in ?? ()
#2  0x00000000006fa231 in tuplesort_puttupleslot ()
#3  0x0000000000573ad1 in ExecSort ()
#4  0x000000000055cdda in ExecProcNode ()
#5  0x000000000055bcd1 in standard_ExecutorRun ()
#6  0x0000000000623594 in ?? ()
#7  0x0000000000624ae0 in PortalRun ()
#8  0x00000000006220f2 in PostgresMain ()
#9  0x00000000005e6ba4 in ?? ()
#10 0x00000000005e791c in PostmasterMain ()
#11 0x000000000058b9ae in main ()



We have the (5GB) core file, and are happy to do any more forensics anyone
can advise.

Please instruct.

I hope this helps point to a root cause and resolution....

Thank you,

Mike Brauwerman
Data Team, Redfin

On Fri, Jan 27, 2012 at 10:53 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Fri, Jan 27, 2012 at 1:31 PM, Bridget Frey <bridget(dot)frey(at)redfin(dot)com>
> wrote:
> > Thanks for the info - that's very helpful.  We had also noted that the
> alloc
> > seems to be -3 bytes.  We have run pg_check and it found no instances of
> > corruption. We've also replayed queries that have failed, and have never
> > been able to get the same query to fail twice.  In the case you
> > investigated, was there permanent page corruption - e.g. you could run
> the
> > same query over and over and get the same result?
>
> Yes.  I observed that the infomask bits on several tuples had somehow
> been overwritten by nonsense.  I am not sure whether there were other
> kinds of corruption as well - I suspect probably so - but that's the
> only one I saw with my own eyes, courtesy of pg_filedump.
>
> > It really does seem like this is an issue either in Hot Standby or very
> > closely related to that feature, where there is temporary corruption of a
> > btree index that then disappears.  Our master is not experiencing any
> malloc
> > issues, while the 3 slaves get about a dozen per day, despite similar
> > workloads.  We haven't have a slave segfault since we set it up to
> produce a
> > core dump, but we're expecting to have that within the next few days
> > (assuming we'll continue to get a segfault every 3-4 days).  We're also
> > planning to set up one slave that will panic when it gets a malloc
> issue, as
> > you (and other posters on 6400) had suggested.
> >
> > Thanks again for the help, and we'll keep you posted as we learn more...
>
> The case I investigated involved corruption on the master, and I think
> it predated Hot Standby.  However, the symptom is generic enough that
> it seems quite possible that there's more than one way for it to
> happen.  :-(
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
> --
> Sent via pgsql-bugs mailing list (pgsql-bugs(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs
>



-- 
Mike Brauwerman
Data Team, Redfin

In response to

Responses

pgsql-bugs by date

Next:From: Andy GrimmDate: 2012-01-28 21:55:20
Subject: Re: BUG #6412: psql & fe-connect truncate passwords
Previous:From: James StevensonDate: 2012-01-28 20:30:10
Subject: Re: BUG #6413: pg_relation_size wont work on table with upper case chars

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group