Re: BUG #6200: standby bad memory allocations on SELECT

From: Bridget Frey <bridget(dot)frey(at)redfin(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #6200: standby bad memory allocations on SELECT
Date: 2012-01-27 18:31:39
Message-ID: CAHOc93mzK58JO8XJdHHp=6tLRjv45WYY4cxsdkAZYNnpNQ7RjA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Thanks for the info - that's very helpful. We had also noted that the
alloc seems to be -3 bytes. We have run pg_check and it found no instances
of corruption. We've also replayed queries that have failed, and have never
been able to get the same query to fail twice. In the case you
investigated, was there permanent page corruption - e.g. you could run the
same query over and over and get the same result?

It really does seem like this is an issue either in Hot Standby or very
closely related to that feature, where there is temporary corruption of a
btree index that then disappears. Our master is not experiencing any
malloc issues, while the 3 slaves get about a dozen per day, despite
similar workloads. We haven't have a slave segfault since we set it up to
produce a core dump, but we're expecting to have that within the next few
days (assuming we'll continue to get a segfault every 3-4 days). We're
also planning to set up one slave that will panic when it gets a malloc
issue, as you (and other posters on 6400) had suggested.

Thanks again for the help, and we'll keep you posted as we learn more...
-B

On Fri, Jan 27, 2012 at 6:31 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Mon, Jan 23, 2012 at 3:22 PM, Bridget Frey <bridget(dot)frey(at)redfin(dot)com>
> wrote:
> > Hello,
> > We upgraded to postgres 9.1.2 two weeks ago, and we are also
> experiencing an
> > issue that seems very similar to the one reported as bug 6200. We see
> > approximately 2 dozen alloc errors per day across 3 slaves, and we are
> > getting one segfault approximately every 3 days. We did not experience
> this
> > issue before our upgrade (we were on version 8.4, and used skytools for
> > replication).
> >
> > We are attempting to get a core dump on segfault (our last attempt did
> not
> > work due to a config issue for the core dump). We're also attempting to
> > repro the alloc errors on a test setup, but it seems like we may need
> quite
> > a bit of load to trigger the issue. We're not certain that the alloc
> issues
> > and the sefaults are "the same issue" - but it seems that it may be since
> > the OP for bug 6200 sees the same behavior. We have seen no issues on
> the
> > master, all alloc errors and segfaults have been on the slaves.
> >
> > We've seen the alloc errors on a few different tables, but most
> frequently
> > on logins. Rows are added to the logins table one-by-one, and updates
> > generally happen one row at a time. The table is pretty basic, it looks
> > like this...
> >
> > CREATE TABLE logins
> > (
> > login_id bigserial NOT NULL,
> > <snip - a bunch of columns>
> > CONSTRAINT logins_pkey PRIMARY KEY (login_id ),
> > <snip - some other constraints...>
> > )
> > WITH (
> > FILLFACTOR=80,
> > OIDS=FALSE
> > );
> >
> > The queries that trigger the alloc error on this table look like this (we
> > use hibernate hence the funny underscoring...)
> > select login0_.login_id as login1_468_0_, l... from logins login0_ where
> > login0_.login_id=$1
> >
> > The alloc error in the logs looks like this:
> > -01-12_080925.log:2012-01-12 17:33:46 PST [16034]: [7-1] [24/25934]
> ERROR:
> > invalid memory alloc request size 18446744073709551613
> >
> > The alloc error is nearly always for size 18446744073709551613 - though
> we
> > have seen one time where it was a different amount...
>
> Hmm, that number in hex works out to 0xfffffffffffffffd, which makes
> it sound an awful lot like the system (for some unknown reason)
> attempted to allocate -3 bytes of memory. I've seen something like
> this once before on a customer system running a modified version of
> PostgreSQL. In that case, the problem turned out to be page
> corruption. Circumstances didn't permit determination of the root
> cause of the page corruption, however, nor was I able to figure out
> exactly how the corruption I saw resulted in an allocation request
> like this. It would be nice to figure out where in the code this is
> happening and put in a higher-level guard so that we get a better
> error message.
>
> You want want to compile a modified PostgreSQL executable that puts an
> extremely long sleep (like a year) just before this error is reported.
> Then, when the system hangs at that point, you can attach a debugger
> and pull a stack backtrace. Or you could insert an abort() at that
> point in the code and get a backtrace from the core dump.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

--
Bridget Frey Director, Data & Analytics Engineering | Redfin

bridget(dot)frey(at)redfin(dot)com | tel: 206.576.5894

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Robert Haas 2012-01-27 18:53:31 Re: BUG #6200: standby bad memory allocations on SELECT
Previous Message Marko Kreen 2012-01-27 18:18:35 Re: pgcrypto decrypt_iv() issue