RE: dsa_allocate() faliure

From: Arne Roland <A(dot)Roland(at)index(dot)de>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Jakub Glapa <jakub(dot)glapa(at)gmail(dot)com>
Cc: Fabio Isabettini <fisabettini(at)voipfuture(dot)com>, Sand Stone <sand(dot)m(dot)stone(at)gmail(dot)com>, Rick Otten <rottenwindfish(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-performance(at)lists(dot)postgresql(dot)org" <pgsql-performance(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: RE: dsa_allocate() faliure
Date: 2019-02-04 20:31:47
Message-ID: d9c6cc80e21241349db53b2f64075029@index.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

It's definitely a quite a relatively complex pattern. The query I set you last time was minimal with respect to predicates (so removing any single one of the predicates converted that one into a working query).
> Huh. Ok well that's a lot more frequent that I thought. Is it always the same query? Any chance you can get the plan? Are there more things going on on the server, like perhaps concurrent parallel queries?
I had this bug occurring while I was the only one working on the server. I checked there was just one transaction with a snapshot at all and it was a autovacuum busy with a totally unrelated relation my colleague was working on.

The bug is indeed behaving like a ghost.
One child relation needed a few new rows to test a particular application a colleague of mine was working on. The insert triggered an autoanalyze and the explain changed slightly:
Besides row and cost estimates the change is that the line
Recheck Cond: (((COALESCE((fid)::bigint, fallback) ) >= 1) AND ((COALESCE((fid)::bigint, fallback) ) <= 1) AND (gid && '{853078,853080,853082}'::integer[]))
is now
Recheck Cond: ((gid && '{853078,853080,853082}'::integer[]) AND ((COALESCE((fid)::bigint, fallback) ) >= 1) AND ((COALESCE((fid)::bigint, fallback) ) <= 1))
and the error vanished.

I could try to hunt down another query by assembling seemingly random queries. I don't see a very clear pattern from the queries aborting with this error on our production servers. I'm not surprised that bug is had to chase on production servers. They usually are quite lively.

>If you're able to run a throwaway copy of your production database on another system that you don't have to worry about crashing, you could just replace ERROR with PANIC and run a high-speed loop of the query that crashed in product, or something. This might at least tell us whether it's reach that condition via something dereferencing a dsa_pointer or something manipulating the segment lists while allocating/freeing.

I could take a backup and restore the relevant tables on a throwaway system. You are just suggesting to replace line 728
elog(FATAL,
"dsa_allocate could not find %zu free pages", npages);
by
elog(PANIC,
"dsa_allocate could not find %zu free pages", npages);
correct? Just for my understanding: why would the shutdown of the whole instance create more helpful logging?

All the best
Arne

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bossart, Nathan 2019-02-04 21:22:38 Re: New vacuum option to do only freezing
Previous Message Alvaro Herrera 2019-02-04 18:41:13 Re: propagating replica identity to partitions

Browse pgsql-performance by date

  From Date Subject
Next Message Justin Pryzby 2019-02-04 21:47:08 Re: dsa_allocate() faliure
Previous Message Mariel Cherkassky 2019-02-04 16:42:19 Re: ERROR: found xmin from before relfrozenxid