Re: Re: Proposed Windows-specific change: Enable crash dumps (like core files)

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: Proposed Windows-specific change: Enable crash dumps (like core files)
Date: 2010-12-19 11:51:19
Message-ID: AANLkTimFyXg9KYx4S-COj2Ha5EfP_v+y7kXyC2e98ptd@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Dec 19, 2010 at 07:26, Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> wrote:
> On 18/12/2010 1:13 AM, Magnus Hagander wrote:
>>
>> On Fri, Dec 17, 2010 at 17:42, Magnus Hagander<magnus(at)hagander(dot)net>
>>  wrote:
>>>
>>> On Fri, Dec 17, 2010 at 17:24, Craig Ringer<craig(at)postnewspapers(dot)com(dot)au>
>>>  wrote:
>>>>
>>>> On 17/12/2010 7:17 PM, Magnus Hagander wrote:
>>>
>>> Now, that's annoying. So clearly we can't use that function to
>>> determine which version we're on. Seems it only works for "image help
>>> api", and not the general thing.
>>>
>>> According to
>>> http://msdn.microsoft.com/en-us/library/ms679294(v=vs.85).aspx,
>>> we could look for:
>>>
>>> SysEnumLines - if present, we have at least 6.1.
>>>
>>> However, I don't see any function that appeared in 6.0 only..
>>
>> Actually, I'm wrong - there are functions enough to determine the
>> version. So here's a patch that tries that.
>
> Great. I pulled the latest from your git tree, tested that, and got much
> better results. Crashdump size is back to what I expected. In my test code,
> fcinfo->args and fcinfo->argnull can be examined without problems.
> Backtraces look good; see below. It seems to be including backend private
> memory again now. Thanks _very_ much for your work on this.

Ok, great. I think that leaves us at least complete enough to commit -
we can always improve things further as we get some more real world
testing.

> fcinfo->flinfo is still inaccessible, but I suspect it's in shared memory,
> as it's at 0x00000135 . Ditto fcinfo->resultinfo and fcinfo->context.

Hmm. Not sure why those would be in shared memory, that seems strange.

> This has me wondering - is it going to be necessary to dump shared memory to
> make many backtraces useful? I just responded to Tom mentioning that the
> patch doesn't currently dump shared memory, but I hadn't realized the extent
> to which it's used for _lots_ more than just disk buffers. I'm not sure how
> to handle dumping shared_buffers when someone might be using multi-gigabyte
> shared_buffers, though. Dumping the whole lot would risk sudden
> out-of-disk-space issues, slowdowns as dumps are written, and the backend
> being "frozen" as it's being dumped could delay the system coming back up
> again. Trying to selectively dump critical parts could cause dumps to fail
> if the system is in early startup or a bad state.

I don'tt hink a slowdown issue during dump being written is really a
problem - it's not like this is something that happens normally. In
fact, we're going to restart all the other backends after the crash
anyway, and that's going to be a lot more disruptive.

Filling up the disk, however, is a much bigger issue.. As is having a
huge dump to mail around (for error reports), if it's not actually
necessary...

> The same concern applies to writing backend private memory; it's fine most
> of the time, but if you're doing data warehousing queries with 2GB of
> work_mem, it's going to be nasty having all that extra disk I/O and disk
> space use, not to mention the hold-up while the dump is written. If this is
> something we want to have people running in production "just in case" or to
> track down rare / hard to reproduce faults, that'll be a problem.

Potential problem, yes, but not a very big one I believe. Yes, if that
very big backend crashes there will be a big dump - but how else are
you going to find out what went wrong?

In the shared memory case, *all* dumps will be huge, not just those
that happen to use a lot of local memory.

> OTOH, we can't really go poking around in palloc contexts to decide what to
> dump.

No, that's way more complex stuff than we dare execute in a crash handler.

> I guess we could always do a small, minimalist minidump, then write
> _another_ dump that attempts to include select parts of shm and backend
> private memory.
>
> I just thought of two other things, too:
>
> - Is it possible for this handler to be called recursively if it fails
> during the handler call? If so, do we need to uninstall the handler before
> attempting a dump to avoid such recursion? I need to do some testing and dig
> around MSDN to find out more about this.

I'm pretty sure I read somewhere that it can't happen and we don't
need to do that, but I can't find any reference to it atm :-(

> - Can asynchronous events like signals (or their win32 emulation) interrupt
> an executing crash handler, or are they blocked before the crash handler is
> called? If they're not blocked, do we need to try to block them before
> attempting a dump? Again, I need to do some reading on this.

The exception handler itself is global and will run on the crashing
thread. I'm fairly certain it stops the other threads while running,
but again I can't find a reference to that right now. But ISTM it
would have to.

> Anyway, here's an example of the backtraces I'm currently getting. They're
> clearly missing some parameters (in shm? Unsure) but provide source
> file+line, argument values where resolvable, and the call stack its self.
> Locals are accessible at all levels of the stack when you go poking around
> in windbg.

Yeah, they're still very useful. Is that a release or debug build?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2010-12-19 12:01:48 Re: relaxing sync commit if no WAL written (was Re: unlogged tables)
Previous Message tv 2010-12-19 11:36:20 Re: keeping a timestamp of the last stats reset (for a db, table and function)