Re: Re: Proposed Windows-specific change: Enable crash dumps (like core files)

From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: Proposed Windows-specific change: Enable crash dumps (like core files)
Date: 2010-12-19 06:26:08
Message-ID: 4D0DA580.1000009@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 18/12/2010 1:13 AM, Magnus Hagander wrote:
> On Fri, Dec 17, 2010 at 17:42, Magnus Hagander<magnus(at)hagander(dot)net> wrote:
>> On Fri, Dec 17, 2010 at 17:24, Craig Ringer<craig(at)postnewspapers(dot)com(dot)au> wrote:
>>> On 17/12/2010 7:17 PM, Magnus Hagander wrote:
>> Now, that's annoying. So clearly we can't use that function to
>> determine which version we're on. Seems it only works for "image help
>> api", and not the general thing.
>>
>> According to http://msdn.microsoft.com/en-us/library/ms679294(v=vs.85).aspx,
>> we could look for:
>>
>> SysEnumLines - if present, we have at least 6.1.
>>
>> However, I don't see any function that appeared in 6.0 only..
>
> Actually, I'm wrong - there are functions enough to determine the
> version. So here's a patch that tries that.

Great. I pulled the latest from your git tree, tested that, and got much
better results. Crashdump size is back to what I expected. In my test
code, fcinfo->args and fcinfo->argnull can be examined without problems.
Backtraces look good; see below. It seems to be including backend
private memory again now. Thanks _very_ much for your work on this.

fcinfo->flinfo is still inaccessible, but I suspect it's in shared
memory, as it's at 0x00000135 . Ditto fcinfo->resultinfo and
fcinfo->context.

This has me wondering - is it going to be necessary to dump shared
memory to make many backtraces useful? I just responded to Tom
mentioning that the patch doesn't currently dump shared memory, but I
hadn't realized the extent to which it's used for _lots_ more than just
disk buffers. I'm not sure how to handle dumping shared_buffers when
someone might be using multi-gigabyte shared_buffers, though. Dumping
the whole lot would risk sudden out-of-disk-space issues, slowdowns as
dumps are written, and the backend being "frozen" as it's being dumped
could delay the system coming back up again. Trying to selectively dump
critical parts could cause dumps to fail if the system is in early
startup or a bad state.

The same concern applies to writing backend private memory; it's fine
most of the time, but if you're doing data warehousing queries with 2GB
of work_mem, it's going to be nasty having all that extra disk I/O and
disk space use, not to mention the hold-up while the dump is written. If
this is something we want to have people running in production "just in
case" or to track down rare / hard to reproduce faults, that'll be a
problem.

OTOH, we can't really go poking around in palloc contexts to decide what
to dump.

I guess we could always do a small, minimalist minidump, then write
_another_ dump that attempts to include select parts of shm and backend
private memory.

I just thought of two other things, too:

- Is it possible for this handler to be called recursively if it fails
during the handler call? If so, do we need to uninstall the handler
before attempting a dump to avoid such recursion? I need to do some
testing and dig around MSDN to find out more about this.

- Can asynchronous events like signals (or their win32 emulation)
interrupt an executing crash handler, or are they blocked before the
crash handler is called? If they're not blocked, do we need to try to
block them before attempting a dump? Again, I need to do some reading on
this.

Anyway, here's an example of the backtraces I'm currently getting.
They're clearly missing some parameters (in shm? Unsure) but provide
source file+line, argument values where resolvable, and the call stack
its self. Locals are accessible at all levels of the stack when you go
poking around in windbg.

> This dump file has an exception of interest stored in it.
> The stored exception information can be accessed via .ecxr.
> (930.12e8): Access violation - code c0000005 (first/second chance not available)
> eax=00bce2c0 ebx=72d0e800 ecx=000002e4 edx=72cb81c8 esi=000000f0 edi=00000930
> eip=771464f4 esp=00bce294 ebp=00bce2a4 iopl=0 nv up ei pl zr na pe nc
> cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000246
> ntdll!KiFastSystemCallRet:
> 771464f4 c3 ret
> 0:000> .ecxr
> eax=00000000 ebx=00000000 ecx=015fd7d8 edx=7362100f esi=015fd7c8 edi=015fd804
> eip=73621052 esp=00bcf284 ebp=015fd7c8 iopl=0 nv up ei pl zr na pe nc
> cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010246
> crashme!crashdump_crashme+0x2:
> 73621052 c70001000000 mov dword ptr [eax],1 ds:0023:00000000=????????
> 0:000> kp
> *** Stack trace for last set context - .thread/.cxr resets it
> ChildEBP RetAddr
> 00bcf280 0031c797 crashme!crashdump_crashme(struct FunctionCallInfoData * fcinfo = 0x015e3318)+0x2 [c:\users\craig\developer\postgres\contrib\crashme\crashme.c @ 14]
> 00bcf2e4 0031c804 postgres!ExecMakeFunctionResult(struct FuncExprState * fcache = 0x015e3318, struct ExprContext * econtext = 0x00319410, char * isNull = 0x00000000 "", ExprDoneCond * isDone = 0x7362100f)+0x427 [c:\users\craig\developer\postgres\src\backend\executor\execqual.c @ 1824]
> 00bcf30c 0031b760 postgres!ExecEvalFunc(struct FuncExprState * fcache = 0x00000000, struct ExprContext * econtext = 0x00000000, char * isNull = 0x00000000 "", ExprDoneCond * isDone = 0x00000000)+0x34 [c:\users\craig\developer\postgres\src\backend\executor\execqual.c @ 2260]
> 00bcf338 0031ba83 postgres!ExecTargetList(struct List * targetlist = 0x00000000, struct ExprContext * econtext = 0x00000000, unsigned int * values = 0x00000000, char * isnull = 0x00000000 "", ExprDoneCond * itemIsDone = 0x00000000, ExprDoneCond * isDone = 0x00000000)+0x70 [c:\users\craig\developer\postgres\src\backend\executor\execqual.c @ 5095]
> 00bcf378 0032f074 postgres!ExecProject(struct ProjectionInfo * projInfo = 0x00000000, ExprDoneCond * isDone = 0x00000000)+0x173 [c:\users\craig\developer\postgres\src\backend\executor\execqual.c @ 5312]
> 00bcf38c 00317e07 postgres!ExecResult(struct ResultState * node = <Memory access error>)+0x94 [c:\users\craig\developer\postgres\src\backend\executor\noderesult.c @ 157]
> 00bcf39c 00315ccd postgres!ExecProcNode(struct PlanState * node = <Memory access error>)+0x67 [c:\users\craig\developer\postgres\src\backend\executor\execprocnode.c @ 361]
> 00bcf3b0 00316ace postgres!ExecutePlan(struct EState * estate = 0x015fd7c8, struct PlanState * planstate = <Memory access error>, CmdType operation = <Memory access error>, char sendTuples = <Memory access error>, long numberTuples = <Memory access error>, ScanDirection direction = NoMovementScanDirection (0n0), struct _DestReceiver * dest = <Memory access error>)+0x2d [c:\users\craig\developer\postgres\src\backend\executor\execmain.c @ 1236]
> 00bcf3e0 0041ec5d postgres!standard_ExecutorRun(struct QueryDesc * queryDesc = <Memory access error>, ScanDirection direction = <Memory access error>, long count = <Memory access error>)+0x8e [c:\users\craig\developer\postgres\src\backend\executor\execmain.c @ 288]
> 00bcf404 0041f270 postgres!PortalRunSelect(struct PortalData * portal = 0x00000000, char forward = <Memory access error>, long count = <Memory access error>, struct _DestReceiver * dest = <Memory access error>)+0x6d [c:\users\craig\developer\postgres\src\backend\tcop\pquery.c @ 953]
> 00bcf48c 0041c292 postgres!PortalRun(struct PortalData * portal = 0x015fb5b8, long count = 0n2147483647, char isTopLevel = 0n1 '', struct _DestReceiver * dest = 0x015e3418, struct _DestReceiver * altdest = 0x015e3418, char * completionTag = 0x00bcf500 "")+0x190 [c:\users\craig\developer\postgres\src\backend\tcop\pquery.c @ 803]
> 00bcf540 0041cbc5 postgres!exec_simple_query(char * query_string = 0x015fd7d8 "???")+0x3a2 [c:\users\craig\developer\postgres\src\backend\tcop\postgres.c @ 1067]
> 00bcf5c4 003e2bdc postgres!PostgresMain(int argc = 0n2, char ** argv = 0x01555138, char * username = 0x00d484a0 "Craig")+0x575 [c:\users\craig\developer\postgres\src\backend\tcop\postgres.c @ 3935]
> 00bcf5e4 003e58a9 postgres!BackendRun(struct Port * port = 0x00000000)+0x19c [c:\users\craig\developer\postgres\src\backend\postmaster\postmaster.c @ 3562]
> 00bcf788 003475bc postgres!SubPostmasterMain(int argc = 0n13900471, char ** argv = 0x00d41ac5)+0x2f9 [c:\users\craig\developer\postgres\src\backend\postmaster\postmaster.c @ 4058]
> 00bcf7a0 0051845d postgres!main(int argc = 0n1990922644, char ** argv = 0x7ffdf000)+0x1ec [c:\users\craig\developer\postgres\src\backend\main\main.c @ 173]
> 00bcf7e4 76ab1194 postgres!__tmainCRTStartup(void)+0x10f [f:\dd\vctools\crt_bld\self_x86\crt\src\crtexe.c @ 586]
> 00bcf7f0 7715b495 kernel32!BaseThreadInitThunk+0xe
> 00bcf830 7715b468 ntdll!__RtlUserThreadStart+0x70
> 00bcf848 00000000 ntdll!_RtlUserThreadStart+0x1b

--
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message flyusa2010 fly 2010-12-19 07:10:56 Can postgres create a file with physically continuous blocks.
Previous Message flyusa2010 fly 2010-12-19 06:11:21 can shared cache be swapped to disk?