Re: Race-condition with failed block-write?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Arjen van der Meijden <acm(at)tweakers(dot)net>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Race-condition with failed block-write?
Date: 2005-09-12 22:26:07
Message-ID: 6160.1126563967@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-bugs

Arjen van der Meijden <acm(at)tweakers(dot)net> writes:
> It had taken over 800MB of memory. The machine is a P4 3.0 Ghz (with HT
> enabled), 1GB of memory a few SATA disks and runs with a recent Gentoo +
> 2.6.12.5 kernel.
> The postgresql version that failed was 8.0.3, it may or may not be worth
> knowing that it runs a 8.1devel on another port and from another directory.

> In the postgresql.log a write-failure messages was repeated enough to
> make the log file 50MB larger:

I was able to replicate the memory leak in 8.0 by artificially poking a
bad value into the LSN field of a dirty page. CVS tip did not seem to
leak anything under the same conditions --- probably the difference is
that BufferSync() uses palloc'd temporary space in 8.0 but not in HEAD.
Nonetheless, we ought to have a positive defense against such leaks in
future, so I've applied the attached patch to both branches.

There is still the question of how the LSN value got corrupted, but
we've probably missed any chance of finding that out now.

regards, tom lane

Index: bgwriter.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/postmaster/bgwriter.c,v
retrieving revision 1.19
diff -c -r1.19 bgwriter.c
*** bgwriter.c 20 Aug 2005 23:26:17 -0000 1.19
--- bgwriter.c 12 Sep 2005 21:42:48 -0000
***************
*** 160,165 ****
--- 160,166 ----
BackgroundWriterMain(void)
{
sigjmp_buf local_sigjmp_buf;
+ MemoryContext bgwriter_context;

Assert(BgWriterShmem != NULL);
BgWriterShmem->bgwriter_pid = MyProcPid;
***************
*** 208,213 ****
--- 209,227 ----
last_checkpoint_time = time(NULL);

/*
+ * Create a memory context that we will do all our work in. We do this
+ * so that we can reset the context during error recovery and thereby
+ * avoid possible memory leaks. Formerly this code just ran in
+ * TopMemoryContext, but resetting that would be a really bad idea.
+ */
+ bgwriter_context = AllocSetContextCreate(TopMemoryContext,
+ "Background Writer",
+ ALLOCSET_DEFAULT_MINSIZE,
+ ALLOCSET_DEFAULT_INITSIZE,
+ ALLOCSET_DEFAULT_MAXSIZE);
+ MemoryContextSwitchTo(bgwriter_context);
+
+ /*
* If an exception is encountered, processing resumes here.
*
* See notes in postgres.c about the design of this coding.
***************
*** 247,255 ****
* Now return to normal top-level context and clear ErrorContext
* for next time.
*/
! MemoryContextSwitchTo(TopMemoryContext);
FlushErrorState();

/* Now we can allow interrupts again */
RESUME_INTERRUPTS();

--- 261,272 ----
* Now return to normal top-level context and clear ErrorContext
* for next time.
*/
! MemoryContextSwitchTo(bgwriter_context);
FlushErrorState();

+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(bgwriter_context);
+
/* Now we can allow interrupts again */
RESUME_INTERRUPTS();

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Alexei 2005-09-12 23:44:10 BUG #1878: Different execution plans for the same query.
Previous Message Oliver Jowett 2005-09-12 22:06:25 Re: Strange behavoir of batches