> Heikki Linnakangas wrote: > On 28.12.2011 01:39, Simon Riggs wrote: >> On Tue, Dec 27, 2011 at 8:05 PM, Heikki Linnakangas >> wrote: >>> On 25.12.2011 15:01, Kevin Grittner wrote: >>>> >>>> I don't believe that. Double-writing is a technique to avoid >>>> torn pages, but it requires a checksum to work. This chicken- >>>> and-egg problem requires the checksum to be implemented first. >>> >>> I don't think double-writes require checksums on the data pages >>> themselves, just on the copies in the double-write buffers. In >>> the double-write buffer, you'll need some extra information per- >>> page anyway, like a relfilenode and block number that indicates >>> which page it is in the buffer. You are clearly right -- if there is no checksum in the page itself, you can put one in the double-write metadata. I've never seen that discussed before, but I'm embarrassed that it never occurred to me. >> How would you know when to look in the double write buffer? > > You scan the double-write buffer, and every page in the double > write buffer that has a valid checksum, you copy to the main > storage. There's no need to check validity of pages in the main > storage. Right. I'll recap my understanding of double-write (from memory -- if there's a material error or omission, I hope someone will correct me). The write-ups I've seen on double-write techniques have all the writes to the double-write buffer (a single, sequential file that stays around). This is done as sequential writing to a file which is overwritten pretty frequently, making the writes to a controller very fast, and a BBU write-back cache unlikely to actually write to disk very often. On good server-quality hardware, it should be blasting RAM-to_RAM very efficiently. The file is fsync'd (like I said, hopefully to BBU cache), then each page in the double-write buffer is written to the normal page location, and that is fsync'd. Once that is done, the database writes have no risk of being torn, and the double-write buffer is marked as empty. This all happens at the point when you would be writing the page to the database, after the WAL-logging. On crash recovery you read through the double-write buffer from the start and write the pages which look good (including a good checksum) to the database before replaying WAL. If you find a checksum error in processing the double-write buffer, you assume that you never got as far as the fsync of the double-write buffer, which means you never started writing the buffer contents to the database, which means there can't be any torn pages there. If you get to the end and fsync, you can be sure any torn pages from a previous attempt to write to the database itself have been overwritten with the good copy in the double-write buffer. Either way, you move on to WAL processing. You wind up with a database free of torn pages before you apply WAL. full_page_writes to the WAL are not needed as long as double-write is used for any pages which would have been written to the WAL. If checksums were written to the double-buffer metadata instead of adding them to the page itself, this could be implemented alone. It would probably allow a modest speed improvement over using full_page_writes and would eliminate those full-page images from the WAL files, making them smaller. If we do add a checksum to the page header, that could be used for testing for torn pages in the double-write buffer without needing a redundant calculation for double-write. With no torn pages in the actual database, checksum failures there would never be false positives. To get this right for a checksum in the page header, double-write would need to be used for all cases where full_page_writes now are used (i.e., the first write of a page after a checkpoint), and for all unlogged writes (e.g., hint-bit-only writes). There would be no correctness problem for always using double-write, but it would be unnecessary overhead for other page writes, which I think we can avoid. -Kevin
On Thu, Dec 29, 2011 at 6:44 PM, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote: > positives. To get this right for a checksum in the page header, > double-write would need to be used for all cases where > full_page_writes now are used (i.e., the first write of a page after > a checkpoint), and for all unlogged writes (e.g., hint-bit-only > writes). There would be no correctness problem for always using > double-write, but it would be unnecessary overhead for other page > writes, which I think we can avoid. Unless I'm missing something, double-writes are needed for all writes, not only the first page after a checkpoint. Consider this sequence of events: 1. Checkpoint 2. Double-write of page A (DW buffer write, sync, heap write) 3. Sync of heap, releasing DW buffer for new writes. ... some time goes by 4. Regular write of page A 5. OS writes one part of page A 6. Crash! Now recovery comes along, page A is broken in the heap with no double-write buffer backup nor anything to recover it by in the WAL. -- Ants Aasma
2011/12/30 Ants Aasma <ants(dot)aasma(at)eesti(dot)ee>: > On Thu, Dec 29, 2011 at 6:44 PM, Kevin Grittner > <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote: > >> positives. To get this right for a checksum in the page header, >> double-write would need to be used for all cases where >> full_page_writes now are used (i.e., the first write of a page after >> a checkpoint), and for all unlogged writes (e.g., hint-bit-only >> writes). There would be no correctness problem for always using >> double-write, but it would be unnecessary overhead for other page >> writes, which I think we can avoid. > > Unless I'm missing something, double-writes are needed for all writes, > not only the first page after a checkpoint. Consider this sequence of > events: > > 1. Checkpoint > 2. Double-write of page A (DW buffer write, sync, heap write) > 3. Sync of heap, releasing DW buffer for new writes. > ... some time goes by > 4. Regular write of page A > 5. OS writes one part of page A > 6. Crash! > > Now recovery comes along, page A is broken in the heap with no > double-write buffer backup nor anything to recover it by in the WAL. I guess the assumption is that the write in (4) is either backed by the WAL, or made safe by double writing. ISTM that such reasoning is only correct if the change that is expressed by the WAL record can be applied in the context of inconsistent (i.e., partially written) pages, which I assume is not the case (excuse my ignorance regarding such basic facts). So I think you are right. Nicolas -- A. Because it breaks the logical sequence of discussion. Q. Why is top posting bad?
On Thu, Dec 29, 2011 at 4:44 PM, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote: >> Heikki Linnakangas wrote: >> On 28.12.2011 01:39, Simon Riggs wrote: >>> On Tue, Dec 27, 2011 at 8:05 PM, Heikki Linnakangas >>> wrote: >>>> On 25.12.2011 15:01, Kevin Grittner wrote: >>>>> >>>>> I don't believe that. Double-writing is a technique to avoid >>>>> torn pages, but it requires a checksum to work. This chicken- >>>>> and-egg problem requires the checksum to be implemented first. >>>> >>>> I don't think double-writes require checksums on the data pages >>>> themselves, just on the copies in the double-write buffers. In >>>> the double-write buffer, you'll need some extra information per- >>>> page anyway, like a relfilenode and block number that indicates >>>> which page it is in the buffer. > > You are clearly right -- if there is no checksum in the page itself, > you can put one in the double-write metadata. I've never seen that > discussed before, but I'm embarrassed that it never occurred to me. Heikki's idea for double writes works well. It solves the problems of torn pages in a way that would make FPW redundant. However, I don't see that it provides protection across non-crash write problems. We know we have these since many systems have run without a crash for years and yet still experience corrupt data. Double writes do not require page checksums but neither do they replace page checksums. So I think we need page checksums plus either FPWs or double writes. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Dec 29, 2011 at 11:44 AM, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote: > You wind up with a database free of torn pages before you apply WAL. > full_page_writes to the WAL are not needed as long as double-write is > used for any pages which would have been written to the WAL. If > checksums were written to the double-buffer metadata instead of > adding them to the page itself, this could be implemented alone. It > would probably allow a modest speed improvement over using > full_page_writes and would eliminate those full-page images from the > WAL files, making them smaller. Correct. So now lots of people seem to be jumping on the double-write bandwagon and looking at some the things it promise: All writes are durable This solves 2 big issues: - Remove torn-page problem - Remove FPW from WAL That up front looks pretty attractive. But we need to look at the tradeoffs, and then decide (benchmark anyone). Remember, postgresql is a double-write system right now. The 1st, checkumed write is the FPW in WAL. It's fsynced. And the 2nd synced write is when the file is synced during checkpoint. So, postgresql currently has an optimization now that not every write has *requirements* for atomic, instant durability. And so postgresql get's to do lots of writes to the OS cache and *not* request them to be instantly synced. And then at some point, when it's reay to clear the 1st checksumed write, make sure everywrite is synced. And lots of work went into PG recently to get even better at the collection of writes/syncs that happen at checkpoint time to take even biger advantage of the fact that its' better to write everything in a fil efirst, then call a single sync. So moving to this new double-write-area bandwagon, we move from a "WAL FPW synced at the commit, collect as many other writes, then final sync" type system to a system where *EVERY* write requires syncs of 2 separate 8K writes at buffer write-out time. So we avoid the FPW at commit (yes, that's nice for latency), and we guarentee every buffer written is consistent (that fixes our hit-bit-only dirty writes from being torn). And we do that at a cost of every buffer write requiring 2 fsyncs, in a serial fashion. Come checkpoint, I'm wondering.... Again, all that to avoid a single "optimization" that postgresql currently has: 1) writes for hint-bit only buffers don't need to be durable And the problem that optimization introduces: 1) Since they aren't guarenteed durable, we can't believe a checksum -- Aidan Van Dyk Create like a god, aidan(at)highrise(dot)ca command like a king, http://www.highrise.ca/ work like a slave.
On 12/29/11, Ants Aasma <ants(dot)aasma(at)eesti(dot)ee> wrote: > Unless I'm missing something, double-writes are needed for all writes, > not only the first page after a checkpoint. Consider this sequence of > events: > > 1. Checkpoint > 2. Double-write of page A (DW buffer write, sync, heap write) > 3. Sync of heap, releasing DW buffer for new writes. > ... some time goes by > 4. Regular write of page A > 5. OS writes one part of page A > 6. Crash! > > Now recovery comes along, page A is broken in the heap with no > double-write buffer backup nor anything to recover it by in the WAL. Isn't 3 the very definition of a checkpoint, meaning that 4 is not really a regular write as it is the first one after a checkpoint? But it doesn't seem safe to me replace a page from the DW buffer and then apply WAL to that replaced page which preceded the age of the page in the buffer. Cheers, Jeff
On Fri, Dec 30, 2011 at 11:58 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote: > On 12/29/11, Ants Aasma <ants(dot)aasma(at)eesti(dot)ee> wrote: >> Unless I'm missing something, double-writes are needed for all writes, >> not only the first page after a checkpoint. Consider this sequence of >> events: >> >> 1. Checkpoint >> 2. Double-write of page A (DW buffer write, sync, heap write) >> 3. Sync of heap, releasing DW buffer for new writes. >> ... some time goes by >> 4. Regular write of page A >> 5. OS writes one part of page A >> 6. Crash! >> >> Now recovery comes along, page A is broken in the heap with no >> double-write buffer backup nor anything to recover it by in the WAL. > > Isn't 3 the very definition of a checkpoint, meaning that 4 is not > really a regular write as it is the first one after a checkpoint? I think you nailed it. > But it doesn't seem safe to me replace a page from the DW buffer and > then apply WAL to that replaced page which preceded the age of the > page in the buffer. That's what LSNs are for. If we write the page to the checkpoint buffer just once per checkpoint, recovery can restore the double-written versions of the pages and then begin WAL replay, which will restore all the subsequent changes made to the page. Recovery may also need to do additional double-writes if it encounters pages that for which we wrote WAL but never flushed the buffer, because a crash during recovery can also create torn pages. When we reach a restartpoint, we fsync everything down to disk and then nuke the double-write buffer. Similarly, in normal running, we can nuke the double-write buffer at checkpoint time, once the fsyncs are complete. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 4, 2012 at 3:49 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote: > On Fri, Dec 30, 2011 at 11:58 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote: >> On 12/29/11, Ants Aasma <ants(dot)aasma(at)eesti(dot)ee> wrote: >>> Unless I'm missing something, double-writes are needed for all writes, >>> not only the first page after a checkpoint. Consider this sequence of >>> events: >>> >>> 1. Checkpoint >>> 2. Double-write of page A (DW buffer write, sync, heap write) >>> 3. Sync of heap, releasing DW buffer for new writes. >>> ... some time goes by >>> 4. Regular write of page A >>> 5. OS writes one part of page A >>> 6. Crash! >>> >>> Now recovery comes along, page A is broken in the heap with no >>> double-write buffer backup nor anything to recover it by in the WAL. >> >> Isn't 3 the very definition of a checkpoint, meaning that 4 is not >> really a regular write as it is the first one after a checkpoint? > > I think you nailed it. No, I should have explicitly stated that no checkpoint happens in between. I think the confusion here is because I assumed Kevin described a fixed size d-w buffer in this message: On Thu, Dec 29, 2011 at 6:44 PM, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote: > ... The file is fsync'd (like I said, > hopefully to BBU cache), then each page in the double-write buffer is > written to the normal page location, and that is fsync'd. Once that > is done, the database writes have no risk of being torn, and the > double-write buffer is marked as empty. ... If the double-write buffer survives until the next checkpoint, double-writing only the first write should work just fine. The advantage over current full-page writes is that the write is not into the WAL stream and is done (hopefully) by the bgwriter/checkpointer in the background. -- Ants Aasma
On 12/30/11 9:44 AM, Aidan Van Dyk wrote: > So moving to this new double-write-area bandwagon, we move from a "WAL > FPW synced at the commit, collect as many other writes, then final > sync" type system to a system where *EVERY* write requires syncs of 2 > separate 8K writes at buffer write-out time. It's not quite that bad. The double-write area is going to be a small chunk of re-used sequential I/O, like the current WAL. And if this approach shifts some of the full-page writes out of the WAL and toward the new area instead, that's not a real doubling either. Could probably put both on the same disk, and in situations where you don't have a battery-backed write cache it's possible to get a write to both per rotation. This idea has been tested pretty extensively as part of MySQL's Innodb engine. Results there suggest the overhead is in the 5% to 30% range; some examples mentioning both extremes of that: http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/ http://www.bigdbahead.com/?p=392 Makes me wish I knew off the top of my head how expensive WAL logging hint bits would be, for comparison sake. -- Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com