Skip site navigation (1) Skip section navigation (2)

Re: [PATCHES] Full page writes improvement, code update

From: Koichi Suzuki <suzuki(dot)koichi(at)oss(dot)ntt(dot)co(dot)jp>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: [PATCHES] Full page writes improvement, code update
Date: 2007-03-29 08:50:03
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-hackerspgsql-patches
Hi, Here're some feedback to the comment:

Simon Riggs wrote:
> On Wed, 2007-03-28 at 10:54 +0900, Koichi Suzuki wrote:
>> As written below, full page write can be
>> categolized as follows:
>> 1) Needed for crash recovery: first page update after each checkpoint.
>> This has to be kept in WAL.
>> 2) Needed for archive recovery: page update between pg_start_backup and
>> pg_stop_backup. This has to be kept in archive log.
>> 3) For log-shipping slave such as pg_standby: no full page writes will
>> be needed for this purpose.
>> My proposal deals with 2). So, if we mark each "full_page_write", I'd
>> rather mark when this is needed. Still need only one bit because the
>> case 3) does not need any mark.
> I'm very happy with this proposal, though I do still have some points in
> detailed areas.
> If you accept that 1 & 2 are valid goals, then 1 & 3 or 1, 2 & 3 are
> also valid goals, ISTM. i.e. you might choose to use full_page_writes on
> the primary and yet would like to see optimised data transfer to the
> standby server. In that case, you would need the mark.

Yes, I need the mark.  In my proposal, only unmarked full-page-writes, 
which were written as the first update after a checkpoint, are to be 
removed offline (pg_compresslog).

>>> - Not sure why we need "full_page_compress", why not just mark them
>>> always? That harms noone. (Did someone else ask for that? If so, keep
>>> it)
>> No, no one asked to have a separate option. There'll be no bad
>> influence to do so.  So, if we mark each "full_page_write", I'd
>> rather mark when this is needed. Still need only one bit because the
>> case 3) does not need any mark.
> OK, different question: 
> Why would anyone ever set full_page_compress = off? 
> Why have a parameter that does so little? ISTM this is:
> i) one more thing to get wrong
> ii) cheaper to mark the block when appropriate than to perform the if()
> test each time. That can be done only in the path where backup blocks
> are present.
> iii) If we mark the blocks every time, it allows us to do an offline WAL
> compression. If the blocks aren't marked that option is lost. The bit is
> useful information, so we should have it in all cases.

Not only full-page-writes are written as WAL record.   In my proposal, 
both full-page-writes and logical log are written in a WAL record, which 
will make WAL size slightly bigger (five percent or so).   If 
full_page_compress = off, only a full-page-write will be written in a 
WAL record.   I thought someone will not be happy with this size growth.

I agree to make this mandatory if every body is happy with extra logical 
  log in WAL records with full page writes.

I'd like to have your opinion.

>>> - OTOH I'd like to see an explicit parameter set during recovery since
>>> you're asking the main recovery path to act differently in case a single
>>> bit is set/unset. If you are using that form of recovery, we should say
>>> so explicitly, to keep everybody else safe.
>> Only one thing I had to do is to create "dummy" full page write to
>> maintain LSNs. Full page writes are omitted in archive log. We have to
>> LSNs same as those in the original WAL. In this case, recovery has to
>> read logical log, not "dummy" full page writes. On the other hand, if
>> both logical log and "real" full page writes are found in a log record,
>> the recovery has to use "real" full page writes.
> I apologise for not understanding your reply, perhaps my original
> request was unclear.
> In recovery.conf, I'd like to see a parameter such as
> dummy_backup_blocks = off (default) | on
> to explicitly indicate to the recovery process that backup blocks are
> present, yet they are garbage and should be ignored. Having garbage data
> within the system is potentially dangerous and I want to be told by the
> user that they were expecting that and its OK to ignore that data.
> Otherwise I want to throw informative errors. Maybe it seems OK now, but
> the next change to the system may have unintended consequences and it
> may not be us making the change. "It's OK the Alien will never escape
> from the lab" is the starting premise for many good sci-fi horrors and I
> want to watch them, not be in one myself. :-)
> We can call it other things, of course. e.g.
> ignore_dummy_blocks
> decompressed_blocks
> apply_backup_blocks

So far, we don't need any modification to the recovery and redo 
functions.   They ignore the dummy and apply logical logs.   Also, if 
there are both full page writes and logical log, current recovery 
selects full page writes to apply.

I agree to introduce this option if 8.3 code introduces any conflict to 
the current.   Or, we could introduce this option for future safety.  Do 
you think we should introduce this option?

If this should be introduced now, what we should do is to check this 
option when dummy full-page-write appears.

>> Yes I believe so. As pg_standby does not include any chance to meet
>> partial writes of pages, I believe you can omit all the full page
>> writes. Of course, as Tom Lange suggested in
>> removing full page writes can lose a chance to recover from
>> partial/inconsisitent writes in the crash of pg_standby. In this case,
>> we have to import a backup and archive logs (with full page writes
>> during the backup) to recover. (We have to import them when the file
>> system crashes anyway). If it's okay, I believe
>> pg_compresslog/pg_decompresslog can be integrated with pg_standby.
>> Maybe we can work together to include pg_compresslog/pg_decompresslog in 
>> pg_standby.
> ISTM there are two options.
> I think this option is already possible:
> 1. Allow pg_decompresslog to operate on a file, replacing it with the
> expanded form, like gunzip, so we would do this:
>   restore_command = 'pg_standby %f decomp.tmp && pg_decompresslog
> decomp.tmp %p'
> though the decomp.tmp file would not get properly initialised or cleaned
> up when we finish.
> whereas this will take additional work
> 2. Allow pg_standby to write to stdin, so that we can do this:
>   restore_command = 'pg_standby %f | pg_decompresslog - %p'

Both seem to work fine.   pg_decompresslog will read entire file at the 
beginning and so both two will be equivallent.   To make the second 
option run quicker, pg_decompresslog needs some modification to read WAL 
record one after another.

Anyway, could you try to run pg_standby with pg_compresslog and 

Additional recomment on page header removal:

I found that it is not simple to keep page header in the compressed 
archive log.   Because we eliminate unmarked full page writes and shift 
the rest of the WAL file data, it is not simple to keep page header as 
the page header in the compressed archive log.   It is much simpler to 
remove page header as well and rebuild them.   I'd like to keep current 
implementation in this point.

Any suggestions are welcome.

I'll modify the name of the commands and post it hopefully within 20hours.

With Best Regards;

Koichi Suzuki

In response to


pgsql-hackers by date

Next:From: Zeugswetter Andreas ADI SDDate: 2007-03-29 09:06:14
Subject: Re: Patch queue concern
Previous:From: Tom LaneDate: 2007-03-29 05:34:04
Subject: Re: Patch queue concern

pgsql-patches by date

Next:From: Michael MeskesDate: 2007-03-29 09:18:02
Subject: Re: ecpg threading vs win32
Previous:From: Heikki LinnakangasDate: 2007-03-29 08:44:26
Subject: Re: Small addition to PGInterval

Privacy Policy | About PostgreSQL
Copyright © 1996-2018 The PostgreSQL Global Development Group