Re: [PATCHES] Full page writes improvement, code update

From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Koichi Suzuki" <suzuki(dot)koichi(at)oss(dot)ntt(dot)co(dot)jp>
Cc: <pgsql-hackers(at)postgresql(dot)org>, <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [PATCHES] Full page writes improvement, code update
Date: 2007-03-28 09:07:27
Message-ID: 1175072848.4386.323.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

On Wed, 2007-03-28 at 10:54 +0900, Koichi Suzuki wrote:

> As written below, full page write can be
> categolized as follows:
>
> 1) Needed for crash recovery: first page update after each checkpoint.
> This has to be kept in WAL.
>
> 2) Needed for archive recovery: page update between pg_start_backup and
> pg_stop_backup. This has to be kept in archive log.
>
> 3) For log-shipping slave such as pg_standby: no full page writes will
> be needed for this purpose.
>
> My proposal deals with 2). So, if we mark each "full_page_write", I'd
> rather mark when this is needed. Still need only one bit because the
> case 3) does not need any mark.

I'm very happy with this proposal, though I do still have some points in
detailed areas.

If you accept that 1 & 2 are valid goals, then 1 & 3 or 1, 2 & 3 are
also valid goals, ISTM. i.e. you might choose to use full_page_writes on
the primary and yet would like to see optimised data transfer to the
standby server. In that case, you would need the mark.

> > - Not sure why we need "full_page_compress", why not just mark them
> > always? That harms noone. (Did someone else ask for that? If so, keep
> > it)
>
> No, no one asked to have a separate option. There'll be no bad
> influence to do so. So, if we mark each "full_page_write", I'd
> rather mark when this is needed. Still need only one bit because the
> case 3) does not need any mark.

OK, different question:
Why would anyone ever set full_page_compress = off?

Why have a parameter that does so little? ISTM this is:

i) one more thing to get wrong

ii) cheaper to mark the block when appropriate than to perform the if()
test each time. That can be done only in the path where backup blocks
are present.

iii) If we mark the blocks every time, it allows us to do an offline WAL
compression. If the blocks aren't marked that option is lost. The bit is
useful information, so we should have it in all cases.

> > - OTOH I'd like to see an explicit parameter set during recovery since
> > you're asking the main recovery path to act differently in case a single
> > bit is set/unset. If you are using that form of recovery, we should say
> > so explicitly, to keep everybody else safe.
>
> Only one thing I had to do is to create "dummy" full page write to
> maintain LSNs. Full page writes are omitted in archive log. We have to
> LSNs same as those in the original WAL. In this case, recovery has to
> read logical log, not "dummy" full page writes. On the other hand, if
> both logical log and "real" full page writes are found in a log record,
> the recovery has to use "real" full page writes.

I apologise for not understanding your reply, perhaps my original
request was unclear.

In recovery.conf, I'd like to see a parameter such as

dummy_backup_blocks = off (default) | on

to explicitly indicate to the recovery process that backup blocks are
present, yet they are garbage and should be ignored. Having garbage data
within the system is potentially dangerous and I want to be told by the
user that they were expecting that and its OK to ignore that data.
Otherwise I want to throw informative errors. Maybe it seems OK now, but
the next change to the system may have unintended consequences and it
may not be us making the change. "It's OK the Alien will never escape
from the lab" is the starting premise for many good sci-fi horrors and I
want to watch them, not be in one myself. :-)

We can call it other things, of course. e.g.
ignore_dummy_blocks
decompressed_blocks
apply_backup_blocks

> Yes I believe so. As pg_standby does not include any chance to meet
> partial writes of pages, I believe you can omit all the full page
> writes. Of course, as Tom Lange suggested in
> http://archives.postgresql.org/pgsql-hackers/2007-02/msg00034.php
> removing full page writes can lose a chance to recover from
> partial/inconsisitent writes in the crash of pg_standby. In this case,
> we have to import a backup and archive logs (with full page writes
> during the backup) to recover. (We have to import them when the file
> system crashes anyway). If it's okay, I believe
> pg_compresslog/pg_decompresslog can be integrated with pg_standby.
>
> Maybe we can work together to include pg_compresslog/pg_decompresslog in
> pg_standby.

ISTM there are two options.

I think this option is already possible:

1. Allow pg_decompresslog to operate on a file, replacing it with the
expanded form, like gunzip, so we would do this:
restore_command = 'pg_standby %f decomp.tmp && pg_decompresslog
decomp.tmp %p'

though the decomp.tmp file would not get properly initialised or cleaned
up when we finish.

whereas this will take additional work

2. Allow pg_standby to write to stdin, so that we can do this:
restore_command = 'pg_standby %f | pg_decompresslog - %p'

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2007-03-28 10:04:44 Re: Concurrent connections in psql
Previous Message Magnus Hagander 2007-03-28 08:36:54 Re: O_DIRECT support for Windows

Browse pgsql-patches by date

  From Date Subject
Next Message Heikki Linnakangas 2007-03-28 09:09:40 Re: [PATCH] add CLUSTER table ORDER BY index
Previous Message Heikki Linnakangas 2007-03-28 09:00:15 Re: LIMIT/SORT optimization