Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Date: 2021-09-03 14:37:59
Message-ID: CA+TgmoaMHFaOrVO-Ejrt2ce8K=yCUW0vw6hSjPEv6f2wCKU9Vg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 3, 2021 at 6:23 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>> + /* Built-in oids are mapped directly */
>> + if (classForm->oid < FirstGenbkiObjectId)
>> + relfilenode = classForm->oid;
>> + else if (OidIsValid(classForm->relfilenode))
>> + relfilenode = classForm->relfilenode;
>> + else
>> + continue;
>>
>> Am I missing something, or is this totally busted?
>
> Oops, I think the condition should be like below, but I will think carefully before posting the next version if there is something else I am missing.
>
> if (OidIsValid(classForm->relfilenode))
> relfilenode = classForm->relfilenode;
> else if if (classForm->oid < FirstGenbkiObjectId)
> relfilenode = classForm->oid;
> else
> continue

What about mapped rels that have been rewritten at some point?

> Agreed to all, but In general, I think WAL hitting the disk before data is more applicable for the shared buffers, where we want to flush the WAL first before writing the shared buffer so that in case of torn page we have an option to recover the page from previous FPI. But in such cases where we are creating a directory or file there is no such requirement. Anyways, I agreed with the comments that it should be more uniform and the comment should be correct.

There have been previous debates about whether WAL records for
filesystem operations should be issued before or after those
operations are performed. I'm not sure how easy those discussion are
to find in the archives, but it's very relevant here. I think the
short version is - if we write a WAL record first and then the
operation fails afterward, we have to PANIC. But if we perform the
operation first and then write the WAL record if it succeeds, then we
could crash before writing WAL and end up out of sync with our
standbys. If we then later do any WAL-logged operation locally that
depends on that operation having been performed, replay will fail on
the standby. There used to be, or maybe still are, comments in the
code defending the latter approach, but more recently it's been
strongly criticized. The thinking, AIUI, is basically that filesystem
operations really ought not to fail, because nobody should be doing
weird things to the data directory, and if they do, panicking is OK.
But having replay fail in strange ways on the standby later is not OK.

I'm not sure if everyone agrees with that logic; it seems somewhat
debatable. I *think* I personally agree with it but ... I'm not even
100% sure about that.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-09-03 14:39:33 Re: Add guc to enable send SIGSTOP to peers when backend exits abnormally
Previous Message Tom Lane 2021-09-03 14:27:13 Re: Improve logging when using Huge Pages