Re: [Patch] ALTER SYSTEM READ ONLY

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amul Sul <sulamul(at)gmail(dot)com>
Cc: Jaime Casanova <jcasanov(at)systemguards(dot)com(dot)ec>, Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Prabhat Sahu <prabhat(dot)sahu(at)enterprisedb(dot)com>
Subject: Re: [Patch] ALTER SYSTEM READ ONLY
Date: 2021-10-14 17:40:12
Message-ID: CA+TgmobppXiL598ytRivAX6-MoRxDxNV4np1h+ybbJdxf2uxgw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 12, 2021 at 8:18 AM Amul Sul <sulamul(at)gmail(dot)com> wrote:
> In the attached version I have fixed this issue by restoring missingContrecPtr.
>
> To handle abortedRecPtr and missingContrecPtr newly added global
> variables thought the commit # ff9f111bce24, we don't need to store
> them in the shared memory separately, instead, we need a flag that
> indicates a broken record found previously, at the end of recovery, so
> that we can overwrite contrecord.
>
> The missingContrecPtr is assigned to the EndOfLog, and we have handled
> EndOfLog previously in the 0004 patch, and the abortedRecPtr is the
> same as the lastReplayedEndRecPtr, AFAICS. I have added an assert to
> ensure that the lastReplayedEndRecPtr value is the same as the
> abortedRecPtr, but I think that is not needed, we can go ahead and
> write an overwrite-contrecord starting at lastReplayedEndRecPtr.

I thought that it made sense to commit 0001 and 0002 at this point, so
I have done that. I think that the treatment of missingContrecPtr and
abortedRecPtr may need more thought yet, so at least for that reason I
don't think it's a good idea to proceed with 0004 yet. 0003 is just
code movement so I guess that can be committed whenever we're
confident that we know exactly which things we want to end up inside
XLogAcceptWrites().

I do have a few ideas after studying this a bit more:

- I wonder whether, in addition to moving a few things later as 0002
did, we also ought to think about moving one thing earlier,
specifically XLogReportParameters(). Right now, we have, I believe,
four things that write WAL at the end of recovery:
CreateOverwriteContrecordRecord(), UpdateFullPageWrites(),
PerformRecoveryXLogAction(), and XLogReportParameters(). As the code
is structured now, we do the first three of those things, and then do
a bunch of other stuff inside CleanupAfterArchiveRecovery() like
running recovery_end_command, and removing non-parent xlog files, and
archiving the partial segment, and then come back and do the fourth
one. Is there any good reason for that? If not, I think doing them all
together would be cleaner, and would propose to reverse the order of
CleanupAfterArchiveRecovery() and XLogReportParameters().

- If we did that, then I would further propose to adjust things so
that we remove the call to LocalSetXLogInsertAllowed() and the
assignment LocalXLogInsertAllowed = -1 from inside
CreateEndOfRecoveryRecord(), the LocalXLogInsertAllowed = -1 from just
after UpdateFullPageWrites(), and the call to
LocalSetXLogInsertAllowed() just before XLogReportParameters().
Instead, just let the call to LocalSetXLogInsertAllowed() right before
CreateOverwriteContrecordRecord() remain in effect. There doesn't seem
to be much point in flipping that switch off and on again, and the
fact that we have been doing so is in my view just evidence that
StartupXLOG() doesn't do a very good job of getting related code all
into one place.

- It seems really tempting to invent a fourth RecoveryState value that
indicates that we are done with REDO but not yet in production, and
maybe also to rename RecoveryState to something like WALState. I'm
thinking of something like WAL_STATE_CRASH_RECOVERY,
WAL_STATE_ARCHIVE_RECOVERY, WAL_STATE_REDO_COMPLETE, and
WAL_STATE_PRODUCTION. Then, instead of having
LocalSetXLogInsertAllowed(), we could teach XLogInsertAllowed() that
the startup process and the checkpointer are allowed to insert WAL
when the state is WAL_STATE_REDO_COMPLETE, but other processes only
once we reach WAL_STATE_PRODUCTION. We would set
WAL_STATE_REDO_COMPLETE where we now call LocalSetXLogInsertAllowed().
It's necessary to include the checkpointer, or at least I think it is,
because PerformRecoveryXLogAction() might call RequestCheckpoint(),
and that's got to work. If we did this, then I think it would also
solve another problem which the overall patch set has to address
somehow. Say that we eventually move responsibility for the
to-be-created XLogAcceptWrites() function from the startup process to
the checkpointer, as proposed. The checkpointer needs to know when to
call it ... and the answer with this change is simple: when we reach
WAL_STATE_REDO_COMPLETE, it's time!

But this idea is not completely problem-free. I spent some time poking
at it and I think it's a little hard to come up with a satisfying way
to code XLogInsertAllowed(). Right now that function calls
RecoveryInProgress(), and if RecoveryInProgress() decides that
recovery is no longer in progress, it calls InitXLOGAccess(). However,
that presumes that the only reason you'd call RecoveryInProgress() is
to figure out whether you should write WAL, which I don't think is
really true, and it also means that, when the wal state is
WAL_STATE_REDO_COMPLETE, RecoveryInProgress() would need to return
true in the checkpointer and startup process and false everywhere
else, which does not sound like a great idea. It seems fine to say
that xlog insertion is allowed in some processes but not others,
because not all processes are necessarily equally privileged, but
whether or not we're in recovery is supposed to be something about
which everyone agrees, so answering that question differently in
different processes doesn't seem nice. XLogInsertAllowed() could be
rewritten to check the state directly and make its own determination,
without relying on RecoveryInProgress(), and I think that might be the
right way to go here.

But that isn't entirely problem-free either, because there's a lot of
code that uses RecoveryInProgress() to answer the question "should I
write WAL?" and therefore it's not great if RecoveryInProgress() is
returning an answer that is inconsistent with XLogInsertAllowed().
MarkBufferDirtyHint() and heap_page_prune_opt() are examples of this
kind of coding. It probably wouldn't break in practice right away,
because most of that code never runs in the startup process or the
checkpointer and would therefore never notice the difference in
behavior between those two functions, but if in the future we get the
read-only feature that this thread is supposed to be about, we'd have
problems. Not all RecoveryInProgress() calls have this sense - e.g.
sendDir() in basebackup.c is trying to figure out whether recovery
ended during the backup, not whether we can write WAL. But perhaps
this is a good time to go and replace RecoveryInProgress() checks that
are intending to decide whether or not it's OK to write WAL with
XLogInsertAllowed() checks (noting that the return value is reversed).
If we did that, then I think RecoveryInProgress() could also NOT call
InitXLOGAccess(), and that could be done only by XLogInsertAllowed(),
which seems like it might be better. But I haven't really tried to
code all of this up, so I'm not really sure how it all works out.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2021-10-14 17:43:21 Re: should we allow users with a predefined role to access pg_backend_memory_contexts view and pg_log_backend_memory_contexts function?
Previous Message Andres Freund 2021-10-14 17:26:38 Re: [RFC] building postgres with meson