Re: New WAL record to detect the checkpoint redo location

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New WAL record to detect the checkpoint redo location
Date: 2023-09-22 06:07:27
Message-ID: CAA4eK1JVKZGRHLOEotWi+e+09jucNedqpkkc-Do4dh5FTAU+5w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 21, 2023 at 9:06 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Thu, Sep 21, 2023 at 4:22 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > After the 0003 patch, do we need acquire exclusive lock via
> > WALInsertLockAcquireExclusive() for non-shutdown checkpoints. Even the
> > comment "We must block concurrent insertions while examining insert
> > state to determine the checkpoint REDO pointer." seems to indicate
> > that it is not required. If it is required then we may want to change
> > the comments and also acquiring the locks twice will have more cost
> > than acquiring it once and write the new WAL record under that lock.
>
> I think the comment needs updating. I don't think we can do curInsert
> = XLogBytePosToRecPtr(Insert->CurrBytePos) without taking the locks.
> Same for Insert->fullPageWrites.
>

If we can't do those without taking all the locks then it is fine but
just wanted to give it a try to see if there is a way to avoid in case
of online (non-shutdown) checkpoints. For example, curInsert is used
only for the shutdown path, so we don't need to acquire all locks for
it in the cases except for the shutdown case. Here, we are reading
Insert->fullPageWrites which requires an insertion lock but not all
the locks (as per comments in structure XLogCtlInsert). Now, I haven't
done detailed analysis for
XLogCtl->InsertTimeLineID/XLogCtl->PrevTimeLineID but some places
reading InsertTimeLineID have a comment like "Given that we're not in
recovery, InsertTimeLineID is set and can't change, so we can read it
without a lock." which suggests that some analysis is required whether
reading those requires all locks in this code path. OTOH, it won't
matter to acquire all locks in this code path for the reasons
mentioned by you and it may help in keeping the code simple. So, it is
up to you to take the final call on this matter. I am fine with your
decision.

>
> > BTW, I would like to mention that there is a slight interaction of
> > this work with the patch to upgrade/migrate slots [1]. Basically in
> > [1], to allow slots migration from lower to higher version, we need to
> > ensure that all the WAL has been consumed by the slots before clean
> > shutdown. However, during upgrade we can generate few records like
> > checkpoint which we will ignore for the slot consistency checking as
> > such records doesn't matter for data consistency after upgrade. We
> > probably need to add this record to that list. I'll keep an eye on
> > both the patches so that we don't miss that interaction but mentioned
> > it here to make others also aware of the same.
>
> If your approach requires a code change every time someone adds a new
> WAL record that doesn't modify table data, you might want to rethink
> the approach a bit.
>

I understand your hesitation and we have discussed several approaches
that do not rely on the WAL record type to determine if the slots have
caught up but the other approaches seem to have different other
downsides. I know it may not be a good idea to discuss those here but
as there was a slight interaction with this work, so I thought to
bring it up. To be precise, we need to ensure that we ignore WAL
records that got generated during pg_upgrade operation (say during
pg_upgrade --check).

The approach we initially followed was to check if the slot's
confirmed_flush_lsn is equal to the latest checkpoint in
pg_controldata (which is the shutdown checkpoint after stopping the
server). This approach doesn't work for the use case where the user
runs pg_upgrade --check before actually performing the upgrade [1].
This is because during the upgrade check, the server will be
stopped/started and update the position of the latest checkpoint,
causing the check to fail in the actual upgrade and leading pg_upgrade
to believe that the slot has not been caught up.

To address the issues in the above approach, we also discussed several
alternative approaches[2][3]: a) Adding a new field in pg_controldata
to record the last checkpoint that happens in non-upgrade mode, so
that we can compare the slot's confirmed_flush_lsn with this value.
However, we were not sure if this was a good enough reason to add a
new field in controldata field and sprinkle IsBinaryUpgrade check in
checkpointer code path. b) Advancing each slot's confirmed_flush_lsn
to the latest position if the first upgrade check passes. This way,
when performing the actual upgrade, the confirmed_flush_lsn will also
pass. However, internally advancing the LSN seems unconventional. c)
Introducing a new pg_upgrade option to skip the check for slot
catch-up so that if it is already done at the time of pg_upgrade
--check, we can avoid rechecking during actual upgrade. Although this
might work, the user would need to specify this manually, which is not
ideal. d) Document this and suggest users consume the WALs, but this
doesn't look acceptable to users.

All the above approaches have their downsides, prompting us to
consider the WAL scan approach which is to scan the end of the WAL for
records that should have been streamed out. This approach was first
proposed by Andres[4] and was chosen[5] after considering all other
approaches. If we don't like relying on WAL record types then I think
the alternative (a) to add a new field in ControlDataFile is worth
considering.

[1] https://www.postgresql.org/message-id/CAA4eK1LzeZLoTLaAuadmuiggc5mq39oLY6fK95oFKiPBPBf%2BeQ%40mail.gmail.com
[2] https://www.postgresql.org/message-id/OS0PR01MB571640E1B58741979A5E586594F7A%40OS0PR01MB5716.jpnprd01.prod.outlook.com
[3] https://www.postgresql.org/message-id/TYAPR01MB5866EF7398CB13FFDBF230E7F5F0A%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[4] https://www.postgresql.org/message-id/20230725170319.h423jbthfohwgnf7%40awork3.anarazel.de
[5] https://www.postgresql.org/message-id/CAA4eK1KqqWayKtRhvyRgkhEHvAUemW_dEqgFn7UOG3D4B6f0ew%40mail.gmail.com

--
With Regards,
Amit Kapila.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2023-09-22 06:29:14 Re: [PoC] pg_upgrade: allow to upgrade publisher node
Previous Message Peter Eisentraut 2023-09-22 06:06:57 Re: Remove MSVC scripts from the tree