Re: archive status ".ready" files may be created too early

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: matsumura(dot)ryo(at)fujitsu(dot)com
Cc: bossartn(at)amazon(dot)com, masao(dot)fujii(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: archive status ".ready" files may be created too early
Date: 2020-06-25 06:35:32
Message-ID: 20200625.153532.379700510444980240.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello. Matsumura-san.

I agree that WAL writer is not the place to notify segmnet. And the
direction you suggested would work.

At Fri, 19 Jun 2020 10:18:34 +0000, "matsumura(dot)ryo(at)fujitsu(dot)com" <matsumura(dot)ryo(at)fujitsu(dot)com> wrote in
> 1. Description in primary side
>
> [Basic problem]
> A process flushing WAL record doesn't know whether the flushed RecPtr is
> EndRecPtr of cross-segment-boundary WAL record or not because only process
> inserting the WAL record knows it and it never memorizes the information to anywhere.
>
> [Basic concept of the patch in primary]
> A process inserting a cross-segment-boundary WAL record memorizes its EndRecPtr
> (I call it CrossBoundaryEndRecPtr) to a new structure in XLogCtl.
> A flushing process creates .ready (Later, I call it just 'notify'.) against
> a segment that is previous one including CrossBoundaryEndRecPtr only when its
> flushed RecPtr is equal or greater than the CrossBoundaryEndRecPtr.
...
> See also the definition of XLogCtl, XLOGShmemSize(), and XLOGShmemInit() in my patch.

I think we don't need most of that shmem stuff. XLogWrite is called
after WAL buffer is filled up to the requested position. So when it
crosses segment boundary we know the all past corss segment-boundary
records are stable. That means all we need to remember is only the
position of the latest corss-boundary record.

> * Action of inserting process
> A inserting process memorie its CrossBoundaryEndRecPtr to CrossBoundaryEndRecPtr
> array element calculated by XLogRecPtrToBufIdx with its CrossBoundaryEndRecPtr.
> If the WAL record crosses many segments, only element against last segment
> including the EndRecPtr is set and others are not set.
> See also CopyXLogRecordToWAL() in my patch.

If we call XLogMarkEndRecPtrIfNeeded() there, the function is called
every time a record is written, most of which are wasteful.
XLogInsertRecord already has a code block executed only at every page
boundary.

> * Action of flushing process
> Overview has been already written as the follwing.
> A flushing process creates .ready (Later, I call it just 'notify'.) against
> a segment that is previous one including CrossBoundaryEndRecPtr only when its
> flushed RecPtr is equal or greater than the CrossBoundaryEndRecPtr.
>
> An additional detail is as the following. The flushing process may notify
> many segments if the record crosses many segments, so the process memorizes
> latest notified segment number to latestArchiveNotifiedSegNo in XLogCtl.
> The process notifies from latestArchiveNotifiedSegNo + 1 to
> flushing segment number - 1.
>
> And latestArchiveNotifiedSegNo is set to EndOfLog after Startup process exits
> replay-loop. Standby also set same timing (= before promoting).
>
> Mutual exlusion about latestArchiveNotifiedSegNo is not required because
> the timing of accessing has been already included in WALWriteLock critical section.

Looks reasonable.

> 2. Description in standby side
>
> * Who notifies?
> walreceiver also doesn't know whether the flushed RecPtr is EndRecPtr of
> cross-segment-boundary WAL record or not. In standby, only Startup process
> knows the information because it is hidden in WAL record itself and only
> Startup process reads and builds WAL record.

Standby doesn't write it's own WAL records. Even if primary sent an
immature record on segment boundary, it just would promote to a new
TLI and starts its own history. Nothing breaks. However it could be a
problem if a standby that crashed the problematic way were started
as-is as a primary, such scenario is out of our scope.

Now we can identify stable portion of WAL stream. It's enough to
prevent walsender from sending data that can be overwritten
afterwards. GetReplicationTargetRecPtr() in the attached does that.

> * Action of Statup process
> Therefore, I implemented that walreceiver never notify and Startup process does it.
> In detail, when Startup process reads one full-length WAL record, it notifies
> from a segment that includes head(ReadRecPtr) of the record to a previous segment that
> includes EndRecPtr of the record.

I don't like that archiving on standby relies on replay progress. We
should avoid that and fortunately I think we dont need it.

> Now, we must pay attention about switching time line.
> The last segment of previous TimeLineID must be notified before switching.
> This case is considered when RM_XLOG_ID is detected.

That segment is archived after renamed as ".partial" later. We don't
archive the last incomplete segment of the previous timeline as-is.

> 3. About other notifying for segment
> Two notifyings for segment are remain. They are not needed to fix.
>
> (1) Notifying for partial segment
> It is not needed to be completed, so it's OK to notify without special consideration.
>
> (2) Re-notifying
> Currently, Checkpointer has notified through XLogArchiveCheckDone().
> It is a safe-net for failure of notifying by backend or WAL writer.
> Backend or WAL writer doesn't retry to notify if falis, but Checkpointer retries
> to notify when it removes old segment. If it fails to notify, then it does not
> remove the segment. It makes Checkpointer to retry notify until the notifying suceeeds.
> Also, in this case, we can just notify whithout special consideration
> because Checkpointer guarantees that all WAL record included in the segment have been already flushed.

So it can be simplified as the attached. Any thoughts?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
0001-Avoid-to-archive-immature-records.patch text/x-patch 7.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2020-06-25 06:58:58 Re: some more pg_dump refactoring
Previous Message Fujii Masao 2020-06-25 05:35:34 Re: Review for GetWALAvailability()