Re: .ready and .done files considered harmful

From: "Bossart, Nathan" <bossartn(at)amazon(dot)com>
To: Dipesh Pandit <dipesh(dot)pandit(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Jeevan Ladhe <jeevan(dot)ladhe(at)enterprisedb(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, Hannu Krosing <hannuk(at)google(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: .ready and .done files considered harmful
Date: 2021-09-14 16:18:34
Message-ID: 40DBACB6-963B-434D-99A3-B21E607FF240@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 9/14/21, 7:23 AM, "Dipesh Pandit" <dipesh(dot)pandit(at)gmail(dot)com> wrote:
> I agree that when we are creating a .ready file we should compare
> the current .ready file with the last .ready file to check if this file is
> created out of order. We can store the state of the last .ready file
> in shared memory and compare it with the current .ready file. I
> believe that archiver specific shared memory area can be used
> to store the state of the last .ready file unless I am missing
> something and this needs to be stored in a separate shared
> memory area.
>
> With this change, we have the flexibility to move the current archiver
> state out of shared memory and keep it local to archiver. I have
> incorporated these changes and updated a new patch.

I wonder if this can be simplified even further. If we don't bother
trying to catch out-of-order .ready files in XLogArchiveNotify() and
just depend on the per-checkpoint/restartpoint directory scans, we can
probably remove lastReadySegNo from archiver state completely.

+ /* Force a directory scan if we are archiving anything but a regular
+ * WAL file or if this WAL file is being created out-of-order.
+ */
+ if (!IsXLogFileName(xlog))
+ PgArchForceDirScan();
+ else
+ {
+ TimeLineID tli;
+ XLogSegNo last_segno;
+ XLogSegNo this_segno;
+
+ last_segno = PgArchGetLastReadySegNo();
+ XLogFromFileName(xlog, &tli, &this_segno, wal_segment_size);
+
+ /*
+ * Force a directory scan in case if this .ready file created out of
+ * order.
+ */
+ last_segno++;
+ if (last_segno != this_segno)
+ PgArchForceDirScan();
+
+ PgArchSetLastReadySegNo(this_segno);
+ }

This is an interesting idea, but the "else" block here seems prone to
race conditions. I think we'd have to hold arch_lck to prevent that.
But as I mentioned above, if we are okay with depending on the
fallback directory scans, I think we can remove the "else" block
completely.

+ /* Initialize the current state of archiver */
+ xlogState.lastSegNo = MaxXLogSegNo;
+ xlogState.lastTli = MaxTimeLineID;

It looks like we have two ways to force a directory scan. We can
either set force_dir_scan to true, or lastSegNo can be set to
MaxXLogSegNo. Why not just set force_dir_scan to true here so that we
only have one way to force a directory scan?

+ /*
+ * Failed to archive, make sure that archiver performs a
+ * full directory scan in the next cycle to avoid missing
+ * the WAL file which could not be archived due to some
+ * failure in current cycle.
+ */
+ PgArchForceDirScan();

Don't we also need to force a directory scan in the other cases we
return early from pgarch_ArchiverCopyLoop()? We will have already
advanced the archiver state in pgarch_readyXlog(), so I think we'd end
up skipping files if we didn't. For example, if archive_command isn't
set, we'll just return, and the next call to pgarch_readyXlog() might
return the next file.

+ /* Continue directory scan until we find a regular WAL file */
+ SpinLockAcquire(&PgArch->arch_lck);
+ PgArch->force_dir_scan = true;
+ SpinLockRelease(&PgArch->arch_lck);

nitpick: I think we should just call PgArchForceDirScan() here.

Nathan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-09-14 17:44:26 Re: Getting ERROR "subplan "SubPlan 1" was not initialized" in EXISTS subplan when using for list partition.
Previous Message Alvaro Herrera 2021-09-14 15:57:47 Re: Remove duplicate static function check_permissions in slotfuncs.c and logicalfuncs.c