Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

From: "Bossart, Nathan" <bossartn(at)amazon(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work
Date: 2022-01-20 19:15:08
Message-ID: 21F86990-CC52-42B2-8A85-A3DA110F57F1@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I took the liberty of adjusting Bharath's patch based on the latest
feedback.

On 1/19/22, 10:35 AM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> We should be fsync'ing the directory not the files
> themselves, no?

I added a directory sync at the end of CheckPointLogicalRewriteHeap(),
which IIUC is enough.

> * Why is the check for "map-" prefix after, rather than before,
> the lstat?

I swapped these checks. I stopped short of moving the sscanf() before
the lstat(), though, as 1) I don't think it will help very much and 2)
it seemed weird to start emitting "could not parse filename" logs for
non-regular files we presently skip silently.

> * Why is it okay to ignore lstat failure? Seems like we might
> as well not even have the lstat.

I added error checking for lstat().

> * The sscanf on the file name would not notice trailing junk,
> such as an editor backup marker. Is that okay?

I think this is okay. The absolute worst that would happen would be
that the extra file would be deleted. This might eventually become a
problem if files with the same prefix format were created by the
server. However, CheckPointSnapBuild() already has this problem with
temporary files, and it claims not to need any extra handling:

* temporary filenames from SnapBuildSerialize() include the LSN and
* everything but are postfixed by .$pid.tmp. We can just remove them
* the same as other files because there can be none that are
* currently being written that are older than cutoff.

> (Actually, isn't the
> separate check for "map-" useless, given that sscanf will make
> the equivalent check?)

The only benefit I see from the extra "map-" check is that it'll avoid
"could not parse filename" logs for files that clearly aren't related
to the task at hand. I don't know if this is expected during normal
operation at the moment. I've left the "map-" check for now.

> I started out wondering why the patch didn't also change the loop's
> other ERROR conditions to LOG. But we do want to ERROR if we're
> unable to sync transient state down to disk, and that is what
> the other steps (think they) are doing. It might be worth a
> comment to point that out though, before someone decides that
> if these errors are just LOG level then the others can be too.

I added such a comment.

I also updated CheckPointSnapBuild() and UpdateLogicalMappings() with
similar adjustments.

Nathan

Attachment Content-Type Size
v2-0001-fix-up-code-in-directory-scans-for-logical-replic.patch application/octet-stream 4.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2022-01-20 19:23:45 autovacuum prioritization
Previous Message Nikita Malakhov 2022-01-20 18:24:56 Re: Pluggable toaster