From: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
---|---|
To: | bharath(dot)rupireddyforpostgres(at)gmail(dot)com |
Cc: | robertmhaas(at)gmail(dot)com, nathandbossart(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: thinko in basic_archive.c |
Date: | 2022-10-20 01:26:47 |
Message-ID: | 20221020.102647.746666963360821887.horikyota.ntt@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
At Wed, 19 Oct 2022 10:21:12 +0530, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote in
> On Wed, Oct 19, 2022 at 8:58 AM Kyotaro Horiguchi
> <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> > An archive module could clean up them at startup or at the first call
> > but that might be dangerous (since archive directory is I think
> > thought as external resource).
>
> The archive module must be responsible for cleaning up the temp file
> that it creates. One way to do it is in the archive module's shutdown
> callback, this covers most of the cases, but not all.
True. But I agree to Robert that such temporary files should be
cleanup-able without needing temporarily-valid knowledge (current file
name, in this case). A common strategy for this is to name those files
by names that can be identifed as garbage.
> > Honestly, I'm not sure about a reasonable scenario where simultaneous
> > archivings of a same file is acceptable, though. I feel that we should
> > not allow simultaneous archiving of the same segment by some kind of
> > interlocking. In other words, we might should serialize duplicate
> > archiving of asame file.
>
> In a typical production environment, there's some kind of locking (for
> instance lease files) that allow/disallow file
> creation/deletion/writes/reads which guarantees that the same file
> isn't put into a directory (can be archive location) many times. And
> as you rightly said archive_directory is something external to
> postgres and we really can't deal with concurrent writers
> writing/creating the same files. Even if we somehow try to do it, it
> makes things complicated. This is true for any PGDATA directories.
> However, the archive module implementers can choose to define such a
> locking strategy.
flock() on nfs..
> > In another direction, the current code allows duplicate simultaneous
> > copying to temporary files with different names then the latest
> > renaming wins. We reach the almost same result (on Linuxen (or
> > POSIX?)) by unlinking the existing tempfile first then create a new
> > one with the same name then continue. Even if the tempfile were left
> > alone after a crash, that file would be unlinked at the next trial of
> > archiving. But I'm not sure how this can be done on Windows.. In the
> > first place I'm not sure that the latest-unconditionally-wins policy
> > is appropriate or not, though.
>
> We really can't just unlink the temp file because it has pid and
> timestamp in the filename and it's hard to determine the temp file
> that we created earlier.
But since power cut is a typical crash source, we need to identify
ruined temporary files and the current naming convention is incomplete
in this regard.
The worst case I can come up with regardless of feasibility is a
multi-standby physical replication set where all hosts share one
archive directory. Indeed live and dead temprary files can coexist
there. However, I think we can identify truly rotten temp files by
inserting host name or cluster name (means cluster_name in
postgresql.conf) even in that case. This premise that DBA names every
cluster differently, but I think DBAs that is going to configure such
a system are required to be very cautious about that kind of aspect.
> As far as the basic_archive module is concerned, we ought to keep it
> simple. I still think the simplest we can do is to use the
> basic_archive's shutdown_cb to delete (in most of the cases, but not
(Sorry, my memory was confused at the time. That callback feature
already existed.)
> all) the left-over temp file that the module is dealing with
> as-of-the-moment and add a note about the users dealing with
> concurrent writers to the basic_archive.archive_directory like the
> attached v2 patch.
>
> > > A simple server restart while the basic_archive module is copying
> > > to/from temp file would leave the file behind, see[1]. I think we can
> > > fix this by defining shutdown_cb for the basic_archive module, like
> > > the attached patch. While this helps in most of the crashes, but not
> > > all. However, this is better than what we have right now.
> >
> > ShutdownWalRecovery() does the similar thing, but as you say this one
> > covers rather narrow cases than that since RestoreArchiveFile()
> > finally overwrites the left-alone files at the next call for that
> > file.
>
> We're using unique temp file names in the basic_archive module so we
> can't overwrite the same upon restart.
Of course, it premised that a cluster uses the same name for a
segment. If we insert cluseter_name into the temprary name, a starting
cluster can indentify garbage files to clean up. For example if we
name them as follows.
ARCHTEMP_cluster1_pid_time_<lsn>
A starting cluster can clean up all files starts with
"archtemp_cluster1_*". (We need to choose the delimiter carefully,
though..)
> > # The patch seems forgetting to clear the tmepfilepath *after* a
> > # successful renaming.
>
> It does so at the beginning of basic_archive_file() which is sufficient.
No. I didn't mean that, If server stops after a successfull
durable_rename but before the next call to
basic_archive_file_internal, that call back makes false comlaint since
that temprary file is actually gone.
> > And I don't see how the callback is called.
>
> call_archive_module_shutdown_callback()->basic_archive_shutdown().
Yeah, sorry for the noise.
> Please see the attached v2 patch.
+static char tempfilepath[MAXPGPATH + 256];
MAXPGPATH is the maximum length of a file name that PG assumes to be
able to handle. Thus extending that length seems wrong.
--
Kyotaro Horiguchi
NTT Open Source Software Center
From | Date | Subject | |
---|---|---|---|
Next Message | shiy.fnst@fujitsu.com | 2022-10-20 02:37:43 | RE: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher |
Previous Message | Peter Smith | 2022-10-20 00:56:58 | Re: GUC values - recommended way to declare the C variables? |