Re: remove more archiving overhead

From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: remove more archiving overhead
Date: 2022-02-24 17:55:53
Message-ID: 20220224175553.GA660690@nathanxps13
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for taking a look!

On Thu, Feb 24, 2022 at 02:13:49PM +0900, Kyotaro Horiguchi wrote:
> https://www.postgresql.org/docs/14/continuous-archiving.html
> | The archive command should generally be designed to refuse to
> | overwrite any pre-existing archive file. This is an important safety
> | feature to preserve the integrity of your archive in case of
> | administrator error (such as sending the output of two different
> | servers to the same archive directory).
>
> The recommended behavior of archive_command above is to avoid this
> kind of corruption. If we officially admit that a WAL segment can be
> archive twice, we need to revise that part (and the following part) of
> the doc.

Yeah, we might want to recommend succeeding if the archive already exists
with identical contents (like basic_archive does). IMO this would best be
handled in a separate thread that is solely focused on documentation
updates. AFAICT this is an existing problem.

>> IIUC this means that in theory, a crash at an unfortunate moment could
>> leave you with a .ready file, the file to archive, and another link to that
>> file with a "future" WAL filename. If you re-archive the file after it has
>> been reused, you could end up with corrupt WAL archives. I think this
>
> IMHO the difference would be that the single system call (maybe)
> cannot be interrupted by sigkill. So I'm not sure that that is worth
> considering. The sigkill window can be elimianted if we could use
> renameat(RENAME_NOREPLACE). But finally power-loss can cause that.

Perhaps durable_rename_excl() could use renameat2() for systems that
support it. However, the problem would still exist for systems that have
link() but not renameat2(). In this particular case, there really
shouldn't be an existing file in pg_wal.

>> might already be possible today. Syncing the directory after every rename
>> probably reduces the likelihood quite substantially, but if
>> RemoveOldXlogFiles() quickly picks up the .done file and attempts
>> recycling before durable_rename() calls fsync() on the directory,
>> presumably the same problem could occur.
>
> If we don't fsync at every rename, we don't even need for
> RemoveOldXlogFiles to pickup .done very quickly. Isn't it a big
> difference?

Yeah, it probably increases the chances quite a bit.

> I think we don't want make checkpointer slower in exchange of making
> archiver faster. Perhaps we need to work using a condensed
> information rather than repeatedly scanning over the distributed
> information like archive_status?

v2 avoids giving the checkpointer more work.

> At Tue, 22 Feb 2022 11:52:29 -0800, Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote in
>> On Tue, Feb 22, 2022 at 09:37:11AM -0800, Nathan Bossart wrote:
>> > In my testing, I found that when I killed the server just before unlink()
>> > during WAL recyling, I ended up with links to the same file in pg_wal after
>> > restarting. My latest test produced links to the same file for the current
>> > WAL file and the next one. Maybe WAL recyling should use durable_rename()
>> > instead of durable_rename_excl().
>>
>> Here is an updated patch.
>
> Is it intentional that v2 loses the xlogarchive.c part?

Yes. I found that a crash at an unfortunate moment can produce multiple
links to the same file in pg_wal, which seemed bad independent of archival.
By fixing that (i.e., switching from durable_rename_excl() to
durable_rename()), we not only avoid this problem, but we also avoid trying
to archive a file the server is concurrently writing. Then, after a crash,
the WAL file to archive should either not exist (which is handled by the
archiver) or contain the same contents as any preexisting archives.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-02-24 18:17:38 Re: fix crash with Python 3.11
Previous Message Andres Freund 2022-02-24 17:35:43 Re: convert libpq uri-regress tests to tap test