remove more archiving overhead

From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: remove more archiving overhead
Date: 2022-02-22 01:19:48
Message-ID: 20220222011948.GA3850532@nathanxps13
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

With the addition of archive modules (5ef1eef) and other archiving
improvements (e.g., beb4e9b), the amount of archiving overhead has been
reduced. I spent some time measuring the remaining overhead, and the
following function stood out:

/*
* pgarch_archiveDone
*
* Emit notification that an xlog file has been successfully archived.
* We do this by renaming the status file from NNN.ready to NNN.done.
* Eventually, a checkpoint process will notice this and delete both the
* NNN.done file and the xlog file itself.
*/
static void
pgarch_archiveDone(char *xlog)
{
char rlogready[MAXPGPATH];
char rlogdone[MAXPGPATH];

StatusFilePath(rlogready, xlog, ".ready");
StatusFilePath(rlogdone, xlog, ".done");
(void) durable_rename(rlogready, rlogdone, WARNING);
}

Specifically, the call to durable_rename(), which calls fsync() a few
times, can cause a decent delay. On my machine, archiving ~12k WAL files
using a bare-bones archive module that did nothing but return true took
over a minute. When I changed this to rename() (i.e., reverting the change
to pgarch.c in 1d4a0ab), the same test took a second or less.

I also spent some time investigating whether durably renaming the archive
status files was even necessary. In theory, it shouldn't be. If a crash
happens before the rename is persisted, we might try to re-archive files,
but your archive_command/archive_library should handle that. If the file
was already recycled or removed, the archiver will skip it (thanks to
6d8727f). But when digging further, I found that WAL file recyling uses
durable_rename_excl(), which has the following note:

* Note that a crash in an unfortunate moment can leave you with two links to
* the target file.

IIUC this means that in theory, a crash at an unfortunate moment could
leave you with a .ready file, the file to archive, and another link to that
file with a "future" WAL filename. If you re-archive the file after it has
been reused, you could end up with corrupt WAL archives. I think this
might already be possible today. Syncing the directory after every rename
probably reduces the likelihood quite substantially, but if
RemoveOldXlogFiles() quickly picks up the .done file and attempts
recycling before durable_rename() calls fsync() on the directory,
presumably the same problem could occur.

I believe the current recommendation is to fail if the archive file already
exists. The basic_archive contrib module fails if the archive already
exists and has different contents, and it succeeds without re-archiving if
it already exists with identical contents. Since something along these
lines is recommended as a best practice, perhaps this is okay. But I think
we might be able to do better. If we ensure that the archive_status
directory is sync'd prior to recycling/removing a WAL file, I believe we
are safe from the aforementioned problem.

The drawback of this is that we are essentially offloading more work to the
checkpointer process. In addition to everything else it must do, it will
also need to fsync() the archive_status directory many times. I'm not sure
how big of a problem this is. It should be good to ensure that archiving
keeps up, and there are far more expensive tasks that the checkpointer must
perform that could make the added fsync() calls relatively insignificant.

I've attached a patch that makes the changes discussed above. Thoughts?

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v1-0001-remove-more-archiving-overhead.patch text/x-diff 3.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message osumi.takamichi@fujitsu.com 2022-02-22 01:34:25 RE: Failed transaction statistics to measure the logical replication progress
Previous Message tanghy.fnst@fujitsu.com 2022-02-22 01:15:24 RE: Failed transaction statistics to measure the logical replication progress