Re: .ready and .done files considered harmful

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: .ready and .done files considered harmful
Date: 2021-05-04 04:27:55
Message-ID: 20210504042755.ehuaoz5blcnjw5yk@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2021-05-03 16:49:16 -0400, Robert Haas wrote:
> I have two possible ideas for addressing this; perhaps other people
> will have further suggestions. A relatively non-invasive fix would be
> to teach pgarch.c how to increment a WAL file name. After archiving
> segment N, check using stat() whether there's an .ready file for
> segment N+1. If so, do that one next. If not, then fall back to
> performing a full directory scan.

Hm. I wonder if it'd not be better to determine multiple files to be
archived in one readdir() pass?

> As far as I can see, this is just cheap insurance. If archiving is
> keeping up, the extra stat() won't matter much. If it's not, this will
> save more system calls than it costs. Since during normal operation it
> shouldn't really be possible for files to show up in pg_wal out of
> order, I don't really see a scenario where this changes the behavior,
> either. If there are gaps in the sequence at startup time, this will
> cope with it exactly the same as we do now, except with a better
> chance of finishing before I retire.

There's definitely gaps in practice :(. Due to the massive performance
issues with archiving there are several tools that archive multiple
files as part of one archive command invocation (and mark the additional
archived files as .done immediately).

> However, that's still pretty wasteful. Every time we have to wait for
> the next file to be ready for archiving, we'll basically fall back to
> repeatedly scanning the whole directory, waiting for it to show up.

Hm. That seems like it's only an issue because .done and .ready are in
the same directory? Otherwise the directory would be empty while we're
waiting for the next file to be ready to be archived. I hate that that's
a thing but given teh serial nature of archiving, with high per-call
overhead, I don't think it'd be ok to just break that without a
replacement :(.

> But perhaps we could work around this by allowing pgarch.c to access
> shared memory, in which case it could examine the current timeline
> whenever it wants, and probably also whatever LSNs it needs to know
> what's safe to archive.

FWIW, the shared memory stats patch implies doing that, since the
archiver reports stats.

> If we did that, could we just get rid of the .ready and .done files
> altogether? Are they just a really expensive IPC mechanism to avoid a
> shared memory connection, or is there some more fundamental reason why
> we need them?

What kind of shared memory mechanism are you thinking of? Due to
timelines and history files I don't think simple position counters would
be quite enough.

I think the aforementioned "batching" archive commands are part of the
problem :(.

> And is there any good reason why the archiver shouldn't be connected
> to shared memory? It is certainly nice to avoid having more processes
> connected to shared memory than necessary, but the current scheme is
> so inefficient that I think we end up worse off.

I think there is no fundamental for avoiding shared memory in the
archiver. I guess there's a minor robustness advantage, because the
forked shell to start the archvive command won't be attached to shared
memory. But that's only until the child exec()s to the archive command.

There is some minor performance advantage as well, not having to process
the often large and contended memory mapping for shared_buffers is
probably measurable - but swamped by the cost of needing to actually
archive the segment.

My only "concern" with doing anything around this is that I think the
whole approach of archive_command is just hopelessly broken, with even
just halfway busy servers only able to keep up archiving if they muck
around with postgres internal data during archive command execution. Add
to that how hard it is to write a robust archive command (e.g. the one
in our docs still suggests test ! -f && cp, which means that copy
failing in the middle yields an incomplete archive)...

While I don't think it's all that hard to design a replacement, it's
however likely still more work than addressing the O(n^2) issue, so ...

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2021-05-04 04:31:49 Re: AlterSubscription_refresh "wrconn" wrong variable?
Previous Message Masahiko Sawada 2021-05-04 04:18:03 Re: Replication slot stats misgivings