Re: pg_receivewal makes a bad daemon

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_receivewal makes a bad daemon
Date: 2021-05-05 16:34:36
Message-ID: CABUevExjQ9VHtx8VpRC8OZYQs2c5gQjTuH5wf=Yz1pArgXqyiQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, May 5, 2021 at 5:04 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> You might want to use pg_receivewal to save all of your WAL segments
> somewhere instead of relying on archive_command. It has, at the least,
> the advantage of working on the byte level rather than the segment
> level. But it seems to me that it is not entirely suitable as a
> substitute for archiving, for a couple of reasons. One is that as soon
> as it runs into a problem, it exits, which is not really what you want
> out of a daemon that's critical to the future availability of your
> system. Another is that you can't monitor it aside from looking at
> what it prints out, which is also not really what you want for a piece
> of critical infrastructure.
>
> The first problem seems somewhat more straightforward. Suppose we add
> a new command-line option, perhaps --daemon but we can bikeshed. If
> this option is specified, then it tries to keep going when it hits a
> problem, rather than just giving up. There's some fuzziness in my mind
> about exactly what this should mean. If the problem we hit is that we
> lost the connection to the remote server, then we should try to
> reconnect. But if the problem is something like a failure inside
> open_walfile() or close_walfile(), like a failed open() or fsync() or
> close() or something, it's a little less clear what to do. Maybe one
> idea would be to have a parent process and a child process, where the
> child process does all the work and the parent process just keeps
> re-launching it if it dies. It's not entirely clear that this is a
> suitable way of recovering from, say, an fsync() failure, given
> previous discussions claiming that - and I might be exaggerating a bit
> here - there is essentially no way to recover from a failed fsync()
> because the kernel might have already thrown out your data and you
> might as well just set the data center on fire - but perhaps an retry
> system that can't cope with certain corner cases is better than not
> having one at all, and perhaps we could revise the logic here and
> there to have the process doing the work take some action other than
> exiting when that's an intelligent approach.

Is this really a problem we should fix ourselves? Most daemon-managers
today will happily be configured to automatically restart a daemon on
failure with a single setting since a long time now. E.g. in systemd
(which most linuxen uses now) you just set Restart=on-failure (or
maybe even Restart=always) and something like RestartSec=10.

That said, it wouldn't cover an fsync() error -- they will always
restart. The way to handle that is for the operator to capture the
error message perhaps, and just "deal with it"?

What could be more interesting there in a "systemd world" would be to
add watchdog support. That'd obviously only be interesting on systemd
platforms, but we already have some of that basic notification support
in the postmaster for those.

> The second problem is a bit more complex. If you were transferring WAL
> to another PostgreSQL instance rather than to a frontend process, you
> could log to some place other than standard output, like for example a
> file, and you could periodically rotate that file, or alternatively
> you could log to syslog or the Windows event log. Even better, you
> could connect to PostgreSQL and run SQL queries against monitoring
> views and see what results you get. If the existing monitoring views
> don't give users what they need, we can improve them, but the whole
> infrastructure needed for this kind of thing is altogether lacking for
> any frontend program. It does not seem very appealing to reinvent log
> rotation, connection management, and monitoring views inside
> pg_receivewal, let alone in every frontend process where similar
> monitoring might be useful. But at least for me, without such
> capabilities, it is a little hard to take pg_receivewal seriously.

Again, isn't this the job of the daemon runner? At least in cases
where it's not Windows :)? That is, taking the output and putting it
in a log, and interfacing with log rotation.

Now, having some sort of statistics *other* than parsing a log would
definitely be useful. But perhaps that could be something as simple
having a --statsfile=/foo/bar parameter and then update that one at
regular intervals with "whatever is the current state"?

And of course, the other point to monitor is the replication slot on
the server it's connected to -- but I agree that being able to monitor
both sides there would be good.

> I wonder first of all whether other people agree with these concerns,
> and secondly what they think we ought to do about it. One option is -
> do nothing. This could be based either on the idea that pg_receivewal
> is hopeless, or else on the idea that pg_receivewal can be restarted
> by some external system when required and monitored well enough as
> things stand. A second option is to start building out capabilities in
> pg_receivewal to turn it into something closer to what you'd expect of
> a normal daemon, with the addition of a retry capability as probably
> the easiest improvement. A third option is to somehow move towards a
> world where you can use the server to move WAL around even if you
> don't really want to run the server. Imagine a server running with no
> data directory and only a minimal set of running processes, just (1) a
> postmaster and (2) a walreceiver that writes to an archive directory
> and (3) non-database-connected backends that are just smart enough to
> handle queries for status information. This has the same problem that
> I mentioned on the thread about monitoring the recovery process,
> namely that we haven't got pg_authid. But against that, you get a lot
> of infrastructure for free: configuration files, process management,
> connection management, an existing wire protocol, memory contexts,
> rich error reporting, etc.
>
> I am curious to hear what other people think about the usefulness (or
> lack thereof) of pg_receivewal as thing stand today, as well as ideas
> about future direction.

Per above, I'm thinking maybe our efforts are better directed at
documenting ways to do it now?

Also, all the above also apply to pg_recvlogical, right? So if we do
want to invent our own daemon-init-system, we should probably do one
more generic that can handle both.

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2021-05-05 16:42:40 Re: MaxOffsetNumber for Table AMs
Previous Message Alvaro Herrera 2021-05-05 16:33:39 Re: useless argument of ATAddForeignKeyConstraint