Re: parallelizing the archiver

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Julien Rouhaud <rjuju123(at)gmail(dot)com>
Cc: "Bossart, Nathan" <bossartn(at)amazon(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallelizing the archiver
Date: 2021-09-10 15:22:18
Message-ID: CA+TgmoZUd6zBNb+boukVXrGAVgLyU-fPY+6yfiKj6abmNUCvWA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 10, 2021 at 10:19 AM Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:
> Those approaches don't really seems mutually exclusive? In both case
> you will need to internally track the status of each WAL file and
> handle non contiguous file sequences. In case of parallel commands
> you only need additional knowledge that some commands is already
> working on a file. Wouldn't it be even better to eventually be able
> launch multiple batches of multiple files rather than a single batch?

Well, I guess I'm not convinced. Perhaps people with more knowledge of
this than I may already know why it's beneficial, but in my experience
commands like 'cp' and 'scp' are usually limited by the speed of I/O,
not the fact that you only have one of them running at once. Running
several at once, again in my experience, is typically not much faster.
On the other hand, scp has a LOT of startup overhead, so it's easy to
see the benefits of batching.

[rhaas pgsql]$ touch x y z
[rhaas pgsql]$ time sh -c 'scp x cthulhu: && scp y cthulhu: && scp z cthulhu:'
x 100% 207KB 78.8KB/s 00:02
y 100% 0 0.0KB/s 00:00
z 100% 0 0.0KB/s 00:00

real 0m9.418s
user 0m0.045s
sys 0m0.071s
[rhaas pgsql]$ time sh -c 'scp x y z cthulhu:'
x 100% 207KB 273.1KB/s 00:00
y 100% 0 0.0KB/s 00:00
z 100% 0 0.0KB/s 00:00

real 0m3.216s
user 0m0.017s
sys 0m0.020s

> If we start with parallelism first, the whole ecosystem could
> immediately benefit from it as is. To be able to handle multiple
> files in a single command, we would need some way to let the server
> know which files were successfully archived and which files weren't,
> so it requires a different communication approach than the command
> return code.

That is possibly true. I think it might work to just assume that you
have to retry everything if it exits non-zero, but that requires the
archive command to be smart enough to do something sensible if an
identical file is already present in the archive.

> But as I said, I'm not convinced that using the archive_command
> approach for that is the best approach If I understand correctly,
> most of the backup solutions would prefer to have a daemon being
> launched and use it at a queuing system. Wouldn't it be better to
> have a new archive_mode, e.g. "daemon", and have postgres responsible
> to (re)start it, and pass information through the daemon's
> stdin/stdout or something like that?

Sure. Actually, I think a background worker would be better than a
separate daemon. Then it could just talk to shared memory directly.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Dilger 2021-09-10 15:42:09 Re: [Patch] ALTER SYSTEM READ ONLY
Previous Message Aleksander Alekseev 2021-09-10 14:44:25 Re: Increase value of OUTER_VAR