Re: parallelizing the archiver

From: Julien Rouhaud <rjuju123(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "Bossart, Nathan" <bossartn(at)amazon(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallelizing the archiver
Date: 2021-09-10 15:48:54
Message-ID: CAOBaU_YpHNp4aCEL5v-3UFVSdN65nCZ6=AR+o6q7H+A=C5huNg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 10, 2021 at 11:22 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> Well, I guess I'm not convinced. Perhaps people with more knowledge of
> this than I may already know why it's beneficial, but in my experience
> commands like 'cp' and 'scp' are usually limited by the speed of I/O,
> not the fact that you only have one of them running at once. Running
> several at once, again in my experience, is typically not much faster.
> On the other hand, scp has a LOT of startup overhead, so it's easy to
> see the benefits of batching.

I totally agree that batching as many file as possible in a single
command is probably what's gonna achieve the best performance. But if
the archiver only gets an answer from the archive_command once it
tried to process all of the file, it also means that postgres won't be
able to remove any WAL file until all of them could be processed. It
means that users will likely have to limit the batch size and
therefore pay more startup overhead than they would like. In case of
archiving on server with high latency / connection overhead it may be
better to be able to run multiple commands in parallel. I may be
overthinking here and definitely having feedback from people with more
experience around that would be welcome.

> That is possibly true. I think it might work to just assume that you
> have to retry everything if it exits non-zero, but that requires the
> archive command to be smart enough to do something sensible if an
> identical file is already present in the archive.

Yes, it could be. I think that we need more feedback for that too.

> Sure. Actually, I think a background worker would be better than a
> separate daemon. Then it could just talk to shared memory directly.

I thought about it too, but I was under the impression that most
people would want to implement a custom daemon (or already have) with
some more parallel/thread friendly language.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Borodin 2021-09-10 15:55:21 Re: parallelizing the archiver
Previous Message Zhihong Yu 2021-09-10 15:48:01 Re: a misbehavior of partition row movement (?)