Re: parallelizing the archiver

From: Jacob Champion <pchampion(at)vmware(dot)com>
To: "rjuju123(at)gmail(dot)com" <rjuju123(at)gmail(dot)com>, "robertmhaas(at)gmail(dot)com" <robertmhaas(at)gmail(dot)com>
Cc: "bossartn(at)amazon(dot)com" <bossartn(at)amazon(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallelizing the archiver
Date: 2021-09-10 17:07:01
Message-ID: 78728f8c5e413c05e00426369f79780a35caef5c.camel@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 2021-09-10 at 23:48 +0800, Julien Rouhaud wrote:
> I totally agree that batching as many file as possible in a single
> command is probably what's gonna achieve the best performance. But if
> the archiver only gets an answer from the archive_command once it
> tried to process all of the file, it also means that postgres won't be
> able to remove any WAL file until all of them could be processed. It
> means that users will likely have to limit the batch size and
> therefore pay more startup overhead than they would like. In case of
> archiving on server with high latency / connection overhead it may be
> better to be able to run multiple commands in parallel.

Well, users would also have to limit the parallelism, right? If
connections are high-overhead, I wouldn't imagine that running hundreds
of them simultaneously would work very well in practice. (The proof
would be in an actual benchmark, obviously, but usually I would rather
have one process handling a hundred items than a hundred processes
handling one item each.)

For a batching scheme, would it be that big a deal to wait for all of
them to be archived before removal?

> > That is possibly true. I think it might work to just assume that you
> > have to retry everything if it exits non-zero, but that requires the
> > archive command to be smart enough to do something sensible if an
> > identical file is already present in the archive.
>
> Yes, it could be. I think that we need more feedback for that too.

Seems like this is the sticking point. What would be the smartest thing
for the command to do? If there's a destination file already, checksum
it and make sure it matches the source before continuing?

--Jacob

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2021-09-10 17:09:51 Re: parallelizing the archiver
Previous Message Bossart, Nathan 2021-09-10 17:06:59 Re: parallelizing the archiver