Quick Links

Re: WIP/PoC for parallel backup

From:	Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP/PoC for parallel backup
Date:	2020-01-04 06:53:52
Message-ID:	CADM=JejLkDmtQ1md9y5KeLVx_pungb75popLO_q9gHKjFTuWag@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Dec 19, 2019 at 10:47 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Thu, Dec 12, 2019 at 10:20 AM Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>
> wrote:
> > I have updated the patches (v7 attached) and have taken care of all
> issues pointed by Jeevan, additionally
> > ran the pgindent on each patch. Furthermore, Command names have been
> renamed as suggested and I
> > have simplified the SendFiles function. Client can only request the
> regular files, any other kind such as
> > directories or symlinks will be skipped, the client will be responsible
> for taking care of such.
>
> Hi,
>
> Patch 0001 of this series conflicts with my recent commit
> 303640199d0436c5e7acdf50b837a027b5726594; that commit was actually
> inspired by some previous study of 0001. That being said, I think 0001
> has the wrong idea. There's no reason that I can see why it should be
> correct to remove the PG_ENSURE_ERROR_CLEANUP calls from
> perform_base_backup(). It's true that if we register a long-lived
> before_shmem_exit hook, then the backup will get cleaned up even
> without the PG_ENSURE_ERROR_CLEANUP block, but there's also the
> question of the warning message. I think that our goal should be to
> emit the warning message about a backup being stopped too early if the
> user uses either pg_start_backup() or the new START_BACKUP command and
> does not end the backup with either pg_stop_backup() or the new
> STOP_BACKUP command -- but not if a single command that both starts
> and ends a backup, like BASE_BACKUP, is interrupted. To accomplish
> that goal in the wake of 303640199d0436c5e7acdf50b837a027b5726594, we
> need to temporarily register do_pg_abort_backup() as a
> before_shmem_exit() handler using PG_ENSURE_ERROR_CLEANUP() during
> commands like BASE_BACKUP() -- and for things like pg_start_backup()
> or the new START_BACKUP command, we just need to add a single call to
> register_persistent_abort_backup_handler().
>
> So I think you can drop 0001, and then in the patch that actually
> introduces START_BACKUP, add the call to
> register_persistent_abort_backup_handler() before calling
> do_pg_start_backup(). Also in that patch, also adjust the warning text
> that do_pg_abort_backup() emits to be more generic e.g. "aborting
> backup due to backend exiting while a non-exclusive backup is in
> progress".
>
> Sure. will do.

> 0003 creates three new functions, moving code from
> do_pg_start_backup() to a new function collectTablespaces() and from
> perform_base_backup() to new functions setup_throttle() and
> include_wal_files(). I'm skeptical about all of these changes. One
> general nitpick is that the way these function names are capitalized
> and punctuated does not seem to have been chosen very consistently;
> how about name_like_this() throughout? A bit more substantively:
>
> - collectTablespaces() is factored out of do_pg_start_backup() so that
> it can also be used by SendFileList(), but that means that a client is
> going to invoke START_BACKUP, indirectly calling collectTablespaces(),
> and then immediately afterward the client is probably going to call
> SEND_FILE_LIST, which will again call collectTablespaces(). That does
> not appear to be super-great. For one thing, it's duplicate work,
> although because SendFileList() is going to pass infotbssize as false,
> it's not a lot of duplicated work.

I'll remove this duplication by eliminating this call from START_BACKUP and
SEND_FILE_LIST functions. More about this is explained later in this email.

> Also, what happens if the two calls
> to collectTablespaces() return different answers due to concurrent
> CREATE/DROP TABLESPACE commands? Maybe it would all work out fine, but
> it seems like there is at least the possibility of bugs if different
> parts of the backup have different notions of what tablespaces exist.
>

The concurrent CREATE/DROP TABLESPACE commands, it can happen and will
be resolved by the WAL files collected for the backup. I don't think we
can do anything when objects are created or dropped in-between start and
stop backup. BASE_BACKUPalso relies on the WAL files to handle such a
scenario and does not error out when some relation files go away.

>
> - setup_throttle() is factored out of perform_base_backup() so that it
> can be called in StartBackup() and StopBackup() and SendFiles(). This
> seems extremely odd. Why does it make any sense to give the user an
> option to activate throttling when *ending* a backup? Why does it make
> sense to give the user a chance to enable throttling *both* at the
> startup of a backup *and also* for each individual file. If we're
> going to support throttling here, it seems like it should be either a
> backup-level property or a file-level property, not both.
>

It's a file-level property only. Throttle functionality relies on global
variables. StartBackup() and StopBackup() are calling setup_throttle
function to disable the throttling.

I should have been more explicit here by using -1 to setup_throttle,
Illustrating that throttling is disabled, instead of using 'opt->maxrate'.
(Although it defaults to -1 for these functions).

I'll remove the setup_throttle() call for both functions.

>
> - include_wal_files() is factored out of perform_base_backup() so that
> it can be called by StopBackup(). This seems like a poor design
> decision. The idea behind the BASE_BACKUP command is that you run that
> one command, and the server sends you everything. The idea in this new
> way of doing business is that the client requests the individual files
> it wants -- except for the WAL files, which are for some reason not
> requested individually but sent all together as part of the
> STOP_BACKUP response. It seems like it would be more consistent if the
> client were to decide which WAL files it needs and request them one by
> one, just as we do with other files.
>

As I understand you are suggesting to add another command to fetch the
list of WAL files which would be called by the client after executing stop
backup. Once the client gets that list, it starts requesting the WAL files
one
by one.

So I will add LIST_WAL_FILES command that will take start_lsn and end_lsn
as arguments and return the list of WAL files between these LSNs.

Something like this :
LIST_WAL_FILES 'start_lsn' 'end_lsn';

>
> I think there's a common theme to all of these complaints, which is
> that you haven't done enough to move things that are the
> responsibility of the backend in the BASE_BACKUP model to the frontend
> in this model. I started wondering, for example, whether it might not
> be better to have the client rather than the server construct the
> tablespace_map file. After all, the client needs to get the list of
> files anyway (hence SEND_FILE_LIST) and if it's got that then it knows
> almost enough to construct the tablespace map. The only additional
> thing it needs is the full pathname to which the link points. But, it
> seems that we could fairly easily extend SEND_FILE_LIST to send, for
> files that are symbolic links, the target of the link, using a new
> column. Or alternatively, using a separate command, so that instead of
> just sending a single SEND_FILE_LIST command, the client might first
> ask for a tablespace list and then might ask for a list of files
> within each tablespace (e.g. LIST_TABLESPACES, then LIST_FILES <oid>
> for each tablespace, with 0 for the main tablespace, perhaps). I'm not
> sure which way is better.
>

do_pg_start_backup is collecting the tablespace information anyway to
build the tablespace_map for BASE_BACKUP. So returning the same seemed
better than adding a new command for the same information. hence multiple
calls to the collectTablespaces() [to be renamed to collect_tablespaces].

tablespace_map can be constructed by the client, but then BASE_BACKUP
is returning it as part of the full backup. If clients in parallel mode
are to construct this themselves, these will seem like two different
approaches. Perhaps this should be done for BASE_BACKUP as
well?

I'll refactor the do_pg_start_backup function to remove the code related
to tablespace information collection (to collect_tablespaces) and
tablespace_map file creation, so that this function does not collect this
information unnecessarily. perform_base_backup function can collect and
send the tablespace information to the client and then the client can
construct the tablespace_map file.

I'll add a new command to fetch the list of tablespaces i.e.
LIST_TABLESPACES
which will return the tablespace information to the client for parallel
mode. And will refactor START_BACKUP and STOP_BACKUP commands,
so that they only do the specific job of putting the system in backup mode
or
out of it, nothing else.These commands should only return the start and end
LSN to the client.

>
> Similarly, for throttling, I have a hard time understanding how what
> you've got here is going to work reasonably. It looks like each client
> is just going to request whatever MAX_RATE the user specifies, but the
> result of that will be that the actual transfer rate is probably a
> multiple of the specified rate, approximately equal to the specified
> rate times the number of clients. That's probably not what the user
> wants. You could take the specified rate and divide it by the number
> of workers, but limiting each of 4 workers to a quarter of the rate
> will probably lead to a combined rate of less than than the specified
> rate, because if one worker doesn't use all of the bandwidth to which
> it's entitled, or even exits earlier than the others, the other
> workers don't get to go any faster as a result. Another problem is
> that, in the current approach, throttling applies overall to the
> entire backup, but in this approach, it is applied separately to each
> SEND_FILE command. In the current approach, if one file finishes a
> little faster or slower than anticipated, the next file in the tarball
> will be sent a little slower or faster to compensate. But in this
> approach, each SEND_FILES command is throttled separately, so this
> property is lost. Furthermore, while BASEBACKUP sends data
> continuously, this approach naturally involves pauses between
> commands. If files are large, that won't matter much, but if they're
> small and numerous, it will tend to cause the actual transfer rate to
> be less than the throttling rate.
>
> One potential way to solve this problem is... move it to the client
> side. Instead of making it the server's job not to send data too fast,
> make it the client's job not to receive data too fast. Let the server
> backends write as fast as they want, and on the pg_basebackup side,
> have the threads coordinate with each other so that they don't read
> data faster than the configured rate. That's not quite the same thing,
> though, because the server can get ahead by the size of the client's
> receive buffers plus whatever data is on the wire. I don't know
> whether that's a big enough problem to be worth caring about. If it
> is, then I think we need some server infrastructure to "group
> throttle" a group of cooperating backends.
>

That was a mistake in my code. maxrate should've been equally divided
amongst all threads. I agree that we should move this to the client-side.
When a thread exits, its share should also be equally divided amongst
the remaining threads (i.e. recalculate maxrate for each remaining thread).

Say we have 4 running threads with each allocation 25% of the bandwidth.
Thread 1 exits. We recalculate bandwidth and assign the remaining 3 threads
33.33% each. This solves one problem that you had identified. However,
it doesn't solve where one (or more) thread is not fully consuming their
allocated share. I'm not really sure how we can solve it properly.
Suggestions
are welcome.

>
> A general comment about 0004 is that it seems like you've proceeded by
> taking the code from perform_base_backup() and spreading it across
> several different functions without, necessarily, as much thought as
> is needed there. For instance, StartBackup() looks like just the
> beginning of perform_base_backup(). But, why shouldn't it instead look
> like pg_start_backup() -- in fact, a simplified version that only
> handles the non-exclusive backup case? Is the extra stuff it's doing
> really appropriate? I've already complained about the
> tablespace-related stuff here and the throttling, but there's more.
> Setting statrelpath here will probably break if somebody tries to use
> SEND_FILES without first calling START_BACKUP. Sending the
> backup_label file here is oddly asymmetric, because that's done by
> pg_stop_backup(), not pg_start_backup(). And similarly, StopBackup()
> looks like it's just the end of perform_base_backup(), but that's not
> pretty strange-looking too. Again, I've already complained about
> include_wal_files() being part of this, but there's also:
>
> + /* ... and pg_control after everything else. */
>
> ...which (1) is an odd thing to say when this is the first thing this
> particular function is to send and (2) is another example of a sloppy
> division of labor between client and server; apparently, the client is
> supposed to know not to request pg_control, because the server is
> going to send it unsolicited. There's no particular reason to have
> this a special case. The client could just request it last. And then
> the server code wouldn't need a special case, and you wouldn't have
> this odd logic split between the client and the server.
>
> Overall, I think this needs a lot more work. The overall idea's not
> wrong, but there seem to be a very large number of details which, at
> least to me, do not seem to be correct.
>
>

Thank you Robert for the detailed review. I really appreciate your insights
and very precise feedback.

After the changes suggested above, the design on a high level will look
something
like this:

=== SEQUENTIAL EXECUTION ===
START_BACKUP [LABEL | FAST]
- Starts backup on the server
- Returns the start LSN to client

LIST_TABLESPACES
- Sends a list of all tables spaces to client

Loops over LIST_TABLESPACES
- LIST_FILES [tablespace]
- Sends file list for the given tablespace
- Create a list of all files

=== PARALLEL EXECUTION ===
Thread loop until the list of files is exhausted
SEND_FILE <file(s)> [CHECKSUM | WAL_START_LOCATION]
- If the checksum is enabled then WAL_START_LOCATION is required.
- Can request server to send one or more files but we are requesting one at
a time
- Pick next file from list of files

- Threads sleep after the list is exhausted
- All threads are sleeping

=== SEQUENTIAL EXECUTION ===
STOP_BACKUP [NOWAIT]
- Stops backup mode
- Return end LSN

If --wal-method=fetch then
LIST_WAL_FILES 'start_lsn' 'end_lsn'
- Sends a list of WAL files between start LSN and end LSN

=== PARALLEL EXECUTION ===
Thread loop until the list of WAL files is exhausted
SEND_FILE <WAL file>
- Can request server to send one or more files but we are requesting one
WAL file at a time
- Pick next file from list of WAL files

- Threads terminate and set their status as completed/terminated

=== SEQUENTIAL EXECUTION ===
Cleanup

--
Asif Rehman
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca

In response to

Re: WIP/PoC for parallel backup at 2019-12-19 17:47:22 from Robert Haas

Responses

Re: WIP/PoC for parallel backup at 2020-01-30 13:39:51 from Asif Rehman

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2020-01-04 07:00:29	Re: pgsql: Add basic TAP tests for psql's tab-completion logic.
Previous Message	Peter Geoghegan	2020-01-04 06:49:02	Re: pgsql: Add basic TAP tests for psql's tab-completion logic.