Quick Links

Re: WIP/PoC for parallel backup

From:	Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP/PoC for parallel backup
Date:	2020-01-30 13:39:51
Message-ID:	CADM=Jejh2gFzeOpYCVJPWmL9NqOPhOcqYLLWcF0CLFjq1JcvqA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sat, Jan 4, 2020 at 11:53 AM Asif Rehman <asifr(dot)rehman(at)gmail(dot)com> wrote:

>
>
> On Thu, Dec 19, 2019 at 10:47 PM Robert Haas <robertmhaas(at)gmail(dot)com>
> wrote:
>
>> On Thu, Dec 12, 2019 at 10:20 AM Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>
>> wrote:
>> > I have updated the patches (v7 attached) and have taken care of all
>> issues pointed by Jeevan, additionally
>> > ran the pgindent on each patch. Furthermore, Command names have been
>> renamed as suggested and I
>> > have simplified the SendFiles function. Client can only request the
>> regular files, any other kind such as
>> > directories or symlinks will be skipped, the client will be responsible
>> for taking care of such.
>>
>> Hi,
>>
>> Patch 0001 of this series conflicts with my recent commit
>> 303640199d0436c5e7acdf50b837a027b5726594; that commit was actually
>> inspired by some previous study of 0001. That being said, I think 0001
>> has the wrong idea. There's no reason that I can see why it should be
>> correct to remove the PG_ENSURE_ERROR_CLEANUP calls from
>> perform_base_backup(). It's true that if we register a long-lived
>> before_shmem_exit hook, then the backup will get cleaned up even
>> without the PG_ENSURE_ERROR_CLEANUP block, but there's also the
>> question of the warning message. I think that our goal should be to
>> emit the warning message about a backup being stopped too early if the
>> user uses either pg_start_backup() or the new START_BACKUP command and
>> does not end the backup with either pg_stop_backup() or the new
>> STOP_BACKUP command -- but not if a single command that both starts
>> and ends a backup, like BASE_BACKUP, is interrupted. To accomplish
>> that goal in the wake of 303640199d0436c5e7acdf50b837a027b5726594, we
>> need to temporarily register do_pg_abort_backup() as a
>> before_shmem_exit() handler using PG_ENSURE_ERROR_CLEANUP() during
>> commands like BASE_BACKUP() -- and for things like pg_start_backup()
>> or the new START_BACKUP command, we just need to add a single call to
>> register_persistent_abort_backup_handler().
>>
>> So I think you can drop 0001, and then in the patch that actually
>> introduces START_BACKUP, add the call to
>> register_persistent_abort_backup_handler() before calling
>> do_pg_start_backup(). Also in that patch, also adjust the warning text
>> that do_pg_abort_backup() emits to be more generic e.g. "aborting
>> backup due to backend exiting while a non-exclusive backup is in
>> progress".
>>
>> Sure. will do.
>
>
>> 0003 creates three new functions, moving code from
>> do_pg_start_backup() to a new function collectTablespaces() and from
>> perform_base_backup() to new functions setup_throttle() and
>> include_wal_files(). I'm skeptical about all of these changes. One
>> general nitpick is that the way these function names are capitalized
>> and punctuated does not seem to have been chosen very consistently;
>> how about name_like_this() throughout? A bit more substantively:
>>
>> - collectTablespaces() is factored out of do_pg_start_backup() so that
>> it can also be used by SendFileList(), but that means that a client is
>> going to invoke START_BACKUP, indirectly calling collectTablespaces(),
>> and then immediately afterward the client is probably going to call
>> SEND_FILE_LIST, which will again call collectTablespaces(). That does
>> not appear to be super-great. For one thing, it's duplicate work,
>> although because SendFileList() is going to pass infotbssize as false,
>> it's not a lot of duplicated work.
>
>
> I'll remove this duplication by eliminating this call from START_BACKUP and
> SEND_FILE_LIST functions. More about this is explained later in this email.
>
>
>> Also, what happens if the two calls
>> to collectTablespaces() return different answers due to concurrent
>> CREATE/DROP TABLESPACE commands? Maybe it would all work out fine, but
>> it seems like there is at least the possibility of bugs if different
>> parts of the backup have different notions of what tablespaces exist.
>>
>
> The concurrent CREATE/DROP TABLESPACE commands, it can happen and will
> be resolved by the WAL files collected for the backup. I don't think we
> can do anything when objects are created or dropped in-between start and
> stop backup. BASE_BACKUPalso relies on the WAL files to handle such a
> scenario and does not error out when some relation files go away.
>
>
>>
>> - setup_throttle() is factored out of perform_base_backup() so that it
>> can be called in StartBackup() and StopBackup() and SendFiles(). This
>> seems extremely odd. Why does it make any sense to give the user an
>> option to activate throttling when *ending* a backup? Why does it make
>> sense to give the user a chance to enable throttling *both* at the
>> startup of a backup *and also* for each individual file. If we're
>> going to support throttling here, it seems like it should be either a
>> backup-level property or a file-level property, not both.
>>
>
> It's a file-level property only. Throttle functionality relies on global
> variables. StartBackup() and StopBackup() are calling setup_throttle
> function to disable the throttling.
>
> I should have been more explicit here by using -1 to setup_throttle,
> Illustrating that throttling is disabled, instead of using 'opt->maxrate'.
> (Although it defaults to -1 for these functions).
>
> I'll remove the setup_throttle() call for both functions.
>
>
>>
>> - include_wal_files() is factored out of perform_base_backup() so that
>> it can be called by StopBackup(). This seems like a poor design
>> decision. The idea behind the BASE_BACKUP command is that you run that
>> one command, and the server sends you everything. The idea in this new
>> way of doing business is that the client requests the individual files
>> it wants -- except for the WAL files, which are for some reason not
>> requested individually but sent all together as part of the
>> STOP_BACKUP response. It seems like it would be more consistent if the
>> client were to decide which WAL files it needs and request them one by
>> one, just as we do with other files.
>>
>
> As I understand you are suggesting to add another command to fetch the
> list of WAL files which would be called by the client after executing stop
> backup. Once the client gets that list, it starts requesting the WAL files
> one
> by one.
>
> So I will add LIST_WAL_FILES command that will take start_lsn and end_lsn
> as arguments and return the list of WAL files between these LSNs.
>
> Something like this :
> LIST_WAL_FILES 'start_lsn' 'end_lsn';
>
>
>>
>> I think there's a common theme to all of these complaints, which is
>> that you haven't done enough to move things that are the
>> responsibility of the backend in the BASE_BACKUP model to the frontend
>> in this model. I started wondering, for example, whether it might not
>> be better to have the client rather than the server construct the
>> tablespace_map file. After all, the client needs to get the list of
>> files anyway (hence SEND_FILE_LIST) and if it's got that then it knows
>> almost enough to construct the tablespace map. The only additional
>> thing it needs is the full pathname to which the link points. But, it
>> seems that we could fairly easily extend SEND_FILE_LIST to send, for
>> files that are symbolic links, the target of the link, using a new
>> column. Or alternatively, using a separate command, so that instead of
>> just sending a single SEND_FILE_LIST command, the client might first
>> ask for a tablespace list and then might ask for a list of files
>> within each tablespace (e.g. LIST_TABLESPACES, then LIST_FILES <oid>
>> for each tablespace, with 0 for the main tablespace, perhaps). I'm not
>> sure which way is better.
>>
>
> do_pg_start_backup is collecting the tablespace information anyway to
> build the tablespace_map for BASE_BACKUP. So returning the same seemed
> better than adding a new command for the same information. hence multiple
> calls to the collectTablespaces() [to be renamed to collect_tablespaces].
>
> tablespace_map can be constructed by the client, but then BASE_BACKUP
> is returning it as part of the full backup. If clients in parallel mode
> are to construct this themselves, these will seem like two different
> approaches. Perhaps this should be done for BASE_BACKUP as
> well?
>
> I'll refactor the do_pg_start_backup function to remove the code related
> to tablespace information collection (to collect_tablespaces) and
> tablespace_map file creation, so that this function does not collect this
> information unnecessarily. perform_base_backup function can collect and
> send the tablespace information to the client and then the client can
> construct the tablespace_map file.
>
> I'll add a new command to fetch the list of tablespaces i.e.
> LIST_TABLESPACES
> which will return the tablespace information to the client for parallel
> mode. And will refactor START_BACKUP and STOP_BACKUP commands,
> so that they only do the specific job of putting the system in backup mode
> or
> out of it, nothing else.These commands should only return the start and end
> LSN to the client.
>
>
>
>>
>> Similarly, for throttling, I have a hard time understanding how what
>> you've got here is going to work reasonably. It looks like each client
>> is just going to request whatever MAX_RATE the user specifies, but the
>> result of that will be that the actual transfer rate is probably a
>> multiple of the specified rate, approximately equal to the specified
>> rate times the number of clients. That's probably not what the user
>> wants. You could take the specified rate and divide it by the number
>> of workers, but limiting each of 4 workers to a quarter of the rate
>> will probably lead to a combined rate of less than than the specified
>> rate, because if one worker doesn't use all of the bandwidth to which
>> it's entitled, or even exits earlier than the others, the other
>> workers don't get to go any faster as a result. Another problem is
>> that, in the current approach, throttling applies overall to the
>> entire backup, but in this approach, it is applied separately to each
>> SEND_FILE command. In the current approach, if one file finishes a
>> little faster or slower than anticipated, the next file in the tarball
>> will be sent a little slower or faster to compensate. But in this
>> approach, each SEND_FILES command is throttled separately, so this
>> property is lost. Furthermore, while BASEBACKUP sends data
>> continuously, this approach naturally involves pauses between
>> commands. If files are large, that won't matter much, but if they're
>> small and numerous, it will tend to cause the actual transfer rate to
>> be less than the throttling rate.
>>
>> One potential way to solve this problem is... move it to the client
>> side. Instead of making it the server's job not to send data too fast,
>> make it the client's job not to receive data too fast. Let the server
>> backends write as fast as they want, and on the pg_basebackup side,
>> have the threads coordinate with each other so that they don't read
>> data faster than the configured rate. That's not quite the same thing,
>> though, because the server can get ahead by the size of the client's
>> receive buffers plus whatever data is on the wire. I don't know
>> whether that's a big enough problem to be worth caring about. If it
>> is, then I think we need some server infrastructure to "group
>> throttle" a group of cooperating backends.
>>
>
> That was a mistake in my code. maxrate should've been equally divided
> amongst all threads. I agree that we should move this to the client-side.
> When a thread exits, its share should also be equally divided amongst
> the remaining threads (i.e. recalculate maxrate for each remaining
> thread).
>
> Say we have 4 running threads with each allocation 25% of the bandwidth.
> Thread 1 exits. We recalculate bandwidth and assign the remaining 3 threads
> 33.33% each. This solves one problem that you had identified. However,
> it doesn't solve where one (or more) thread is not fully consuming their
> allocated share. I'm not really sure how we can solve it properly.
> Suggestions
> are welcome.
>
>
>>
>> A general comment about 0004 is that it seems like you've proceeded by
>> taking the code from perform_base_backup() and spreading it across
>> several different functions without, necessarily, as much thought as
>> is needed there. For instance, StartBackup() looks like just the
>> beginning of perform_base_backup(). But, why shouldn't it instead look
>> like pg_start_backup() -- in fact, a simplified version that only
>> handles the non-exclusive backup case? Is the extra stuff it's doing
>> really appropriate? I've already complained about the
>> tablespace-related stuff here and the throttling, but there's more.
>> Setting statrelpath here will probably break if somebody tries to use
>> SEND_FILES without first calling START_BACKUP. Sending the
>> backup_label file here is oddly asymmetric, because that's done by
>> pg_stop_backup(), not pg_start_backup(). And similarly, StopBackup()
>> looks like it's just the end of perform_base_backup(), but that's not
>> pretty strange-looking too. Again, I've already complained about
>> include_wal_files() being part of this, but there's also:
>>
>> + /* ... and pg_control after everything else. */
>>
>> ...which (1) is an odd thing to say when this is the first thing this
>> particular function is to send and (2) is another example of a sloppy
>> division of labor between client and server; apparently, the client is
>> supposed to know not to request pg_control, because the server is
>> going to send it unsolicited. There's no particular reason to have
>> this a special case. The client could just request it last. And then
>> the server code wouldn't need a special case, and you wouldn't have
>> this odd logic split between the client and the server.
>>
>> Overall, I think this needs a lot more work. The overall idea's not
>> wrong, but there seem to be a very large number of details which, at
>> least to me, do not seem to be correct.
>>
>>
>
> Thank you Robert for the detailed review. I really appreciate your insights
> and very precise feedback.
>
> After the changes suggested above, the design on a high level will look
> something
> like this:
>
> === SEQUENTIAL EXECUTION ===
> START_BACKUP [LABEL | FAST]
> - Starts backup on the server
> - Returns the start LSN to client
>
> LIST_TABLESPACES
> - Sends a list of all tables spaces to client
>
> Loops over LIST_TABLESPACES
> - LIST_FILES [tablespace]
> - Sends file list for the given tablespace
> - Create a list of all files
>
> === PARALLEL EXECUTION ===
> Thread loop until the list of files is exhausted
> SEND_FILE <file(s)> [CHECKSUM | WAL_START_LOCATION]
> - If the checksum is enabled then WAL_START_LOCATION is required.
> - Can request server to send one or more files but we are requesting one
> at a time
> - Pick next file from list of files
>
> - Threads sleep after the list is exhausted
> - All threads are sleeping
>
> === SEQUENTIAL EXECUTION ===
> STOP_BACKUP [NOWAIT]
> - Stops backup mode
> - Return end LSN
>
> If --wal-method=fetch then
> LIST_WAL_FILES 'start_lsn' 'end_lsn'
> - Sends a list of WAL files between start LSN and end LSN
>
> === PARALLEL EXECUTION ===
> Thread loop until the list of WAL files is exhausted
> SEND_FILE <WAL file>
> - Can request server to send one or more files but we are requesting one
> WAL file at a time
> - Pick next file from list of WAL files
>
> - Threads terminate and set their status as completed/terminated
>
> === SEQUENTIAL EXECUTION ===
> Cleanup
>
>
>
>
Here are the the updated patches, taking care of the issues pointed
earlier. This patch adds the following commands (with specified option):

START_BACKUP [LABEL '<label>'] [FAST]
STOP_BACKUP [NOWAIT]
LIST_TABLESPACES [PROGRESS]
LIST_FILES [TABLESPACE]
LIST_WAL_FILES [START_WAL_LOCATION 'X/X'] [END_WAL_LOCATION 'X/X']
SEND_FILES '(' FILE, FILE... ')' [START_WAL_LOCATION 'X/X']
[NOVERIFY_CHECKSUMS]

Parallel backup is not making any use of tablespace map, so I have
removed that option from the above commands. There is a patch pending
to remove the exclusive backup; we can further refactor the
do_pg_start_backup
function at that time, to remove the tablespace information and move the
creation of tablespace_map file to the client.

I have disabled the maxrate option for parallel backup. I intend to send
out a separate patch for it. Robert previously suggested to implement
throttling on the client-side. I found the original email thread [1]
where throttling was proposed and added to the server. In that thread,
it was originally implemented on the client-side, but per many suggestions,
it was moved to server-side.

So, I have a few suggestions on how we can implement this:

1- have another option for pg_basebackup (i.e. per-worker-maxrate) where
the user could choose the bandwidth allocation for each worker. This
approach
can be implemented on the client-side as well as on the server-side.

2- have the maxrate, be divided among workers equally at first. and the
let the main thread keep adjusting it whenever one of the workers finishes.
I believe this would only be possible if we handle throttling on the client.
Also, as I understand it, implementing this will introduce additional mutex
for handling of bandwidth consumption data so that rate may be adjusted
according to data received by threads.

[1]
https://www.postgresql.org/message-id/flat/521B4B29.20009%402ndquadrant.com#189bf840c87de5908c0b4467d31b50af

--
Asif Rehman
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca

Attachment	Content-Type	Size
0001-Rename-sizeonly-to-dryrun-for-few-functions-in-baseb.patch	application/octet-stream	7.3 KB
0002-Refactor-some-backup-code-to-increase-reusability.-T.patch	application/octet-stream	15.6 KB
0004-Parallel-Backup-pg_basebackup.patch	application/octet-stream	32.2 KB
0005-parallel-backup-testcase.patch	application/octet-stream	18.0 KB
0003-Parallel-Backup-Backend-Replication-commands.patch	application/octet-stream	33.3 KB

In response to

Re: WIP/PoC for parallel backup at 2020-01-04 06:53:52 from Asif Rehman

Responses

Re: WIP/PoC for parallel backup at 2020-02-10 13:48:37 from Jeevan Chalke

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	曾文旌 (义从)	2020-01-30 14:06:41	Re: [Proposal] Global temporary tables
Previous Message	Michael Paquier	2020-01-30 13:03:01	Re: Expose lock group leader pid in pg_stat_activity