Re: WIP/PoC for parallel backup

From: Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>
To: asifr(dot)rehman(at)gmail(dot)com
Cc: Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP/PoC for parallel backup
Date: 2020-03-11 09:38:20
Message-ID: CAKcux6mMk-F2LmUy9arVu0QQiZfVCuBWGZbSxjn=dAjUWMSvew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Asif

I have started testing this feature. I have applied v6 patch on commit
a069218163704c44a8996e7e98e765c56e2b9c8e (30 Jan).
I got few observations, please take a look.

*--if backup failed, backup directory is not getting removed.*
[edb(at)localhost bin]$ ./pg_basebackup -p 5432 --jobs=9 -D /tmp/test_bkp/bkp6
pg_basebackup: error: could not connect to server: FATAL: number of
requested standby connections exceeds max_wal_senders (currently 10)
[edb(at)localhost bin]$ ./pg_basebackup -p 5432 --jobs=8 -D /tmp/test_bkp/bkp6
pg_basebackup: error: directory "/tmp/test_bkp/bkp6" exists but is not empty

*--giving large number of jobs leading segmentation fault.*
./pg_basebackup -p 5432 --jobs=1000 -D /tmp/t3
pg_basebackup: error: could not connect to server: FATAL: number of
requested standby connections exceeds max_wal_senders (currently 10)
pg_basebackup: error: could not connect to server: FATAL: number of
requested standby connections exceeds max_wal_senders (currently 10)
pg_basebackup: error: could not connect to server: FATAL: number of
requested standby connections exceeds max_wal_senders (currently 10)
.
.
.
pg_basebackup: error: could not connect to server: FATAL: number of
requested standby connections exceeds max_wal_senders (currently 10)
pg_basebackup: error: could not connect to server: FATAL: number of
requested standby connections exceeds max_wal_senders (currently 10)
pg_basebackup: error: could not connect to server: FATAL: number of
requested standby connections exceeds max_wal_senders (currently 10)
pg_basebackup: error: could not connect to server: FATAL: number of
requested standby connections exceeds max_wal_senders (currently 10)
pg_basebackup: error: could not connect to server: could not fork new
process for connection: Resource temporarily unavailable

could not fork new process for connection: Resource temporarily unavailable
pg_basebackup: error: failed to create thread: Resource temporarily
unavailable
Segmentation fault (core dumped)

--stack-trace
gdb -q -c core.11824 pg_basebackup
Loaded symbols for /lib64/libnss_files.so.2
Core was generated by `./pg_basebackup -p 5432 --jobs=1000 -D
/tmp/test_bkp/bkp10'.
Program terminated with signal 11, Segmentation fault.
#0 pthread_join (threadid=140503120623360, thread_return=0x0) at
pthread_join.c:46
46 if (INVALID_NOT_TERMINATED_TD_P (pd))
Missing separate debuginfos, use: debuginfo-install
keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64
libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64
openssl-1.0.1e-58.el6_10.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0 pthread_join (threadid=140503120623360, thread_return=0x0) at
pthread_join.c:46
#1 0x0000000000408e21 in cleanup_workers () at pg_basebackup.c:2840
#2 0x0000000000403846 in disconnect_atexit () at pg_basebackup.c:316
#3 0x0000003921235a02 in __run_exit_handlers (status=1) at exit.c:78
#4 exit (status=1) at exit.c:100
#5 0x0000000000408aa6 in create_parallel_workers (backupinfo=0x1a4b8c0) at
pg_basebackup.c:2713
#6 0x0000000000407946 in BaseBackup () at pg_basebackup.c:2127
#7 0x000000000040895c in main (argc=6, argv=0x7ffd566f4718) at
pg_basebackup.c:2668

*--with tablespace is in the same directory as data, parallel_backup
crashed*
[edb(at)localhost bin]$ ./initdb -D /tmp/data
[edb(at)localhost bin]$ ./pg_ctl -D /tmp/data -l /tmp/logfile start
[edb(at)localhost bin]$ mkdir /tmp/ts
[edb(at)localhost bin]$ ./psql postgres
psql (13devel)
Type "help" for help.

postgres=# create tablespace ts location '/tmp/ts';
CREATE TABLESPACE
postgres=# create table tx (a int) tablespace ts;
CREATE TABLE
postgres=# \q
[edb(at)localhost bin]$ ./pg_basebackup -j 2 -D /tmp/tts -T /tmp/ts=/tmp/ts1
Segmentation fault (core dumped)

--stack-trace
[edb(at)localhost bin]$ gdb -q -c core.15778 pg_basebackup
Loaded symbols for /lib64/libnss_files.so.2
Core was generated by `./pg_basebackup -j 2 -D /tmp/tts -T
/tmp/ts=/tmp/ts1'.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000409442 in get_backup_filelist (conn=0x140cb20,
backupInfo=0x14210a0) at pg_basebackup.c:3000
3000 backupInfo->curr->next = file;
Missing separate debuginfos, use: debuginfo-install
keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64
libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64
openssl-1.0.1e-58.el6_10.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0 0x0000000000409442 in get_backup_filelist (conn=0x140cb20,
backupInfo=0x14210a0) at pg_basebackup.c:3000
#1 0x0000000000408b56 in parallel_backup_run (backupinfo=0x14210a0) at
pg_basebackup.c:2739
#2 0x0000000000407955 in BaseBackup () at pg_basebackup.c:2128
#3 0x000000000040895c in main (argc=7, argv=0x7ffca2910c58) at
pg_basebackup.c:2668
(gdb)

Thanks & Regards,
Rajkumar Raghuwanshi

On Tue, Feb 25, 2020 at 7:49 PM Asif Rehman <asifr(dot)rehman(at)gmail(dot)com> wrote:

> Hi,
>
> I have created a commitfest entry.
> https://commitfest.postgresql.org/27/2472/
>
>
> On Mon, Feb 17, 2020 at 1:39 PM Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>
> wrote:
>
>> Thanks Jeevan. Here is the documentation patch.
>>
>> On Mon, Feb 10, 2020 at 6:49 PM Jeevan Chalke <
>> jeevan(dot)chalke(at)enterprisedb(dot)com> wrote:
>>
>>> Hi Asif,
>>>
>>> On Thu, Jan 30, 2020 at 7:10 PM Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>
>>> wrote:
>>>
>>>>
>>>> Here are the the updated patches, taking care of the issues pointed
>>>> earlier. This patch adds the following commands (with specified option):
>>>>
>>>> START_BACKUP [LABEL '<label>'] [FAST]
>>>> STOP_BACKUP [NOWAIT]
>>>> LIST_TABLESPACES [PROGRESS]
>>>> LIST_FILES [TABLESPACE]
>>>> LIST_WAL_FILES [START_WAL_LOCATION 'X/X'] [END_WAL_LOCATION 'X/X']
>>>> SEND_FILES '(' FILE, FILE... ')' [START_WAL_LOCATION 'X/X']
>>>> [NOVERIFY_CHECKSUMS]
>>>>
>>>>
>>>> Parallel backup is not making any use of tablespace map, so I have
>>>> removed that option from the above commands. There is a patch pending
>>>> to remove the exclusive backup; we can further refactor the
>>>> do_pg_start_backup
>>>> function at that time, to remove the tablespace information and move the
>>>> creation of tablespace_map file to the client.
>>>>
>>>>
>>>> I have disabled the maxrate option for parallel backup. I intend to send
>>>> out a separate patch for it. Robert previously suggested to implement
>>>> throttling on the client-side. I found the original email thread [1]
>>>> where throttling was proposed and added to the server. In that thread,
>>>> it was originally implemented on the client-side, but per many
>>>> suggestions,
>>>> it was moved to server-side.
>>>>
>>>> So, I have a few suggestions on how we can implement this:
>>>>
>>>> 1- have another option for pg_basebackup (i.e. per-worker-maxrate) where
>>>> the user could choose the bandwidth allocation for each worker. This
>>>> approach
>>>> can be implemented on the client-side as well as on the server-side.
>>>>
>>>> 2- have the maxrate, be divided among workers equally at first. and the
>>>> let the main thread keep adjusting it whenever one of the workers
>>>> finishes.
>>>> I believe this would only be possible if we handle throttling on the
>>>> client.
>>>> Also, as I understand it, implementing this will introduce additional
>>>> mutex
>>>> for handling of bandwidth consumption data so that rate may be adjusted
>>>> according to data received by threads.
>>>>
>>>> [1]
>>>> https://www.postgresql.org/message-id/flat/521B4B29.20009%402ndquadrant.com#189bf840c87de5908c0b4467d31b50af
>>>>
>>>> --
>>>> Asif Rehman
>>>> Highgo Software (Canada/China/Pakistan)
>>>> URL : www.highgo.ca
>>>>
>>>>
>>>
>>> The latest changes look good to me. However, the patch set is missing
>>> the documentation.
>>> Please add those.
>>>
>>> Thanks
>>>
>>> --
>>> Jeevan Chalke
>>> Associate Database Architect & Team Lead, Product Development
>>> EnterpriseDB Corporation
>>> The Enterprise PostgreSQL Company
>>>
>>>
>>
>> --
>> --
>> Asif Rehman
>> Highgo Software (Canada/China/Pakistan)
>> URL : www.highgo.ca
>>
>>
>
> --
> --
> Asif Rehman
> Highgo Software (Canada/China/Pakistan)
> URL : www.highgo.ca
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2020-03-11 09:51:50 Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
Previous Message Amit Kapila 2020-03-11 09:06:23 Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager