Re: WIP/PoC for parallel backup

From: Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>
To: Ahsan Hadi <ahsan(dot)hadi(at)gmail(dot)com>
Cc: Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Kashif Zeeshan <kashif(dot)zeeshan(at)enterprisedb(dot)com>
Subject: Re: WIP/PoC for parallel backup
Date: 2020-04-02 09:57:52
Message-ID: CAKcux6mU3sJRUJxvdeajZ+kDnmxcnfsJNGbY72KLd+=X3_Ymmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Asif,

My colleague Kashif Zeeshan reported an issue off-list, posting here,
please take a look.

When executing two backups at the same time, getting FATAL error due to
max_wal_senders and instead of exit Backup got completed
And when tried to start the server from the backup cluster, getting error.

[edb(at)localhost bin]$ ./pgbench -i -s 200 -h localhost -p 5432 postgres
[edb(at)localhost bin]$ ./pg_basebackup -v -j 8 -D /home/edb/Desktop/backup/
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/C2000270 on timeline 1
pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_57849"
pg_basebackup: backup worker (0) created
pg_basebackup: backup worker (1) created
pg_basebackup: backup worker (2) created
pg_basebackup: error: could not connect to server: FATAL: number of
requested standby connections exceeds max_wal_senders (currently 10)
pg_basebackup: backup worker (3) created
pg_basebackup: error: could not connect to server: FATAL: number of
requested standby connections exceeds max_wal_senders (currently 10)
pg_basebackup: backup worker (4) created
pg_basebackup: error: could not connect to server: FATAL: number of
requested standby connections exceeds max_wal_senders (currently 10)
pg_basebackup: backup worker (5) created
pg_basebackup: error: could not connect to server: FATAL: number of
requested standby connections exceeds max_wal_senders (currently 10)
pg_basebackup: backup worker (6) created
pg_basebackup: error: could not connect to server: FATAL: number of
requested standby connections exceeds max_wal_senders (currently 10)
pg_basebackup: backup worker (7) created
pg_basebackup: write-ahead log end point: 0/C3000050
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: syncing data to disk ...
pg_basebackup: base backup completed
[edb(at)localhost bin]$ ./pg_basebackup -v -j 8 -D /home/edb/Desktop/backup1/
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/C20001C0 on timeline 1
pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_57848"
pg_basebackup: backup worker (0) created
pg_basebackup: backup worker (1) created
pg_basebackup: backup worker (2) created
pg_basebackup: error: could not connect to server: FATAL: number of
requested standby connections exceeds max_wal_senders (currently 10)
pg_basebackup: backup worker (3) created
pg_basebackup: error: could not connect to server: FATAL: number of
requested standby connections exceeds max_wal_senders (currently 10)
pg_basebackup: backup worker (4) created
pg_basebackup: error: could not connect to server: FATAL: number of
requested standby connections exceeds max_wal_senders (currently 10)
pg_basebackup: backup worker (5) created
pg_basebackup: error: could not connect to server: FATAL: number of
requested standby connections exceeds max_wal_senders (currently 10)
pg_basebackup: backup worker (6) created
pg_basebackup: error: could not connect to server: FATAL: number of
requested standby connections exceeds max_wal_senders (currently 10)
pg_basebackup: backup worker (7) created
pg_basebackup: write-ahead log end point: 0/C2000348
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: syncing data to disk ...
pg_basebackup: base backup completed

[edb(at)localhost bin]$ ./pg_ctl -D /home/edb/Desktop/backup1/ -o "-p 5438"
start
pg_ctl: directory "/home/edb/Desktop/backup1" is not a database cluster
directory

Thanks & Regards,
Rajkumar Raghuwanshi

On Mon, Mar 30, 2020 at 6:28 PM Ahsan Hadi <ahsan(dot)hadi(at)gmail(dot)com> wrote:

>
>
> On Mon, Mar 30, 2020 at 3:44 PM Rajkumar Raghuwanshi <
> rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
>
>> Thanks Asif,
>>
>> I have re-verified reported issue. expect standby backup, others are
>> fixed.
>>
>
> Yes As Asif mentioned he is working on the standby issue and adding
> bandwidth throttling functionality to parallel backup.
>
> It would be good to get some feedback on Asif previous email from Robert
> on the design considerations for stand-by server support and throttling. I
> believe all the other points mentioned by Robert in this thread are
> addressed by Asif so it would be good to hear about any other concerns that
> are not addressed.
>
> Thanks,
>
> -- Ahsan
>
>
>> Thanks & Regards,
>> Rajkumar Raghuwanshi
>>
>>
>> On Fri, Mar 27, 2020 at 11:04 PM Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>
>> wrote:
>>
>>>
>>>
>>> On Wed, Mar 25, 2020 at 12:22 PM Rajkumar Raghuwanshi <
>>> rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
>>>
>>>> Hi Asif,
>>>>
>>>> While testing further I observed parallel backup is not able to take
>>>> backup of standby server.
>>>>
>>>> mkdir /tmp/archive_dir
>>>> echo "archive_mode='on'">> data/postgresql.conf
>>>> echo "archive_command='cp %p /tmp/archive_dir/%f'">>
>>>> data/postgresql.conf
>>>>
>>>> ./pg_ctl -D data -l logs start
>>>> ./pg_basebackup -p 5432 -Fp -R -D /tmp/slave
>>>>
>>>> echo "primary_conninfo='host=127.0.0.1 port=5432 user=edb'">>
>>>> /tmp/slave/postgresql.conf
>>>> echo "restore_command='cp /tmp/archive_dir/%f %p'">>
>>>> /tmp/slave/postgresql.conf
>>>> echo "promote_trigger_file='/tmp/failover.log'">>
>>>> /tmp/slave/postgresql.conf
>>>>
>>>> ./pg_ctl -D /tmp/slave -l /tmp/slave_logs -o "-p 5433" start -c
>>>>
>>>> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "select
>>>> pg_is_in_recovery();"
>>>> pg_is_in_recovery
>>>> -------------------
>>>> f
>>>> (1 row)
>>>>
>>>> [edb(at)localhost bin]$ ./psql postgres -p 5433 -c "select
>>>> pg_is_in_recovery();"
>>>> pg_is_in_recovery
>>>> -------------------
>>>> t
>>>> (1 row)
>>>>
>>>>
>>>>
>>>>
>>>> *[edb(at)localhost bin]$ ./pg_basebackup -p 5433 -D /tmp/bkp_s --jobs
>>>> 6pg_basebackup: error: could not list backup files: ERROR: the standby was
>>>> promoted during online backupHINT: This means that the backup being taken
>>>> is corrupt and should not be used. Try taking another online
>>>> backup.pg_basebackup: removing data directory "/tmp/bkp_s"*
>>>>
>>>> #same is working fine without parallel backup
>>>> [edb(at)localhost bin]$ ./pg_basebackup -p 5433 -D /tmp/bkp_s --jobs 1
>>>> [edb(at)localhost bin]$ ls /tmp/bkp_s/PG_VERSION
>>>> /tmp/bkp_s/PG_VERSION
>>>>
>>>> Thanks & Regards,
>>>> Rajkumar Raghuwanshi
>>>>
>>>>
>>>> On Thu, Mar 19, 2020 at 4:11 PM Rajkumar Raghuwanshi <
>>>> rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
>>>>
>>>>> Hi Asif,
>>>>>
>>>>> In another scenarios, bkp data is corrupted for tablespace. again this
>>>>> is not reproducible everytime,
>>>>> but If I am running the same set of commands I am getting the same
>>>>> error.
>>>>>
>>>>> [edb(at)localhost bin]$ ./pg_ctl -D data -l logfile start
>>>>> waiting for server to start.... done
>>>>> server started
>>>>> [edb(at)localhost bin]$
>>>>> [edb(at)localhost bin]$ mkdir /tmp/tblsp
>>>>> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "create tablespace
>>>>> tblsp location '/tmp/tblsp';"
>>>>> CREATE TABLESPACE
>>>>> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "create database
>>>>> testdb tablespace tblsp;"
>>>>> CREATE DATABASE
>>>>> [edb(at)localhost bin]$ ./psql testdb -p 5432 -c "create table testtbl
>>>>> (a text);"
>>>>> CREATE TABLE
>>>>> [edb(at)localhost bin]$ ./psql testdb -p 5432 -c "insert into testtbl
>>>>> values ('parallel_backup with tablespace');"
>>>>> INSERT 0 1
>>>>> [edb(at)localhost bin]$ ./pg_basebackup -p 5432 -D /tmp/bkp -T
>>>>> /tmp/tblsp=/tmp/tblsp_bkp --jobs 2
>>>>> [edb(at)localhost bin]$ ./pg_ctl -D /tmp/bkp -l /tmp/bkp_logs -o "-p
>>>>> 5555" start
>>>>> waiting for server to start.... done
>>>>> server started
>>>>> [edb(at)localhost bin]$ ./psql postgres -p 5555 -c "select * from
>>>>> pg_tablespace where spcname like 'tblsp%' or spcname = 'pg_default'";
>>>>> oid | spcname | spcowner | spcacl | spcoptions
>>>>> -------+------------+----------+--------+------------
>>>>> 1663 | pg_default | 10 | |
>>>>> 16384 | tblsp | 10 | |
>>>>> (2 rows)
>>>>>
>>>>> [edb(at)localhost bin]$ ./psql testdb -p 5555 -c "select * from testtbl";
>>>>> psql: error: could not connect to server: FATAL:
>>>>> "pg_tblspc/16384/PG_13_202003051/16385" is not a valid data directory
>>>>> DETAIL: File "pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION" is
>>>>> missing.
>>>>> [edb(at)localhost bin]$
>>>>> [edb(at)localhost bin]$ ls
>>>>> data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
>>>>> data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
>>>>> [edb(at)localhost bin]$ ls
>>>>> /tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
>>>>> ls: cannot access
>>>>> /tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION: No such file or
>>>>> directory
>>>>>
>>>>>
>>>>> Thanks & Regards,
>>>>> Rajkumar Raghuwanshi
>>>>>
>>>>>
>>>>> On Mon, Mar 16, 2020 at 6:19 PM Rajkumar Raghuwanshi <
>>>>> rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
>>>>>
>>>>>> Hi Asif,
>>>>>>
>>>>>> On testing further, I found when taking backup with -R, pg_basebackup
>>>>>> crashed
>>>>>> this crash is not consistently reproducible.
>>>>>>
>>>>>> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "create table test
>>>>>> (a text);"
>>>>>> CREATE TABLE
>>>>>> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "insert into test
>>>>>> values ('parallel_backup with -R recovery-conf');"
>>>>>> INSERT 0 1
>>>>>> [edb(at)localhost bin]$ ./pg_basebackup -p 5432 -j 2 -D
>>>>>> /tmp/test_bkp/bkp -R
>>>>>> Segmentation fault (core dumped)
>>>>>>
>>>>>> stack trace looks the same as it was on earlier reported crash with
>>>>>> tablespace.
>>>>>> --stack trace
>>>>>> [edb(at)localhost bin]$ gdb -q -c core.37915 pg_basebackup
>>>>>> Loaded symbols for /lib64/libnss_files.so.2
>>>>>> Core was generated by `./pg_basebackup -p 5432 -j 2 -D
>>>>>> /tmp/test_bkp/bkp -R'.
>>>>>> Program terminated with signal 11, Segmentation fault.
>>>>>> #0 0x00000000004099ee in worker_get_files (wstate=0xc1e458) at
>>>>>> pg_basebackup.c:3175
>>>>>> 3175 backupinfo->curr = fetchfile->next;
>>>>>> Missing separate debuginfos, use: debuginfo-install
>>>>>> keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64
>>>>>> libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64
>>>>>> openssl-1.0.1e-58.el6_10.x86_64 zlib-1.2.3-29.el6.x86_64
>>>>>> (gdb) bt
>>>>>> #0 0x00000000004099ee in worker_get_files (wstate=0xc1e458) at
>>>>>> pg_basebackup.c:3175
>>>>>> #1 0x0000000000408a9e in worker_run (arg=0xc1e458) at
>>>>>> pg_basebackup.c:2715
>>>>>> #2 0x0000003921a07aa1 in start_thread (arg=0x7f72207c0700) at
>>>>>> pthread_create.c:301
>>>>>> #3 0x00000039212e8c4d in clone () at
>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
>>>>>> (gdb)
>>>>>>
>>>>>> Thanks & Regards,
>>>>>> Rajkumar Raghuwanshi
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 16, 2020 at 2:14 PM Jeevan Chalke <
>>>>>> jeevan(dot)chalke(at)enterprisedb(dot)com> wrote:
>>>>>>
>>>>>>> Hi Asif,
>>>>>>>
>>>>>>>
>>>>>>>> Thanks Rajkumar. I have fixed the above issues and have rebased the
>>>>>>>> patch to the latest master (b7f64c64).
>>>>>>>> (V9 of the patches are attached).
>>>>>>>>
>>>>>>>
>>>>>>> I had a further review of the patches and here are my few
>>>>>>> observations:
>>>>>>>
>>>>>>> 1.
>>>>>>> +/*
>>>>>>> + * stop_backup() - ends an online backup
>>>>>>> + *
>>>>>>> + * The function is called at the end of an online backup. It sends
>>>>>>> out pg_control
>>>>>>> + * file, optionally WAL segments and ending WAL location.
>>>>>>> + */
>>>>>>>
>>>>>>> Comments seem out-dated.
>>>>>>>
>>>>>>
>>> Fixed.
>>>
>>>
>>>>
>>>>>>> 2. With parallel jobs, maxrate is now not supported. Since we are
>>>>>>> now asking
>>>>>>> data in multiple threads throttling seems important here. Can you
>>>>>>> please
>>>>>>> explain why have you disabled that?
>>>>>>>
>>>>>>> 3. As we are always fetching a single file and as Robert suggested,
>>>>>>> let rename
>>>>>>> SEND_FILES to SEND_FILE instead.
>>>>>>>
>>>>>>
>>> Yes, we are fetching a single file. However, SEND_FILES is still capable
>>> of fetching multiple files in one
>>> go, that's why the name.
>>>
>>>
>>>>>>> 4. Does this work on Windows? I mean does pthread_create() work on
>>>>>>> Windows?
>>>>>>> I asked this as I see that pgbench has its own implementation for
>>>>>>> pthread_create() for WIN32 but this patch doesn't.
>>>>>>>
>>>>>>
>>> patch is updated to add support for the Windows platform.
>>>
>>>
>>>>>>> 5. Typos:
>>>>>>> tablspace => tablespace
>>>>>>> safly => safely
>>>>>>>
>>>>>>> Done.
>>>
>>>
>>>> 6. parallel_backup_run() needs some comments explaining the states it
>>>>>>> goes
>>>>>>> through PB_* states.
>>>>>>>
>>>>>>> 7.
>>>>>>> + case PB_FETCH_REL_FILES: /* fetch files from server
>>>>>>> */
>>>>>>> + if (backupinfo->activeworkers == 0)
>>>>>>> + {
>>>>>>> + backupinfo->backupstate = PB_STOP_BACKUP;
>>>>>>> + free_filelist(backupinfo);
>>>>>>> + }
>>>>>>> + break;
>>>>>>> + case PB_FETCH_WAL_FILES: /* fetch WAL files from
>>>>>>> server */
>>>>>>> + if (backupinfo->activeworkers == 0)
>>>>>>> + {
>>>>>>> + backupinfo->backupstate = PB_BACKUP_COMPLETE;
>>>>>>> + }
>>>>>>> + break;
>>>>>>>
>>>>>> Done.
>>>
>>>
>>>>
>>>>>>> Why free_filelist() is not called in PB_FETCH_WAL_FILES case?
>>>>>>>
>>>>>> Done.
>>>
>>> The corrupted tablespace and crash, reported by Rajkumar, have been
>>> fixed. A pointer
>>> variable remained uninitialized which in turn caused the system to
>>> misbehave.
>>>
>>> Attached is the updated set of patches. AFAIK, to complete parallel
>>> backup feature
>>> set, there remain three sub-features:
>>>
>>> 1- parallel backup does not work with a standby server. In parallel
>>> backup, the server
>>> spawns multiple processes and there is no shared state being maintained.
>>> So currently,
>>> no way to tell multiple processes if the standby was promoted during the
>>> backup since
>>> the START_BACKUP was called.
>>>
>>> 2- throttling. Robert previously suggested that we implement
>>> throttling on the client-side.
>>> However, I found a previous discussion where it was advocated to be
>>> added to the
>>> backend instead[1].
>>>
>>> So, it was better to have a consensus before moving the throttle
>>> function to the client.
>>> That’s why for the time being I have disabled it and have asked for
>>> suggestions on it
>>> to move forward.
>>>
>>> It seems to me that we have to maintain a shared state in order to
>>> support taking backup
>>> from standby. Also, there is a new feature recently committed for backup
>>> progress
>>> reporting in the backend (pg_stat_progress_basebackup). This
>>> functionality was recently
>>> added via this commit ID: e65497df. For parallel backup to update these
>>> stats, a shared
>>> state will be required.
>>>
>>> Since multiple pg_basebackup can be running at the same time,
>>> maintaining a shared state
>>> can become a little complex, unless we disallow taking multiple parallel
>>> backups.
>>>
>>> So proceeding on with this patch, I will be working on:
>>> - throttling to be implemented on the client-side.
>>> - adding a shared state to handle backup from the standby.
>>>
>>>
>>>
>>> [1]
>>> https://www.postgresql.org/message-id/flat/521B4B29.20009%402ndquadrant.com#189bf840c87de5908c0b4467d31b50af
>>>
>>>
>>> --
>>> Asif Rehman
>>> Highgo Software (Canada/China/Pakistan)
>>> URL : www.highgo.ca
>>>
>>>
>
> --
> Highgo Software (Canada/China/Pakistan)
> URL : http://www.highgo.ca
> ADDR: 10318 WHALLEY BLVD, Surrey, BC
> EMAIL: mailto: ahsan(dot)hadi(at)highgo(dot)ca
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2020-04-02 10:38:59 Re: [BUG] non archived WAL removed during production crash recovery
Previous Message Alexey Bashtanov 2020-04-02 09:35:03 Re: control max length of parameter values logged