Re: WIP/PoC for parallel backup

From: Kashif Zeeshan <kashif(dot)zeeshan(at)enterprisedb(dot)com>
To: Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>
Cc: Ahsan Hadi <ahsan(dot)hadi(at)gmail(dot)com>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP/PoC for parallel backup
Date: 2020-04-02 11:29:47
Message-ID: CAKfXphqhzCr-8ggS9-o_ctMiLm7h+4bkcUP1un087K3sS2EPjw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Asif

The backup failed with errors "error: could not connect to server: could
not look up local user ID 1000: Too many open files" when the
max_wal_senders was set to 2000.
The errors generated for the workers starting from backup worke=1017.
Please note that the backup directory was also not cleaned after the backup
was failed.

Steps
=======
1) Generate data in DB
./pgbench -i -s 600 -h localhost -p 5432 postgres
2) Set max_wal_senders = 2000 in postgresql.
3) Generate the backup

[edb(at)localhost bin]$
^[[A[edb(at)localhost bin]$
[edb(at)localhost bin]$ ./pg_basebackup -v -j 1990 -D
/home/edb/Desktop/backup/
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 1/F1000028 on timeline 1
pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_58692"
pg_basebackup: backup worker (0) created
….
…..
…..
pg_basebackup: backup worker (1017) created
pg_basebackup: error: could not connect to server: could not look up local
user ID 1000: Too many open files
pg_basebackup: backup worker (1018) created
pg_basebackup: error: could not connect to server: could not look up local
user ID 1000: Too many open files



pg_basebackup: error: could not connect to server: could not look up local
user ID 1000: Too many open files
pg_basebackup: backup worker (1989) created
pg_basebackup: error: could not create file
"/home/edb/Desktop/backup//global/4183": Too many open files
pg_basebackup: error: could not create file
"/home/edb/Desktop/backup//global/3592": Too many open files
pg_basebackup: error: could not create file
"/home/edb/Desktop/backup//global/4177": Too many open files
[edb(at)localhost bin]$

4) The backup directory is not cleaned

[edb(at)localhost bin]$
[edb(at)localhost bin]$ ls /home/edb/Desktop/backup
base pg_commit_ts pg_logical pg_notify pg_serial pg_stat
pg_subtrans pg_twophase pg_xact
global pg_dynshmem pg_multixact pg_replslot pg_snapshots pg_stat_tmp
pg_tblspc pg_wal
[edb(at)localhost bin]$

Kashif Zeeshan
EnterpriseDB

On Thu, Apr 2, 2020 at 2:58 PM Rajkumar Raghuwanshi <
rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:

> Hi Asif,
>
> My colleague Kashif Zeeshan reported an issue off-list, posting here,
> please take a look.
>
> When executing two backups at the same time, getting FATAL error due to
> max_wal_senders and instead of exit Backup got completed
> And when tried to start the server from the backup cluster, getting error.
>
> [edb(at)localhost bin]$ ./pgbench -i -s 200 -h localhost -p 5432 postgres
> [edb(at)localhost bin]$ ./pg_basebackup -v -j 8 -D /home/edb/Desktop/backup/
> pg_basebackup: initiating base backup, waiting for checkpoint to complete
> pg_basebackup: checkpoint completed
> pg_basebackup: write-ahead log start point: 0/C2000270 on timeline 1
> pg_basebackup: starting background WAL receiver
> pg_basebackup: created temporary replication slot "pg_basebackup_57849"
> pg_basebackup: backup worker (0) created
> pg_basebackup: backup worker (1) created
> pg_basebackup: backup worker (2) created
> pg_basebackup: error: could not connect to server: FATAL: number of
> requested standby connections exceeds max_wal_senders (currently 10)
> pg_basebackup: backup worker (3) created
> pg_basebackup: error: could not connect to server: FATAL: number of
> requested standby connections exceeds max_wal_senders (currently 10)
> pg_basebackup: backup worker (4) created
> pg_basebackup: error: could not connect to server: FATAL: number of
> requested standby connections exceeds max_wal_senders (currently 10)
> pg_basebackup: backup worker (5) created
> pg_basebackup: error: could not connect to server: FATAL: number of
> requested standby connections exceeds max_wal_senders (currently 10)
> pg_basebackup: backup worker (6) created
> pg_basebackup: error: could not connect to server: FATAL: number of
> requested standby connections exceeds max_wal_senders (currently 10)
> pg_basebackup: backup worker (7) created
> pg_basebackup: write-ahead log end point: 0/C3000050
> pg_basebackup: waiting for background process to finish streaming ...
> pg_basebackup: syncing data to disk ...
> pg_basebackup: base backup completed
> [edb(at)localhost bin]$ ./pg_basebackup -v -j 8 -D
> /home/edb/Desktop/backup1/
> pg_basebackup: initiating base backup, waiting for checkpoint to complete
> pg_basebackup: checkpoint completed
> pg_basebackup: write-ahead log start point: 0/C20001C0 on timeline 1
> pg_basebackup: starting background WAL receiver
> pg_basebackup: created temporary replication slot "pg_basebackup_57848"
> pg_basebackup: backup worker (0) created
> pg_basebackup: backup worker (1) created
> pg_basebackup: backup worker (2) created
> pg_basebackup: error: could not connect to server: FATAL: number of
> requested standby connections exceeds max_wal_senders (currently 10)
> pg_basebackup: backup worker (3) created
> pg_basebackup: error: could not connect to server: FATAL: number of
> requested standby connections exceeds max_wal_senders (currently 10)
> pg_basebackup: backup worker (4) created
> pg_basebackup: error: could not connect to server: FATAL: number of
> requested standby connections exceeds max_wal_senders (currently 10)
> pg_basebackup: backup worker (5) created
> pg_basebackup: error: could not connect to server: FATAL: number of
> requested standby connections exceeds max_wal_senders (currently 10)
> pg_basebackup: backup worker (6) created
> pg_basebackup: error: could not connect to server: FATAL: number of
> requested standby connections exceeds max_wal_senders (currently 10)
> pg_basebackup: backup worker (7) created
> pg_basebackup: write-ahead log end point: 0/C2000348
> pg_basebackup: waiting for background process to finish streaming ...
> pg_basebackup: syncing data to disk ...
> pg_basebackup: base backup completed
>
> [edb(at)localhost bin]$ ./pg_ctl -D /home/edb/Desktop/backup1/ -o "-p 5438"
> start
> pg_ctl: directory "/home/edb/Desktop/backup1" is not a database cluster
> directory
>
> Thanks & Regards,
> Rajkumar Raghuwanshi
>
>
> On Mon, Mar 30, 2020 at 6:28 PM Ahsan Hadi <ahsan(dot)hadi(at)gmail(dot)com> wrote:
>
>>
>>
>> On Mon, Mar 30, 2020 at 3:44 PM Rajkumar Raghuwanshi <
>> rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
>>
>>> Thanks Asif,
>>>
>>> I have re-verified reported issue. expect standby backup, others are
>>> fixed.
>>>
>>
>> Yes As Asif mentioned he is working on the standby issue and adding
>> bandwidth throttling functionality to parallel backup.
>>
>> It would be good to get some feedback on Asif previous email from Robert
>> on the design considerations for stand-by server support and throttling. I
>> believe all the other points mentioned by Robert in this thread are
>> addressed by Asif so it would be good to hear about any other concerns that
>> are not addressed.
>>
>> Thanks,
>>
>> -- Ahsan
>>
>>
>>> Thanks & Regards,
>>> Rajkumar Raghuwanshi
>>>
>>>
>>> On Fri, Mar 27, 2020 at 11:04 PM Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Mar 25, 2020 at 12:22 PM Rajkumar Raghuwanshi <
>>>> rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
>>>>
>>>>> Hi Asif,
>>>>>
>>>>> While testing further I observed parallel backup is not able to take
>>>>> backup of standby server.
>>>>>
>>>>> mkdir /tmp/archive_dir
>>>>> echo "archive_mode='on'">> data/postgresql.conf
>>>>> echo "archive_command='cp %p /tmp/archive_dir/%f'">>
>>>>> data/postgresql.conf
>>>>>
>>>>> ./pg_ctl -D data -l logs start
>>>>> ./pg_basebackup -p 5432 -Fp -R -D /tmp/slave
>>>>>
>>>>> echo "primary_conninfo='host=127.0.0.1 port=5432 user=edb'">>
>>>>> /tmp/slave/postgresql.conf
>>>>> echo "restore_command='cp /tmp/archive_dir/%f %p'">>
>>>>> /tmp/slave/postgresql.conf
>>>>> echo "promote_trigger_file='/tmp/failover.log'">>
>>>>> /tmp/slave/postgresql.conf
>>>>>
>>>>> ./pg_ctl -D /tmp/slave -l /tmp/slave_logs -o "-p 5433" start -c
>>>>>
>>>>> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "select
>>>>> pg_is_in_recovery();"
>>>>> pg_is_in_recovery
>>>>> -------------------
>>>>> f
>>>>> (1 row)
>>>>>
>>>>> [edb(at)localhost bin]$ ./psql postgres -p 5433 -c "select
>>>>> pg_is_in_recovery();"
>>>>> pg_is_in_recovery
>>>>> -------------------
>>>>> t
>>>>> (1 row)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *[edb(at)localhost bin]$ ./pg_basebackup -p 5433 -D /tmp/bkp_s --jobs
>>>>> 6pg_basebackup: error: could not list backup files: ERROR: the standby was
>>>>> promoted during online backupHINT: This means that the backup being taken
>>>>> is corrupt and should not be used. Try taking another online
>>>>> backup.pg_basebackup: removing data directory "/tmp/bkp_s"*
>>>>>
>>>>> #same is working fine without parallel backup
>>>>> [edb(at)localhost bin]$ ./pg_basebackup -p 5433 -D /tmp/bkp_s --jobs 1
>>>>> [edb(at)localhost bin]$ ls /tmp/bkp_s/PG_VERSION
>>>>> /tmp/bkp_s/PG_VERSION
>>>>>
>>>>> Thanks & Regards,
>>>>> Rajkumar Raghuwanshi
>>>>>
>>>>>
>>>>> On Thu, Mar 19, 2020 at 4:11 PM Rajkumar Raghuwanshi <
>>>>> rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
>>>>>
>>>>>> Hi Asif,
>>>>>>
>>>>>> In another scenarios, bkp data is corrupted for tablespace. again
>>>>>> this is not reproducible everytime,
>>>>>> but If I am running the same set of commands I am getting the same
>>>>>> error.
>>>>>>
>>>>>> [edb(at)localhost bin]$ ./pg_ctl -D data -l logfile start
>>>>>> waiting for server to start.... done
>>>>>> server started
>>>>>> [edb(at)localhost bin]$
>>>>>> [edb(at)localhost bin]$ mkdir /tmp/tblsp
>>>>>> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "create tablespace
>>>>>> tblsp location '/tmp/tblsp';"
>>>>>> CREATE TABLESPACE
>>>>>> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "create database
>>>>>> testdb tablespace tblsp;"
>>>>>> CREATE DATABASE
>>>>>> [edb(at)localhost bin]$ ./psql testdb -p 5432 -c "create table testtbl
>>>>>> (a text);"
>>>>>> CREATE TABLE
>>>>>> [edb(at)localhost bin]$ ./psql testdb -p 5432 -c "insert into testtbl
>>>>>> values ('parallel_backup with tablespace');"
>>>>>> INSERT 0 1
>>>>>> [edb(at)localhost bin]$ ./pg_basebackup -p 5432 -D /tmp/bkp -T
>>>>>> /tmp/tblsp=/tmp/tblsp_bkp --jobs 2
>>>>>> [edb(at)localhost bin]$ ./pg_ctl -D /tmp/bkp -l /tmp/bkp_logs -o "-p
>>>>>> 5555" start
>>>>>> waiting for server to start.... done
>>>>>> server started
>>>>>> [edb(at)localhost bin]$ ./psql postgres -p 5555 -c "select * from
>>>>>> pg_tablespace where spcname like 'tblsp%' or spcname = 'pg_default'";
>>>>>> oid | spcname | spcowner | spcacl | spcoptions
>>>>>> -------+------------+----------+--------+------------
>>>>>> 1663 | pg_default | 10 | |
>>>>>> 16384 | tblsp | 10 | |
>>>>>> (2 rows)
>>>>>>
>>>>>> [edb(at)localhost bin]$ ./psql testdb -p 5555 -c "select * from
>>>>>> testtbl";
>>>>>> psql: error: could not connect to server: FATAL:
>>>>>> "pg_tblspc/16384/PG_13_202003051/16385" is not a valid data directory
>>>>>> DETAIL: File "pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION" is
>>>>>> missing.
>>>>>> [edb(at)localhost bin]$
>>>>>> [edb(at)localhost bin]$ ls
>>>>>> data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
>>>>>> data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
>>>>>> [edb(at)localhost bin]$ ls
>>>>>> /tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
>>>>>> ls: cannot access
>>>>>> /tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION: No such file or
>>>>>> directory
>>>>>>
>>>>>>
>>>>>> Thanks & Regards,
>>>>>> Rajkumar Raghuwanshi
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 16, 2020 at 6:19 PM Rajkumar Raghuwanshi <
>>>>>> rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
>>>>>>
>>>>>>> Hi Asif,
>>>>>>>
>>>>>>> On testing further, I found when taking backup with -R,
>>>>>>> pg_basebackup crashed
>>>>>>> this crash is not consistently reproducible.
>>>>>>>
>>>>>>> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "create table test
>>>>>>> (a text);"
>>>>>>> CREATE TABLE
>>>>>>> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "insert into test
>>>>>>> values ('parallel_backup with -R recovery-conf');"
>>>>>>> INSERT 0 1
>>>>>>> [edb(at)localhost bin]$ ./pg_basebackup -p 5432 -j 2 -D
>>>>>>> /tmp/test_bkp/bkp -R
>>>>>>> Segmentation fault (core dumped)
>>>>>>>
>>>>>>> stack trace looks the same as it was on earlier reported crash with
>>>>>>> tablespace.
>>>>>>> --stack trace
>>>>>>> [edb(at)localhost bin]$ gdb -q -c core.37915 pg_basebackup
>>>>>>> Loaded symbols for /lib64/libnss_files.so.2
>>>>>>> Core was generated by `./pg_basebackup -p 5432 -j 2 -D
>>>>>>> /tmp/test_bkp/bkp -R'.
>>>>>>> Program terminated with signal 11, Segmentation fault.
>>>>>>> #0 0x00000000004099ee in worker_get_files (wstate=0xc1e458) at
>>>>>>> pg_basebackup.c:3175
>>>>>>> 3175 backupinfo->curr = fetchfile->next;
>>>>>>> Missing separate debuginfos, use: debuginfo-install
>>>>>>> keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64
>>>>>>> libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64
>>>>>>> openssl-1.0.1e-58.el6_10.x86_64 zlib-1.2.3-29.el6.x86_64
>>>>>>> (gdb) bt
>>>>>>> #0 0x00000000004099ee in worker_get_files (wstate=0xc1e458) at
>>>>>>> pg_basebackup.c:3175
>>>>>>> #1 0x0000000000408a9e in worker_run (arg=0xc1e458) at
>>>>>>> pg_basebackup.c:2715
>>>>>>> #2 0x0000003921a07aa1 in start_thread (arg=0x7f72207c0700) at
>>>>>>> pthread_create.c:301
>>>>>>> #3 0x00000039212e8c4d in clone () at
>>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
>>>>>>> (gdb)
>>>>>>>
>>>>>>> Thanks & Regards,
>>>>>>> Rajkumar Raghuwanshi
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 16, 2020 at 2:14 PM Jeevan Chalke <
>>>>>>> jeevan(dot)chalke(at)enterprisedb(dot)com> wrote:
>>>>>>>
>>>>>>>> Hi Asif,
>>>>>>>>
>>>>>>>>
>>>>>>>>> Thanks Rajkumar. I have fixed the above issues and have rebased
>>>>>>>>> the patch to the latest master (b7f64c64).
>>>>>>>>> (V9 of the patches are attached).
>>>>>>>>>
>>>>>>>>
>>>>>>>> I had a further review of the patches and here are my few
>>>>>>>> observations:
>>>>>>>>
>>>>>>>> 1.
>>>>>>>> +/*
>>>>>>>> + * stop_backup() - ends an online backup
>>>>>>>> + *
>>>>>>>> + * The function is called at the end of an online backup. It sends
>>>>>>>> out pg_control
>>>>>>>> + * file, optionally WAL segments and ending WAL location.
>>>>>>>> + */
>>>>>>>>
>>>>>>>> Comments seem out-dated.
>>>>>>>>
>>>>>>>
>>>> Fixed.
>>>>
>>>>
>>>>>
>>>>>>>> 2. With parallel jobs, maxrate is now not supported. Since we are
>>>>>>>> now asking
>>>>>>>> data in multiple threads throttling seems important here. Can you
>>>>>>>> please
>>>>>>>> explain why have you disabled that?
>>>>>>>>
>>>>>>>> 3. As we are always fetching a single file and as Robert suggested,
>>>>>>>> let rename
>>>>>>>> SEND_FILES to SEND_FILE instead.
>>>>>>>>
>>>>>>>
>>>> Yes, we are fetching a single file. However, SEND_FILES is still
>>>> capable of fetching multiple files in one
>>>> go, that's why the name.
>>>>
>>>>
>>>>>>>> 4. Does this work on Windows? I mean does pthread_create() work on
>>>>>>>> Windows?
>>>>>>>> I asked this as I see that pgbench has its own implementation for
>>>>>>>> pthread_create() for WIN32 but this patch doesn't.
>>>>>>>>
>>>>>>>
>>>> patch is updated to add support for the Windows platform.
>>>>
>>>>
>>>>>>>> 5. Typos:
>>>>>>>> tablspace => tablespace
>>>>>>>> safly => safely
>>>>>>>>
>>>>>>>> Done.
>>>>
>>>>
>>>>> 6. parallel_backup_run() needs some comments explaining the states it
>>>>>>>> goes
>>>>>>>> through PB_* states.
>>>>>>>>
>>>>>>>> 7.
>>>>>>>> + case PB_FETCH_REL_FILES: /* fetch files from server
>>>>>>>> */
>>>>>>>> + if (backupinfo->activeworkers == 0)
>>>>>>>> + {
>>>>>>>> + backupinfo->backupstate = PB_STOP_BACKUP;
>>>>>>>> + free_filelist(backupinfo);
>>>>>>>> + }
>>>>>>>> + break;
>>>>>>>> + case PB_FETCH_WAL_FILES: /* fetch WAL files from
>>>>>>>> server */
>>>>>>>> + if (backupinfo->activeworkers == 0)
>>>>>>>> + {
>>>>>>>> + backupinfo->backupstate = PB_BACKUP_COMPLETE;
>>>>>>>> + }
>>>>>>>> + break;
>>>>>>>>
>>>>>>> Done.
>>>>
>>>>
>>>>>
>>>>>>>> Why free_filelist() is not called in PB_FETCH_WAL_FILES case?
>>>>>>>>
>>>>>>> Done.
>>>>
>>>> The corrupted tablespace and crash, reported by Rajkumar, have been
>>>> fixed. A pointer
>>>> variable remained uninitialized which in turn caused the system to
>>>> misbehave.
>>>>
>>>> Attached is the updated set of patches. AFAIK, to complete parallel
>>>> backup feature
>>>> set, there remain three sub-features:
>>>>
>>>> 1- parallel backup does not work with a standby server. In parallel
>>>> backup, the server
>>>> spawns multiple processes and there is no shared state being
>>>> maintained. So currently,
>>>> no way to tell multiple processes if the standby was promoted during
>>>> the backup since
>>>> the START_BACKUP was called.
>>>>
>>>> 2- throttling. Robert previously suggested that we implement
>>>> throttling on the client-side.
>>>> However, I found a previous discussion where it was advocated to be
>>>> added to the
>>>> backend instead[1].
>>>>
>>>> So, it was better to have a consensus before moving the throttle
>>>> function to the client.
>>>> That’s why for the time being I have disabled it and have asked for
>>>> suggestions on it
>>>> to move forward.
>>>>
>>>> It seems to me that we have to maintain a shared state in order to
>>>> support taking backup
>>>> from standby. Also, there is a new feature recently committed for
>>>> backup progress
>>>> reporting in the backend (pg_stat_progress_basebackup). This
>>>> functionality was recently
>>>> added via this commit ID: e65497df. For parallel backup to update these
>>>> stats, a shared
>>>> state will be required.
>>>>
>>>> Since multiple pg_basebackup can be running at the same time,
>>>> maintaining a shared state
>>>> can become a little complex, unless we disallow taking multiple
>>>> parallel backups.
>>>>
>>>> So proceeding on with this patch, I will be working on:
>>>> - throttling to be implemented on the client-side.
>>>> - adding a shared state to handle backup from the standby.
>>>>
>>>>
>>>>
>>>> [1]
>>>> https://www.postgresql.org/message-id/flat/521B4B29.20009%402ndquadrant.com#189bf840c87de5908c0b4467d31b50af
>>>>
>>>>
>>>> --
>>>> Asif Rehman
>>>> Highgo Software (Canada/China/Pakistan)
>>>> URL : www.highgo.ca
>>>>
>>>>
>>
>> --
>> Highgo Software (Canada/China/Pakistan)
>> URL : http://www.highgo.ca
>> ADDR: 10318 WHALLEY BLVD, Surrey, BC
>> EMAIL: mailto: ahsan(dot)hadi(at)highgo(dot)ca
>>
>

--
Regards
====================================
Kashif Zeeshan
Lead Quality Assurance Engineer / Manager

EnterpriseDB Corporation
The Enterprise Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2020-04-02 11:46:51 Re: WIP/PoC for parallel backup
Previous Message Fujii Masao 2020-04-02 10:38:59 Re: [BUG] non archived WAL removed during production crash recovery