Re: WIP/PoC for parallel backup

From: Ahsan Hadi <ahsan(dot)hadi(at)gmail(dot)com>
To: Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>
Cc: Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP/PoC for parallel backup
Date: 2020-03-30 12:58:18
Message-ID: CA+9bhCLGA8LF8oBjroWcz5Gu2oNUw6hUDa0k9F+nnxgnj0fAMg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 30, 2020 at 3:44 PM Rajkumar Raghuwanshi <
rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:

> Thanks Asif,
>
> I have re-verified reported issue. expect standby backup, others are fixed.
>

Yes As Asif mentioned he is working on the standby issue and adding
bandwidth throttling functionality to parallel backup.

It would be good to get some feedback on Asif previous email from Robert on
the design considerations for stand-by server support and throttling. I
believe all the other points mentioned by Robert in this thread are
addressed by Asif so it would be good to hear about any other concerns that
are not addressed.

Thanks,

-- Ahsan

> Thanks & Regards,
> Rajkumar Raghuwanshi
>
>
> On Fri, Mar 27, 2020 at 11:04 PM Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>
> wrote:
>
>>
>>
>> On Wed, Mar 25, 2020 at 12:22 PM Rajkumar Raghuwanshi <
>> rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
>>
>>> Hi Asif,
>>>
>>> While testing further I observed parallel backup is not able to take
>>> backup of standby server.
>>>
>>> mkdir /tmp/archive_dir
>>> echo "archive_mode='on'">> data/postgresql.conf
>>> echo "archive_command='cp %p /tmp/archive_dir/%f'">> data/postgresql.conf
>>>
>>> ./pg_ctl -D data -l logs start
>>> ./pg_basebackup -p 5432 -Fp -R -D /tmp/slave
>>>
>>> echo "primary_conninfo='host=127.0.0.1 port=5432 user=edb'">>
>>> /tmp/slave/postgresql.conf
>>> echo "restore_command='cp /tmp/archive_dir/%f %p'">>
>>> /tmp/slave/postgresql.conf
>>> echo "promote_trigger_file='/tmp/failover.log'">>
>>> /tmp/slave/postgresql.conf
>>>
>>> ./pg_ctl -D /tmp/slave -l /tmp/slave_logs -o "-p 5433" start -c
>>>
>>> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "select
>>> pg_is_in_recovery();"
>>> pg_is_in_recovery
>>> -------------------
>>> f
>>> (1 row)
>>>
>>> [edb(at)localhost bin]$ ./psql postgres -p 5433 -c "select
>>> pg_is_in_recovery();"
>>> pg_is_in_recovery
>>> -------------------
>>> t
>>> (1 row)
>>>
>>>
>>>
>>>
>>> *[edb(at)localhost bin]$ ./pg_basebackup -p 5433 -D /tmp/bkp_s --jobs
>>> 6pg_basebackup: error: could not list backup files: ERROR: the standby was
>>> promoted during online backupHINT: This means that the backup being taken
>>> is corrupt and should not be used. Try taking another online
>>> backup.pg_basebackup: removing data directory "/tmp/bkp_s"*
>>>
>>> #same is working fine without parallel backup
>>> [edb(at)localhost bin]$ ./pg_basebackup -p 5433 -D /tmp/bkp_s --jobs 1
>>> [edb(at)localhost bin]$ ls /tmp/bkp_s/PG_VERSION
>>> /tmp/bkp_s/PG_VERSION
>>>
>>> Thanks & Regards,
>>> Rajkumar Raghuwanshi
>>>
>>>
>>> On Thu, Mar 19, 2020 at 4:11 PM Rajkumar Raghuwanshi <
>>> rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
>>>
>>>> Hi Asif,
>>>>
>>>> In another scenarios, bkp data is corrupted for tablespace. again this
>>>> is not reproducible everytime,
>>>> but If I am running the same set of commands I am getting the same
>>>> error.
>>>>
>>>> [edb(at)localhost bin]$ ./pg_ctl -D data -l logfile start
>>>> waiting for server to start.... done
>>>> server started
>>>> [edb(at)localhost bin]$
>>>> [edb(at)localhost bin]$ mkdir /tmp/tblsp
>>>> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "create tablespace
>>>> tblsp location '/tmp/tblsp';"
>>>> CREATE TABLESPACE
>>>> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "create database
>>>> testdb tablespace tblsp;"
>>>> CREATE DATABASE
>>>> [edb(at)localhost bin]$ ./psql testdb -p 5432 -c "create table testtbl (a
>>>> text);"
>>>> CREATE TABLE
>>>> [edb(at)localhost bin]$ ./psql testdb -p 5432 -c "insert into testtbl
>>>> values ('parallel_backup with tablespace');"
>>>> INSERT 0 1
>>>> [edb(at)localhost bin]$ ./pg_basebackup -p 5432 -D /tmp/bkp -T
>>>> /tmp/tblsp=/tmp/tblsp_bkp --jobs 2
>>>> [edb(at)localhost bin]$ ./pg_ctl -D /tmp/bkp -l /tmp/bkp_logs -o "-p
>>>> 5555" start
>>>> waiting for server to start.... done
>>>> server started
>>>> [edb(at)localhost bin]$ ./psql postgres -p 5555 -c "select * from
>>>> pg_tablespace where spcname like 'tblsp%' or spcname = 'pg_default'";
>>>> oid | spcname | spcowner | spcacl | spcoptions
>>>> -------+------------+----------+--------+------------
>>>> 1663 | pg_default | 10 | |
>>>> 16384 | tblsp | 10 | |
>>>> (2 rows)
>>>>
>>>> [edb(at)localhost bin]$ ./psql testdb -p 5555 -c "select * from testtbl";
>>>> psql: error: could not connect to server: FATAL:
>>>> "pg_tblspc/16384/PG_13_202003051/16385" is not a valid data directory
>>>> DETAIL: File "pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION" is
>>>> missing.
>>>> [edb(at)localhost bin]$
>>>> [edb(at)localhost bin]$ ls
>>>> data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
>>>> data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
>>>> [edb(at)localhost bin]$ ls
>>>> /tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
>>>> ls: cannot access
>>>> /tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION: No such file or
>>>> directory
>>>>
>>>>
>>>> Thanks & Regards,
>>>> Rajkumar Raghuwanshi
>>>>
>>>>
>>>> On Mon, Mar 16, 2020 at 6:19 PM Rajkumar Raghuwanshi <
>>>> rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
>>>>
>>>>> Hi Asif,
>>>>>
>>>>> On testing further, I found when taking backup with -R, pg_basebackup
>>>>> crashed
>>>>> this crash is not consistently reproducible.
>>>>>
>>>>> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "create table test (a
>>>>> text);"
>>>>> CREATE TABLE
>>>>> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "insert into test
>>>>> values ('parallel_backup with -R recovery-conf');"
>>>>> INSERT 0 1
>>>>> [edb(at)localhost bin]$ ./pg_basebackup -p 5432 -j 2 -D
>>>>> /tmp/test_bkp/bkp -R
>>>>> Segmentation fault (core dumped)
>>>>>
>>>>> stack trace looks the same as it was on earlier reported crash with
>>>>> tablespace.
>>>>> --stack trace
>>>>> [edb(at)localhost bin]$ gdb -q -c core.37915 pg_basebackup
>>>>> Loaded symbols for /lib64/libnss_files.so.2
>>>>> Core was generated by `./pg_basebackup -p 5432 -j 2 -D
>>>>> /tmp/test_bkp/bkp -R'.
>>>>> Program terminated with signal 11, Segmentation fault.
>>>>> #0 0x00000000004099ee in worker_get_files (wstate=0xc1e458) at
>>>>> pg_basebackup.c:3175
>>>>> 3175 backupinfo->curr = fetchfile->next;
>>>>> Missing separate debuginfos, use: debuginfo-install
>>>>> keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64
>>>>> libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64
>>>>> openssl-1.0.1e-58.el6_10.x86_64 zlib-1.2.3-29.el6.x86_64
>>>>> (gdb) bt
>>>>> #0 0x00000000004099ee in worker_get_files (wstate=0xc1e458) at
>>>>> pg_basebackup.c:3175
>>>>> #1 0x0000000000408a9e in worker_run (arg=0xc1e458) at
>>>>> pg_basebackup.c:2715
>>>>> #2 0x0000003921a07aa1 in start_thread (arg=0x7f72207c0700) at
>>>>> pthread_create.c:301
>>>>> #3 0x00000039212e8c4d in clone () at
>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
>>>>> (gdb)
>>>>>
>>>>> Thanks & Regards,
>>>>> Rajkumar Raghuwanshi
>>>>>
>>>>>
>>>>> On Mon, Mar 16, 2020 at 2:14 PM Jeevan Chalke <
>>>>> jeevan(dot)chalke(at)enterprisedb(dot)com> wrote:
>>>>>
>>>>>> Hi Asif,
>>>>>>
>>>>>>
>>>>>>> Thanks Rajkumar. I have fixed the above issues and have rebased the
>>>>>>> patch to the latest master (b7f64c64).
>>>>>>> (V9 of the patches are attached).
>>>>>>>
>>>>>>
>>>>>> I had a further review of the patches and here are my few
>>>>>> observations:
>>>>>>
>>>>>> 1.
>>>>>> +/*
>>>>>> + * stop_backup() - ends an online backup
>>>>>> + *
>>>>>> + * The function is called at the end of an online backup. It sends
>>>>>> out pg_control
>>>>>> + * file, optionally WAL segments and ending WAL location.
>>>>>> + */
>>>>>>
>>>>>> Comments seem out-dated.
>>>>>>
>>>>>
>> Fixed.
>>
>>
>>>
>>>>>> 2. With parallel jobs, maxrate is now not supported. Since we are now
>>>>>> asking
>>>>>> data in multiple threads throttling seems important here. Can you
>>>>>> please
>>>>>> explain why have you disabled that?
>>>>>>
>>>>>> 3. As we are always fetching a single file and as Robert suggested,
>>>>>> let rename
>>>>>> SEND_FILES to SEND_FILE instead.
>>>>>>
>>>>>
>> Yes, we are fetching a single file. However, SEND_FILES is still capable
>> of fetching multiple files in one
>> go, that's why the name.
>>
>>
>>>>>> 4. Does this work on Windows? I mean does pthread_create() work on
>>>>>> Windows?
>>>>>> I asked this as I see that pgbench has its own implementation for
>>>>>> pthread_create() for WIN32 but this patch doesn't.
>>>>>>
>>>>>
>> patch is updated to add support for the Windows platform.
>>
>>
>>>>>> 5. Typos:
>>>>>> tablspace => tablespace
>>>>>> safly => safely
>>>>>>
>>>>>> Done.
>>
>>
>>> 6. parallel_backup_run() needs some comments explaining the states it
>>>>>> goes
>>>>>> through PB_* states.
>>>>>>
>>>>>> 7.
>>>>>> + case PB_FETCH_REL_FILES: /* fetch files from server */
>>>>>> + if (backupinfo->activeworkers == 0)
>>>>>> + {
>>>>>> + backupinfo->backupstate = PB_STOP_BACKUP;
>>>>>> + free_filelist(backupinfo);
>>>>>> + }
>>>>>> + break;
>>>>>> + case PB_FETCH_WAL_FILES: /* fetch WAL files from
>>>>>> server */
>>>>>> + if (backupinfo->activeworkers == 0)
>>>>>> + {
>>>>>> + backupinfo->backupstate = PB_BACKUP_COMPLETE;
>>>>>> + }
>>>>>> + break;
>>>>>>
>>>>> Done.
>>
>>
>>>
>>>>>> Why free_filelist() is not called in PB_FETCH_WAL_FILES case?
>>>>>>
>>>>> Done.
>>
>> The corrupted tablespace and crash, reported by Rajkumar, have been
>> fixed. A pointer
>> variable remained uninitialized which in turn caused the system to
>> misbehave.
>>
>> Attached is the updated set of patches. AFAIK, to complete parallel
>> backup feature
>> set, there remain three sub-features:
>>
>> 1- parallel backup does not work with a standby server. In parallel
>> backup, the server
>> spawns multiple processes and there is no shared state being maintained.
>> So currently,
>> no way to tell multiple processes if the standby was promoted during the
>> backup since
>> the START_BACKUP was called.
>>
>> 2- throttling. Robert previously suggested that we implement
>> throttling on the client-side.
>> However, I found a previous discussion where it was advocated to be added
>> to the
>> backend instead[1].
>>
>> So, it was better to have a consensus before moving the throttle function
>> to the client.
>> That’s why for the time being I have disabled it and have asked for
>> suggestions on it
>> to move forward.
>>
>> It seems to me that we have to maintain a shared state in order to
>> support taking backup
>> from standby. Also, there is a new feature recently committed for backup
>> progress
>> reporting in the backend (pg_stat_progress_basebackup). This
>> functionality was recently
>> added via this commit ID: e65497df. For parallel backup to update these
>> stats, a shared
>> state will be required.
>>
>> Since multiple pg_basebackup can be running at the same time, maintaining
>> a shared state
>> can become a little complex, unless we disallow taking multiple parallel
>> backups.
>>
>> So proceeding on with this patch, I will be working on:
>> - throttling to be implemented on the client-side.
>> - adding a shared state to handle backup from the standby.
>>
>>
>>
>> [1]
>> https://www.postgresql.org/message-id/flat/521B4B29.20009%402ndquadrant.com#189bf840c87de5908c0b4467d31b50af
>>
>>
>> --
>> Asif Rehman
>> Highgo Software (Canada/China/Pakistan)
>> URL : www.highgo.ca
>>
>>

--
Highgo Software (Canada/China/Pakistan)
URL : http://www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
EMAIL: mailto: ahsan(dot)hadi(at)highgo(dot)ca

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2020-03-30 13:33:14 tweaking perfect hash multipliers
Previous Message Surafel Temesgen 2020-03-30 12:46:13 Re: Conflict handling for COPY FROM