Re: WIP/PoC for parallel backup

From: Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>
To: asifr(dot)rehman(at)gmail(dot)com
Cc: Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP/PoC for parallel backup
Date: 2020-03-30 10:43:47
Message-ID: CAKcux6=Wu91dyXWALOzQ7NGX1fkgWHPjZjxZEsFJfOKvrc8pBw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks Asif,

I have re-verified reported issue. expect standby backup, others are fixed.

Thanks & Regards,
Rajkumar Raghuwanshi

On Fri, Mar 27, 2020 at 11:04 PM Asif Rehman <asifr(dot)rehman(at)gmail(dot)com> wrote:

>
>
> On Wed, Mar 25, 2020 at 12:22 PM Rajkumar Raghuwanshi <
> rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
>
>> Hi Asif,
>>
>> While testing further I observed parallel backup is not able to take
>> backup of standby server.
>>
>> mkdir /tmp/archive_dir
>> echo "archive_mode='on'">> data/postgresql.conf
>> echo "archive_command='cp %p /tmp/archive_dir/%f'">> data/postgresql.conf
>>
>> ./pg_ctl -D data -l logs start
>> ./pg_basebackup -p 5432 -Fp -R -D /tmp/slave
>>
>> echo "primary_conninfo='host=127.0.0.1 port=5432 user=edb'">>
>> /tmp/slave/postgresql.conf
>> echo "restore_command='cp /tmp/archive_dir/%f %p'">>
>> /tmp/slave/postgresql.conf
>> echo "promote_trigger_file='/tmp/failover.log'">>
>> /tmp/slave/postgresql.conf
>>
>> ./pg_ctl -D /tmp/slave -l /tmp/slave_logs -o "-p 5433" start -c
>>
>> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "select
>> pg_is_in_recovery();"
>> pg_is_in_recovery
>> -------------------
>> f
>> (1 row)
>>
>> [edb(at)localhost bin]$ ./psql postgres -p 5433 -c "select
>> pg_is_in_recovery();"
>> pg_is_in_recovery
>> -------------------
>> t
>> (1 row)
>>
>>
>>
>>
>> *[edb(at)localhost bin]$ ./pg_basebackup -p 5433 -D /tmp/bkp_s --jobs
>> 6pg_basebackup: error: could not list backup files: ERROR: the standby was
>> promoted during online backupHINT: This means that the backup being taken
>> is corrupt and should not be used. Try taking another online
>> backup.pg_basebackup: removing data directory "/tmp/bkp_s"*
>>
>> #same is working fine without parallel backup
>> [edb(at)localhost bin]$ ./pg_basebackup -p 5433 -D /tmp/bkp_s --jobs 1
>> [edb(at)localhost bin]$ ls /tmp/bkp_s/PG_VERSION
>> /tmp/bkp_s/PG_VERSION
>>
>> Thanks & Regards,
>> Rajkumar Raghuwanshi
>>
>>
>> On Thu, Mar 19, 2020 at 4:11 PM Rajkumar Raghuwanshi <
>> rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
>>
>>> Hi Asif,
>>>
>>> In another scenarios, bkp data is corrupted for tablespace. again this
>>> is not reproducible everytime,
>>> but If I am running the same set of commands I am getting the same error.
>>>
>>> [edb(at)localhost bin]$ ./pg_ctl -D data -l logfile start
>>> waiting for server to start.... done
>>> server started
>>> [edb(at)localhost bin]$
>>> [edb(at)localhost bin]$ mkdir /tmp/tblsp
>>> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "create tablespace
>>> tblsp location '/tmp/tblsp';"
>>> CREATE TABLESPACE
>>> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "create database testdb
>>> tablespace tblsp;"
>>> CREATE DATABASE
>>> [edb(at)localhost bin]$ ./psql testdb -p 5432 -c "create table testtbl (a
>>> text);"
>>> CREATE TABLE
>>> [edb(at)localhost bin]$ ./psql testdb -p 5432 -c "insert into testtbl
>>> values ('parallel_backup with tablespace');"
>>> INSERT 0 1
>>> [edb(at)localhost bin]$ ./pg_basebackup -p 5432 -D /tmp/bkp -T
>>> /tmp/tblsp=/tmp/tblsp_bkp --jobs 2
>>> [edb(at)localhost bin]$ ./pg_ctl -D /tmp/bkp -l /tmp/bkp_logs -o "-p 5555"
>>> start
>>> waiting for server to start.... done
>>> server started
>>> [edb(at)localhost bin]$ ./psql postgres -p 5555 -c "select * from
>>> pg_tablespace where spcname like 'tblsp%' or spcname = 'pg_default'";
>>> oid | spcname | spcowner | spcacl | spcoptions
>>> -------+------------+----------+--------+------------
>>> 1663 | pg_default | 10 | |
>>> 16384 | tblsp | 10 | |
>>> (2 rows)
>>>
>>> [edb(at)localhost bin]$ ./psql testdb -p 5555 -c "select * from testtbl";
>>> psql: error: could not connect to server: FATAL:
>>> "pg_tblspc/16384/PG_13_202003051/16385" is not a valid data directory
>>> DETAIL: File "pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION" is
>>> missing.
>>> [edb(at)localhost bin]$
>>> [edb(at)localhost bin]$ ls
>>> data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
>>> data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
>>> [edb(at)localhost bin]$ ls
>>> /tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
>>> ls: cannot access
>>> /tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION: No such file or
>>> directory
>>>
>>>
>>> Thanks & Regards,
>>> Rajkumar Raghuwanshi
>>>
>>>
>>> On Mon, Mar 16, 2020 at 6:19 PM Rajkumar Raghuwanshi <
>>> rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
>>>
>>>> Hi Asif,
>>>>
>>>> On testing further, I found when taking backup with -R, pg_basebackup
>>>> crashed
>>>> this crash is not consistently reproducible.
>>>>
>>>> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "create table test (a
>>>> text);"
>>>> CREATE TABLE
>>>> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "insert into test
>>>> values ('parallel_backup with -R recovery-conf');"
>>>> INSERT 0 1
>>>> [edb(at)localhost bin]$ ./pg_basebackup -p 5432 -j 2 -D /tmp/test_bkp/bkp
>>>> -R
>>>> Segmentation fault (core dumped)
>>>>
>>>> stack trace looks the same as it was on earlier reported crash with
>>>> tablespace.
>>>> --stack trace
>>>> [edb(at)localhost bin]$ gdb -q -c core.37915 pg_basebackup
>>>> Loaded symbols for /lib64/libnss_files.so.2
>>>> Core was generated by `./pg_basebackup -p 5432 -j 2 -D
>>>> /tmp/test_bkp/bkp -R'.
>>>> Program terminated with signal 11, Segmentation fault.
>>>> #0 0x00000000004099ee in worker_get_files (wstate=0xc1e458) at
>>>> pg_basebackup.c:3175
>>>> 3175 backupinfo->curr = fetchfile->next;
>>>> Missing separate debuginfos, use: debuginfo-install
>>>> keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64
>>>> libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64
>>>> openssl-1.0.1e-58.el6_10.x86_64 zlib-1.2.3-29.el6.x86_64
>>>> (gdb) bt
>>>> #0 0x00000000004099ee in worker_get_files (wstate=0xc1e458) at
>>>> pg_basebackup.c:3175
>>>> #1 0x0000000000408a9e in worker_run (arg=0xc1e458) at
>>>> pg_basebackup.c:2715
>>>> #2 0x0000003921a07aa1 in start_thread (arg=0x7f72207c0700) at
>>>> pthread_create.c:301
>>>> #3 0x00000039212e8c4d in clone () at
>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
>>>> (gdb)
>>>>
>>>> Thanks & Regards,
>>>> Rajkumar Raghuwanshi
>>>>
>>>>
>>>> On Mon, Mar 16, 2020 at 2:14 PM Jeevan Chalke <
>>>> jeevan(dot)chalke(at)enterprisedb(dot)com> wrote:
>>>>
>>>>> Hi Asif,
>>>>>
>>>>>
>>>>>> Thanks Rajkumar. I have fixed the above issues and have rebased the
>>>>>> patch to the latest master (b7f64c64).
>>>>>> (V9 of the patches are attached).
>>>>>>
>>>>>
>>>>> I had a further review of the patches and here are my few observations:
>>>>>
>>>>> 1.
>>>>> +/*
>>>>> + * stop_backup() - ends an online backup
>>>>> + *
>>>>> + * The function is called at the end of an online backup. It sends
>>>>> out pg_control
>>>>> + * file, optionally WAL segments and ending WAL location.
>>>>> + */
>>>>>
>>>>> Comments seem out-dated.
>>>>>
>>>>
> Fixed.
>
>
>>
>>>>> 2. With parallel jobs, maxrate is now not supported. Since we are now
>>>>> asking
>>>>> data in multiple threads throttling seems important here. Can you
>>>>> please
>>>>> explain why have you disabled that?
>>>>>
>>>>> 3. As we are always fetching a single file and as Robert suggested,
>>>>> let rename
>>>>> SEND_FILES to SEND_FILE instead.
>>>>>
>>>>
> Yes, we are fetching a single file. However, SEND_FILES is still capable
> of fetching multiple files in one
> go, that's why the name.
>
>
>>>>> 4. Does this work on Windows? I mean does pthread_create() work on
>>>>> Windows?
>>>>> I asked this as I see that pgbench has its own implementation for
>>>>> pthread_create() for WIN32 but this patch doesn't.
>>>>>
>>>>
> patch is updated to add support for the Windows platform.
>
>
>>>>> 5. Typos:
>>>>> tablspace => tablespace
>>>>> safly => safely
>>>>>
>>>>> Done.
>
>
>> 6. parallel_backup_run() needs some comments explaining the states it goes
>>>>> through PB_* states.
>>>>>
>>>>> 7.
>>>>> + case PB_FETCH_REL_FILES: /* fetch files from server */
>>>>> + if (backupinfo->activeworkers == 0)
>>>>> + {
>>>>> + backupinfo->backupstate = PB_STOP_BACKUP;
>>>>> + free_filelist(backupinfo);
>>>>> + }
>>>>> + break;
>>>>> + case PB_FETCH_WAL_FILES: /* fetch WAL files from
>>>>> server */
>>>>> + if (backupinfo->activeworkers == 0)
>>>>> + {
>>>>> + backupinfo->backupstate = PB_BACKUP_COMPLETE;
>>>>> + }
>>>>> + break;
>>>>>
>>>> Done.
>
>
>>
>>>>> Why free_filelist() is not called in PB_FETCH_WAL_FILES case?
>>>>>
>>>> Done.
>
> The corrupted tablespace and crash, reported by Rajkumar, have been fixed.
> A pointer
> variable remained uninitialized which in turn caused the system to
> misbehave.
>
> Attached is the updated set of patches. AFAIK, to complete parallel backup
> feature
> set, there remain three sub-features:
>
> 1- parallel backup does not work with a standby server. In parallel
> backup, the server
> spawns multiple processes and there is no shared state being maintained.
> So currently,
> no way to tell multiple processes if the standby was promoted during the
> backup since
> the START_BACKUP was called.
>
> 2- throttling. Robert previously suggested that we implement throttling on
> the client-side.
> However, I found a previous discussion where it was advocated to be added
> to the
> backend instead[1].
>
> So, it was better to have a consensus before moving the throttle function
> to the client.
> That’s why for the time being I have disabled it and have asked for
> suggestions on it
> to move forward.
>
> It seems to me that we have to maintain a shared state in order to support
> taking backup
> from standby. Also, there is a new feature recently committed for backup
> progress
> reporting in the backend (pg_stat_progress_basebackup). This functionality
> was recently
> added via this commit ID: e65497df. For parallel backup to update these
> stats, a shared
> state will be required.
>
> Since multiple pg_basebackup can be running at the same time, maintaining
> a shared state
> can become a little complex, unless we disallow taking multiple parallel
> backups.
>
> So proceeding on with this patch, I will be working on:
> - throttling to be implemented on the client-side.
> - adding a shared state to handle backup from the standby.
>
>
>
> [1]
> https://www.postgresql.org/message-id/flat/521B4B29.20009%402ndquadrant.com#189bf840c87de5908c0b4467d31b50af
>
>
> --
> Asif Rehman
> Highgo Software (Canada/China/Pakistan)
> URL : www.highgo.ca
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2020-03-30 11:10:01 Re: Some problems of recovery conflict wait events
Previous Message Fujii Masao 2020-03-30 10:41:43 Re: pgsql: Improve handling of parameter differences in physical replicatio