Re: WIP/PoC for parallel backup

From: Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>
To: Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>
Cc: Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP/PoC for parallel backup
Date: 2020-03-27 17:33:28
Message-ID: CADM=JegwcTCfdv6pKfbzQPc-hqLjir-ZdXeKqDydz5xAF1RL0g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 25, 2020 at 12:22 PM Rajkumar Raghuwanshi <
rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:

> Hi Asif,
>
> While testing further I observed parallel backup is not able to take
> backup of standby server.
>
> mkdir /tmp/archive_dir
> echo "archive_mode='on'">> data/postgresql.conf
> echo "archive_command='cp %p /tmp/archive_dir/%f'">> data/postgresql.conf
>
> ./pg_ctl -D data -l logs start
> ./pg_basebackup -p 5432 -Fp -R -D /tmp/slave
>
> echo "primary_conninfo='host=127.0.0.1 port=5432 user=edb'">>
> /tmp/slave/postgresql.conf
> echo "restore_command='cp /tmp/archive_dir/%f %p'">>
> /tmp/slave/postgresql.conf
> echo "promote_trigger_file='/tmp/failover.log'">>
> /tmp/slave/postgresql.conf
>
> ./pg_ctl -D /tmp/slave -l /tmp/slave_logs -o "-p 5433" start -c
>
> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "select
> pg_is_in_recovery();"
> pg_is_in_recovery
> -------------------
> f
> (1 row)
>
> [edb(at)localhost bin]$ ./psql postgres -p 5433 -c "select
> pg_is_in_recovery();"
> pg_is_in_recovery
> -------------------
> t
> (1 row)
>
>
>
>
> *[edb(at)localhost bin]$ ./pg_basebackup -p 5433 -D /tmp/bkp_s --jobs
> 6pg_basebackup: error: could not list backup files: ERROR: the standby was
> promoted during online backupHINT: This means that the backup being taken
> is corrupt and should not be used. Try taking another online
> backup.pg_basebackup: removing data directory "/tmp/bkp_s"*
>
> #same is working fine without parallel backup
> [edb(at)localhost bin]$ ./pg_basebackup -p 5433 -D /tmp/bkp_s --jobs 1
> [edb(at)localhost bin]$ ls /tmp/bkp_s/PG_VERSION
> /tmp/bkp_s/PG_VERSION
>
> Thanks & Regards,
> Rajkumar Raghuwanshi
>
>
> On Thu, Mar 19, 2020 at 4:11 PM Rajkumar Raghuwanshi <
> rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
>
>> Hi Asif,
>>
>> In another scenarios, bkp data is corrupted for tablespace. again this is
>> not reproducible everytime,
>> but If I am running the same set of commands I am getting the same error.
>>
>> [edb(at)localhost bin]$ ./pg_ctl -D data -l logfile start
>> waiting for server to start.... done
>> server started
>> [edb(at)localhost bin]$
>> [edb(at)localhost bin]$ mkdir /tmp/tblsp
>> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "create tablespace tblsp
>> location '/tmp/tblsp';"
>> CREATE TABLESPACE
>> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "create database testdb
>> tablespace tblsp;"
>> CREATE DATABASE
>> [edb(at)localhost bin]$ ./psql testdb -p 5432 -c "create table testtbl (a
>> text);"
>> CREATE TABLE
>> [edb(at)localhost bin]$ ./psql testdb -p 5432 -c "insert into testtbl
>> values ('parallel_backup with tablespace');"
>> INSERT 0 1
>> [edb(at)localhost bin]$ ./pg_basebackup -p 5432 -D /tmp/bkp -T
>> /tmp/tblsp=/tmp/tblsp_bkp --jobs 2
>> [edb(at)localhost bin]$ ./pg_ctl -D /tmp/bkp -l /tmp/bkp_logs -o "-p 5555"
>> start
>> waiting for server to start.... done
>> server started
>> [edb(at)localhost bin]$ ./psql postgres -p 5555 -c "select * from
>> pg_tablespace where spcname like 'tblsp%' or spcname = 'pg_default'";
>> oid | spcname | spcowner | spcacl | spcoptions
>> -------+------------+----------+--------+------------
>> 1663 | pg_default | 10 | |
>> 16384 | tblsp | 10 | |
>> (2 rows)
>>
>> [edb(at)localhost bin]$ ./psql testdb -p 5555 -c "select * from testtbl";
>> psql: error: could not connect to server: FATAL:
>> "pg_tblspc/16384/PG_13_202003051/16385" is not a valid data directory
>> DETAIL: File "pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION" is
>> missing.
>> [edb(at)localhost bin]$
>> [edb(at)localhost bin]$ ls
>> data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
>> data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
>> [edb(at)localhost bin]$ ls
>> /tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
>> ls: cannot access
>> /tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION: No such file or
>> directory
>>
>>
>> Thanks & Regards,
>> Rajkumar Raghuwanshi
>>
>>
>> On Mon, Mar 16, 2020 at 6:19 PM Rajkumar Raghuwanshi <
>> rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
>>
>>> Hi Asif,
>>>
>>> On testing further, I found when taking backup with -R, pg_basebackup
>>> crashed
>>> this crash is not consistently reproducible.
>>>
>>> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "create table test (a
>>> text);"
>>> CREATE TABLE
>>> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "insert into test
>>> values ('parallel_backup with -R recovery-conf');"
>>> INSERT 0 1
>>> [edb(at)localhost bin]$ ./pg_basebackup -p 5432 -j 2 -D /tmp/test_bkp/bkp
>>> -R
>>> Segmentation fault (core dumped)
>>>
>>> stack trace looks the same as it was on earlier reported crash with
>>> tablespace.
>>> --stack trace
>>> [edb(at)localhost bin]$ gdb -q -c core.37915 pg_basebackup
>>> Loaded symbols for /lib64/libnss_files.so.2
>>> Core was generated by `./pg_basebackup -p 5432 -j 2 -D /tmp/test_bkp/bkp
>>> -R'.
>>> Program terminated with signal 11, Segmentation fault.
>>> #0 0x00000000004099ee in worker_get_files (wstate=0xc1e458) at
>>> pg_basebackup.c:3175
>>> 3175 backupinfo->curr = fetchfile->next;
>>> Missing separate debuginfos, use: debuginfo-install
>>> keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64
>>> libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64
>>> openssl-1.0.1e-58.el6_10.x86_64 zlib-1.2.3-29.el6.x86_64
>>> (gdb) bt
>>> #0 0x00000000004099ee in worker_get_files (wstate=0xc1e458) at
>>> pg_basebackup.c:3175
>>> #1 0x0000000000408a9e in worker_run (arg=0xc1e458) at
>>> pg_basebackup.c:2715
>>> #2 0x0000003921a07aa1 in start_thread (arg=0x7f72207c0700) at
>>> pthread_create.c:301
>>> #3 0x00000039212e8c4d in clone () at
>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
>>> (gdb)
>>>
>>> Thanks & Regards,
>>> Rajkumar Raghuwanshi
>>>
>>>
>>> On Mon, Mar 16, 2020 at 2:14 PM Jeevan Chalke <
>>> jeevan(dot)chalke(at)enterprisedb(dot)com> wrote:
>>>
>>>> Hi Asif,
>>>>
>>>>
>>>>> Thanks Rajkumar. I have fixed the above issues and have rebased the
>>>>> patch to the latest master (b7f64c64).
>>>>> (V9 of the patches are attached).
>>>>>
>>>>
>>>> I had a further review of the patches and here are my few observations:
>>>>
>>>> 1.
>>>> +/*
>>>> + * stop_backup() - ends an online backup
>>>> + *
>>>> + * The function is called at the end of an online backup. It sends out
>>>> pg_control
>>>> + * file, optionally WAL segments and ending WAL location.
>>>> + */
>>>>
>>>> Comments seem out-dated.
>>>>
>>>
Fixed.

>
>>>> 2. With parallel jobs, maxrate is now not supported. Since we are now
>>>> asking
>>>> data in multiple threads throttling seems important here. Can you please
>>>> explain why have you disabled that?
>>>>
>>>> 3. As we are always fetching a single file and as Robert suggested, let
>>>> rename
>>>> SEND_FILES to SEND_FILE instead.
>>>>
>>>
Yes, we are fetching a single file. However, SEND_FILES is still capable of
fetching multiple files in one
go, that's why the name.

>>>> 4. Does this work on Windows? I mean does pthread_create() work on
>>>> Windows?
>>>> I asked this as I see that pgbench has its own implementation for
>>>> pthread_create() for WIN32 but this patch doesn't.
>>>>
>>>
patch is updated to add support for the Windows platform.

>>>> 5. Typos:
>>>> tablspace => tablespace
>>>> safly => safely
>>>>
>>>> Done.

> 6. parallel_backup_run() needs some comments explaining the states it goes
>>>> through PB_* states.
>>>>
>>>> 7.
>>>> + case PB_FETCH_REL_FILES: /* fetch files from server */
>>>> + if (backupinfo->activeworkers == 0)
>>>> + {
>>>> + backupinfo->backupstate = PB_STOP_BACKUP;
>>>> + free_filelist(backupinfo);
>>>> + }
>>>> + break;
>>>> + case PB_FETCH_WAL_FILES: /* fetch WAL files from server
>>>> */
>>>> + if (backupinfo->activeworkers == 0)
>>>> + {
>>>> + backupinfo->backupstate = PB_BACKUP_COMPLETE;
>>>> + }
>>>> + break;
>>>>
>>> Done.

>
>>>> Why free_filelist() is not called in PB_FETCH_WAL_FILES case?
>>>>
>>> Done.

The corrupted tablespace and crash, reported by Rajkumar, have been fixed.
A pointer
variable remained uninitialized which in turn caused the system to
misbehave.

Attached is the updated set of patches. AFAIK, to complete parallel backup
feature
set, there remain three sub-features:

1- parallel backup does not work with a standby server. In parallel backup,
the server
spawns multiple processes and there is no shared state being maintained. So
currently,
no way to tell multiple processes if the standby was promoted during the
backup since
the START_BACKUP was called.

2- throttling. Robert previously suggested that we implement throttling on
the client-side.
However, I found a previous discussion where it was advocated to be added
to the
backend instead[1].

So, it was better to have a consensus before moving the throttle function
to the client.
That’s why for the time being I have disabled it and have asked for
suggestions on it
to move forward.

It seems to me that we have to maintain a shared state in order to support
taking backup
from standby. Also, there is a new feature recently committed for backup
progress
reporting in the backend (pg_stat_progress_basebackup). This functionality
was recently
added via this commit ID: e65497df. For parallel backup to update these
stats, a shared
state will be required.

Since multiple pg_basebackup can be running at the same time, maintaining a
shared state
can become a little complex, unless we disallow taking multiple parallel
backups.

So proceeding on with this patch, I will be working on:
- throttling to be implemented on the client-side.
- adding a shared state to handle backup from the standby.

[1]
https://www.postgresql.org/message-id/flat/521B4B29.20009%402ndquadrant.com#189bf840c87de5908c0b4467d31b50af

--
Asif Rehman
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca

Attachment Content-Type Size
v10-parallel-backup.zip application/zip 40.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message legrand legrand 2020-03-27 17:43:15 Re: pg_stat_statements: rows not updated for CREATE TABLE AS SELECT statements
Previous Message legrand legrand 2020-03-27 17:27:52 Re: Patch: to pass query string to pg_plan_query()