Re: WIP/PoC for parallel backup

From: Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>
To: Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>
Cc: asifr(dot)rehman(at)gmail(dot)com, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP/PoC for parallel backup
Date: 2020-03-19 10:41:24
Message-ID: CAKcux6mBX+Y0wcmtqEQjW9Y7PvTwN0OyjhLiV78Sc8oCgpkrOw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Asif,

In another scenarios, bkp data is corrupted for tablespace. again this is
not reproducible everytime,
but If I am running the same set of commands I am getting the same error.

[edb(at)localhost bin]$ ./pg_ctl -D data -l logfile start
waiting for server to start.... done
server started
[edb(at)localhost bin]$
[edb(at)localhost bin]$ mkdir /tmp/tblsp
[edb(at)localhost bin]$ ./psql postgres -p 5432 -c "create tablespace tblsp
location '/tmp/tblsp';"
CREATE TABLESPACE
[edb(at)localhost bin]$ ./psql postgres -p 5432 -c "create database testdb
tablespace tblsp;"
CREATE DATABASE
[edb(at)localhost bin]$ ./psql testdb -p 5432 -c "create table testtbl (a
text);"
CREATE TABLE
[edb(at)localhost bin]$ ./psql testdb -p 5432 -c "insert into testtbl values
('parallel_backup with tablespace');"
INSERT 0 1
[edb(at)localhost bin]$ ./pg_basebackup -p 5432 -D /tmp/bkp -T
/tmp/tblsp=/tmp/tblsp_bkp --jobs 2
[edb(at)localhost bin]$ ./pg_ctl -D /tmp/bkp -l /tmp/bkp_logs -o "-p 5555"
start
waiting for server to start.... done
server started
[edb(at)localhost bin]$ ./psql postgres -p 5555 -c "select * from
pg_tablespace where spcname like 'tblsp%' or spcname = 'pg_default'";
oid | spcname | spcowner | spcacl | spcoptions
-------+------------+----------+--------+------------
1663 | pg_default | 10 | |
16384 | tblsp | 10 | |
(2 rows)

[edb(at)localhost bin]$ ./psql testdb -p 5555 -c "select * from testtbl";
psql: error: could not connect to server: FATAL:
"pg_tblspc/16384/PG_13_202003051/16385" is not a valid data directory
DETAIL: File "pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION" is missing.
[edb(at)localhost bin]$
[edb(at)localhost bin]$ ls
data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
[edb(at)localhost bin]$ ls
/tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
ls: cannot access
/tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION: No such file or
directory

Thanks & Regards,
Rajkumar Raghuwanshi

On Mon, Mar 16, 2020 at 6:19 PM Rajkumar Raghuwanshi <
rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:

> Hi Asif,
>
> On testing further, I found when taking backup with -R, pg_basebackup
> crashed
> this crash is not consistently reproducible.
>
> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "create table test (a
> text);"
> CREATE TABLE
> [edb(at)localhost bin]$ ./psql postgres -p 5432 -c "insert into test values
> ('parallel_backup with -R recovery-conf');"
> INSERT 0 1
> [edb(at)localhost bin]$ ./pg_basebackup -p 5432 -j 2 -D /tmp/test_bkp/bkp -R
> Segmentation fault (core dumped)
>
> stack trace looks the same as it was on earlier reported crash with
> tablespace.
> --stack trace
> [edb(at)localhost bin]$ gdb -q -c core.37915 pg_basebackup
> Loaded symbols for /lib64/libnss_files.so.2
> Core was generated by `./pg_basebackup -p 5432 -j 2 -D /tmp/test_bkp/bkp
> -R'.
> Program terminated with signal 11, Segmentation fault.
> #0 0x00000000004099ee in worker_get_files (wstate=0xc1e458) at
> pg_basebackup.c:3175
> 3175 backupinfo->curr = fetchfile->next;
> Missing separate debuginfos, use: debuginfo-install
> keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64
> libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64
> openssl-1.0.1e-58.el6_10.x86_64 zlib-1.2.3-29.el6.x86_64
> (gdb) bt
> #0 0x00000000004099ee in worker_get_files (wstate=0xc1e458) at
> pg_basebackup.c:3175
> #1 0x0000000000408a9e in worker_run (arg=0xc1e458) at pg_basebackup.c:2715
> #2 0x0000003921a07aa1 in start_thread (arg=0x7f72207c0700) at
> pthread_create.c:301
> #3 0x00000039212e8c4d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
> (gdb)
>
> Thanks & Regards,
> Rajkumar Raghuwanshi
>
>
> On Mon, Mar 16, 2020 at 2:14 PM Jeevan Chalke <
> jeevan(dot)chalke(at)enterprisedb(dot)com> wrote:
>
>> Hi Asif,
>>
>>
>>> Thanks Rajkumar. I have fixed the above issues and have rebased the
>>> patch to the latest master (b7f64c64).
>>> (V9 of the patches are attached).
>>>
>>
>> I had a further review of the patches and here are my few observations:
>>
>> 1.
>> +/*
>> + * stop_backup() - ends an online backup
>> + *
>> + * The function is called at the end of an online backup. It sends out
>> pg_control
>> + * file, optionally WAL segments and ending WAL location.
>> + */
>>
>> Comments seem out-dated.
>>
>> 2. With parallel jobs, maxrate is now not supported. Since we are now
>> asking
>> data in multiple threads throttling seems important here. Can you please
>> explain why have you disabled that?
>>
>> 3. As we are always fetching a single file and as Robert suggested, let
>> rename
>> SEND_FILES to SEND_FILE instead.
>>
>> 4. Does this work on Windows? I mean does pthread_create() work on
>> Windows?
>> I asked this as I see that pgbench has its own implementation for
>> pthread_create() for WIN32 but this patch doesn't.
>>
>> 5. Typos:
>> tablspace => tablespace
>> safly => safely
>>
>> 6. parallel_backup_run() needs some comments explaining the states it goes
>> through PB_* states.
>>
>> 7.
>> + case PB_FETCH_REL_FILES: /* fetch files from server */
>> + if (backupinfo->activeworkers == 0)
>> + {
>> + backupinfo->backupstate = PB_STOP_BACKUP;
>> + free_filelist(backupinfo);
>> + }
>> + break;
>> + case PB_FETCH_WAL_FILES: /* fetch WAL files from server */
>> + if (backupinfo->activeworkers == 0)
>> + {
>> + backupinfo->backupstate = PB_BACKUP_COMPLETE;
>> + }
>> + break;
>>
>> Why free_filelist() is not called in PB_FETCH_WAL_FILES case?
>>
>> Thanks
>> --
>> Jeevan Chalke
>> Associate Database Architect & Team Lead, Product Development
>> EnterpriseDB Corporation
>> The Enterprise PostgreSQL Company
>>
>> Phone: +91 20 66449694
>>
>> Website: www.enterprisedb.com
>> EnterpriseDB Blog: http://blogs.enterprisedb.com/
>> Follow us on Twitter: http://www.twitter.com/enterprisedb
>>
>> This e-mail message (and any attachment) is intended for the use of the
>> individual or entity to whom it is addressed. This message contains
>> information from EnterpriseDB Corporation that may be privileged,
>> confidential, or exempt from disclosure under applicable law. If you are
>> not the intended recipient or authorized to receive this for the intended
>> recipient, any use, dissemination, distribution, retention, archiving, or
>> copying of this communication is strictly prohibited. If you have received
>> this e-mail in error, please notify the sender immediately by reply e-mail
>> and delete this message.
>>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2020-03-19 10:47:37 Re: [PATCH] Skip llvm bytecode generation if LLVM is missing
Previous Message Atsushi Torikoshi 2020-03-19 10:39:33 Re: Wait event that should be reported while waiting for WAL archiving to finish