Re: WIP/PoC for parallel backup

From: Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>
To: asifr(dot)rehman(at)gmail(dot)com
Cc: Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP/PoC for parallel backup
Date: 2020-03-16 06:08:31
Message-ID: CAKcux6kUyLCagQthb7pXzTNSom9qw7JS3kF-0FWQ80ZAvD7+pg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for the patches.

I have verified reported issues with new patches, issues are fixed now.

I got another observation where If a new slot name given without -C option,
it leads to server crash error.

[edb(at)localhost bin]$ ./pg_basebackup -p 5432 -j 4 -D /tmp/bkp --slot
test_bkp_slot
pg_basebackup: error: could not send replication command
"START_REPLICATION": ERROR: replication slot "test_bkp_slot" does not exist
pg_basebackup: error: could not list backup files: server closed the
connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
pg_basebackup: removing data directory "/tmp/bkp"

Thanks & Regards,
Rajkumar Raghuwanshi

On Fri, Mar 13, 2020 at 9:51 PM Asif Rehman <asifr(dot)rehman(at)gmail(dot)com> wrote:

>
> On Wed, Mar 11, 2020 at 2:38 PM Rajkumar Raghuwanshi <
> rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
>
>> Hi Asif
>>
>> I have started testing this feature. I have applied v6 patch on commit
>> a069218163704c44a8996e7e98e765c56e2b9c8e (30 Jan).
>> I got few observations, please take a look.
>>
>> *--if backup failed, backup directory is not getting removed.*
>> [edb(at)localhost bin]$ ./pg_basebackup -p 5432 --jobs=9 -D
>> /tmp/test_bkp/bkp6
>> pg_basebackup: error: could not connect to server: FATAL: number of
>> requested standby connections exceeds max_wal_senders (currently 10)
>> [edb(at)localhost bin]$ ./pg_basebackup -p 5432 --jobs=8 -D
>> /tmp/test_bkp/bkp6
>> pg_basebackup: error: directory "/tmp/test_bkp/bkp6" exists but is not
>> empty
>>
>>
>> *--giving large number of jobs leading segmentation fault.*
>> ./pg_basebackup -p 5432 --jobs=1000 -D /tmp/t3
>> pg_basebackup: error: could not connect to server: FATAL: number of
>> requested standby connections exceeds max_wal_senders (currently 10)
>> pg_basebackup: error: could not connect to server: FATAL: number of
>> requested standby connections exceeds max_wal_senders (currently 10)
>> pg_basebackup: error: could not connect to server: FATAL: number of
>> requested standby connections exceeds max_wal_senders (currently 10)
>> .
>> .
>> .
>> pg_basebackup: error: could not connect to server: FATAL: number of
>> requested standby connections exceeds max_wal_senders (currently 10)
>> pg_basebackup: error: could not connect to server: FATAL: number of
>> requested standby connections exceeds max_wal_senders (currently 10)
>> pg_basebackup: error: could not connect to server: FATAL: number of
>> requested standby connections exceeds max_wal_senders (currently 10)
>> pg_basebackup: error: could not connect to server: FATAL: number of
>> requested standby connections exceeds max_wal_senders (currently 10)
>> pg_basebackup: error: could not connect to server: could not fork new
>> process for connection: Resource temporarily unavailable
>>
>> could not fork new process for connection: Resource temporarily
>> unavailable
>> pg_basebackup: error: failed to create thread: Resource temporarily
>> unavailable
>> Segmentation fault (core dumped)
>>
>> --stack-trace
>> gdb -q -c core.11824 pg_basebackup
>> Loaded symbols for /lib64/libnss_files.so.2
>> Core was generated by `./pg_basebackup -p 5432 --jobs=1000 -D
>> /tmp/test_bkp/bkp10'.
>> Program terminated with signal 11, Segmentation fault.
>> #0 pthread_join (threadid=140503120623360, thread_return=0x0) at
>> pthread_join.c:46
>> 46 if (INVALID_NOT_TERMINATED_TD_P (pd))
>> Missing separate debuginfos, use: debuginfo-install
>> keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64
>> libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64
>> openssl-1.0.1e-58.el6_10.x86_64 zlib-1.2.3-29.el6.x86_64
>> (gdb) bt
>> #0 pthread_join (threadid=140503120623360, thread_return=0x0) at
>> pthread_join.c:46
>> #1 0x0000000000408e21 in cleanup_workers () at pg_basebackup.c:2840
>> #2 0x0000000000403846 in disconnect_atexit () at pg_basebackup.c:316
>> #3 0x0000003921235a02 in __run_exit_handlers (status=1) at exit.c:78
>> #4 exit (status=1) at exit.c:100
>> #5 0x0000000000408aa6 in create_parallel_workers (backupinfo=0x1a4b8c0)
>> at pg_basebackup.c:2713
>> #6 0x0000000000407946 in BaseBackup () at pg_basebackup.c:2127
>> #7 0x000000000040895c in main (argc=6, argv=0x7ffd566f4718) at
>> pg_basebackup.c:2668
>>
>>
>> *--with tablespace is in the same directory as data, parallel_backup
>> crashed*
>> [edb(at)localhost bin]$ ./initdb -D /tmp/data
>> [edb(at)localhost bin]$ ./pg_ctl -D /tmp/data -l /tmp/logfile start
>> [edb(at)localhost bin]$ mkdir /tmp/ts
>> [edb(at)localhost bin]$ ./psql postgres
>> psql (13devel)
>> Type "help" for help.
>>
>> postgres=# create tablespace ts location '/tmp/ts';
>> CREATE TABLESPACE
>> postgres=# create table tx (a int) tablespace ts;
>> CREATE TABLE
>> postgres=# \q
>> [edb(at)localhost bin]$ ./pg_basebackup -j 2 -D /tmp/tts -T /tmp/ts=/tmp/ts1
>> Segmentation fault (core dumped)
>>
>> --stack-trace
>> [edb(at)localhost bin]$ gdb -q -c core.15778 pg_basebackup
>> Loaded symbols for /lib64/libnss_files.so.2
>> Core was generated by `./pg_basebackup -j 2 -D /tmp/tts -T
>> /tmp/ts=/tmp/ts1'.
>> Program terminated with signal 11, Segmentation fault.
>> #0 0x0000000000409442 in get_backup_filelist (conn=0x140cb20,
>> backupInfo=0x14210a0) at pg_basebackup.c:3000
>> 3000 backupInfo->curr->next = file;
>> Missing separate debuginfos, use: debuginfo-install
>> keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64
>> libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64
>> openssl-1.0.1e-58.el6_10.x86_64 zlib-1.2.3-29.el6.x86_64
>> (gdb) bt
>> #0 0x0000000000409442 in get_backup_filelist (conn=0x140cb20,
>> backupInfo=0x14210a0) at pg_basebackup.c:3000
>> #1 0x0000000000408b56 in parallel_backup_run (backupinfo=0x14210a0) at
>> pg_basebackup.c:2739
>> #2 0x0000000000407955 in BaseBackup () at pg_basebackup.c:2128
>> #3 0x000000000040895c in main (argc=7, argv=0x7ffca2910c58) at
>> pg_basebackup.c:2668
>> (gdb)
>>
>
>
> Thanks Rajkumar. I have fixed the above issues and have rebased the patch
> to the latest master (b7f64c64).
> (V9 of the patches are attached).
>
>
> --
> Asif Rehman
> Highgo Software (Canada/China/Pakistan)
> URL : www.highgo.ca
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2020-03-16 06:14:25 Re: error context for vacuum to include block number
Previous Message Suraj Kharage 2020-03-16 06:03:23 Re: backup manifests