Re: WIP/PoC for parallel backup

From: Sumanta Mukherjee <sumanta(dot)mukherjee(at)enterprisedb(dot)com>
To: David Zhang <david(dot)zhang(at)highgo(dot)ca>
Cc: Suraj Kharage <suraj(dot)kharage(at)enterprisedb(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Ahsan Hadi <ahsan(dot)hadi(at)gmail(dot)com>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, Kashif Zeeshan <kashif(dot)zeeshan(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP/PoC for parallel backup
Date: 2020-04-30 09:18:23
Message-ID: CAMSJAiqUAsHzkoFqeOTVafN99vy7yX2=b5qeLqJYLOGARVzXpA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Would it be possible to put in the absolute numbers of the perf
so that it is easier to understand the amount of improvement with
and without the patch and different loads and workers.

I am also unsure why the swapper is taking such a huge percentage of the
absolute time
in the base run of just the postgres server and pg_basebackup client.

With Regards,
Sumanta Mukherjee.
EnterpriseDB: http://www.enterprisedb.com

On Thu, Apr 30, 2020 at 1:18 PM David Zhang <david(dot)zhang(at)highgo(dot)ca> wrote:

> Hi,
>
> Thanks a lot for sharing the test results. Here is the our test results
> using perf on three ASW t2.xlarge with below configuration.
>
> Machine configuration:
> Instance Type :t2.xlarge
> Volume type :io1
> Memory (MiB) :16GB
> vCPU # :4
> Architecture :x86_64
> IOP :6000
> Database Size (GB) :45 (Server)
>
> case 1: postgres server: without patch and without load
>
> * Disk I/O:
>
> # Samples: 342K of event 'block:block_rq_insert'
> # Event count (approx.): 342834
> #
> # Overhead Command Shared Object Symbol
> # ........ ............... ................. .....................
> #
> 97.65% postgres [kernel.kallsyms] [k] __elv_add_request
> 2.27% kworker/u30:0 [kernel.kallsyms] [k] __elv_add_request
>
>
> * CPU:
>
> # Samples: 6M of event 'cpu-clock'
> # Event count (approx.): 1559444750000
> #
> # Overhead Command Shared Object
> Symbol
> # ........ ............... ....................
> .............................................
> #
> 64.73% swapper [kernel.kallsyms] [k] native_safe_halt
> 10.89% postgres [vdso] [.] __vdso_gettimeofday
> 5.64% postgres [kernel.kallsyms] [k] do_syscall_64
> 5.43% postgres libpthread-2.26.so [.] __libc_recv
> 1.72% postgres [kernel.kallsyms] [k]
> pvclock_clocksource_read
>
> * Network:
>
> # Samples: 2M of event 'skb:consume_skb'
> # Event count (approx.): 2739785
> #
> # Overhead Command Shared Object Symbol
> # ........ ............... ................. ...........................
> #
> 91.58% swapper [kernel.kallsyms] [k] consume_skb
> 7.09% postgres [kernel.kallsyms] [k] consume_skb
> 0.61% kswapd0 [kernel.kallsyms] [k] consume_skb
> 0.44% ksoftirqd/3 [kernel.kallsyms] [k] consume_skb
>
>
> case 1: pg_basebackup client: without patch and without load
>
> * Disk I/O:
>
> # Samples: 371K of event 'block:block_rq_insert'
> # Event count (approx.): 371362
> #
> # Overhead Command Shared Object Symbol
> # ........ ............... ................. .....................
> #
> 96.78% kworker/u30:0 [kernel.kallsyms] [k] __elv_add_request
> 2.82% pg_basebackup [kernel.kallsyms] [k] __elv_add_request
> 0.29% kworker/u30:1 [kernel.kallsyms] [k] __elv_add_request
> 0.09% xfsaild/xvda1 [kernel.kallsyms] [k] __elv_add_request
>
>
> * CPU:
>
> # Samples: 3M of event 'cpu-clock'
> # Event count (approx.): 903527000000
> #
> # Overhead Command Shared Object
> Symbol
> # ........ ............... ..................
> .............................................
> #
> 87.99% swapper [kernel.kallsyms] [k] native_safe_halt
> 3.14% swapper [kernel.kallsyms] [k] __lock_text_start
> 0.48% swapper [kernel.kallsyms] [k]
> __softirqentry_text_start
> 0.37% pg_basebackup [kernel.kallsyms] [k]
> copy_user_enhanced_fast_string
> 0.35% swapper [kernel.kallsyms] [k] do_csum
>
> * Network:
>
> # Samples: 12M of event 'skb:consume_skb'
> # Event count (approx.): 12260713
> #
> # Overhead Command Shared Object Symbol
> # ........ ............... ................. ...........................
> #
> 95.12% swapper [kernel.kallsyms] [k] consume_skb
> 3.23% pg_basebackup [kernel.kallsyms] [k] consume_skb
> 0.83% ksoftirqd/1 [kernel.kallsyms] [k] consume_skb
> 0.45% kswapd0 [kernel.kallsyms] [k] consume_skb
>
>
> case 2: postgres server: with patch and with load, 4 backup workers on
> client side
>
> * Disk I/O:
>
> # Samples: 3M of event 'block:block_rq_insert'
> # Event count (approx.): 3634542
> #
> # Overhead Command Shared Object Symbol
> # ........ ............... ................. .....................
> #
> 98.88% postgres [kernel.kallsyms] [k] __elv_add_request
> 0.66% perf [kernel.kallsyms] [k] __elv_add_request
> 0.42% kworker/u30:1 [kernel.kallsyms] [k] __elv_add_request
> 0.01% sshd [kernel.kallsyms] [k] __elv_add_request
>
> * CPU:
>
> # Samples: 9M of event 'cpu-clock'
> # Event count (approx.): 2299129250000
> #
> # Overhead Command Shared Object
> Symbol
> # ........ ............... .....................
> .............................................
> #
> 52.73% swapper [kernel.kallsyms] [k] native_safe_halt
> 8.31% postgres [vdso] [.] __vdso_gettimeofday
> 4.46% postgres [kernel.kallsyms] [k] do_syscall_64
> 4.16% postgres libpthread-2.26.so [.] __libc_recv
> 1.58% postgres [kernel.kallsyms] [k] __lock_text_start
> 1.52% postgres [kernel.kallsyms] [k]
> pvclock_clocksource_read
> 0.81% postgres [kernel.kallsyms] [k]
> copy_user_enhanced_fast_string
>
>
> * Network:
>
> # Samples: 6M of event 'skb:consume_skb'
> # Event count (approx.): 6048795
> #
> # Overhead Command Shared Object Symbol
> # ........ ............... ................. ...........................
> #
> 85.81% postgres [kernel.kallsyms] [k] consume_skb
> 12.03% swapper [kernel.kallsyms] [k] consume_skb
> 0.97% postgres [kernel.kallsyms] [k] __consume_stateless_skb
> 0.85% ksoftirqd/3 [kernel.kallsyms] [k] consume_skb
> 0.24% perf [kernel.kallsyms] [k] consume_skb
>
>
> case 2: pg_basebackup 4 workers: with patch and with load
>
> * Disk I/O:
>
> # Samples: 372K of event 'block:block_rq_insert'
> # Event count (approx.): 372360
> #
> # Overhead Command Shared Object Symbol
> # ........ ............... ................. .....................
> #
> 97.26% kworker/u30:0 [kernel.kallsyms] [k] __elv_add_request
> 1.45% pg_basebackup [kernel.kallsyms] [k] __elv_add_request
> 0.95% kworker/u30:1 [kernel.kallsyms] [k] __elv_add_request
> 0.14% xfsaild/xvda1 [kernel.kallsyms] [k] __elv_add_request
>
>
> * CPU:
>
> # Samples: 4M of event 'cpu-clock'
> # Event count (approx.): 1234071000000
> #
> # Overhead Command Shared Object
> Symbol
> # ........ ............... ........................
> .................................................
> #
> 89.25% swapper [kernel.kallsyms] [k] native_safe_halt
> 0.93% pg_basebackup [kernel.kallsyms] [k]
> __lock_text_start
> 0.91% swapper [kernel.kallsyms] [k]
> __lock_text_start
> 0.69% pg_basebackup [kernel.kallsyms] [k]
> copy_user_enhanced_fast_string
> 0.45% swapper [kernel.kallsyms] [k] do_csum
>
>
> * Network:
>
> # Samples: 6M of event 'skb:consume_skb'
> # Event count (approx.): 6449013
> #
> # Overhead Command Shared Object Symbol
> # ........ ............... ................. ...........................
> #
> 90.28% pg_basebackup [kernel.kallsyms] [k] consume_skb
> 9.09% swapper [kernel.kallsyms] [k] consume_skb
> 0.29% ksoftirqd/1 [kernel.kallsyms] [k] consume_skb
> 0.21% sshd [kernel.kallsyms] [k] consume_skb
>
>
> The detailed perf report is attached, with different scenarios, i.e.
> without patch (with and without load for server and client) , with patch
> (with and without load for 1, 2, 4, 8 workers for both server and client).
> The file name should self explain the cases.
>
> Let me know if more information required.
>
> Best regards,
>
> David
> On 2020-04-29 5:41 a.m., Suraj Kharage wrote:
>
> Hi,
>
> We at EnterpriseDB did some performance testing around this
> parallel backup to check how this is beneficial and below are the results.
> In this testing, we run the backup -
> 1) Without Asif’s patch
> 2) With Asif’s patch and combination of workers 1,2,4,8.
>
> We run those test on two setup
>
> 1) Client and Server both on the same machine (Local backups)
>
> 2) Client and server on a different machine (remote backups)
>
>
> *Machine details: *
>
> 1: Server (on which local backups performed and used as server for remote
> backups)
>
> 2: Client (Used as a client for remote backups)
>
>
> *Server:*
> RAM: 500 GB
> CPU details:
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Byte Order: Little Endian
> CPU(s): 128
> On-line CPU(s) list: 0-127
> Thread(s) per core: 2
> Core(s) per socket: 8
> Socket(s): 8
> NUMA node(s): 8
> Filesystem: ext4
>
>
> *Client:*
> RAM: 490 GB
> CPU details:
> Architecture: ppc64le
> Byte Order: Little Endian
> CPU(s): 192
> On-line CPU(s) list: 0-191
> Thread(s) per core: 8
> Core(s) per socket: 1
> Socket(s): 24
> Filesystem: ext4
>
> Below are the results for the local test:
>
> Data size without paralle backup
> patch parallel backup with
> 1 worker % performance
> increased/decreased
> compare to normal
> backup
> (without patch) parallel backup with
> 2 worker % performance
> increased/decreased
> compare to normal
> backup
> (without patch) parallel backup with
> 4 worker % performance
> increased/decreased
> compare to normal
> backup
> (without patch) parallel backup with
> 8 worker % performance
> increased/decreased
> compare to normal
> backup
> (without patch)
> 10 GB
> (10 tables - each table around 1.05 GB) real 0m27.016s
> user 0m3.378s
> sys 0m23.059s real 0m30.314s
> user 0m3.575s
> sys 0m22.946s 12% performance
> decreased real 0m20.400s
> user 0m3.622s
> sys 0m29.670s 27% performace
> increased real 0m15.331s
> user 0m3.706s
> sys 0m39.189s 43% performance
> increased real 0m15.094s
> user 0m3.915s
> sys 1m23.350s 44% performace
> increased.
> 50GB
> (50 tables - each table around 1.05 GB) real 2m11.049s
> user 0m16.464s
> sys 2m1.757s real 2m26.621s
> user 0m18.497s
> sys 2m4.792s 21% performance
> decreased real 1m9.581s
> user 0m18.298s
> sys 2m12.030s 46% performance
> increased real 0m53.894s
> user 0m18.588s
> sys 2m47.390s 58% performance
> increased. real 0m55.373s
> user 0m18.423s
> sys 5m57.470s 57% performance
> increased.
> 100GB
> (100 tables - each table around 1.05 GB) real 4m4.776s
> user 0m33.699s
> sys 3m27.777s real 4m20.862s
> user 0m35.753s
> sys 3m28.262s 6% performance
> decreased real 2m37.411s
> user 0m36.440s
> sys 4m16.424s" 35% performance
> increased real 1m49.503s
> user 0m37.200s
> sys 5m58.077s 55% performace
> increased real 1m36.762s
> user 0m36.987s
> sys 9m36.906s 60% performace
> increased.
> 200GB
> (200 tables - each table around 1.05 GB) real 10m34.998s
> user 1m8.471s
> sys 7m21.520s real 11m30.899s
> user 1m12.933s
> sys 8m14.496s 8% performance
> decreased real 6m8.481s
> user 1m13.771s
> sys 9m31.216s 41% performance
> increased real 4m2.403s
> user 1m18.331s
> sys 12m29.661s 61% performance
> increased real 4m3.768s
> user 1m24.547s
> sys 15m21.421s 61% performance
> increased
>
> Results for the remote test:
>
> Data size without paralle backup
> patch parallel backup with
> 1 worker % performance
> increased/decreased
> compare to normal
> backup
> (without patch) parallel backup with
> 2 worker % performance
> increased/decreased
> compare to normal
> backup
> (without patch) parallel backup with
> 4 worker % performance
> increased/decreased
> compare to normal
> backup
> (without patch) parallel backup with
> 8 worker % performance
> increased/decreased
> compare to normal
> backup
> (without patch)
> 10 GB
> (10 tables - each table around 1.05 GB) real 1m36.829s
> user 0m2.124s
> sys 0m14.004s real 1m37.598s
> user 0m3.272s
> sys 0m11.110s 0.8% performance
> decreased real 1m36.753s
> user 0m2.627s
> sys 0m15.312s 0.08% performance
> increased. real 1m37.212s
> user 0m3.835s
> sys 0m13.221s 0.3% performance
> decreased. real 1m36.977s
> user 0m4.475s
> sys 0m17.937s 0.1% perfomance
> decreased.
> 50GB
> (50 tables - each table around 1.05 GB) real 7m54.211s
> user 0m10.826s
> sys 1m10.435s real 7m55.603s
> user 0m16.535s
> sys 1m8.147s 0.2% performance
> decreased real 7m53.499s
> user 0m18.131s
> sys 1m8.822s 0.1% performance
> increased. real 7m54.687s
> user 0m15.818s
> sys 1m30.991s 0.1% performance
> decreased real 7m54.658s
> user 0m20.783s
> sys 1m34.460s 0.1% performance
> decreased
> 100GB
> (100 tables - each table around 1.05 GB) real 15m45.776s
> user 0m21.802s
> sys 2m59.006s real 15m46.315s
> user 0m32.499s
> sys 2m47.245s 0.05% performance
> decreased real 15m46.065s
> user 0m28.877s
> sys 2m21.181s 0.03% performacne
> drcreased real 15m47.793s
> user 0m30.932s
> sys 2m36.708s 0.2% performance
> decresed real 15m47.129s
> user 0m35.151s
> sys 3m23.572s 0.14% performance
> decreased.
> 200GB
> (200 tables - each table around 1.05 GB) real 32m55.720s
> user 0m50.602s
> sys 5m38.875s real 31m30.602s
> user 0m45.377s
> sys 4m57.405s 4% performance
> increased real 31m30.214s
> user 0m55.023s
> sys 5m8.689s 4% performance
> increased real 31m31.187s
> user 1m13.390s
> sys 5m40.861s 4% performance
> increased real 31m31.729s
> user 1m4.955s
> sys 6m35.774s 4% performance
> decreased
>
>
> Client & Server on the same machine, the result shows around 50%
> improvement in parallel run with worker 4 and 8. We don’t see the huge
> performance improvement with more workers been added.
>
>
> Whereas, when the client and server on a different machine, we don’t see
> any major benefit in performance. This testing result matches the testing
> results posted by David Zhang up thread.
>
>
>
> We ran the test for 100GB backup with parallel worker 4 to see the CPU
> usage and other information. What we noticed is that server is consuming
> the CPU almost 100% whole the time and pg_stat_activity shows that server
> is busy with ClientWrite most of the time.
>
>
> Attaching captured output for
>
> 1) Top command output on the server after every 5 second
>
> 2) pg_stat_activity output after every 5 second
>
> 3) Top command output on the client after every 5 second
>
>
> Do let me know if anyone has further questions/inputs for the
> benchmarking.
>
> Thanks to Rushabh Lathia for helping me with this testing.
>
> On Tue, Apr 28, 2020 at 8:46 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
>
>> On Mon, Apr 27, 2020 at 10:23 PM David Zhang <david(dot)zhang(at)highgo(dot)ca>
>> wrote:
>> >
>> > Hi,
>> >
>> > Here is the parallel backup performance test results with and without
>> > the patch "parallel_backup_v15" on AWS cloud environment. Two
>> > "t2.xlarge" machines were used: one for Postgres server and the other
>> > one for pg_basebackup with the same machine configuration showing below.
>> >
>> > Machine configuration:
>> > Instance Type :t2.xlarge
>> > Volume type :io1
>> > Memory (MiB) :16GB
>> > vCPU # :4
>> > Architecture :x86_64
>> > IOP :6000
>> > Database Size (GB) :108
>> >
>> > Performance test results:
>> > without patch:
>> > real 18m49.346s
>> > user 1m24.178s
>> > sys 7m2.966s
>> >
>> > 1 worker with patch:
>> > real 18m43.201s
>> > user 1m55.787s
>> > sys 7m24.724s
>> >
>> > 2 worker with patch:
>> > real 18m47.373s
>> > user 2m22.970s
>> > sys 11m23.891s
>> >
>> > 4 worker with patch:
>> > real 18m46.878s
>> > user 2m26.791s
>> > sys 13m14.716s
>> >
>> > As required, I didn't have the pgbench running in parallel like we did
>> > in the previous benchmark.
>> >
>>
>> So, there doesn't seem to be any significant improvement in this
>> scenario. Now, it is not clear why there was a significant
>> improvement in the previous run where pgbench was also running
>> simultaneously. I am not sure but maybe it is because when a lot of
>> other backends were running (performing read-only workload) the
>> backend that was responsible for doing backup was getting frequently
>> scheduled out and it slowed down the overall backup process. And when
>> we start using multiple backends for backup one or other backup
>> process is always running making the overall backup faster. One idea
>> to find this out is to check how much time backup takes when we run it
>> with and without pgbench workload on HEAD (aka unpatched code). Even
>> if what I am saying is true or there is some other reason due to which
>> we are seeing speedup in some cases (where there is a concurrent
>> workload), it might not make the case for using multiple backends for
>> backup but still, it is good to find that information as it might help
>> in designing this feature better.
>>
>> > The perf report files for both Postgres server and pg_basebackup sides
>> > are attached.
>> >
>>
>> It is not clear which functions are taking more time or for which
>> functions time is reduced as function symbols are not present in the
>> reports. I think you can refer
>> "https://wiki.postgresql.org/wiki/Profiling_with_perf" to see how to
>> take profiles and additionally use -fno-omit-frame-pointer during
>> configure (you can use CFLAGS="-fno-omit-frame-pointer during
>> configure).
>>
>>
>> --
>> With Regards,
>> Amit Kapila.
>> EnterpriseDB: http://www.enterprisedb.com
>>
>>
>>
>
> --
> --
>
> Thanks & Regards,
> Suraj kharage,
> EnterpriseDB Corporation,
> The Postgres Database Company.
>
> --
> David
>
> Software Engineer
> Highgo Software Inc. (Canada)
> www.highgo.ca
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message 邱宇航 2020-04-30 10:37:27 Optimization for hot standby XLOG_STANDBY_LOCK redo
Previous Message Julien Rouhaud 2020-04-30 08:48:46 Re: WAL usage calculation patch