Re: WIP/PoC for parallel backup

From: David Zhang <david(dot)zhang(at)highgo(dot)ca>
To: Suraj Kharage <suraj(dot)kharage(at)enterprisedb(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Ahsan Hadi <ahsan(dot)hadi(at)gmail(dot)com>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, Kashif Zeeshan <kashif(dot)zeeshan(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP/PoC for parallel backup
Date: 2020-04-30 06:26:16
Message-ID: 52a86c49-ff6c-3825-1d25-20596907b7a7@highgo.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Thanks a lot for sharing the test results. Here is the our test results
using perf on three ASW t2.xlarge with below configuration.

Machine configuration:
      Instance Type        :t2.xlarge
      Volume type          :io1
      Memory (MiB)         :16GB
      vCPU #                   :4
      Architecture           :x86_64
      IOP                         :6000
      Database Size (GB)  :45 (Server)

case 1: postgres server: without patch and without load

* Disk I/O:

# Samples: 342K of event 'block:block_rq_insert'
# Event count (approx.): 342834
#
# Overhead  Command          Shared Object Symbol
# ........  ...............  ................. .....................
#
    97.65%  postgres         [kernel.kallsyms]  [k] __elv_add_request
     2.27%  kworker/u30:0    [kernel.kallsyms]  [k] __elv_add_request

* CPU:

# Samples: 6M of event 'cpu-clock'
# Event count (approx.): 1559444750000
#
# Overhead  Command          Shared Object Symbol
# ........  ...............  ....................
.............................................
#
    64.73%  swapper          [kernel.kallsyms]     [k] native_safe_halt
    10.89%  postgres         [vdso]                [.] __vdso_gettimeofday
     5.64%  postgres         [kernel.kallsyms]     [k] do_syscall_64
     5.43%  postgres         libpthread-2.26.so    [.] __libc_recv
     1.72%  postgres         [kernel.kallsyms]     [k]
pvclock_clocksource_read

* Network:

# Samples: 2M of event 'skb:consume_skb'
# Event count (approx.): 2739785
#
# Overhead  Command          Shared Object Symbol
# ........  ...............  ................. ...........................
#
    91.58%  swapper          [kernel.kallsyms]  [k] consume_skb
     7.09%  postgres         [kernel.kallsyms]  [k] consume_skb
     0.61%  kswapd0          [kernel.kallsyms]  [k] consume_skb
     0.44%  ksoftirqd/3      [kernel.kallsyms]  [k] consume_skb

case 1: pg_basebackup client: without patch and without load

* Disk I/O:

# Samples: 371K of event 'block:block_rq_insert'
# Event count (approx.): 371362
#
# Overhead  Command          Shared Object Symbol
# ........  ...............  ................. .....................
#
    96.78%  kworker/u30:0    [kernel.kallsyms]  [k] __elv_add_request
     2.82%  pg_basebackup    [kernel.kallsyms]  [k] __elv_add_request
     0.29%  kworker/u30:1    [kernel.kallsyms]  [k] __elv_add_request
     0.09%  xfsaild/xvda1    [kernel.kallsyms]  [k] __elv_add_request

* CPU:

# Samples: 3M of event 'cpu-clock'
# Event count (approx.): 903527000000
#
# Overhead  Command          Shared Object Symbol
# ........  ...............  ..................
.............................................
#
    87.99%  swapper          [kernel.kallsyms]   [k] native_safe_halt
     3.14%  swapper          [kernel.kallsyms]   [k] __lock_text_start
     0.48%  swapper          [kernel.kallsyms]   [k]
__softirqentry_text_start
     0.37%  pg_basebackup    [kernel.kallsyms]   [k]
copy_user_enhanced_fast_string
     0.35%  swapper          [kernel.kallsyms]   [k] do_csum

* Network:

# Samples: 12M of event 'skb:consume_skb'
# Event count (approx.): 12260713
#
# Overhead  Command          Shared Object Symbol
# ........  ...............  ................. ...........................
#
    95.12%  swapper          [kernel.kallsyms]  [k] consume_skb
     3.23%  pg_basebackup    [kernel.kallsyms]  [k] consume_skb
     0.83%  ksoftirqd/1      [kernel.kallsyms]  [k] consume_skb
     0.45%  kswapd0          [kernel.kallsyms]  [k] consume_skb

case 2: postgres server: with patch and with load, 4 backup workers on
client side

* Disk I/O:

# Samples: 3M of event 'block:block_rq_insert'
# Event count (approx.): 3634542
#
# Overhead  Command          Shared Object Symbol
# ........  ...............  ................. .....................
#
    98.88%  postgres         [kernel.kallsyms]  [k] __elv_add_request
     0.66%  perf             [kernel.kallsyms]  [k] __elv_add_request
     0.42%  kworker/u30:1    [kernel.kallsyms]  [k] __elv_add_request
     0.01%  sshd             [kernel.kallsyms]  [k] __elv_add_request

* CPU:

# Samples: 9M of event 'cpu-clock'
# Event count (approx.): 2299129250000
#
# Overhead  Command          Shared Object Symbol
# ........  ...............  .....................
.............................................
#
    52.73%  swapper          [kernel.kallsyms]      [k] native_safe_halt
     8.31%  postgres         [vdso]                 [.] __vdso_gettimeofday
     4.46%  postgres         [kernel.kallsyms]      [k] do_syscall_64
     4.16%  postgres         libpthread-2.26.so     [.] __libc_recv
     1.58%  postgres         [kernel.kallsyms]      [k] __lock_text_start
     1.52%  postgres         [kernel.kallsyms]      [k]
pvclock_clocksource_read
     0.81%  postgres         [kernel.kallsyms]      [k]
copy_user_enhanced_fast_string

* Network:

# Samples: 6M of event 'skb:consume_skb'
# Event count (approx.): 6048795
#
# Overhead  Command          Shared Object Symbol
# ........  ...............  ................. ...........................
#
    85.81%  postgres         [kernel.kallsyms]  [k] consume_skb
    12.03%  swapper          [kernel.kallsyms]  [k] consume_skb
     0.97%  postgres         [kernel.kallsyms]  [k] __consume_stateless_skb
     0.85%  ksoftirqd/3      [kernel.kallsyms]  [k] consume_skb
     0.24%  perf             [kernel.kallsyms]  [k] consume_skb

case 2: pg_basebackup 4 workers: with patch and with load

* Disk I/O:

# Samples: 372K of event 'block:block_rq_insert'
# Event count (approx.): 372360
#
# Overhead  Command          Shared Object Symbol
# ........  ...............  ................. .....................
#
    97.26%  kworker/u30:0    [kernel.kallsyms]  [k] __elv_add_request
     1.45%  pg_basebackup    [kernel.kallsyms]  [k] __elv_add_request
     0.95%  kworker/u30:1    [kernel.kallsyms]  [k] __elv_add_request
     0.14%  xfsaild/xvda1    [kernel.kallsyms]  [k] __elv_add_request

* CPU:

# Samples: 4M of event 'cpu-clock'
# Event count (approx.): 1234071000000
#
# Overhead  Command          Shared Object Symbol
# ........  ...............  ........................
.................................................
#
    89.25%  swapper          [kernel.kallsyms]         [k] native_safe_halt
     0.93%  pg_basebackup    [kernel.kallsyms]         [k]
__lock_text_start
     0.91%  swapper          [kernel.kallsyms]         [k]
__lock_text_start
     0.69%  pg_basebackup    [kernel.kallsyms]         [k]
copy_user_enhanced_fast_string
     0.45%  swapper          [kernel.kallsyms]         [k] do_csum

* Network:

# Samples: 6M of event 'skb:consume_skb'
# Event count (approx.): 6449013
#
# Overhead  Command          Shared Object Symbol
# ........  ...............  ................. ...........................
#
    90.28%  pg_basebackup    [kernel.kallsyms]  [k] consume_skb
     9.09%  swapper          [kernel.kallsyms]  [k] consume_skb
     0.29%  ksoftirqd/1      [kernel.kallsyms]  [k] consume_skb
     0.21%  sshd             [kernel.kallsyms]  [k] consume_skb

The detailed perf report is attached, with different scenarios, i.e.
without patch (with and without load for server and client) , with patch
(with and without load for 1, 2, 4, 8 workers for both server and
client). The file name should self explain the cases.

Let me know if more information required.

Best regards,

David

On 2020-04-29 5:41 a.m., Suraj Kharage wrote:
> Hi,
>
> We at EnterpriseDB did some performance testing around this
> parallel backup to check how this is beneficial and below are the
> results. In this testing, we run the backup -
> 1) Without Asif’s patch
> 2) With Asif’s patch and combination of workers 1,2,4,8.
>
> We run those test on two setup
>
> 1) Client and Server both on the same machine (Local backups)
>
> 2) Client and server on a different machine (remote backups)
>
>
> *Machine details: *
>
> 1: Server (on which local backups performed and used as server for
> remote backups)
>
> 2: Client (Used as a client for remote backups)
>
>
> *Server:*
>
> RAM:500 GB
> CPU details:
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Byte Order: Little Endian
> CPU(s): 128
> On-line CPU(s) list: 0-127
> Thread(s) per core: 2
> Core(s) per socket: 8
> Socket(s): 8
> NUMA node(s): 8
> Filesystem:ext4
>
>
> *Client:*
> RAM:490 GB
> CPU details:
> Architecture: ppc64le
> Byte Order: Little Endian
> CPU(s): 192
> On-line CPU(s) list: 0-191
> Thread(s) per core: 8
> Core(s) per socket: 1
> Socket(s): 24
> Filesystem:ext4
>
> Below are the results for the local test:
>
> Data size without paralle backup
> patch parallel backup with
> 1 worker % performance
> increased/decreased
> compare to normal
> backup
> (without patch) parallel backup with
> 2 worker % performance
> increased/decreased
> compare to normal
> backup
> (without patch) parallel backup with
> 4 worker % performance
> increased/decreased
> compare to normal
> backup
> (without patch) parallel backup with
> 8 worker % performance
> increased/decreased
> compare to normal
> backup
> (without patch)
> 10 GB
> (10 tables - each table around 1.05 GB) real 0m27.016s
> user 0m3.378s
> sys 0m23.059s real 0m30.314s
> user 0m3.575s
> sys 0m22.946s 12% performance
> decreased real 0m20.400s
> user 0m3.622s
> sys 0m29.670s 27% performace
> increased real 0m15.331s
> user 0m3.706s
> sys 0m39.189s 43% performance
> increased real 0m15.094s
> user 0m3.915s
> sys 1m23.350s 44% performace
> increased.
> 50GB
> (50 tables - each table around 1.05 GB) real 2m11.049s
> user 0m16.464s
> sys 2m1.757s real 2m26.621s
> user 0m18.497s
> sys 2m4.792s 21% performance
> decreased real 1m9.581s
> user 0m18.298s
> sys 2m12.030s 46% performance
> increased real 0m53.894s
> user 0m18.588s
> sys 2m47.390s 58% performance
> increased. real 0m55.373s
> user 0m18.423s
> sys 5m57.470s 57% performance
> increased.
> 100GB
> (100 tables - each table around 1.05 GB) real 4m4.776s
> user 0m33.699s
> sys 3m27.777s real 4m20.862s
> user 0m35.753s
> sys 3m28.262s 6% performance
> decreased real 2m37.411s
> user 0m36.440s
> sys 4m16.424s" 35% performance
> increased real 1m49.503s
> user 0m37.200s
> sys 5m58.077s 55% performace
> increased real 1m36.762s
> user 0m36.987s
> sys 9m36.906s 60% performace
> increased.
> 200GB
> (200 tables - each table around 1.05 GB) real 10m34.998s
> user 1m8.471s
> sys 7m21.520s real 11m30.899s
> user 1m12.933s
> sys 8m14.496s 8% performance
> decreased real 6m8.481s
> user 1m13.771s
> sys 9m31.216s 41% performance
> increased real 4m2.403s
> user 1m18.331s
> sys 12m29.661s 61% performance
> increased real 4m3.768s
> user 1m24.547s
> sys 15m21.421s 61% performance
> increased
>
>
> Results for the remote test:
>
> Data size without paralle backup
> patch parallel backup with
> 1 worker % performance
> increased/decreased
> compare to normal
> backup
> (without patch) parallel backup with
> 2 worker % performance
> increased/decreased
> compare to normal
> backup
> (without patch) parallel backup with
> 4 worker % performance
> increased/decreased
> compare to normal
> backup
> (without patch) parallel backup with
> 8 worker % performance
> increased/decreased
> compare to normal
> backup
> (without patch)
> 10 GB
> (10 tables - each table around 1.05 GB) real 1m36.829s
> user 0m2.124s
> sys 0m14.004s real 1m37.598s
> user 0m3.272s
> sys 0m11.110s 0.8% performance
> decreased real 1m36.753s
> user 0m2.627s
> sys 0m15.312s 0.08% performance
> increased. real 1m37.212s
> user 0m3.835s
> sys 0m13.221s 0.3% performance
> decreased. real 1m36.977s
> user 0m4.475s
> sys 0m17.937s 0.1% perfomance
> decreased.
> 50GB
> (50 tables - each table around 1.05 GB) real 7m54.211s
> user 0m10.826s
> sys 1m10.435s real 7m55.603s
> user 0m16.535s
> sys 1m8.147s 0.2% performance
> decreased real 7m53.499s
> user 0m18.131s
> sys 1m8.822s 0.1% performance
> increased. real 7m54.687s
> user 0m15.818s
> sys 1m30.991s 0.1% performance
> decreased real 7m54.658s
> user 0m20.783s
> sys 1m34.460s 0.1% performance
> decreased
> 100GB
> (100 tables - each table around 1.05 GB) real 15m45.776s
> user 0m21.802s
> sys 2m59.006s real 15m46.315s
> user 0m32.499s
> sys 2m47.245s 0.05% performance
> decreased real 15m46.065s
> user 0m28.877s
> sys 2m21.181s 0.03% performacne
> drcreased real 15m47.793s
> user 0m30.932s
> sys 2m36.708s 0.2% performance
> decresed real 15m47.129s
> user 0m35.151s
> sys 3m23.572s 0.14% performance
> decreased.
> 200GB
> (200 tables - each table around 1.05 GB) real 32m55.720s
> user 0m50.602s
> sys 5m38.875s real 31m30.602s
> user 0m45.377s
> sys 4m57.405s 4% performance
> increased real 31m30.214s
> user 0m55.023s
> sys 5m8.689s 4% performance
> increased real 31m31.187s
> user 1m13.390s
> sys 5m40.861s 4% performance
> increased real 31m31.729s
> user 1m4.955s
> sys 6m35.774s 4% performance
> decreased
>
>
>
> Client & Server on the same machine, the result shows around 50%
> improvement in parallel run with worker 4 and 8.  We don’t see the
> huge performance improvement with more workers been added.
>
>
> Whereas, when the client and server on a different machine, we don’t
> see any major benefit in performance.  This testing result matches the
> testing results posted by David Zhang up thread.
>
>
>
> We ran the test for 100GB backup with parallel worker 4 to see the CPU
> usage and other information. What we noticed is that server is
> consuming the CPU almost 100% whole the time and pg_stat_activity
> shows that server is busy with ClientWrite most of the time.
>
>
> Attaching captured output for
>
> 1) Top command output on the server after every 5 second
>
> 2) pg_stat_activity output after every 5 second
>
> 3) Top command output on the client after every 5 second
>
>
> Do let me know if anyone has further questions/inputs for the
> benchmarking.
>
>
> Thanks to Rushabh Lathia for helping me with this testing.
>
> On Tue, Apr 28, 2020 at 8:46 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com
> <mailto:amit(dot)kapila16(at)gmail(dot)com>> wrote:
>
> On Mon, Apr 27, 2020 at 10:23 PM David Zhang
> <david(dot)zhang(at)highgo(dot)ca <mailto:david(dot)zhang(at)highgo(dot)ca>> wrote:
> >
> > Hi,
> >
> > Here is the parallel backup performance test results with and
> without
> > the patch "parallel_backup_v15" on AWS cloud environment. Two
> > "t2.xlarge" machines were used: one for Postgres server and the
> other
> > one for pg_basebackup with the same machine configuration
> showing below.
> >
> > Machine configuration:
> >      Instance Type        :t2.xlarge
> >      Volume type          :io1
> >      Memory (MiB)         :16GB
> >      vCPU #               :4
> >      Architecture         :x86_64
> >      IOP                  :6000
> >      Database Size (GB)   :108
> >
> > Performance test results:
> > without patch:
> >      real 18m49.346s
> >      user 1m24.178s
> >      sys 7m2.966s
> >
> > 1 worker with patch:
> >      real 18m43.201s
> >      user 1m55.787s
> >      sys 7m24.724s
> >
> > 2 worker with patch:
> >      real 18m47.373s
> >      user 2m22.970s
> >      sys 11m23.891s
> >
> > 4 worker with patch:
> >      real 18m46.878s
> >      user 2m26.791s
> >      sys 13m14.716s
> >
> > As required, I didn't have the pgbench running in parallel like
> we did
> > in the previous benchmark.
> >
>
> So, there doesn't seem to be any significant improvement in this
> scenario.  Now, it is not clear why there was a significant
> improvement in the previous run where pgbench was also running
> simultaneously.  I am not sure but maybe it is because when a lot of
> other backends were running (performing read-only workload) the
> backend that was responsible for doing backup was getting frequently
> scheduled out and it slowed down the overall backup process. And when
> we start using multiple backends for backup one or other backup
> process is always running making the overall backup faster. One idea
> to find this out is to check how much time backup takes when we run it
> with and without pgbench workload on HEAD (aka unpatched code).  Even
> if what I am saying is true or there is some other reason due to which
> we are seeing speedup in some cases (where there is a concurrent
> workload), it might not make the case for using multiple backends for
> backup but still, it is good to find that information as it might help
> in designing this feature better.
>
> > The perf report files for both Postgres server and pg_basebackup
> sides
> > are attached.
> >
>
> It is not clear which functions are taking more time or for which
> functions time is reduced as function symbols are not present in the
> reports.  I think you can refer
> "https://wiki.postgresql.org/wiki/Profiling_with_perf" to see how to
> take profiles and additionally use -fno-omit-frame-pointer during
> configure (you can use CFLAGS="-fno-omit-frame-pointer during
> configure).
>
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
>
>
>
> --
> --
>
> Thanks & Regards,
> Suraj kharage,
> EnterpriseDB Corporation,
> The Postgres Database Company.
--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca

Attachment Content-Type Size
perf_report.tar.gz application/x-gzip 939.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2020-04-30 06:38:04 Re: Fixes for two separate bugs in nbtree VACUUM's page deletion
Previous Message Fabien COELHO 2020-04-30 06:00:07 Re: Proposing WITH ITERATIVE