Re: Parallel copy

From: Greg Nancarrow <gregn4422(at)gmail(dot)com>
To: vignesh C <vignesh21(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Subject: Re: Parallel copy
Date: 2020-09-03 06:50:31
Message-ID: CAJcOf-fea8MrCV+2M=w8jMjp3L7UwJWtEyFsRXV5-gMHxr89CQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>On Wed, Sep 2, 2020 at 3:40 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> I have attached the scripts that I used for the test results I
> mentioned in my previous mail. create.sql file has the table that I
> used, insert_data_gen.txt has the insert data generation scripts. I
> varied the count in insert_data_gen to generate csv files of 1GB, 2GB
> & 5GB & varied the data to generate 1 char, 10 char & 100 char for
> each column for various testing. You can rename insert_data_gen.txt to
> insert_data_gen.sh & generate the csv file.

Hi Vignesh,

I used your script and table definition, multiplying the number of
records to produce a 5GB and 9.5GB CSV file.
I got the following results:

(1) Postgres default settings, 5GB CSV (530000 rows):

Copy Type Duration (s) Load factor
===============================================
Normal Copy 132.197 -

Parallel Copy
(#workers)
1 98.428 1.34
2 52.753 2.51
3 37.630 3.51
4 33.554 3.94
5 33.636 3.93
6 33.821 3.91
7 34.270 3.86
8 34.465 3.84
9 34.315 3.85
10 33.543 3.94

(2) Postgres increased resources, 5GB CSV (530000 rows):

shared_buffers = 20% of RAM (total RAM = 376GB) = 76GB
work_mem = 10% of RAM = 38GB
maintenance_work_mem = 10% of RAM = 38GB
max_worker_processes = 16
max_parallel_workers = 16
checkpoint_timeout = 30min
max_wal_size=2GB

Copy Type Duration (s) Load factor
===============================================
Normal Copy 131.835 -

Parallel Copy
(#workers)
1 98.301 1.34
2 53.261 2.48
3 37.868 3.48
4 34.224 3.85
5 33.831 3.90
6 34.229 3.85
7 34.512 3.82
8 34.303 3.84
9 34.690 3.80
10 34.479 3.82

(3) Postgres default settings, 9.5GB CSV (1000000 rows):

Copy Type Duration (s) Load factor
===============================================
Normal Copy 248.503 -

Parallel Copy
(#workers)
1 185.724 1.34
2 99.832 2.49
3 70.560 3.52
4 63.328 3.92
5 63.182 3.93
6 64.108 3.88
7 64.131 3.87
8 64.350 3.86
9 64.293 3.87
10 63.818 3.89

(4) Postgres increased resources, 9.5GB CSV (1000000 rows):

shared_buffers = 20% of RAM (total RAM = 376GB) = 76GB
work_mem = 10% of RAM = 38GB
maintenance_work_mem = 10% of RAM = 38GB
max_worker_processes = 16
max_parallel_workers = 16
checkpoint_timeout = 30min
max_wal_size=2GB

Copy Type Duration (s) Load factor
===============================================
Normal Copy 248.647 -

Parallel Copy
(#workers)
1 182.236 1.36
2 92.814 2.68
3 67.347 3.69
4 63.839 3.89
5 62.672 3.97
6 63.873 3.89
7 64.930 3.83
8 63.885 3.89
9 62.397 3.98
10 64.477 3.86

So as you found, with this particular table definition and data, 1
parallel worker always performs better than normal copy.
The different result obtained for this particular case seems to be
caused by the following factors:
- different table definition (I used a variety of column types)
- amount of data per row (I used less data per row, so more rows per
same size data file)

As I previously observed, if the target table has no indexes,
increasing resources beyond the default settings makes little
difference to the performance.

Regards,
Greg Nancarrow
Fujitsu Australia

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2020-09-03 07:05:05 Re: New statistics for tuning WAL buffer size
Previous Message torikoshia 2020-09-03 06:40:00 Re: Get memory contexts of an arbitrary backend process