Re: Parallel copy

From: Greg Nancarrow <gregn4422(at)gmail(dot)com>
To: vignesh C <vignesh21(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Subject: Re: Parallel copy
Date: 2020-09-01 10:09:11
Message-ID: CAJcOf-cQ8r1P8gLE+PFR9BWMi2hPLZSXE=deK4iOx0eHsYLFUA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Vignesh,

>Can you share with me the script you used to generate the data & the ddl of the table, so that it will help me check that >scenario you faced the >problem.

Unfortunately I can't directly share it (considered company IP),
though having said that it's only doing something that is relatively
simple and unremarkable, so I'd expect it to be much like what you are
currently doing. I can describe it in general.

The table being used contains 100 columns (as I pointed out earlier),
with the first column of "bigserial" type, and the others of different
types like "character varying(255)", "numeric", "date" and "time
without timezone". There's about 60 of the "character varying(255)"
overall, with the other types interspersed.

When testing with indexes, 4 b-tree indexes were used that each
included the first column and then distinctly 9 other columns.

A CSV record (row) template file was created with test data
(corresponding to the table), and that was simply copied and appended
over and over with a record prefix in order to create the test data
file.
The following shell-script basically does it (but very slowly). I was
using a small C program to do similar, a lot faster.
In my case, N=2550000 produced about a 5GB CSV file.

file_out=data.csv; for i in {1..N}; do echo -n "$i," >> $file_out;
cat sample_record.csv >> $file_out; done

One other thing I should mention is that between each test run, I
cleared the OS page cache, as described here:
https://linuxhint.com/clear_cache_linux/
That way, each COPY FROM is not taking advantage of any OS-cached data
from a previous COPY FROM.

If your data is somehow significantly different and you want to (and
can) share your script, then I can try it in my environment.

Regards,
Greg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2020-09-01 10:12:19 Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
Previous Message Junfeng Yang 2020-09-01 10:03:35 回复: 回复: Is it possible to set end-of-data marker for COPY statement.