Re: Parallel copy

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Ants Aasma <ants(at)cybertec(dot)at>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Alastair Turner <minion(at)decodable(dot)me>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel copy
Date: 2020-02-29 08:42:50
Message-ID: CAFiTN-vM-X6hU1H2m=zK98pjVv-2tGqQ1ZTiuxFX8P1N24FzPw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 26, 2020 at 8:47 PM Ants Aasma <ants(at)cybertec(dot)at> wrote:
>
> On Tue, 25 Feb 2020 at 18:00, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
> > so I still think we need to do some measurements first. I'm willing to
> > do that, but (a) I doubt I'll have time for that until after 2020-03,
> > and (b) it'd be good to agree on some set of typical CSV files.
>
> I agree that getting a nice varied dataset would be nice. Including
> things like narrow integer only tables, strings with newlines and
> escapes in them, extremely wide rows.
>
> I tried to capture a quick profile just to see what it looks like.
> Grabbed a random open data set from the web, about 800MB of narrow
> rows CSV [1].
>
> Script:
> CREATE TABLE census (year int,age int,ethnic int,sex int,area text,count text);
> COPY census FROM '.../Data8277.csv' WITH (FORMAT 'csv', HEADER true);
>
> Profile:
> # Samples: 59K of event 'cycles:u'
> # Event count (approx.): 57644269486
> #
> # Overhead Command Shared Object Symbol
> # ........ ........ ..................
> .......................................
> #
> 18.24% postgres postgres [.] CopyReadLine
> 9.23% postgres postgres [.] NextCopyFrom
> 8.87% postgres postgres [.] NextCopyFromRawFields
> 5.82% postgres postgres [.] pg_verify_mbstr_len
> 5.45% postgres postgres [.] pg_strtoint32
> 4.16% postgres postgres [.] heap_fill_tuple
> 4.03% postgres postgres [.] heap_compute_data_size
> 3.83% postgres postgres [.] CopyFrom
> 3.78% postgres postgres [.] AllocSetAlloc
> 3.53% postgres postgres [.] heap_form_tuple
> 2.96% postgres postgres [.] InputFunctionCall
> 2.89% postgres libc-2.30.so [.] __memmove_avx_unaligned_erms
> 1.82% postgres libc-2.30.so [.] __strlen_avx2
> 1.72% postgres postgres [.] AllocSetReset
> 1.72% postgres postgres [.] RelationPutHeapTuple
> 1.47% postgres postgres [.] heap_prepare_insert
> 1.31% postgres postgres [.] heap_multi_insert
> 1.25% postgres postgres [.] textin
> 1.24% postgres postgres [.] int4in
> 1.05% postgres postgres [.] tts_buffer_heap_clear
> 0.85% postgres postgres [.] pg_any_to_server
> 0.80% postgres postgres [.] pg_comp_crc32c_sse42
> 0.77% postgres postgres [.] cstring_to_text_with_len
> 0.69% postgres postgres [.] AllocSetFree
> 0.60% postgres postgres [.] appendBinaryStringInfo
> 0.55% postgres postgres [.] tts_buffer_heap_materialize.part.0
> 0.54% postgres postgres [.] palloc
> 0.54% postgres libc-2.30.so [.] __memmove_avx_unaligned
> 0.51% postgres postgres [.] palloc0
> 0.51% postgres postgres [.] pg_encoding_max_length
> 0.48% postgres postgres [.] enlargeStringInfo
> 0.47% postgres postgres [.] ExecStoreVirtualTuple
> 0.45% postgres postgres [.] PageAddItemExtended
>
> So that confirms that the parsing is a huge chunk of overhead with
> current splitting into lines being the largest portion. Amdahl's law
> says that splitting into tuples needs to be made fast before
> parallelizing makes any sense.
>

I have ran very simple case on table with 2 indexes and I can see a
lot of time is spent in index insertion. I agree that there is a good
amount of time spent in tokanizing but it is not very huge compared to
index insertion.

I have expanded the time spent in the CopyFrom function from my perf
report and pasted here. We can see that a lot of time is spent in
ExecInsertIndexTuples(77%). I agree that we need to further evaluate
that out of which how much is I/O vs CPU operations. But, the point I
want to make is that it's not true for all the cases that parsing is
taking maximum amout of time.

- 99.50% CopyFrom
- 82.90% CopyMultiInsertInfoFlush
- 82.85% CopyMultiInsertBufferFlush
+ 77.68% ExecInsertIndexTuples
+ 3.74% table_multi_insert
+ 0.89% ExecClearTuple
- 12.54% NextCopyFrom
- 7.70% NextCopyFromRawFields
- 5.72% CopyReadLine
3.96% CopyReadLineText
+ 1.49% pg_any_to_server
1.86% CopyReadAttributesCSV
+ 3.68% InputFunctionCall
+ 2.11% ExecMaterializeSlot
+ 0.94% MemoryContextReset

My test:
-- Prepare:
CREATE TABLE t (a int, b int, c varchar);
insert into t select i,i, 'aaaaaaaaaaaaaaaaaaaaaaaa' from
generate_series(1,10000000) as i;
copy t to '/home/dilipkumar/a.csv' WITH (FORMAT 'csv', HEADER true);
truncate table t;
create index idx on t(a);
create index idx1 on t(c);

-- Test CopyFrom and measure with perf:
copy t from '/home/dilipkumar/a.csv' WITH (FORMAT 'csv', HEADER true);

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andreas Karlsson 2020-02-29 08:58:46 Broken resetting of subplan hash tables
Previous Message Juan José Santamaría Flecha 2020-02-29 08:40:44 Re: BUG #15858: could not stat file - over 4GB