Re: Parallel copy

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Ants Aasma <ants(at)cybertec(dot)at>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Alastair Turner <minion(at)decodable(dot)me>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel copy
Date: 2020-03-03 06:14:05
Message-ID: CALDaNm0fpzFy2kjqzRfb8aaiD6MCLheoFK1DDvGQ09z4peZDTg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 26, 2020 at 8:47 PM Ants Aasma <ants(at)cybertec(dot)at> wrote:
>
> On Tue, 25 Feb 2020 at 18:00, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:
> > Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
> > so I still think we need to do some measurements first. I'm willing to
> > do that, but (a) I doubt I'll have time for that until after 2020-03,
> > and (b) it'd be good to agree on some set of typical CSV files.
>
> I agree that getting a nice varied dataset would be nice. Including
> things like narrow integer only tables, strings with newlines and
> escapes in them, extremely wide rows.
>
> I tried to capture a quick profile just to see what it looks like.
> Grabbed a random open data set from the web, about 800MB of narrow
> rows CSV [1].
>
> Script:
> CREATE TABLE census (year int,age int,ethnic int,sex int,area text,count
text);
> COPY census FROM '.../Data8277.csv' WITH (FORMAT 'csv', HEADER true);
>
> Profile:
> # Samples: 59K of event 'cycles:u'
> # Event count (approx.): 57644269486
> #
> # Overhead Command Shared Object Symbol
> # ........ ........ ..................
> .......................................
> #
> 18.24% postgres postgres [.] CopyReadLine
> 9.23% postgres postgres [.] NextCopyFrom
> 8.87% postgres postgres [.] NextCopyFromRawFields
> 5.82% postgres postgres [.] pg_verify_mbstr_len
> 5.45% postgres postgres [.] pg_strtoint32
> 4.16% postgres postgres [.] heap_fill_tuple
> 4.03% postgres postgres [.] heap_compute_data_size
> 3.83% postgres postgres [.] CopyFrom
> 3.78% postgres postgres [.] AllocSetAlloc
> 3.53% postgres postgres [.] heap_form_tuple
> 2.96% postgres postgres [.] InputFunctionCall
> 2.89% postgres libc-2.30.so [.] __memmove_avx_unaligned_erms
> 1.82% postgres libc-2.30.so [.] __strlen_avx2
> 1.72% postgres postgres [.] AllocSetReset
> 1.72% postgres postgres [.] RelationPutHeapTuple
> 1.47% postgres postgres [.] heap_prepare_insert
> 1.31% postgres postgres [.] heap_multi_insert
> 1.25% postgres postgres [.] textin
> 1.24% postgres postgres [.] int4in
> 1.05% postgres postgres [.] tts_buffer_heap_clear
> 0.85% postgres postgres [.] pg_any_to_server
> 0.80% postgres postgres [.] pg_comp_crc32c_sse42
> 0.77% postgres postgres [.] cstring_to_text_with_len
> 0.69% postgres postgres [.] AllocSetFree
> 0.60% postgres postgres [.] appendBinaryStringInfo
> 0.55% postgres postgres [.]
tts_buffer_heap_materialize.part.0
> 0.54% postgres postgres [.] palloc
> 0.54% postgres libc-2.30.so [.] __memmove_avx_unaligned
> 0.51% postgres postgres [.] palloc0
> 0.51% postgres postgres [.] pg_encoding_max_length
> 0.48% postgres postgres [.] enlargeStringInfo
> 0.47% postgres postgres [.] ExecStoreVirtualTuple
> 0.45% postgres postgres [.] PageAddItemExtended
>
> So that confirms that the parsing is a huge chunk of overhead with
> current splitting into lines being the largest portion. Amdahl's law
> says that splitting into tuples needs to be made fast before
> parallelizing makes any sense.
>

I had taken perf report with the same test data that you had used, I was
getting the following results:
.....
+ 99.61% 0.00% postgres postgres [.] PortalRun
+ 99.61% 0.00% postgres postgres [.] PortalRunMulti
+ 99.61% 0.00% postgres postgres [.] PortalRunUtility
+ 99.61% 0.00% postgres postgres [.] ProcessUtility
+ 99.61% 0.00% postgres postgres [.]
standard_ProcessUtility
+ 99.61% 0.00% postgres postgres [.] DoCopy
+ 99.30% 0.94% postgres postgres [.] CopyFrom
+ 51.61% 7.76% postgres postgres [.] NextCopyFrom
+ 23.66% 0.01% postgres postgres [.]
CopyMultiInsertInfoFlush
+ 23.61% 0.28% postgres postgres [.]
CopyMultiInsertBufferFlush
+ 21.99% 1.02% postgres postgres [.]
NextCopyFromRawFields

*+ 19.79% 0.01% postgres postgres [.]
table_multi_insert+ 19.32% 3.00% postgres postgres [.]
heap_multi_insert*+ 18.27% 2.44% postgres postgres [.]
InputFunctionCall

*+ 15.19% 0.89% postgres postgres [.] CopyReadLine*+
13.05% 0.18% postgres postgres [.] ExecMaterializeSlot
+ 13.00% 0.55% postgres postgres [.]
tts_buffer_heap_materialize
+ 12.31% 1.77% postgres postgres [.] heap_form_tuple
+ 10.43% 0.45% postgres postgres [.] int4in
+ 10.18% 8.92% postgres postgres [.] CopyReadLineText
......

In my results I observed execution table_multi_insert was nearly 20%. Also
I felt like once we have made few tuples from CopyReadLine, the parallel
workers should be able to start consuming the data and process the data. We
need not wait for the complete tokenisation to be finished. Once few tuples
are tokenised parallel workers should start consuming the data parallelly
and tokenisation should happen simultaneously. In this way once the copy is
done parallelly total execution time should be CopyReadLine Time + delta
processing time.

Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2020-03-03 08:06:42 Re: reindex concurrently and two toast indexes
Previous Message Fujii Masao 2020-03-03 06:03:35 Re: pg_stat_progress_basebackup - progress reporting for pg_basebackup, in the server side