Re: Parallel copy

From: Alastair Turner <minion(at)decodable(dot)me>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Ants Aasma <ants(at)cybertec(dot)at>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel copy
Date: 2020-02-26 15:16:11
Message-ID: CAC0Gmywn0TcsqWVsWMy4wzoqQOSvdxw1W6J30d4e=Ek4a=wJEw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 26 Feb 2020 at 10:54, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, Feb 25, 2020 at 9:30 PM Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >
...
> >
> > Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
> > so I still think we need to do some measurements first.
> >
>
> Agreed.
>
> > I'm willing to
> > do that, but (a) I doubt I'll have time for that until after 2020-03,
> > and (b) it'd be good to agree on some set of typical CSV files.
> >
>
> Right, I don't know what is the best way to define that. I can think
> of the below tests.
>
> 1. A table with 10 columns (with datatypes as integers, date, text).
> It has one index (unique/primary). Load with 1 million rows (basically
> the data should be probably 5-10 GB).
> 2. A table with 10 columns (with datatypes as integers, date, text).
> It has three indexes, one index can be (unique/primary). Load with 1
> million rows (basically the data should be probably 5-10 GB).
> 3. A table with 10 columns (with datatypes as integers, date, text).
> It has three indexes, one index can be (unique/primary). It has before
> and after trigeers. Load with 1 million rows (basically the data
> should be probably 5-10 GB).
> 4. A table with 10 columns (with datatypes as integers, date, text).
> It has five or six indexes, one index can be (unique/primary). Load
> with 1 million rows (basically the data should be probably 5-10 GB).
>
> Among all these tests, we can check how much time did we spend in
> reading, parsing the csv files vs. rest of execution?

That's a good set of tests of what happens after the parse. Two
simpler test runs may provide useful baselines - no
constraints/indexes with all columns varchar and no
constraints/indexes with columns correctly typed.

For testing the impact of various parts of the parse process, my idea would be:
- A base dataset with 10 columns including int, date and text. One
text field quoted and containing both delimiters and line terminators
- A derivative to measure just line parsing - strip the quotes around
the text field and quote the whole row as one text field
- A derivative to measure the impact of quoted fields - clean up the
text field so it doesn't require quoting
- A derivative to measure the impact of row length - run ten rows
together to make 100 column rows, but only a tenth as many rows

If that sounds reasonable, I'll try to knock up a generator.

--
Alastair

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ants Aasma 2020-02-26 15:17:33 Re: Parallel copy
Previous Message Sandro Santilli 2020-02-26 15:06:14 Re: [postgis-devel] About EXTENSION from UNPACKAGED on PostgreSQL 13