Re: COPY v. java performance comparison

From: Rob Sargent <robjsargent(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: COPY v. java performance comparison
Date: 2014-04-02 22:46:06
Message-ID: 533C932E.3070804@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 04/02/2014 04:36 PM, Jeff Janes wrote:
> On Wed, Apr 2, 2014 at 12:37 PM, Rob Sargent <robjsargent(at)gmail(dot)com
> <mailto:robjsargent(at)gmail(dot)com>> wrote:
>
> I'm playing with various data models to compare performance and
> practicalities and not sure if I should be surprised by the
> numbers I'm getting. I hope this report isn't too wishy-washy for
> reasoned comment.
>
> One model says a genotype is defined as follows:
>
> Table "public.oldstyle"
> +-------------+--------------+-----------+
> | Column | Type | Modifiers |
> +-------------+--------------+-----------+
> | id | uuid | not null |
> | sample_name | text | not null |
> | marker_name | text | not null |
> | allele1 | character(1) | |
> | allele2 | character(1) | |
> +-------------+--------------+-----------+
> (0. id is a Primary Key)
> (1. Take what you will from the table name.)
> (2. I hadn't thought of "char" type at this point)
> (3. Ultimately the names would become ids, RI included)
> (4. We're loading 39 samples and ~950K markers)
>
> I loaded 37M+ records using jOOQ (batching every 1000 lines) in
> 12+ hours (800+ records/sec). Then I tried COPY and killed that
> after 11.25 hours when I realised that I had added on non-unque
> index on the name fields after the first load. By that point is
> was on line 28301887, so ~0.75 done which implies it would have
> take ~15hours to complete.
>
> Would the overhead of the index likely explain this decrease in
> throughput?
>
>
> Absolutely.
>
>
> Impatience got the better of me and I killed the second COPY.
> This time it had done 54% of the file in 6.75 hours, extrapolating
> to roughly 12 hours to do the whole thing.
>
>
> Are you sure you actually dropped the indices? (And the primary key?)
>
> I get about 375,000 lines per second with no indexes, triggers,
> constraints.
>
> perl -le 'my $x="000000000000"; foreach(1..37e6) {$x++; print join
> "\t", "a0eebc99-9c0b-4ef8-bb6d-$x",$_,$_,"A","T"}'|time psql -c
> 'truncate oldstyle; copy oldstyle from stdin;'
>
> (More if I truncate it in the same transaction as the copy)
>
> If you can't drop the pk constraint, can you at least generate the
> values in sort-order?
>
> Cheers,
>
> Jeff
No I'll leave the pk in at the very least. My example load (37M
records) will not be the last word by any means. That's one experiment,
if you will. My goal is not to see how fast I can get records in,
rather to see what I can expect going forward. Yes, I'm pretty sure I
dropped the index afore the second kick at copy.
I'm about restart after some config changes (doubled the
checkpoint_segments - I have no idea what the value should be.)

Hope you'll stay tuned.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Si Chen 2014-04-02 22:51:11 Re: simple update query stuck
Previous Message Jeff Janes 2014-04-02 22:36:46 Re: COPY v. java performance comparison