Re: COPY v. java performance comparison

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Rob Sargent <robjsargent(at)gmail(dot)com>
Cc: PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: COPY v. java performance comparison
Date: 2014-04-03 19:00:50
Message-ID: CAMkU=1zGS-qyUsa5-QiOLeF4bWOVdGce8HoxAhh9ZqPryTk3iQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Apr 2, 2014 at 5:11 PM, Rob Sargent <robjsargent(at)gmail(dot)com> wrote:

> On 04/02/2014 04:36 PM, Jeff Janes wrote:
>
> On Wed, Apr 2, 2014 at 12:37 PM, Rob Sargent <robjsargent(at)gmail(dot)com>wrote:
>
>> I'm playing with various data models to compare performance and
>> practicalities and not sure if I should be surprised by the numbers I'm
>> getting. I hope this report isn't too wishy-washy for reasoned comment.
>>
>> One model says a genotype is defined as follows:
>>
>> Table "public.oldstyle"
>> +-------------+--------------+-----------+
>> | Column | Type | Modifiers |
>> +-------------+--------------+-----------+
>> | id | uuid | not null |
>> | sample_name | text | not null |
>> | marker_name | text | not null |
>> | allele1 | character(1) | |
>> | allele2 | character(1) | |
>> +-------------+--------------+-----------+
>> (0. id is a Primary Key)
>> (1. Take what you will from the table name.)
>> (2. I hadn't thought of "char" type at this point)
>> (3. Ultimately the names would become ids, RI included)
>> (4. We're loading 39 samples and ~950K markers)
>>
>> I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+ hours
>> (800+ records/sec). Then I tried COPY and killed that after 11.25 hours
>> when I realised that I had added on non-unque index on the name fields
>> after the first load. By that point is was on line 28301887, so ~0.75 done
>> which implies it would have take ~15hours to complete.
>>
>> Would the overhead of the index likely explain this decrease in
>> throughput?
>>
>
> Absolutely.
>
>
>>
>> Impatience got the better of me and I killed the second COPY. This time
>> it had done 54% of the file in 6.75 hours, extrapolating to roughly 12
>> hours to do the whole thing.
>>
>
> Are you sure you actually dropped the indices? (And the primary key?)
>
> I get about 375,000 lines per second with no indexes, triggers,
> constraints.
>
> perl -le 'my $x="000000000000"; foreach(1..37e6) {$x++; print join "\t",
> "a0eebc99-9c0b-4ef8-bb6d-$x",$_,$_,"A","T"}'|time psql -c 'truncate
> oldstyle; copy oldstyle from stdin;'
>
> (More if I truncate it in the same transaction as the copy)
>
> If you can't drop the pk constraint, can you at least generate the
> values in sort-order?
>
> Cheers,
>
> Jeff
>
> I restarted the java-based loading. Still with pk in place, (and
> actually generating the UUIDs etc) I got ~1.1M rows in one minute. And one "LOG:
> checkpoints are occurring too frequently (24 seconds apart)" with
> checkpoint_segment now at 6 (v. default 3). In plotting the records v.
> time in that minute at least it's pretty linear. The time per batch is
> relatively constant with a few hiccups. (.5 sec per 10K lines). So I've
> improved things immensely but not entirely sure that the simple config
> change is the reason. If this continued I would be done inside 40 minutes.
>
> With copy interuptus I now get 2.9M records per minute so the load would
> take only 12 or so minutes. I did get four reports of too-frequent
> checkpoints 2 at 15 seconds 2 at 9 seconds.
>
> I'll need to let each on go to completion.
>
> If these numbers are at all accurate and realistic, I'm still impressed
> with jOOQ, though COPY is rightfully back to its proper place as fastest
> way to load.
>

JOOQ is probably hooking into the same API that COPY uses (or if not, it
should be), so it isn't surprising that they perform similarly.

Cheers,

Jeff

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Merlin Moncure 2014-04-03 19:13:19 Re: SSD Drives
Previous Message Jeff Janes 2014-04-03 18:58:06 Re: COPY v. java performance comparison