From: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com> |
---|---|
To: | Rob Sargent <robjsargent(at)gmail(dot)com> |
Cc: | PostgreSQL General <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: COPY v. java performance comparison |
Date: | 2014-04-03 19:00:50 |
Message-ID: | CAMkU=1zGS-qyUsa5-QiOLeF4bWOVdGce8HoxAhh9ZqPryTk3iQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Wed, Apr 2, 2014 at 5:11 PM, Rob Sargent <robjsargent(at)gmail(dot)com> wrote:
> On 04/02/2014 04:36 PM, Jeff Janes wrote:
>
> On Wed, Apr 2, 2014 at 12:37 PM, Rob Sargent <robjsargent(at)gmail(dot)com>wrote:
>
>> I'm playing with various data models to compare performance and
>> practicalities and not sure if I should be surprised by the numbers I'm
>> getting. I hope this report isn't too wishy-washy for reasoned comment.
>>
>> One model says a genotype is defined as follows:
>>
>> Table "public.oldstyle"
>> +-------------+--------------+-----------+
>> | Column | Type | Modifiers |
>> +-------------+--------------+-----------+
>> | id | uuid | not null |
>> | sample_name | text | not null |
>> | marker_name | text | not null |
>> | allele1 | character(1) | |
>> | allele2 | character(1) | |
>> +-------------+--------------+-----------+
>> (0. id is a Primary Key)
>> (1. Take what you will from the table name.)
>> (2. I hadn't thought of "char" type at this point)
>> (3. Ultimately the names would become ids, RI included)
>> (4. We're loading 39 samples and ~950K markers)
>>
>> I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+ hours
>> (800+ records/sec). Then I tried COPY and killed that after 11.25 hours
>> when I realised that I had added on non-unque index on the name fields
>> after the first load. By that point is was on line 28301887, so ~0.75 done
>> which implies it would have take ~15hours to complete.
>>
>> Would the overhead of the index likely explain this decrease in
>> throughput?
>>
>
> Absolutely.
>
>
>>
>> Impatience got the better of me and I killed the second COPY. This time
>> it had done 54% of the file in 6.75 hours, extrapolating to roughly 12
>> hours to do the whole thing.
>>
>
> Are you sure you actually dropped the indices? (And the primary key?)
>
> I get about 375,000 lines per second with no indexes, triggers,
> constraints.
>
> perl -le 'my $x="000000000000"; foreach(1..37e6) {$x++; print join "\t",
> "a0eebc99-9c0b-4ef8-bb6d-$x",$_,$_,"A","T"}'|time psql -c 'truncate
> oldstyle; copy oldstyle from stdin;'
>
> (More if I truncate it in the same transaction as the copy)
>
> If you can't drop the pk constraint, can you at least generate the
> values in sort-order?
>
> Cheers,
>
> Jeff
>
> I restarted the java-based loading. Still with pk in place, (and
> actually generating the UUIDs etc) I got ~1.1M rows in one minute. And one "LOG:
> checkpoints are occurring too frequently (24 seconds apart)" with
> checkpoint_segment now at 6 (v. default 3). In plotting the records v.
> time in that minute at least it's pretty linear. The time per batch is
> relatively constant with a few hiccups. (.5 sec per 10K lines). So I've
> improved things immensely but not entirely sure that the simple config
> change is the reason. If this continued I would be done inside 40 minutes.
>
> With copy interuptus I now get 2.9M records per minute so the load would
> take only 12 or so minutes. I did get four reports of too-frequent
> checkpoints 2 at 15 seconds 2 at 9 seconds.
>
> I'll need to let each on go to completion.
>
> If these numbers are at all accurate and realistic, I'm still impressed
> with jOOQ, though COPY is rightfully back to its proper place as fastest
> way to load.
>
JOOQ is probably hooking into the same API that COPY uses (or if not, it
should be), so it isn't surprising that they perform similarly.
Cheers,
Jeff
From | Date | Subject | |
---|---|---|---|
Next Message | Merlin Moncure | 2014-04-03 19:13:19 | Re: SSD Drives |
Previous Message | Jeff Janes | 2014-04-03 18:58:06 | Re: COPY v. java performance comparison |