Re: Load experimentation

From: Scott Carey <scott(at)richrelevance(dot)com>
To: Scott Carey <scott(at)richrelevance(dot)com>, Ben Brehmer <benbrehmer(at)gmail(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Cc: Thom Brown <thombrown(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "craig_james(at)emolecules(dot)com" <craig_james(at)emolecules(dot)com>, "kbuckham(at)applocation(dot)net" <kbuckham(at)applocation(dot)net>, "scott(dot)lists(at)enterprisedb(dot)com" <scott(dot)lists(at)enterprisedb(dot)com>
Subject: Re: Load experimentation
Date: 2009-12-11 02:37:16
Message-ID: C746F05C.1B095%scott@richrelevance.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On 12/10/09 3:29 PM, "Scott Carey" <scott(at)richrelevance(dot)com> wrote:

> On 12/7/09 11:12 AM, "Ben Brehmer" <benbrehmer(at)gmail(dot)com> wrote:
>
>> Thanks for the quick responses. I will respond to all questions in one email:
>>
>> COPY command: Unfortunately I'm stuck with INSERTS due to the nature this
>> data
>> was generated (Hadoop/MapReduce).
>
> If you have control over the MapReduce output, you can have that output
> result files in a format that COPY likes.
>
> If you don't have any control over that its more complicated. I use a final
> pass Hadoop Map only job to go over the output and insert into postgres
> directly from the job, using the :
>
> INSERT INTO <table> VALUES (val1, val2, ... ), (val1, val2, ...), ...
> Insert style from Java with about 80 rows per insert statement and a single
> transaction for about a thousand of these. This was faster than batch
> inserts .
>

I should mention that the above is a bit off. There is an important caveat
that each of these individual tasks might run twice in Hadoop (only one will
finish -- speculative execution and retry on error). To deal with this you
can run each job inside a single transaction so that a failure will
rollback, and likely want to turn off speculative execution.

Another option is to run only one map job, with no reduce for this sort of
work in order to ensure duplicate data is not inserted. We are inserting
into a temp table named uniquely per chunk first (sometimes in parallel).
Then while holding a posstgres advisory lock we do a SELECT * FROM <temp>
INTO <destination> type operation, which is fast.

>
>>
>> On 07/12/2009 10:39 AM, Thom Brown wrote:
>>>
>>> 2009/12/7 Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
>>>
>>>>
>>>> Ben Brehmer <benbrehmer(at)gmail(dot)com> wrote:
>>>>
>>>>> -7.5 GB memory
>>>>> -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units
>>>>> each)
>>>>> -64-bit platform
>>>>
>>>>
>>>> What OS?
>>>>
>>>>> (PostgreSQL 8.1.3)
>>>>
>>>> Why use such an antiquated, buggy version? Newer versions are
>>>> faster.
>>>>
>>>> -Kevin
>>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> I'd agree with trying to use the latest version you can.
>>>
>>>
>>>
>>>
>>> How are you loading this data? I'd make sure you haven't got any indices,
>>> primary keys, triggers or constraints on your tables before you begin the
>>> initial load, just add them after. Also use either the COPY command for
>>> loading, or prepared transactions. Individual insert commands will just
>>> take
>>> way too long.
>>>
>>>
>>>
>>>
>>> Regards
>>>
>>>
>>>
>>>
>>> Thom
>>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Joshua D. Drake 2009-12-11 17:58:39 Re: 8.4.1 ubuntu karmic slow createdb
Previous Message Nikolas Everett 2009-12-11 01:38:25 Re: 8.4.1 ubuntu karmic slow createdb