Re: Benchmark Data requested

From: NikhilS <nikkhils(at)gmail(dot)com>
To: "Greg Smith" <gsmith(at)gregsmith(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Benchmark Data requested
Date: 2008-02-06 07:38:34
Message-ID: d3c4af540802052338s7bd3649tafe1b53d3894b4e9@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi,

On Feb 6, 2008 9:05 AM, Greg Smith <gsmith(at)gregsmith(dot)com> wrote:

> On Tue, 5 Feb 2008, Simon Riggs wrote:
>
> > On Tue, 2008-02-05 at 15:50 -0500, Jignesh K. Shah wrote:
> >>
> >> Even if it is a single core, the mere fact that the loading process
> will
> >> eventually wait for a read from the input file which cannot be
> >> non-blocking, the OS can timeslice it well for the second process to
> use
> >> those wait times for the index population work.
> >
> > If Dimitri is working on parallel load, why bother?
>
> pgloader is a great tool for a lot of things, particularly if there's any
> chance that some of your rows will get rejected. But the way things pass
> through the Python/psycopg layer made it uncompetative (more than 50%
> slowdown) against the straight COPY path from a rows/second perspective
> the last time (V2.1.0?) I did what I thought was a fair test of it (usual
> caveat of "with the type of data I was loading"). Maybe there's been some
> gigantic improvement since then, but it's hard to beat COPY when you've
> got an API layer or two in the middle.
>

I think, its time now that we should jazz COPY up a bit to include all the
discussed functionality. Heikki's batch-indexing idea is pretty useful too.
Another thing that pg_bulkload does is it directly loads the tuples into the
relation by constructing the tuples and writing them directly to the
physical file corresponding to the involved relation, bypassing the engine
completely (ofcourse the limitations that arise out of it are not supporting
rules, triggers, constraints, default expression evaluation etc). ISTM, we
could optimize the COPY code to try to do direct loading too (not
necessarily as done by pg_bulkload) to speed it up further in certain cases.

Another thing that we should add to COPY is the ability to continue data
load across errors as was discussed recently on hackers some time back too.

Regards,
Nikhils
--
EnterpriseDB http://www.enterprisedb.com

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message SURANTYN Jean François 2008-02-06 08:42:18 Optimizer : query rewrite and execution plan ?
Previous Message Greg Smith 2008-02-06 03:35:09 Re: Benchmark Data requested