Re: VLDB Features

From: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>
Subject: Re: VLDB Features
Date: 2007-12-12 10:19:02
Message-ID: 200712121119.05192.dfontaine@hi-media.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Le mercredi 12 décembre 2007, Josh Berkus a écrit :
> > I'm curious what you feel is missing that pgloader doesn't fill that
> > requirement: http://pgfoundry.org/projects/pgloader/
>
> Because pgloader is implemented in middleware, it carries a very high
> overhead if you have bad rows. As little as 1% bad rows will slow down
> loading by 20% due to retries.

Not that much, in fact, I'd say.
pgloader allows its user to configure how large a COPY buffer to use (global
parameter as of now, could easily be a per-section configuration knob, just
didn't see any need for this yet).
It's the 'copy_every' parameter as seen on the man page here:
http://pgloader.projects.postgresql.org/#toc4

pgloader will obviously prepare a in-memory buffer of copy_every tuples to
give to COPY, and in case of error will cut it and retry. Classic dichotomy
approach, from initial implementation by Jan Wieck.

So you can easily balance the error recovery costs against the COPY bulk size.

Note also that the overall loading time with pgloader is not scaling the same
as the COPY buffer size, the optimal choice depends on the dataset --- and
the data massaging pgloader has to make on it ---, and I've experienced best
results with 10000 and 15000 tuples buffers so far.

FYI, now the pgloader topic is on the table, the next items I think I'm gonna
develop for it are configurable behavior on errors tuples (load to another
table when pk error, e.g.), and some limited ddl-partioning support.

I'm playing with the idea for pgloader to be able to read some partitioning
schemes (parsing CHECK constraint on inherited tables) and load directly into
the right partitions.
That would of course be done only when configured this way, and if constraints
are misread it would only result in a lot more rejected rows than expected,
and you still can retry using your insert trigger instead of pgloader buggy
smartness.

Comments welcome, regards,
--
dim

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Oleg Bartunov 2007-12-12 11:02:42 little correction
Previous Message Magnus Hagander 2007-12-12 09:55:49 Re: [hensa22@yahoo.es: Re: [pgsql-es-ayuda] SLL error 100% cpu]