Re: Benchmark Data requested

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>
Subject: Re: Benchmark Data requested
Date: 2008-02-05 14:24:55
Message-ID: 1202221496.4252.680.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Tue, 2008-02-05 at 15:06 +0100, Dimitri Fontaine wrote:
> Hi,
>
> Le lundi 04 février 2008, Jignesh K. Shah a écrit :
> > Single stream loader of PostgreSQL takes hours to load data. (Single
> > stream load... wasting all the extra cores out there)
>
> I wanted to work on this at the pgloader level, so CVS version of pgloader is
> now able to load data in parallel, with a python thread per configured
> section (1 section = 1 data file = 1 table is often the case).
> Not configurable at the moment, but I plan on providing a "threads" knob which
> will default to 1, and could be -1 for "as many thread as sections".

That sounds great. I was just thinking of asking for that :-)

I'll look at COPY FROM internals to make this faster. I'm looking at
this now to refresh my memory; I already had some plans on the shelf.

> > Multiple table loads ( 1 per table) spawned via script is bit better
> > but hits wal problems.
>
> pgloader will too hit the WAL problem, but it still may have its benefits, or
> at least we will soon (you can already if you take it from CVS) be able to
> measure if the parallel loading at the client side is a good idea perf. wise.

Should be able to reduce lock contention, but not overall WAL volume.

> [...]
> > I have not even started Partitioning of tables yet since with the
> > current framework, you have to load the tables separately into each
> > tables which means for the TPC-H data you need "extra-logic" to take
> > that table data and split it into each partition child table. Not stuff
> > that many people want to do by hand.
>
> I'm planning to add ddl-partitioning support to pgloader:
> http://archives.postgresql.org/pgsql-hackers/2007-12/msg00460.php
>
> The basic idea is for pgloader to ask PostgreSQL about constraint_exclusion,
> pg_inherits and pg_constraint and if pgloader recognize both the CHECK
> expression and the datatypes involved, and if we can implement the CHECK in
> python without having to resort to querying PostgreSQL, then we can run a
> thread per partition, with as many COPY FROM running in parallel as there are
> partition involved (when threads = -1).
>
> I'm not sure this will be quicker than relying on PostgreSQL trigger or rules
> as used for partitioning currently, but ISTM Jignesh quoted § is just about
> that.

Much better than triggers and rules, but it will be hard to get it to
work.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Matthew 2008-02-05 14:29:12 Re: Benchmark Data requested
Previous Message Dimitri Fontaine 2008-02-05 14:06:48 Re: Benchmark Data requested