Re: Benchmark Data requested --- pgloader CE design ideas

From: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>
Subject: Re: Benchmark Data requested --- pgloader CE design ideas
Date: 2008-02-06 12:36:51
Message-ID: 200802061336.53363.dfontaine@hi-media.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Le mercredi 06 février 2008, Simon Riggs a écrit :
> For me, it would be good to see a --parallel=n parameter that would
> allow pg_loader to distribute rows in "round-robin" manner to "n"
> different concurrent COPY statements. i.e. a non-routing version.

What happen when you want at most N parallel Threads and have several sections
configured: do you want pgloader to serialize sections loading (often there's
one section per table, sometimes different sections target the same table)
but parallelise each section loading?

I'm thinking we should have a global max_threads knob *and* and per-section
max_thread one if we want to go this way, but then multi-threaded sections
will somewhat fight against other sections (multi-threaded or not) for
threads to use.

So I'll also add a parameter to configure how many (max) sections to load in
parallel at any time.

We'll then have (default values presented):
max_threads = 1
max_parallel_sections = 1
section_threads = -1

The section_threads parameter would be overloadable at section level but would
need to stay <= max_threads (if not, discarded, warning issued). When
section_threads is -1, pgloader tries to have the higher number of them
possible, still in the max_threads global limit.
If max_parallel_section is -1, pgloader start a new thread per each new
section, maxing out at max_threads, then it waits for a thread to finish
before launching a new section loading.

If you have N max_threads and max_parallel_sections = section_threads = -1,
then we'll see some kind of a fight between new section threads and in
section thread (the parallel non-routing COPY behaviour). But then it's a
user choice.

Adding in it the Constraint_Exclusion support would not mess it up, but it'll
have some interest only when section_threads != 1 and max_threads > 1.

> Making
> that work well, whilst continuing to do error-handling seems like a
> challenge, but a very useful goal.

Quick tests showed me python threading model allows for easily sharing of
objects between several threads, I don't think I'll need to adjust my reject
code when going per-section multi-threaded. Just have to use a semaphore
object to continue rejected one line at a time. Not that complex if reliable.

> Adding intelligence to the row distribution may be technically hard but
> may also simply move the bottleneck onto pg_loader. We may need multiple
> threads in pg_loader, or we may just need multiple sessions from
> pg_loader. Experience from doing the non-routing parallel version may
> help in deciding whether to go for the routing version.

If non-routing per-section multi-threading is a user request and not that hard
to implement (thanks to python), that sounds a good enough reason for me to
provide it :)

I'll keep you (and the list) informed as soon as I'll have the code to play
with.
--
dim

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Roberts, Jon 2008-02-06 13:35:38 Re: Optimizer : query rewrite and execution plan ?
Previous Message Theo Kramer 2008-02-06 12:12:42 Re: Optimizer : query rewrite and execution plan ?