Re: Benchmark Data requested --- pgloader CE design ideas

From: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>
Subject: Re: Benchmark Data requested --- pgloader CE design ideas
Date: 2008-02-07 09:31:47
Message-ID: 200802071031.50121.dfontaine@hi-media.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Le jeudi 07 février 2008, Greg Smith a écrit :
>Le mercredi 06 février 2008, Dimitri Fontaine a écrit :
>> In other cases, a logical line is a physical line, so we start after first
>> newline met from given lseek start position, and continue reading after the
>> last lseek position until a newline.
>
> Now you're talking. Find a couple of split points that way, fine-tune the
> boundaries a bit so they rest on line termination points, and off you go.

I was thinking of not even reading the file content from the controller
thread, just decide splitting points in bytes (0..ST_SIZE/4 -
ST_SIZE/4+1..2*ST_SIZE/4 etc) and let the reading thread fine-tune by
beginning to process input after having read first newline, etc.

And while we're still at the design board, I'm also thinking to add a
per-section parameter (with a global default value possible)
split_file_reading which defaults to False, and which you'll have to set True
for pgloader to behave the way we're talking about.

When split_file_reading = False and section_threads != 1 pgloader will have to
manage several processing threads per section but only one file reading
thread, giving the read input to processing theads in a round-robin fashion.
In the future the processing thread choosing will possibly (another knob) be
smarter than that, as soon as we get CE support into pgloader.

When split_file_reading = True and section_threads != 1 pgloader will have to
manage several processing threads per section, each one responsible of
reading its own part of the file, processing boundaries to be discovered at
reading time. Adding in here CE support in this case means managing two
separate thread pools per section, one responsible of splitted file reading
and another responsible of data buffering and routing (COPY to partition
instead of to parent table).

In both cases, maybe it would also be needed for pgloader to be able to have a
separate thread for COPYing the buffer to the server, allowing it to continue
preparing next buffer in the meantime?

This will need some re-architecturing of pgloader, but it seems it worth it
(I'm not entirely sold about the two thread-pools idea, though, and this last
continue-reading-while-copying-idea still has to be examined).
Some of the work needing to be done is by now quite clear for me, but a part
of it still needs its design-time share. As usual though, the real hard part
is knowing what we exactly want to get done, and we're showing good progress
here :)

Greg's behavior:
max_threads = N
max_parallel_sections = 1
section_threads = -1
split_file_reading = True

Simon's behaviour:
max_threads = N
max_parallel_sections = 1 # I don't think Simon wants parallel sections
section_threads = -1
split_file_reading = False

Comments?
--
dim

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Markus Bertheau 2008-02-07 14:51:57 Index Scan Backward + check additional condition before heap access
Previous Message Greg Smith 2008-02-06 23:36:13 Re: Benchmark Data requested --- pgloader CE design ideas