Skip site navigation (1) Skip section navigation (2)

Re: Benchmark Data requested --- pgloader CE design ideas

From: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>
Subject: Re: Benchmark Data requested --- pgloader CE design ideas
Date: 2008-02-07 09:31:47
Message-ID: 200802071031.50121.dfontaine@hi-media.com (view raw or flat)
Thread:
Lists: pgsql-performance
Le jeudi 07 février 2008, Greg Smith a écrit :
>Le mercredi 06 février 2008, Dimitri Fontaine a écrit :
>> In other cases, a logical line is a physical line, so we start after first
>> newline met from given lseek start position, and continue reading after the
>> last lseek position until a newline.
>
> Now you're talking.  Find a couple of split points that way, fine-tune the
> boundaries a bit so they rest on line termination points, and off you go.

I was thinking of not even reading the file content from the controller 
thread, just decide splitting points in bytes (0..ST_SIZE/4 - 
ST_SIZE/4+1..2*ST_SIZE/4 etc) and let the reading thread fine-tune by 
beginning to process input after having read first newline, etc.

And while we're still at the design board, I'm also thinking to add a 
per-section parameter (with a global default value possible) 
split_file_reading which defaults to False, and which you'll have to set True 
for pgloader to behave the way we're talking about.

When split_file_reading = False and section_threads != 1 pgloader will have to 
manage several processing threads per section but only one file reading 
thread, giving the read input to processing theads in a round-robin fashion. 
In the future the processing thread choosing will possibly (another knob) be 
smarter than that, as soon as we get CE support into pgloader.

When split_file_reading = True and section_threads != 1 pgloader will have to 
manage several processing threads per section, each one responsible of 
reading its own part of the file, processing boundaries to be discovered at 
reading time. Adding in here CE support in this case means managing two 
separate thread pools per section, one responsible of splitted file reading 
and another responsible of data buffering and routing (COPY to partition 
instead of to parent table).

In both cases, maybe it would also be needed for pgloader to be able to have a 
separate thread for COPYing the buffer to the server, allowing it to continue 
preparing next buffer in the meantime?

This will need some re-architecturing of pgloader, but it seems it worth it 
(I'm not entirely sold about the two thread-pools idea, though, and this last 
continue-reading-while-copying-idea still has to be examined).
Some of the work needing to be done is by now quite clear for me, but a part 
of it still needs its design-time share. As usual though, the real hard part 
is knowing what we exactly want to get done, and we're showing good progress 
here :)

Greg's behavior:
max_threads           = N 
max_parallel_sections = 1
section_threads       = -1
split_file_reading    = True

Simon's behaviour:
max_threads           = N
max_parallel_sections = 1   # I don't think Simon wants parallel sections
section_threads       = -1
split_file_reading    = False

Comments?
-- 
dim

In response to

Responses

pgsql-performance by date

Next:From: Markus BertheauDate: 2008-02-07 14:51:57
Subject: Index Scan Backward + check additional condition before heap access
Previous:From: Greg SmithDate: 2008-02-06 23:36:13
Subject: Re: Benchmark Data requested --- pgloader CE design ideas

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group