we would like to start some work on improving the performance of
PostgreSQL in a multi-CPU environment. Dano Vojtek is student at the
Faculty of Mathematics and Physics of Charles university in Prague
(http://www.mff.cuni.cz) and he is going to cover this topic in his
master thesis. He is going to do some investigation in the methods and
write down the possibilities and then he is going to implement something
from that for PostgreSQL.
We want to come out with a serious proposal for this work after
collecting the feedback/opinions and doing the more serious investigation.
Topics that seem to be of interest and most of them were already
discussed at developers meeting in Ottawa are
1.) parallel sorts
2.) parallel query execution
3.) asynchronous I/O
4.) parallel COPY
5.) parallel pg_dump
6.) using threads for parallel processing
A scaling with increasing number of CPUs in 1.) and 2.) will face with
the I/O bottleneck at some point and the benefit gained here should be
nearly the same as for 3.) - the OS or disk could do a better job while
scheduling multiple reads from the disk for the same query at the same time.
More merges could be executed on different CPUs. However, one N-way
merge on one CPU is probably better than two N/2-way merges on 2 CPUs
while sharing the limit of work_mem together for these. This is specific
and separate from 2.) or 3.) and if something implemented here it could
probably share just the parallel infrastructure code.
Different subtrees (or nodes) of the plan could be executed in parallel
on different CPUs and the results of this subtrees could be requested
either synchronously or asynchronously.
The simplest possible way is to change the scan nodes that they will
send out the asynchronous I/O requests for the next blocks before they
manage to run out of tuples in the block they are going through. The
more advanced way would arise just by implementing 2.) which will then
lead to different scan nodes to be executed on different CPUs at the
4.) and 5.)
We do not want to focus here, since there are on-going projects already.
Currently, threads are not used in PostgreSQL (except some cases for
Windows OS). Generally using them would bring some problems
a) different thread implementations on different OSes
b) crash of the whole process if the problem happens in one thread.
Backends are isolated and the problem in one backend leads to the
graceful shut down of other backends.
c) synchronization problems
* a) seem just to be more for implementation. Is there any problem with
execution of more threads on any supported OS? Like some planning issue
that all the threads for the same process end up planned on the same
CPU? Or something similar?
* b) is fine with using more threads for processing the same query in
the same backend - if one crashes others could do the graceful shutdown.
* c) does not have to be solved in general because the work of
all the threads will be synchronized and we could expect pretty well
which data are being accessed by which thread. The memory allocation
have to be adjusted to be thread safe and should not affect the
performance (Is different memory context for different threads
sufficient?). Other common code might need some changes as well.
Possibly, the synchronization/critical section exclusion could be done
in executor and only if needed.
* Using processes instead of threads makes other things more complex
- sharing objects between processes might need much more coding
- more overhead during execution and synchronization
It seems to that it makes sense to start working on 2) and 3) and we
would like to think of using more threads for processing the same query
within one backend.
We appreciate feedback, comments and/or suggestions.
pgsql-hackers by date
|Next:||From: Mark Cave-Ayland||Date: 2008-10-20 19:12:24|
|Subject: Patch status for reducing de-TOAST overhead?|
|Previous:||From: Heikki Linnakangas||Date: 2008-10-20 19:04:40|
|Subject: Re: Block-level CRC checks|