Re: Postgres with pthread

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Postgres with pthread
Date: 2017-12-07 11:55:29
Message-ID: 169c12a9-adb3-6ff0-dda9-86822cb077c7@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I want to thank everybody for feedbacks and a lot of useful notices.
I am very pleased with interest of community to this topic and will
continue research in this direction.
Some more comments from my side:

My original intention was to implement some king of built-in connection
pooling for Postgres: be able to execute several transactions into one
backend.
It requires use of some kind lightweight multitasking (coroutines). The
obvious candidate for it is libcore.

In this case we also need to solve the problem with static variables.
And __thread will not help in this case. We have to collect all static
variables into some structure (context)
and replace any references to such variable with indirection through
pointer. It will be much harder to implement than annotating variable
definitions with __thread:
it will require change of all accesses to variables, so almost all
Postgres code has to be refactored.

Another problem with this approach is that we need asynchronous disk IO
for it. Unfortunately this is no good file AIO implementation for Linux.
Certainly we can spawn dedicated IO thread (or threads)  and queue IO
requests to it. But such architecture seems to become quite complex.

Also cooperative multitasking itself is not able to load all CPU cores.
So we need to have several physical processes/threads which will execute
coroutines.
In theory such architecture should provide the best performance and
scalability (handle hundreds of thousands of client connections). But in
practice there are a lot of pitfals:
1. Right now each backend has its local relation, catalog and prepared
statement caches. For large database this caches can be large enough:
several megabytes.
So such coroutines becomes really not "lightweight". The  obvious
solution is to have global caches or combine global and local caches.
But it once again requires significant
changes in postgres.
2. Large number of sessions makes current approach with procarray almost
unusable: we need to provide some alternative implementation of
snapshots, for example CSN based.
3. All locking mechanisms have to be rewritten.

So this approach almost exclude possibility of evolution of existed
postgres code base and requires "revolution": rewriting most of Postgres
components from scratch and refactoring  almost all other postgres code.
This is why I have to abandon move in this direction.

Replacing processes with threads can be considered just as first step
and requires changes in many postgres components if we really want to
get significant advantages from it.
But at least such work can be splitted into several phases and it is
possible for some time to support both multithreaded and multiprocess
model in the same codebase.
Below I want to summarize the most important (from my point of view)
arguments pro/contra multithreaded I got from your feedbacks:

Pros:
1. Simplified memory model: no need in DSM, shm_mq, DSA, etc
2. Efficient integration of PLs supporting multithreaded execution,
first of all Java
3. Less memory footprint, faster context switching, more efficient use
of TLB

Contras:
1. Breaks compatibility with existed extensions and adds more
requirements for authors of new extension
2. Problems with integration of single-threaded PLs: Python, Lua,...
3. Worser protection from programming errors, included errors in extensions.
4. Lack of explicit separation of shared and privite memory leads to
more synchronization errors.
Right now in Postgres there is strict distinction between shared memory
and private memory, so it is clear for programmer
whether (s)he is working with shared data and so need some kind of
synchronization to avoid race condition.
In pthreads all memory is shared and more care is needed to work with it.

So pthreads can help to increase scalability, but still do not help much
in implementation of built-in connection pooling, autonomous
transactions,...

Current 50% improvement of select speed for large number of connections
certainly can not be considered as enough motivation for such radical
changes of Postgres architecture.
But it is just first step and much more benefits can be obtained by
adopting Postgres to this model.
It is hard to me to estimate now all complexity of switching to thread
model and all advantages we can get from it.
First of all I am going to repeat my benchmarks at SMP computers with
large number of cores (so that 100 or more active backends can be really
useful even in case of connection pooling).

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Konstantin Knizhnik 2017-12-07 12:06:43 Re: Postgres with pthread
Previous Message Beena Emerson 2017-12-07 11:17:26 Re: [HACKERS] Runtime Partition Pruning