Connection Pooling, a year later

From: Michael Owens <owensmk(at)earthlink(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Connection Pooling, a year later
Date: 2001-12-18 00:46:44
Message-ID: 200112180028.fBI0Sum06915@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I was browsing through the TODO list and stumbled upon the connection pooling
item discussed over a year ago. Is it safe to assume that since it's still
listed in the TODO that it's still a desirable feature? Although I don't know
claim to know jack about the postgresql source (although I am trying to learn
my way around) it does seem like the initial idea proposed by Alfred
Perlstein could offer considerable improvements with respect to postgresql's
ability to handle larger numbers of connections. Passing file descriptors
between processes aside, is it feasible to at least get the backend to where
it could theoretically handle this --- multiple, simultaneous clients.

Even if the descriptor passing were not the best (most portable) way to get
information from the client - postmaster to the backend, there still might be
another way by which data from the client socket can be transmitted to a
backend, thereby allowing the data from multiple clients to be sent to a
single backend. For instance, the postmaster could accept() and also
multiplex (select()/poll(), or use kqueue on BSD) across all file
descriptors, sending and receiving data to/from the clients. After calling
select(), It assigns/maps the client socket descriptor to one of the
backends, which it is connected to via domain socket. All subsequent data
coming from that client descriptor will be passed directly to that backend
via the domain socket. Information sent from the backend to the postmaster
through that socket includes the descriptor number of the client to whom that
information is intended. The postmaster would multiplex across descriptors to
clients and backends alike, effectively connecting the M clients to N
backends, without having to pass descriptors.

I doubt if what I am going to say in this paragraph is new to any of you, so
don't thing I'm being preachy, or that I think this is a novel idea. But
allowing multiple backends to process multiple clients, if only considered
from a mathematical standpoint, seems like a way to increase (perhaps even
dramatically) the maximum number of connections postgresql can service while
maintaining a fixed number of backends, and therefore a known performance
level. The central issue is the fact that the database server reaches
marginal returns with the number of active backend processes, beyond which
the overhead from shared resource contention, context switching, etc.,
degrades performance. This ideal number, varying from machine to machine, can
of course be controlled by setting the max number of connections. However,
once reached, other clients are then locked out. When that number is reached,
the real question becomes whether or not the server is really being used to
its full potential. And most likely it is not --- because there is still a
significant amount of idle time in each backend. And while the maximum number
of connections have been reached, the overall utilization is less than what
the machine can be performing, all the while locking other clients out. Then
the pendulum can swing the other way: you set the max number of connections
well beyond the ideal number (creating a buffer of sorts) to allow for this
scenario, so those blocked clients are let in. The problem is that this also
opens another worse-case scenario: what if all of the maximum connections are
active, sending the number of active backends well beyond the ideal limit?

By having the postmaster map multiple clients to a fixed number of backends,
you achieve the happy medium: You never exceed the ideal number of active
backends, and at the same time you are not limited to only accepting a fixed
number of connections. Accepting connections can now be based on load
(however you wish to define it), not number. You now make decisions based on
utlization.

If it were shown that even half of a backend's life consisted of idle time,
leasing out that idle time to another active connection would potentially
double the average number of simultaneous requests without (theoretically)
incurring any significant degradation in performance.

I have worked on code the uses this model, and would be glad to
adapt/contribute it to postgresql. Currently it does pass files descriptors
from a listening process to a queue process using send/recvmsg(). I have no
trouble with it on Linux and BSD, but I don't pretend to know anything about
portability. However, even it this is an issue, I would be willing to adapt
the model mentioned above. Currently, the design I have is to pass in
complete transactions, so that the N to M mapping can be achieved as an
application server. The downside is that state cannot be maintained within
the database (backend) as they are shared between clients on the application
server. Every request the server makes to postgresql must be a standalone
transaction.

So I guess my question is whether or not there is still interest in this, and
whether there are still great difficulties in making the necessary changes to
the backend so that it could handle multiple clients. If it does seem
possible, and there is interest, then I would be willing to take a stab at it
with the code I have developed.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2001-12-18 01:16:59 Re: Connection Pooling, a year later
Previous Message Tom Lane 2001-12-18 00:05:13 Re: [HACKERS] Problem compiling postgres sql --with-tcl