MPTCP - multiplexing many TCP connections through one socket to get better bandwidth

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: MPTCP - multiplexing many TCP connections through one socket to get better bandwidth
Date: 2025-09-04 10:56:06
Message-ID: CAKZiRmy6j9PBzDHZwdgwHavwKDzv5GWtRSWOTj6-jv6SCOZ=YA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi -hackers,

With the attached patch PostgreSQL could possibly gain built-in MPTCP
support which would allow multiplexing (aggregating) multiple
kernel-based TCP streams into one MPTCP socket. This allows bypassing
any "chokepoints" on the network transparently for libpq, especially
if having *multiple* TCP streams could achieve higher bandwidth than
single one. One can think of transparent aggregation of bandwidth over
multiple WAN links/tunnels and so. In short it works like this:
libpq_client <--MPTCP--> client_kernel <==multiple TCP
connections==> server_kernel <--MPTCP--> server_kernel

Without much rework of PostgreSQL, this means accelerating any
libpq-based use case. Most obvious beneficiaries could be any
libpq-based heavy network transfers, especially in enterprise
networks. Those come to my mind:
- pg_basebackup (over e.g. WAN or multiple interfaces; but also one
can think of using 2x 10GigE over LAN)
- streaming replication or logical replication [years ago I've was
able to use MPTCP with colleagues on production to bypass single TCP
stream limitation of streaming replication]
- COPY (both upload and download)
- postgres_fdw/dblinks?

MPTCP is IETF standard and included from Linux kernels from some time
(realistically 5.16+?) and it's *enabled* by default in most modern
distributions. One could use it with mptcpize (LD_PRELOAD wrapper to
hijack socket()), but it's not elegant and would require altering
systemd startup scripts (the same story like with NUMA: literally
nobody hacking those to just include numactl --interleave there or
with adjusting ulimits).

The patch right now just assumes IPPROTO_MPTCP is there, so it is not
portable, but not that many OSes support it at all -- I think #ifdef
would be good enough for now. I dont have access to MacOS to develop
this more there, nor I think it would add benefit there, but I may be
wrong. So as such the proposed patch is trivial and Linux-only,
although there is RFC8684[1][2]. I suspect it is way easier and
simpler to support it , rather than try to solve the same problem for
each of the listed use-cases.

Simulation, basic-use and tests:

1. Strictly for demo purposes here, we need to ARTIFICIALLY limit
outbound bandwidth for each new flow (TCP connection) to 10 Mbit/s
using `tc` on the server where PostgreSQL is going to be running later
on (this simulates some chokepoints, multiple WAN paths):
DEV=enp0s31f6
tc qdisc add dev $DEV root handle 1: htb
tc class add dev $DEV parent 1: classid 1:1 htb rate 100mbit
for i in `seq 1 9`; do
tc class add dev $DEV parent 1:1 classid 1:$i htb rate 10mbit
ceil 10mbit
done
# see tc-flow(8) for details, classify each flow with port into
separate class (1:X)
tc filter add dev $DEV parent 1: protocol ip prio 1 handle 1 flow
hash keys src,dst,proto,proto-src,proto-dst divisor 8 baseclass 1

2. From client, verify that single TCP bandwidth is really limited:
verify using iperf3 -P 1 -R -c <server> # if you really getting
limited single-stream TCP connection instead of full
verify using iperf3 -P 8 -R -c <server> # if you really getting
more bandwidth than above

3. Check if MPTCP is enabled and configured on both sides
uname -r # at least 5.10+ according [4] to get this balancing
working, but 6.1+ LTS highly recommended (I've used 6.14.x)
sysctl net.mptcp.enabled # should be 1 on both sides by default
ip mptcp limits set subflows 8 add_addr_accepted 8 # but feel
free to setup max limits

4. Configure MPTCP endpoints on the server (registers some dedicated
listening ports for MPTCP use so that there's no need to use multiple
IP aliases or PBR):
ps uaxw | grep -i mptcpd # check if mptcp daemon (path manager is
running or not), it is NOT required in this case
ip addr ls # let's assume 10.0.1.240 is my main IP on eno1 device,
no need to add new IPs thanks to below trick:
ip mptcp endpoint show # to verify
#ip mptcp endpoint flush # if necessary
# below registers ports 5202..5205 as LISTENing by kernel and
dedicated for MPTCP subflows
ip mptcp endpoint add 10.0.1.240 dev eno1 port 5202 signal
ip mptcp endpoint add 10.0.1.240 dev eno1 port 5203 signal
ip mptcp endpoint add 10.0.1.240 dev eno1 port 5204 signal
ip mptcp endpoint add 10.0.1.240 dev eno1 port 5205 signal
ip mptcp endpoint show # to verify

5. Configure the client:
ip addr ls # here I got 10.0.1.250
ip mptcp endpoint show
ip mptcp endpoint add 10.0.1.250 dev enp0s31f6 subflow fullmesh #
not sure fullmesh is necessary, probably not
ip mptcp limits set add_addr_accepted 8 subflows 8

6. Verify that MPTCP works, rerun tests with mptcpize, e.g.:
on server: mptcpize run iperf3 -s
on client: mptcpize run -d iperf3 -P 1 -R -c <server> # should get
better bandwidth but using just 1 MPTCP connection
on server run PostgreSQL with listen_mptcp='on'
on server: ss -Mtlnp sport 5432 # mptcp should be displayed
on client: run basebackup/psql/..

Sample results for 82MB table copy, it's 3x:
$ time PGMPTCP=0 /usr/pgsql19/bin/psql -h 10.0.1.240 -c '\copy
pgbench_accounts TO '/dev/null';'
COPY 500000
real 0m42.123s

$ time PGMPTCP=1 /usr/pgsql19/bin/psql -h 10.0.1.240 -c '\copy
pgbench_accounts TO '/dev/null';'
enabling MPTCP client
COPY 500000
real 0m14.416s

Sample results for pgbench of DB created with: pgbench -i -s 5,
~1076MB total due to WALs
$ time /usr/pgsql19/bin/pg_basebackup -h 10.0.1.240 -c fast -D /tmp/test -v
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
[..]
pg_basebackup: base backup completed
real 1m26.786s

With PGMPTCP=1 set, it gets ~3x
$ time PGMPTCP=1 /usr/pgsql19/bin/pg_basebackup -h 10.0.1.240 -c
fast -D /tmp/test -v
enabling MPTCP client
pg_basebackup: initiating base backup, waiting for checkpoint to complete
[..]
pg_basebackup: starting background WAL receiver
enabling MPTCP client
[..]
pg_basebackup: base backup completed
real 0m30.460s

Because in the above case, we have advertised 4 IP addresses/port of
server to the client, we got the bump on a single socket (note: flows
end up being hashed into various HTB classes is random depending on
ports used you can get usually 2x .. 4x here). Also as there are two
independent application-based connections here in basebackup (transfer
+ WALs), both get multiplexed (each with 4 subflows). If I would add
more ip mptcp ports (server-side), we could get even more juice of
course there, but it assumes one has that many paths. Some more
advanced setups including separate policy-based-routed (ip rule)
things are possible, and stuff like keeping the TCP connection highly
available 0 even across ISP/interface (WiFi?) outages - is possible.
It works transparently with SSL/TLS too - tested. Of course it won't
remove the single CPU limitation of the tools involved (that's
completely different problem).

If it sounds interesting I was thinking about adding to the patch
something like contrib/mptcpinfo (pg_stat_mptcp view to mimic
pg_stat_ssl). Also as for the patch there were some places where
socket() is being created (libpq cancel packet), but there's no
purpose of adding MPTCP there I think.

It is important to mention there are two implementations of MPTCP on
Linux, so when someone will be googling there's lots of conflicting
information:
1) Earlier one, required kernel patching up to <= 5.6, had
"ndiffports" multiplexer built-in which worked mostly out of the box.
2) Newer one [3], already merged one into kernel today, a little bit
different does not come with built-in ndiffports path manager. In this
newer one, as shown above some more manual steps (ip mptcp endpoints)
may be required, but mptcpd daemon which is managing (sub)flows seems
to be evolving as the usage of this protocol is rising. So I hope in
future all of those mptcp commands would be probably optional.

Thoughts?

-Jakub Wartak.

[1] - https://en.wikipedia.org/wiki/Multipath_TCP
[2] - https://www.rfc-editor.org/rfc/rfc8684.html
[3] - https://www.mptcp.dev/
[4] - https://github.com/multipath-tcp/mptcp_net-next/wiki/#changelog

Attachment Content-Type Size
v1-0001-Add-MPTCP-protocol-support-to-server-and-libpq-on.patch application/octet-stream 7.4 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2025-09-04 11:04:05 Re: docs: Table 9.46. UUID Extraction Functions
Previous Message Peter Eisentraut 2025-09-04 10:55:57 Re: Solaris compiler status