pg_dump: use threads for parallel workers on all platforms

From: Bryan Green <dbryan(dot)green(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject: pg_dump: use threads for parallel workers on all platforms
Date: 2026-07-02 16:30:39
Message-ID: 8c712d76-ecf7-4749-a6d8-dddc01f298ec@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

pg_dump runs parallel workers in two completely different ways. On
non-Windows platforms, they're forked processes communicating with the
leader over pipes-- which is what pipes and processes are for. Windows
has no fork(), so there the workers are threads. But instead of
coordinating those threads as threads, the Windows port runs the same
process-based protocol on top of them, unchanged. Each worker's channel
to the leader is a loopback TCP socketpair on 127.0.0.1, opened when the
worker starts. So to tell a worker "dump table 1234," the leader
serializes the command to a string and writes it down that socket; the
worker reads it back a byte at a time, and the leader watches the
sockets with select() to see who's done. All of it to hand work to a
thread a few megabytes away in the same address space.

And because the leader waits on the workers with WaitForMultipleObjects,
which takes at most 64 handles, you can't run more than 64 jobs on
Windows. The parallelism limit there is a limit of the wait call. (The
non-Windows side doesn't have this limit-- it reaps workers with wait()
rather than WaitForMultipleObjects, so PG_MAX_JOBS is INT_MAX.)

None of this is broken; it works. It's threads pretending to be
processes because the code was written for processes, and the port kept
the protocol rather than rethinking it. I'd like to stop.

One model everywhere should be threads on all platforms, coordinated by
an in-process work queue-- a mutex and a couple of condition variables--
instead of two worker models bridged by an inter-process protocol.

To be clear, the unification is on the queue, not on what Windows does
today. Teaching the non-Windows side to talk to its own threads over a
socket, a byte at a time, would just be the same trick on more
platforms-- that's the part worth deleting, not copying.

I've done the Windows half, both to prove it out and because it's the
coordination layer the non-Windows side would adopt. The socket protocol
is gone; the leader hands work to a worker in memory instead of down a
loopback connection. The 64-job cap is gone. The unchecked
_beginthreadex return-- which on failure recorded a thread that didn't
exist as an idle worker-- is fixed. Dump and restore are byte-for-byte
identical to stock from -j2 through -j250. The non-Windows port to
threads isn't written yet, and I won't start it until the direction is
settled.

One piece I deliberately left for the unified version: the queue still
passes the command as a string-- "DUMP 1234"-- and the worker parses it
back and looks the ID up to recover the TocEntry it already came from.
In one address space that's ceremony; the queue could carry a {
T_Action, TocEntry * } and drop the serialize/parse/lookup entirely. I
didn't do it on Windows because the non-Windows side still forks, and a
pointer is meaningless across a process boundary. There, the string is
the serializer that path needs, so converting it to Windows-only would
add a second message format instead of removing one. With threading in
place across Windows and non-Windows, we can pass the work item directly
and delete buildWorkerCommand/parseWorkerCommand and the matching
response pair, plus the dumpId lookup, on every platform at once.

The cost is crash isolation. fork() gives each non-Windows worker its
own address space, so one that segfaults can't corrupt the leader or its
siblings; threads give that up. What it actually buys today is narrow.
The moment any worker dies, the leader pg_fatal()s and the whole dump
comes down, so processes don't give you recovery-- only the guarantee
that a corrupt worker can't scribble on a sibling's output before
everyone exits. Windows has run without even that for years. It's an
acceptable trade for a single implementation, but it's the real cost.

I'll say plainly that this fixes no user-visible bug, and nothing is
broken today. It's consolidation. There are two implementations of
parallel dump right now, and they drift: the Windows one grew the socket
emulation and the 64-job cap out of running a process protocol on
threads, and it carried bugs the non-Windows side never had, like the
_beginthreadex one above. One model means a fix or a feature lands once,
on a path exercised on every platform, instead of twice, with a seam
down the middle. Windows already shows the threaded model works here,
and threads are the half both sides can share, since Windows can't fork.
The Windows port originally kept the process shape to stay consistent
with the other platforms; this keeps that same goal, on the model that
actually ports.

The strongest evidence of the thread-safety of the worker-reachable
code-- Windows has run that path with threads for years. With the thread
rework, fmtId's static return value, is now _Thread_local. The global
state a worker reads is built during the single-threaded catalog phase,
before any worker exists.

Patches are attached-- 0001 and 0002 are independent and can be
committed separately; 0004 depends on 0003.

--
Bryan Green
EDB: https://www.enterprisedb.com

Attachment Content-Type Size
v1-0001-pg_dump-check-for-_beginthreadex-failure-in-paral.patch text/plain 1.1 KB
v1-0002-Give-fmtId-s-temporary-buffer-thread-local-storag.patch text/plain 4.1 KB
v1-0003-pg_dump-dispatch-parallel-workers-in-process-on-W.patch text/plain 13.3 KB
v1-0004-pg_dump-allow-more-than-64-parallel-jobs-on-Windo.patch text/plain 2.6 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Florin Irion 2026-07-02 16:35:24 pg_plan_advice: add NO_ scan and join method tags
Previous Message Robert Haas 2026-07-02 16:25:23 json/jsonb cleanup + FmgrInfo caching