On legacy 4s/NUMA simple simplest query to `select sum(aid) from pgbench_accounts_1 (single partition, 400MB)` with pgbench -c 1 client and small s_b 4GB just taking single NUMA node, gets this: Finalize Aggregate (cost=72150.70..72150.71 rows=1 width=8) (actual time=252.817..254.409 rows=1.00 loops=1) Output: sum(aid) Buffers: shared hit=54798 -> Gather (cost=72150.49..72150.69 rows=2 width=8) (actual time=252.319..254.381 rows=3.00 loops=1) Output: (PARTIAL sum(aid)) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=54798 -> Partial Aggregate (cost=71150.49..71150.49 rows=1 width=8) (actual time=247.262..247.264 rows=1.00 loops=3) Output: PARTIAL sum(aid) Buffers: shared hit=54798 Worker 0: actual time=245.116..245.118 rows=1.00 loops=1 Buffers: shared hit=19185 Worker 1: actual time=244.797..244.800 rows=1.00 loops=1 Buffers: shared hit=19021 -> Parallel Seq Scan on public.pgbench_accounts_1 (cost=0.00..67879.99 rows=1308199 width=4) (actual time=0.011..136.895 rows=1041666.67 loop s=3) Output: aid, bid, abalance, filler Buffers: shared hit=54798 Worker 0: actual time=0.010..133.393 rows=1094053.00 loops=1 Buffers: shared hit=19185 Worker 1: actual time=0.009..134.281 rows=1084465.00 loops=1 Buffers: shared hit=19021 Observe "lat"ency metric , it primary depends on the where the process are placed: no PQ, default behavoir: progress: 57.0 s, 2.0 tps, lat 515.312 ms stddev 0.053, 0 failed progress: 58.0 s, 2.0 tps, lat 609.009 ms stddev 92.653, 0 failed progress: 59.0 s, 2.0 tps, lat 515.414 ms stddev 0.423, 0 failed progress: 60.0 s, 2.0 tps, lat 515.530 ms stddev 0.138, 0 failed progress: 61.0 s, 2.0 tps, lat 515.370 ms stddev 0.181, 0 failed progress: 62.0 s, 2.0 tps, lat 515.396 ms stddev 0.351, 0 failed progress: 63.0 s, 1.0 tps, lat 515.484 ms stddev NaN, 0 failed progress: 64.0 s, 2.0 tps, lat 515.428 ms stddev 0.035, 0 failed progress: 65.0 s, 2.0 tps, lat 515.357 ms stddev 0.089, 0 failed progress: 66.0 s, 2.0 tps, lat 608.885 ms stddev 92.633, 0 failed progress: 67.0 s, 2.0 tps, lat 515.828 ms stddev 0.501, 0 failed progress: 68.0 s, 1.0 tps, lat 603.824 ms stddev NaN, 0 failed no PQ, manual pinning (above it flunctuates a lot due to migrations): progress: 318.0 s, 3.0 tps, lat 321.714 ms stddev 0.089, 0 failed <-- bind backend to CPU on the same NUMA node, gets 1.4GB/s DRAM read on that whole socket/OS progress: 319.0 s, 3.0 tps, lat 321.630 ms stddev 0.254, 0 failed progress: 320.0 s, 3.0 tps, lat 321.655 ms stddev 0.131, 0 failed progress: 321.0 s, 3.0 tps, lat 321.555 ms stddev 0.132, 0 failed progress: 322.0 s, 3.0 tps, lat 322.027 ms stddev 0.964, 0 failed progress: 323.0 s, 2.0 tps, lat 426.245 ms stddev 49.599, 0 failed progress: 324.0 s, 2.0 tps, lat 519.873 ms stddev 0.160, 0 failed <-- bind query to CPU on different NUMA node (hits interconnect), gets only ~0.8GB/s read on that socket with s_b progress: 325.0 s, 2.0 tps, lat 519.854 ms stddev 0.488, 0 failed progress: 326.0 s, 2.0 tps, lat 519.604 ms stddev 0.285, 0 failed progress: 327.0 s, 2.0 tps, lat 519.626 ms stddev 0.190, 0 failed progress: 328.0 s, 2.0 tps, lat 520.172 ms stddev 0.064, 0 failed progress: 329.0 s, 3.0 tps, lat 409.836 ms stddev 41.466, 0 failed progress: 330.0 s, 3.0 tps, lat 325.236 ms stddev 0.248, 0 failed <-- bind backend to CPU on the same NUMA node again progress: 331.0 s, 3.0 tps, lat 325.190 ms stddev 0.261, 0 failed progress: 332.0 s, 3.0 tps, lat 325.485 ms stddev 0.751, 0 failed progress: 333.0 s, 3.0 tps, lat 325.026 ms stddev 0.065, 0 failed 520/325 ms = 1.6x; IPC jumps from 1.23 to ~2.00 just due to switch from remote to local DRAM memory access (latency) no PQ and numa=on (s_b interleaved), STABLE (low stddev), notice how optimal was ~321ms vs interleaved gets ~463ms yes it is wrose (321->463ms), but we have uniform latency, instead of high variance depending on how scheduler assigns CPU: progress: 10.0 s, 2.0 tps, lat 463.378 ms stddev 0.181, 0 failed progress: 11.0 s, 2.0 tps, lat 463.422 ms stddev 0.016, 0 failed progress: 12.0 s, 2.0 tps, lat 463.121 ms stddev 0.052, 0 failed progress: 13.0 s, 2.0 tps, lat 463.198 ms stddev 0.148, 0 failed progress: 14.0 s, 3.0 tps, lat 463.823 ms stddev 0.854, 0 failed progress: 15.0 s, 2.0 tps, lat 462.937 ms stddev 0.199, 0 failed ok let's enable PQ: 2 workers and leader participating, default: progress: 30.0 s, 7.0 tps, lat 134.692 ms stddev 4.644, 0 failed progress: 31.0 s, 8.0 tps, lat 129.489 ms stddev 1.510, 0 failed progress: 32.0 s, 7.0 tps, lat 130.506 ms stddev 3.393, 0 failed progress: 33.0 s, 8.0 tps, lat 133.785 ms stddev 1.921, 0 failed progress: 34.0 s, 7.0 tps, lat 134.561 ms stddev 6.965, 0 failed progress: 35.0 s, 8.0 tps, lat 135.438 ms stddev 5.286, 0 failed progress: 36.0 s, 7.0 tps, lat 141.760 ms stddev 15.878, 0 failed progress: 37.0 s, 7.0 tps, lat 136.749 ms stddev 5.937, 0 failed progress: 38.0 s, 7.0 tps, lat 138.258 ms stddev 8.666, 0 failed progress: 39.0 s, 7.0 tps, lat 140.263 ms stddev 8.834, 0 failed 2 workers and leader participating and postmaster CPU affinity (therefore any new pq workers) set to only CPUs from nodes having s_b + CPU-pin the leader there too: progress: 342.0 s, 9.0 tps, lat 117.306 ms stddev 0.152, 0 failed progress: 343.0 s, 8.0 tps, lat 117.265 ms stddev 0.216, 0 failed progress: 344.0 s, 9.0 tps, lat 117.216 ms stddev 0.221, 0 failed progress: 345.0 s, 9.0 tps, lat 117.233 ms stddev 0.155, 0 failed postmaster's CPU affinity (so for 2 PQ workers) being forced to remote NUMA node: progress: 346.0 s, 6.0 tps, lat 142.502 ms stddev 12.621, 0 failed progress: 347.0 s, 6.0 tps, lat 168.971 ms stddev 20.585, 0 failed progress: 348.0 s, 7.0 tps, lat 157.017 ms stddev 9.333, 0 failed progress: 349.0 s, 6.0 tps, lat 157.810 ms stddev 15.124, 0 failed progress: 350.0 s, 7.0 tps, lat 150.623 ms stddev 1.246, 0 failed progress: 351.0 s, 6.0 tps, lat 156.901 ms stddev 15.012, 0 failed 2 PQ workers on the local NUMA node to s_b + leader process forced to remote NUMA node: progress: 554.0 s, 7.0 tps, lat 130.552 ms stddev 0.256, 0 failed progress: 555.0 s, 8.0 tps, lat 130.488 ms stddev 0.144, 0 failed progress: 556.0 s, 8.0 tps, lat 130.795 ms stddev 0.171, 0 failed progress: 557.0 s, 7.0 tps, lat 130.759 ms stddev 0.250, 0 failed progress: 558.0 s, 8.0 tps, lat 130.601 ms stddev 0.217, 0 failed progress: 559.0 s, 8.0 tps, lat 130.606 ms stddev 0.151, 0 failed progress: 560.0 s, 7.0 tps, lat 130.503 ms stddev 0.217, 0 failed 2 PQ workers on remote NUMA node to s_b + leader process on local progress: 627.0 s, 6.0 tps, lat 150.555 ms stddev 0.147, 0 failed progress: 628.0 s, 7.0 tps, lat 150.675 ms stddev 0.484, 0 failed progress: 629.0 s, 7.0 tps, lat 150.617 ms stddev 0.324, 0 failed progress: 630.0 s, 6.0 tps, lat 150.410 ms stddev 0.087, 0 failed progress: 631.0 s, 7.0 tps, lat 150.493 ms stddev 0.150, 0 failed progress: 632.0 s, 7.0 tps, lat 150.566 ms stddev 0.095, 0 failed With PQ (2 workers and leader participating) with numa=on (interleaved s_b), CPU affinity default, stddev flunctuates 2-23ms progress: 39.0 s, 6.0 tps, lat 175.806 ms stddev 10.968, 0 failed progress: 40.0 s, 6.0 tps, lat 172.568 ms stddev 10.002, 0 failed progress: 41.0 s, 5.0 tps, lat 175.535 ms stddev 9.343, 0 failed progress: 42.0 s, 6.0 tps, lat 170.761 ms stddev 6.165, 0 failed progress: 43.0 s, 6.0 tps, lat 164.163 ms stddev 1.575, 0 failed progress: 44.0 s, 6.0 tps, lat 175.584 ms stddev 13.247, 0 failed progress: 45.0 s, 6.0 tps, lat 162.442 ms stddev 0.175, 0 failed but we have 4x (number of sockets) more CPU power available at the same latency/bandwidth. So if anything: - latency-wise: it would be best to place leader+all PQ workers close to s_b, provided s_b fits NUMA shared/huge page memory there and you won't need more CPU than there's on that NUMA node... (assuming e.g. hosting 4 DBs on 4-sockets each on it's own, it would be best to pin everything including shm, but PQ workers too) - capacity/TPS-wise or s_b > NUMA: just interleave to maximize bandwidth and get uniform CPU performance out of this