Re: Flushing large data immediately in pqcomm

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Flushing large data immediately in pqcomm
Date: 2024-02-02 22:38:27
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 2024-02-01 15:02:57 -0500, Robert Haas wrote:
> On Thu, Feb 1, 2024 at 10:52 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> There was probably a better way to phrase this email ... the sentiment
> is sincere, but there was almost certainly a way of writing it that
> didn't sound like I'm super-annoyed.

NP - I could have phrased mine better as well...

> > On Wed, Jan 31, 2024 at 10:24 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > While not perfect - e.g. because networks might use jumbo packets / large MTUs
> > > and we don't know how many outstanding bytes there are locally, I think a
> > > decent heuristic could be to always try to send at least one packet worth of
> > > data at once (something like ~1400 bytes), even if that requires copying some
> > > of the input data. It might not be sent on its own, but it should make it
> > > reasonably unlikely to end up with tiny tiny packets.
> >
> > I think that COULD be a decent heuristic but I think it should be
> > TESTED, including against the ~3 or so other heuristics proposed on
> > this thread, before we make a decision.
> >
> > I literally mentioned the Ethernet frame size as one of the things
> > that we should test whether it's relevant in the exact email to which
> > you're replying, and you replied by proposing that as a heuristic, but
> > also criticizing me for wanting more research before we settle on
> > something.

I mentioned the frame size thing because afaict nobody in the thread had
mentioned our use of TCP_NODELAY (which basically forces the kernel to send
out data immediately instead of waiting for further data to be sent). Without
that it'd be a lot less problematic to occasionally send data in small
increments inbetween larger sends. Nor would packet sizes be as relevant.

> > Are we just supposed to assume that your heuristic is better than the
> > others proposed here without testing anything, or, like, what? I don't
> > think this needs to be a completely exhaustive or exhausting process, but
> > I think trying a few different things out and seeing what happens is
> > smart.

I wasn't trying to say that my heuristic necessarily is better. What I was
trying to get at - and expressed badly - was that I doubt that testing can get
us all that far here. It's not too hard to test the effects of our buffering
with regards to syscall overhead, but once you actually take network effects
into account it gets quite hard. Bandwidth, latency, the specific network
hardware and operating systems involved all play a significant role. Given
how, uh, naive our current approach is, I think analyzing the situation from
first principles and then doing some basic validation of the results makes
more sense.

Separately, I think we shouldn't aim for perfect here. It's obviously
extremely inefficient to send a larger amount of data by memcpy()ing and
send()ing it in 8kB chunks. As mentioned by several folks upthread, we can
improve upon that without having worse behaviour than today. Medium-long term
I suspect we're going to want to use asynchronous network interfaces, in
combination with zero-copy sending, which requires larger changes. Not that
relevant for things like query results, quite relevant for base backups etc.

It's perhaps also worth mentioning that the small send buffer isn't great for
SSL performance, the encryption overhead increases when sending in small

I hacked up Melih's patch to send the pending data together with the first bit
of the large "to be sent" data and also added a patch to increased
SINK_BUFFER_LENGTH by 16x. With a 12GB database I tested the time for
pg_basebackup -c fast -Ft --compress=none -Xnone -D - -d "$conn" > /dev/null

time via
test unix tcp tcp+ssl
master 6.305s 9.436s 15.596s
master-larger-buffer 6.535s 9.453s 15.208s
patch 5.900s 7.465s 13.634s
patch-larger-buffer 5.233s 5.439s 11.730s

The increase when using tcp is pretty darn impressive. If I had remembered in
time to disable manifests checksums, the win would have been even bigger.

The bottleneck for SSL is that it still ends up with ~16kB sends, not sure


Andres Freund

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Jelte Fennema-Nio 2024-02-02 22:53:16 Re: [EXTERNAL] Re: Add non-blocking version of PQcancel
Previous Message Noah Misch 2024-02-02 22:30:03 Re: Why is subscription/t/ failing so much?