|From:||Andres Freund <andres(at)anarazel(dot)de>|
|To:||Artemiy Ryabinkov <getlag(at)ya(dot)ru>|
|Subject:||Re: Why does backend send buffer size hardcoded at 8KB?|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
On 2019-07-27 14:43:54 +0300, Artemiy Ryabinkov wrote:
> Why backend send buffer use exactly 8KB? (https://github.com/postgres/postgres/blob/249d64999615802752940e017ee5166e726bc7cd/src/backend/libpq/pqcomm.c#L134)
> I had this question when I try to measure the speed of reading data. The
> bottleneck was a read syscall. With strace I found that in most cases read
> returns 8192 bytes (https://pastebin.com/LU10BdBJ). With tcpdump we can
> confirm, that network packets have size 8192 (https://pastebin.com/FD8abbiA)
Well, in most setups, you can't have that large frames. The most common
limit is 1500 +- some overheads. Using jumbo frames isn't that uncommon,
but it has enough problems that I don't think it's that widely used with
> So, with well-tuned networking stack, the limit is 8KB. The reason is the
> hardcoded size of Postgres write buffer.
Well, jumbo frames are limited to 9000 bytes.
But the reason you're seeing 8192 sized packages isn't just that we have
an 8kb buffer, I think it's also that that we unconditionally set
on = 1;
if (setsockopt(port->sock, IPPROTO_TCP, TCP_NODELAY,
(char *) &on, sizeof(on)) < 0)
elog(LOG, "setsockopt(%s) failed: %m", "TCP_NODELAY");
With 8KB send size, we'll often unnecessarily send some smaller packets
(both for 1500 and 9000 MTUs), because 8kB doesn't neatly divide into
the MTU. Here's e.g. the ip packet sizes for a query returning maybe
the dips are because that's where our 8KB buffer + disabling nagle
implies a packet boundary.
I wonder if we ought to pass MSG_MORE (which overrides TCP_NODELAY by
basically having TCP_CORK behaviour for that call) in cases we know
there's more data to send. Which we pretty much know, although we'd need
to pass that knowledge from pqcomm.c to be-secure.c
It might be better to just use larger send sizes however. I think most
kernels are going to be better than us knowing how to chop up the send
size. We're using much larger limits when sending data from the client
(no limit for !win32, 65k for windows), and I don't recall seeing any
problem reports about that.
OTOH, I'm not quite convinced that you're going to see much of a
performance difference in most scenarios. As soon as the connection is
actually congested, the kernel will coalesce packages regardless of the
> Does it make sense to make this parameter configurable?
I'd much rather not. It's goign to be too hard to tune, and I don't see
any tradeoffs actually requiring that.
|Next Message||Andres Freund||2019-07-27 21:08:50||Re: Why does backend send buffer size hardcoded at 8KB?|
|Previous Message||Peter J. Holzer||2019-07-27 17:18:53||Re: Default ordering option|