Re: Why does backend send buffer size hardcoded at 8KB?

From: Andres Freund <andres(at)anarazel(dot)de>
To: Artemiy Ryabinkov <getlag(at)ya(dot)ru>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Why does backend send buffer size hardcoded at 8KB?
Date: 2019-07-27 20:52:43
Message-ID: 20190727205243.l5xyhsg2y7aasgdd@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 2019-07-27 14:43:54 +0300, Artemiy Ryabinkov wrote:
> Why backend send buffer use exactly 8KB? (https://github.com/postgres/postgres/blob/249d64999615802752940e017ee5166e726bc7cd/src/backend/libpq/pqcomm.c#L134)
>
>
> I had this question when I try to measure the speed of reading data. The
> bottleneck was a read syscall. With strace I found that in most cases read
> returns 8192 bytes (https://pastebin.com/LU10BdBJ). With tcpdump we can
> confirm, that network packets have size 8192 (https://pastebin.com/FD8abbiA)

Well, in most setups, you can't have that large frames. The most common
limit is 1500 +- some overheads. Using jumbo frames isn't that uncommon,
but it has enough problems that I don't think it's that widely used with
postgres.

> So, with well-tuned networking stack, the limit is 8KB. The reason is the
> hardcoded size of Postgres write buffer.

Well, jumbo frames are limited to 9000 bytes.

But the reason you're seeing 8192 sized packages isn't just that we have
an 8kb buffer, I think it's also that that we unconditionally set
TCP_NODELAY:

#ifdef TCP_NODELAY
on = 1;
if (setsockopt(port->sock, IPPROTO_TCP, TCP_NODELAY,
(char *) &on, sizeof(on)) < 0)
{
elog(LOG, "setsockopt(%s) failed: %m", "TCP_NODELAY");
return STATUS_ERROR;
}
#endif

With 8KB send size, we'll often unnecessarily send some smaller packets
(both for 1500 and 9000 MTUs), because 8kB doesn't neatly divide into
the MTU. Here's e.g. the ip packet sizes for a query returning maybe
18kB:

1500
1500
1500
1500
1500
1004
1500
1500
1500
1500
1500
1004
1500
414

the dips are because that's where our 8KB buffer + disabling nagle
implies a packet boundary.

I wonder if we ought to pass MSG_MORE (which overrides TCP_NODELAY by
basically having TCP_CORK behaviour for that call) in cases we know
there's more data to send. Which we pretty much know, although we'd need
to pass that knowledge from pqcomm.c to be-secure.c

It might be better to just use larger send sizes however. I think most
kernels are going to be better than us knowing how to chop up the send
size. We're using much larger limits when sending data from the client
(no limit for !win32, 65k for windows), and I don't recall seeing any
problem reports about that.

OTOH, I'm not quite convinced that you're going to see much of a
performance difference in most scenarios. As soon as the connection is
actually congested, the kernel will coalesce packages regardless of the
send() size.

> Does it make sense to make this parameter configurable?

I'd much rather not. It's goign to be too hard to tune, and I don't see
any tradeoffs actually requiring that.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Andres Freund 2019-07-27 21:08:50 Re: Why does backend send buffer size hardcoded at 8KB?
Previous Message Peter J. Holzer 2019-07-27 17:18:53 Re: Default ordering option