Re: Proposal: PqSendBuffer removal

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Aleksei Ivanov <iv(dot)alekseii(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Proposal: PqSendBuffer removal
Date: 2020-03-11 03:44:37
Message-ID: CAMsr+YGTYrNYSV=iTeVDrMeYSmy_dMJfabqHzoNUtOMz5b9e9Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

faifaiOn Sat, 7 Mar 2020 at 02:45, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On March 5, 2020 1:23:21 PM PST, Aleksei Ivanov <iv(dot)alekseii(at)gmail(dot)com> wrote:
> >Thank you for your reply!
> >
> >Yes, you are right there will be a separate call to send the data, but
> >is
> >copying data each time more costly operation than just one syscall?
>
> Yes, it's very likely to be more expensive to execute a syscall in a lot of cases. They've gotten a lot more expensive with all the security issues.

Good to know.

> >Besides, if we already have a ready message packet to be sent why
> >should we
> >wait?
>
> In a lot of cases we'll send a number of small messages after each other. We don't want to send those out separately, that'd just increase overhead.

Right. We presently set `TCP_NODELAY` to disable Nagle's algorithm, so
we'd tend to generate these messages as separate packets as well as
seperate syscalls, making it doubly undesirable.

We can dynamically twiddle TCP_NODELAY like many other applications do
to optimise the wire packets, but not the syscalls. It'd actually cost
extra syscalls.

> But in some paths/workloads the copy is quite noticable. I've mused before whether we could extend StringInfo to handle cases like this. E.g. by having StringInfo have two lengths. One that is the offset to the start of the allocated memory (0 for plain StringInfos), and one for the length of the string being built.
>
> Then we could get a StringInfo pointing directly to the current insertion point in the send buffer. To support growing it, enlargeStringInfo would first subtract the offset to the start of the allocation, and then reallocate that.
>
> I can imagine that bring useful in a number of places. And because there only would be additional overhead when actually growing the StringInfo, I don't think the cost would be measurable.

That sounds pretty sensible as it'd be minimally intrusive.

I've wondered whether we can further optimise some uses by having
libpq-be manage an iovec for us instead, much like we support iovec
for shm_mq. Instead of a StringInfo we'd use an iovec wrapped by a
libpq-managed struct. libpq would reserve the first few entries in the
managed iovec for the message header. Variants of pq_sendint64(...)
etc would add entries to the iovec and could be inlined since they'd
just be convenience routines. The message would be flushed by having
libpq call writev() on the iovec container.

We'd want a wrapper struct for the iovec so we could have libpq keep a
cursor for the next entry in the iovec. For libpq-fe it'd also contain
a map of which iovec entries need to be free()d; for libpq-be we'd
probably palloc(), maybe with a child memory context. To support a
stack-allocated iovec for when we know all the message fields in
advance we'd have an init function that takes the address of the
preallocated iovec and its length limit.

We could support We could also support a libpq-wrapped iovec where
libpq can realloc it if it fills.
stack-allocated iovec - so the caller would be responsible for
managing the max size

That way we can do zero-copy scatter-gather I/O for messages that
don't require binary-to-text-format transformations etc.

BTW, if we change StringInfo, I'd like to also official bless the
usage pattern where we wrap a buffer in a StringInfo so we can use
pq_getmsgint64 etc on it. Add a initConstStringInfo(StringInfo si,
const char * buf, size_t buflen) or something that assigns the
StringInfo values and sets maxlen = -1. The only in-core usage I see
for this so far is in src/backend/replication/logical/worker.c but
it's used extremely heavily in pglogical etc. It'd just be a
convenience function that blesses and documents existing usage.

But like Tom I really want to first come back to the evidence. Why
should we bother? Are we solving an actual problem here? PostgreSQL is
immensely memory-allocation-happy and copy-happy. Shouldn't we be more
interested in things like reducing the cost of multiple copies and
transform passes of Datum values? Especially since that's an actual
operational pain point when you're working with multi-hundred-megabyte
bytea or text fields.

Can you come up with some profiling/performance numbers that track
time spent on memory copying in the areas you propose to target, plus
malloc overheads? With a tool like systemtap or perf it should not be
overly difficult to do so by making targeted probes that filter based
on callstack, or on file / line-range or function.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2020-03-11 04:10:02 Re: Add an optional timeout clause to isolationtester step.
Previous Message Andy Fan 2020-03-11 03:44:23 Re: Index Skip Scan