Re: Why is pq_begintypsend so slow?

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: David Fetter <david(at)fetter(dot)org>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why is pq_begintypsend so slow?
Date: 2020-01-14 22:45:20
Message-ID: 20200114224520.gh4qktr5kuhu5pyi@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-01-13 15:18:04 -0800, Andres Freund wrote:
> On 2020-01-11 22:32:45 -0500, Tom Lane wrote:
> > I wrote:
> > > I saw at this point that the remaining top spots were
> > > enlargeStringInfo and appendBinaryStringInfo, so I experimented
> > > with inlining them (again, see patch below). That *did* move
> > > the needle: down to 72691 ms, or 20% better than HEAD.
> >
> > Oh ... marking the test in the inline part of enlargeStringInfo()
> > as unlikely() helps quite a bit more: 66100 ms, a further 9% gain.
> > Might be over-optimizing for this particular case, perhaps
> >
> > (But ... I'm not finding these numbers to be super reproducible
> > across different ASLR layouts. So take it with a grain of salt.)
>
> FWIW, I've also observed, in another thread (the node func generation
> thing [1]), that inlining enlargeStringInfo() helps a lot, especially
> when inlining some of its callers. Moving e.g. appendStringInfo() inline
> allows the compiler to sometimes optimize away the strlen. But if
> e.g. an inlined appendBinaryStringInfo() still calls enlargeStringInfo()
> unconditionally, successive appends cannot optimize away memory accesses
> for ->len/->data.

With a set of patches doing so, int4send itself is not a significant
factor for my test benchmark [1] anymore. The assembly looks about as
good as one could hope, I think:

# save rbx on the stack
0x00000000004b7f90 <+0>: push %rbx
0x00000000004b7f91 <+1>: sub $0x20,%rsp
# store integer to be sent into rbx
0x00000000004b7f95 <+5>: mov 0x20(%rdi),%rbx
# palloc length argument
0x00000000004b7f99 <+9>: mov $0x9,%edi
0x00000000004b7f9e <+14>: callq 0x5d9aa0 <palloc>
# store integer in buffer (ebx is 4 byte portion of rbx)
0x00000000004b7fa3 <+19>: movbe %ebx,0x4(%rax)
# store varlena header
0x00000000004b7fa8 <+24>: movl $0x20,(%rax)
# restore stack and rbx registerz
0x00000000004b7fae <+30>: add $0x20,%rsp
0x00000000004b7fb2 <+34>: pop %rbx
0x00000000004b7fb3 <+35>: retq

All the $0x20 constants are a bit confusing, but they just happen to be
the same for int4send. It's the size of the stack frame,
offset for FunctionCallInfoBaseData->args[0], the varlena header (and then the stack
frame again) respectively.

Note that I had to annotate palloc with __attribute__((malloc)) to make
the compiler understand that palloc's returned value will not alias with
anything problematic (e.g. the potential of aliasing with fcinfo
prevents optimizing to the above without that annotation). I think such
annotations would be a good idea anyway, precisely because they allow
the compiler to optimize code significantly better.

These together yields about a 1.8x speedup for me. The profile shows
that the overhead now is overwhelmingly elsewhere:
+ 26.30% postgres postgres [.] CopyOneRowTo
+ 13.40% postgres postgres [.] tts_buffer_heap_getsomeattrs
+ 10.61% postgres postgres [.] AllocSetAlloc
+ 9.26% postgres libc-2.29.so [.] __memmove_avx_unaligned_erms
+ 7.32% postgres postgres [.] SendFunctionCall
+ 6.02% postgres postgres [.] palloc
+ 4.45% postgres postgres [.] int4send
+ 3.68% postgres libc-2.29.so [.] _IO_fwrite
+ 2.71% postgres postgres [.] heapgettup_pagemode
+ 1.96% postgres postgres [.] AllocSetReset
+ 1.83% postgres postgres [.] CopySendEndOfRow
+ 1.75% postgres libc-2.29.so [.] _IO_file_xsputn@@GLIBC_2.2.5
+ 1.60% postgres postgres [.] ExecStoreBufferHeapTuple
+ 1.57% postgres postgres [.] DoCopyTo
+ 1.16% postgres postgres [.] memcpy(at)plt
+ 1.07% postgres postgres [.] heapgetpage

Even without using the new pq_begintypesend_ex()/initStringInfoEx(), the
generated code is still considerably better than before, yielding a
1.58x speedup. Tallocator overhead unsurprisingly is higher:
+ 24.93% postgres postgres [.] CopyOneRowTo
+ 17.10% postgres postgres [.] AllocSetAlloc
+ 10.09% postgres postgres [.] tts_buffer_heap_getsomeattrs
+ 6.50% postgres libc-2.29.so [.] __memmove_avx_unaligned_erms
+ 5.99% postgres postgres [.] SendFunctionCall
+ 5.11% postgres postgres [.] palloc
+ 3.95% postgres libc-2.29.so [.] _int_malloc
+ 3.38% postgres postgres [.] int4send
+ 2.54% postgres postgres [.] heapgettup_pagemode
+ 2.11% postgres libc-2.29.so [.] _int_free
+ 2.06% postgres postgres [.] MemoryContextReset
+ 2.02% postgres postgres [.] AllocSetReset
+ 1.97% postgres libc-2.29.so [.] _IO_fwrite
+ 1.47% postgres postgres [.] DoCopyTo
+ 1.14% postgres postgres [.] ExecStoreBufferHeapTuple
+ 1.06% postgres libc-2.29.so [.] _IO_file_xsputn@@GLIBC_2.2.5
+ 1.04% postgres libc-2.29.so [.] malloc

Adding a few pg_restrict*, and using appendBinaryStringInfoNT instead of
appendBinaryStringInfo in CopySend* gains another 1.05x.

This does result in some code growth, but given the size of the
improvements, and that the improvements are significant even without
code changes to callsites, that seems worth it.

before:
text data bss dec hex filename
8482739 172304 204240 8859283 872e93 src/backend/postgres
after:
text data bss dec hex filename
8604300 172304 204240 8980844 89096c src/backend/postgres

Regards,

Andres

[1]
CREATE TABLE lotsaints4(c01 int4 NOT NULL, c02 int4 NOT NULL, c03 int4 NOT NULL, c04 int4 NOT NULL, c05 int4 NOT NULL, c06 int4 NOT NULL, c07 int4 NOT NULL, c08 int4 NOT NULL, c09 int4 NOT NULL, c10 int4 NOT NULL);
INSERT INTO lotsaints4 SELECT ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int FROM generate_series(1, 2000000);
VACUUM FREEZE lotsaints4;
COPY lotsaints4 TO '/dev/null' WITH binary;

CREATE TABLE lotsaints8(c01 int8 NOT NULL, c02 int8 NOT NULL, c03 int8 NOT NULL, c04 int8 NOT NULL, c05 int8 NOT NULL, c06 int8 NOT NULL, c07 int8 NOT NULL, c08 int8 NOT NULL, c09 int8 NOT NULL, c10 int8 NOT NULL);
INSERT INTO lotsaints8 SELECT ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int FROM generate_series(1, 2000000);
VACUUM FREEZE lotsaints8;
COPY lotsaints8 TO '/dev/null' WITH binary;

Attachment Content-Type Size
v1-0001-stringinfo-Move-more-functions-inline-provide-ini.patch text/x-diff 13.1 KB
v1-0002-stringinfo-Remove-in-core-use-of-appendStringInfo.patch text/x-diff 14.1 KB
v1-0003-pqformat-Move-functions-for-type-sending-inline-a.patch text/x-diff 4.3 KB
v1-0004-WIP-Annotate-palloc-with-malloc-and-other-compile.patch text/x-diff 1.3 KB
v1-0005-Move-int4-int8-send-to-use-pq_begintypsend_ex.patch text/x-diff 1.3 KB
v1-0006-copy-Use-appendBinaryStringInfoNT-for-sending-bin.patch text/x-diff 1.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2020-01-14 22:47:30 Re: Avoid full GIN index scan when possible
Previous Message Tom Lane 2020-01-14 22:42:51 Re: Add FOREIGN to ALTER TABLE in pg_dump