Binary COPY IN size reduction

From: Lőrinc Pap <lorinc(at)gradle(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Binary COPY IN size reduction
Date: 2020-04-24 10:53:00
Message-ID: CAMyrAsfDqStKEaywjNWrs2GAyMFoCQd_EoAXNOVS5JfVTRhncg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hey,

Our application sends millions of rows to the database every hour
using the COPY
IN protocol.
We've switched recently from TEXT based COPY to the BINARY one.
We've noticed a slight performance increase, mostly because we don't need
to escape the content anymore.

Unfortunately the binary protocol's output ended up being slightly bigger
than the text one (e.g. for one payload it's *373MB* now, was *356MB* before
)
We would like to share our thoughts on how we may be able to improve that,
if you're open to suggestions.

It's possible our request is related to what the doc already refers to as:

> It is anticipated that a future extension might add a header field that
> allows per-column format codes to be specified.

----

Currently every row in BINARY defines the number of columns (2 bytes) and
every column defines its size (4 bytes per column) - see
https://www.postgresql.org/docs/12/sql-copy.html#id-1.9.3.55.9.4.6.
NULL values are currently sent as a two byte -1 value.

Given that BINARY can't do any type conversion anyway, we should be able to
deduce the expected size of most columns - while keeping the size prefixes
for the dynamic ones (e.g. BYTEA or TEXT).

The extension part of the header (
https://www.postgresql.org/docs/12/sql-copy.html#id-1.9.3.55.9.4.5:~:text=Header%20extension%20area%20length)
would allow us to keep this backwards compatible by switching between the
two versions.
If we don't want to use this part of the header for the BINARY format,
maybe we could add a FIXED modifier to the COPY IN sql definition?

Or alternatively if we don't want to deduce their counts and sizes for some
reason, could we get away with just sending it once and having every row
follow the single header?

----

By skipping the column count and sizes for every row, in our example this
change would reduce the payload to *332MB* (most of our payload is binary,
lightweight structures consisting of numbers only could see a >*2x*
decrease in size).

For dynamic content, where we have to provide the size in advance we could
send that in variable length encoding
<https://en.wikipedia.org/wiki/Variable-length_code> instead (e.g. the sign
bit could signal whether the next byte is still part of the size). Variable
length sizes would allow us to define a special NULL character as well.
In our case this change would reduce our payload further to *317MB.*

In summary, these proposed changes would allow us to reduce the payload
size by roughly *15% -* but would expect even greater gains in general.

Thanks,
* Lőrinc Pap*

--
Lőrinc Pap
Senior Software Engineer
<https://gradle.com/>

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2020-04-24 10:56:31 Re: Setting min/max TLS protocol in clientside libpq
Previous Message Jehan-Guillaume de Rorthais 2020-04-24 10:15:26 Re: +(pg_lsn, int8) and -(pg_lsn, int8) operators