pg_dump / copy bugs with "big lines" ?

From: Ronan Dunklau <ronan(dot)dunklau(at)dalibo(dot)com>
To: 'pgsql-hackers' <pgsql-hackers(at)postgresql(dot)org>
Subject: pg_dump / copy bugs with "big lines" ?
Date: 2015-03-30 10:46:47
Message-ID: 1836813.YmyOrS99PX@ronan.dunklau.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello hackers,

I've tried my luck on pgsql-bugs before, with no success, so I report these
problem here.

The documentation mentions the following limits for sizes:

Maximum Field Size 1 GB
Maximum Row Size 1.6 TB

However, it seems like rows bigger than 1GB can't be COPYed out:

ro=# create table test_text (c1 text, c2 text);
CREATE TABLE
ro=# insert into test_text (c1) VALUES (repeat('a', 536870912));
INSERT 0 1
ro=# update test_text set c2 = c1;
UPDATE 1

Then, trying to dump or copy that results in the following error:

ro=# COPY test_text TO '/tmp/test';
ERROR: out of memory
DÉTAIL : Cannot enlarge string buffer containing 536870913 bytes by 536870912
more bytes.

In fact, the same thing happens when using a simple SELECT:

ro=# select * from test_text ;
ERROR: out of memory
DÉTAIL : Cannot enlarge string buffer containing 536870922 bytes by 536870912
more bytes.

In the case of COPY, the server uses a StringInfo to output the row. The
problem is, a StringInfo is capped to MAX_ALLOC_SIZE (1GB - 1), but a row
should be able to hold much more than that.

So, is this a bug ? Or is there a caveat I would have missed in the
documentation ?

We also hit a second issue, this time related to bytea encoding.

This test case is a bit more complicated, since I had to use an external
(client) program to insert my data. It involves inserting a string that fit
into 1GB when encoded in escape format, but is larger than that in hex, and
another string which fits in 1GB using the hex format, but is larger than that
in escape:

from psycopg2 import connect
from io import BytesIO

conn = connect(dbname="ro")
cur = conn.cursor()
fullcontent = BytesIO()

# Write a binary string that weight less
# than 1 GB when escape encoded, but more than
# that if hex encoded
for i in range(200):
fullcontent.write(b"aaa" * 1000000)
fullcontent.seek(0)
cur.copy_from(fullcontent, "test_bytea")

fullcontent.seek(0)
fullcontent.truncate()

# Write another binary string that weight
# less than 1GB when hex encoded, but more than
# that if escape encoded
cur.execute("SET bytea_output = 'hex'")
fullcontent.write(b"\\\\x")
for i in range(300):
fullcontent.write(b"00" * 1000000)
fullcontent.seek(0)
cur.copy_from(fullcontent, "test_bytea")

cur.execute("COMMIT;")
cur.close()

I couldn't find an invocation of pg_dump which would allow me to dump both
lines:

ro(at)ronan_laptop /tmp % PGOPTIONS="-c bytea_output=escape" pg_dump -Fc >
/dev/null
pg_dump: Dumping the contents of table "test_bytea" failed: PQgetResult()
failed.
pg_dump: Error message from server: ERROR: invalid memory alloc request size
1200000001
pg_dump: The command was: COPY public.test_bytea (c1) TO stdout;
ro(at)ronan_laptop /tmp % PGOPTIONS="-c bytea_output=hex" pg_dump -Fc >
/dev/null
pg_dump: Dumping the contents of table "test_bytea" failed: PQgetResult()
failed.
pg_dump: Error message from server: ERROR: invalid memory alloc request size
1200000003
pg_dump: The command was: COPY public.test_bytea (c1) TO stdout;

Using a COPY with binary format works:

ro=# COPY test_bytea TO '/tmp/test' WITH BINARY;

There seems to be a third issue, with regards to escape encoding: the
backslash character is escaped, by adding another backslash. This means that a
field which size is less than 1GB using the escape sequence will not be able
to be output once the backslash are escaped.

For example, lets consider a string consisting of 300000000 '\' characters:

ro=# select length(c1) from test_bytea;
length
-----------
300000000
(1 ligne)

ro=# select length(encode(c1, 'escape')) from test_bytea ;
length
-----------
600000000
(1 ligne)

ro=# set bytea_output to escape;
SET
ro=# copy test_bytea to '/tmp/test.csv' ;

ERROR: out of memory
DÉTAIL : Cannot enlarge string buffer containing 1073741822 bytes by 1 more
bytes.

I think pg_dump should not error out on any data which was valid upon
insertion. It seems the fix would be non-trivial, since StringInfo structures
are relying on a limit of MaxAllocSize. Or am I missing something ?

Thank you.


--
Ronan Dunklau
http://dalibo.com - http://dalibo.org

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2015-03-30 10:48:25 Re: pgsql: Centralize definition of integer limits.
Previous Message Andres Freund 2015-03-30 10:45:13 Re: Relation extension scalability