From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org, Michael Paquier <michael(at)paquier(dot)xyz>, Sutou Kouhei <kou(at)clear-code(dot)com> |
Subject: | confusing / inefficient "need_transcoding" handling in copy |
Date: | 2024-02-06 02:05:04 |
Message-ID: | 20240206020504.edijzczkgd25ek6z@awork3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
Looking at the profiles in [1], and similar profiles locally, made me wonder
why a basic COPY TO shows pg_server_to_any() and the strlen() to compute the
length of the to-be-converted string so heavily in profiles. Example
profile, for [2]:
- 88.11% 12.02% postgres postgres [.] CopyOneRowTo
- 76.09% CopyOneRowTo
- 37.24% CopyAttributeOutText
+ 14.25% __strlen_evex
+ 2.76% pg_server_to_any
+ 0.03% 0xffffffff82a00c86
+ 31.82% OutputFunctionCall
+ 2.98% CopySendEndOfRow
+ 2.75% appendBinaryStringInfo
+ 0.58% MemoryContextReset
+ 0.02% 0xffffffff82a00c86
+ 12.01% standard_ExecutorRun
+ 0.02% PostgresMain
In the basic cases the client and server encoding should be the same after
all, so why do we need to do any conversion?
The code has a comment about this:
/*
* Set up encoding conversion info. Even if the file and server encodings
* are the same, we must apply pg_any_to_server() to validate data in
* multibyte encodings.
*/
cstate->need_transcoding =
(cstate->file_encoding != GetDatabaseEncoding() ||
pg_database_encoding_max_length() > 1);
I don't really understand why we need to validate anything during COPY TO?
Which is good, because it turns out that we don't actually validate anything,
as pg_server_to_any() returns without doing anything if the encoding matches:
if (encoding == DatabaseEncoding->encoding ||
encoding == PG_SQL_ASCII)
return unconstify(char *, s); /* assume data is valid */
This means that the strlen() we do in the call do pg_server_to_any(), which on
its own takes 14.25% of the cycles, computes something that will never be
used.
Unsurprisingly, only doing transcoding when encodings differ yields a sizable
improvement, about 18% for [2].
I haven't yet dug into the code history. One guess is that this should only
have been set this way for COPY FROM.
Greetings,
Andres Freund
[1] https://www.postgresql.org/message-id/ZcGE8LrjGW8pmtOf%40paquier.xyz
[2] COPY (SELECT 1::int2,2::int2,3::int2,4::int2,5::int2,6::int2,7::int2,8::int2,9::int2,10::int2,11::int2,12::int2,13::int2,14::int2,15::int2,16::int2,17::int2,18::int2,19::int2,20::int2, generate_series(1, 1000000::int4)) TO '/dev/null';
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2024-02-06 02:41:06 | Re: Make COPY format extendable: Extract COPY TO format implementations |
Previous Message | Andres Freund | 2024-02-06 01:41:25 | Re: Make COPY format extendable: Extract COPY TO format implementations |