Re: pg_dump/restore encoding woes

From: Amit Khandekar <amit(dot)khandekar(at)enterprisedb(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_dump/restore encoding woes
Date: 2013-09-25 07:19:55
Message-ID: CACoZds2K+imiraAZNx85eb4ibxnK2EJ3mNJJ=5RP=XAh2HaWGQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 27 August 2013 20:06, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:

> On 26.08.2013 18:59, Tom Lane wrote:
>
>> Heikki Linnakangas<hlinnakangas(at)**vmware(dot)com <hlinnakangas(at)vmware(dot)com>>
>> writes:
>>
>> The pg_dump -E option just sets client_encoding, but I think it would be
>>> better for -E to only set the encoding used in the dump, and
>>> PGCLIENTENCODING env variable (if set) was used to determine the
>>> encoding of the command-line arguments. Opinions?
>>>
>>
>> I think this is going to be a lot easier said than done, but feel free
>> to see if you can make it work. (As you point out, we don't have
>> any client-side encoding conversion infrastructure, but I don't see
>> how you're going to make this work without it.)
>>
>
> First set client_encoding to PGCLIENTENCODING (ie. let libpq do its
> default thing), and run the queries that fetch the OIDs of any matching
> tables. Only after doing that, set client_encoding to that specified by -E.
> That's actually pretty easy to do, as pg_dump already issues all the
> queries that include user-given strings first.
>
> There's one small wrinkle in that: if the dump fails because the server
> sends an error, the error would come from the server in the -E encoding,
> because that's used as the client_encoding after the initial queries.
> Logically, the errors should be printed in PGCLIENTENCODING. But I think we
> can live with that.
>
> The pg_restore part is more difficult, as pg_restore needs to work without
> a database connection at all. There the conversion has to be done
> client-side.
>
>
> A second issue is whether we should divorce -E and PGCLIENTENCODING like
>> that, when they have always meant the same thing. You mentioned the
>> alternative of looking at pg_dump's locale environment to determine the
>> command line encoding --- would that be better?
>>
>
> Hmm. I wasn't thinking of looking at the locale environment as an
> alternative to the divorce, just as a way to determine the default for the
> client encoding if PGCLIENTENCODING is not set.
>
> It would be good for pg_dump to be consistent with other tools (reindexdb
> etc.). If we say that LC_CTYPE determines the encoding of command-line
> options, regardless of PGCLIENTENCODING, we should do the same in other
> tools, too. And that would be weird for the other tools. So no, I don't
> think we should do that.
>
> My plan is to assume that the command-line options are in the "client
> encoding", and the client encoding is determined by the following things,
> in this order:
>
> 1. client_encoding setting given explicitly in the command line, as part
> of the connection string.
> 2. PGCLIENTENCODING
> 3. LC_CTYPE
> 4. same as server_encoding
>
> The encoding to be used in the pg_dump output is a different concept, and
> there are reasons to *not* want the dump to be in the client encoding.
> Hence using server_encoding for that would be a better default than the
> current client encoding. This isn't so visible today, as client_encoding
> defaults to server_encoding anyway, but it will be if we set
> client_encoding='auto' (ie. look at LC_CTYPE) by default.
>
> There are certainly cases where you'd want to use the client encoding as
> the dump output encoding. For example if you pipe the pg_dump output to
> some custom script that mangles it. Or if you just want to look at the
> output, ie. if the output goes to a terminal. But for the usual case of
> taking a backup, using the server encoding makes more sense. One option is
> to check if the output is a tty (like psql does), and output the dump in
> client or server encoding depending on that (if -E is not given).
>
> So yeah, I think we should divorce -E and PGCLIENTENCODING. It feels that
> using PGCLIENTENCODING to specify the output encoding was more accidental
> than on purpose. Looking at the history, the implementation has been that
> way forever, but the docs were adjusted by commit b524cb36 in 2005 to
> document that fact.
>
> Here is a set of three patches that I've been working on:
>
> 0001-Divorce-pg_dump-E-option-**from-PGCLIENTENCODING.patch
>
> Separates pg_dump -E from PGCLIENTENCODING.
>

Since AH->encoding is now used only to represent dump encoding, we should
rename it as AH->dump_encoding.

-----

The standard_conforming_strings parameter is currently set in
setup_connection. You have moved it in setup_dump_encoding(). Is it in any
way related to encoding ? If not, I think we should keep it in
setup_connection().

-----

If a user supplies pg_dump --role=ROLENAME, we do the SET-ROLE in
setup_connection. The role name is in client encoding. In case of parallel
pg_dump, the setup_connection() in the parent process does it right, but in
the worker processes, the client_encoding is already set to the
dump_encoding when -E is given (parent has already updated it's
client_encoding to dump_encoding and I think the child inherits
client_encoding from parent), and then in the child when SET-ROLE is called
through setup_connection, it does not work because role name is in client
encoding, not dump encoding.

$ pg_dump -t "pöö" -E LATIN1 --role=uöö postgres
# Above succeeds

$ `which pg_dump` -t "pöö" -E LATIN1 -j 5 -f dumpdir -Fd postgres
--role=uöö
pg_dump: [archiver (db)] query failed: ERROR: role "uöö" does not exist
pg_dump: [parallel archiver] query was: SET ROLE "uöö"
pg_dump: [archiver (db)] query failed: ERROR: role "uöö" does not exist
pg_dump: [archiver (db)] query failed: ERROR: role "uöö" does not exist
pg_dump: [archiver (db)] query failed: ERROR: role "uöö" does not exist
pg_dump: [archiver (db)] query failed: ERROR: role "uöö" does not exist

>
> 0002-Set-client_encoding-auto-**in-all-client-utilities.patch

> Set client_encoding='auto' in all the client utilities, like we already
> did in psql. This fixes e.g "reindexdb -t" with a table name with non-ASCII
> chars
>
>
This patch looks good, I don't have any issues with this.

> 0003-Convert-object-names-to-**archive-encoding-before-matc.**patch
>
> Use iconv(3) in pg_restore to do encoding conversion in the client. This
> involves a lot of autoconf changes that I'm not 100% sure about, other than
> that it's pretty straightforward.

I haven't looked into this one yet.

>
>
> - Heikki
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Marc Mamin 2013-09-25 07:33:02 invalid regexp crashes the server on windows or 9.3
Previous Message Erik Rijkers 2013-09-25 06:39:55 Re: Minmax indexes