Re: TODO item -- Improve psql's handling of multi-line

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: "Sergey E(dot) Koposov" <math(at)sai(dot)msu(dot)ru>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Andreas Seltenreich <seltenreich(at)gmx(dot)de>, pgsql-patches(at)postgresql(dot)org
Subject: Re: TODO item -- Improve psql's handling of multi-line
Date: 2006-02-12 03:46:58
Message-ID: 200602120346.k1C3kwI19249@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches


Oh, seems like a serious problem. I don't think all our encodings
avoid bytes after the first multibyte being non-control characters.
Some of the Chinese encodings come to mind.

Here is a comment from copy.c:

* Multi-byte encodings: all supported client-side encodings encode multi-byte
* characters by having the first byte's high bit set. Subsequent bytes of the
* character can have the high bit not set. When scanning data in such an
* encoding to look for a match to a single-byte (ie ASCII) character, we must
* use the full pg_encoding_mblen() machinery to skip over multibyte
* characters, else we might find a false match to a trailing byte. In
* supported server encodings, there is no possibility of a false match, and
* it's faster to make useless comparisons to trailing bytes than it is to
* invoke pg_encoding_mblen() to skip over them. encoding_embeds_ascii is TRUE
* when we have to do it the hard way.

Consider that the client-side encoding can have ASCII characters as the
non-first byte in multi-byte encodings. I think that is the problem,
and you can see how copy.c uses pg_encoding_mblen() to skip over any
control characters embedded in the multi-byte sequence.

I don't think there is any safe byte value in every multi-byte case
except NUL.

FYI, I see it broken now if I exit psql and restart it and look at the
history.

Can we use 0x01 and prefix the history with some kind of tag which
indicates if 0x01 appeared in the original string and supress \n
conversion?

---------------------------------------------------------------------------

Sergey E. Koposov wrote:
> On Sat, 11 Feb 2006, Bruce Momjian wrote:
> >
> > Modified patch attached and applied. Thanks.
> >
> > I adjusted based on Tom's comments to use a zero byte, and to clean up
> > the formatting. I didn't see any extra non-readline overhead, just
> > calls to functions that are no-ops in non-readline cases.
>
> Thank you, Bruce for modifying and applying the patch (during 2 months I
> didn't find time to do that formatting corrections by myself).
>
> > Tom Lane wrote:
> > > "Sergey E. Koposov" <math(at)sai(dot)msu(dot)ru> writes:
> > > > On Wed, 7 Dec 2005, Andrew Dunstan wrote:
> > > >> A zero byte is probably a pretty bad choice. Some other low valued byte
> > > >> (e.g. \x01 ) would probably work better.
> > >
> > > > Currently I replace '\n' with the '\x01' as Andrew suggested.
> > >
> > > Won't this get confused by some of the Far Eastern encodings we support?
> > > The zero-byte approach is at least proof against that. But what we need
> > > to ask is whether we can expect readline to cope with either.
>
> But concerning to your zero byte change, it currently just broke
> everything (as I thought, and that's why I didn't implemented it). The
> problem with using zero byte is that it breaks all the readline functions
> read_history and write_history. Those functions deal with usual C
> strings, so putting zero byte inside them will just truncate everything.
> (that's exactly what occur with the psql from CVS).
>
> So, I don't know. There are two alternatives. One is to use 0x01 byte
> instead: (at least I don't really agree with Tom's comments about possible
> problems with using 0x01 with some exotic encodings) (for example I did a
> test like this
> cat ./pgsql/src/backend/utils/mb/Unicode/*.map |grep 01
> cat ./pgsql/src/backend/utils/mb/Unicode/*.map |grep '0x1'
> and it didn't produce any output, so it seems to me that 0x01 byte is not
> used by any encoding... and UTF encodings are not using 0x01 byte for
> certain)
> The second alternative is to write our own implementations of read_history
> and write_history functions instead of standart readline implementations to
> deal with zero bytes in the strings. But it seems to be a rather bad
> solution....
>
> Regards,
> Sergey
>
> *****************************************************
> Sergey E. Koposov
> Max Planck Institute for Astronomy
> Web: http://lnfm1.sai.msu.ru/~math
> E-mail: math(at)sai(dot)msu(dot)ru
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

In response to

Browse pgsql-patches by date

  From Date Subject
Next Message Bruce Momjian 2006-02-12 03:55:59 Re: TODO Item - Add system view to show free space map
Previous Message Tom Lane 2006-02-12 03:42:59 Re: [HACKERS] Spaces in psql output (Was: FW: PGBuildfarm member snake Branch HEAD Status changed)