Quick Links

Combining chars in psql (pre-patch)

From:	Patrice Hédé <phede-ml(at)islande(dot)org>
To:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Combining chars in psql (pre-patch)
Date:	2001-09-26 18:23:33
Message-ID:	20010926202333.P1316@idf.net
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

I have been working a bit at a patch for that problem in psql. The
patch is far from being ready for inclusion or whatever, it's just for
comments...

By the way, someone can tell me how to generate nice patches showing
the difference between one's version and the cvs code that has been
downloaded ? I'm new to this (I've only used cvs for personal projects
so far, and I don't need to send patches to myself ;) ).

The good things in this patch :

- it works for me :)

- I've used Markus Kuhn's implementation of wcwidth.c : it is locale
independant, and is in the public domain. :) [if we keep it, I'll
have to tell him, though !]

- No dependency on the local libc's UTF-8-awareness ;) [I've seen that
psql has no such dependancy, at least in print.c, so I haven't added
any]. Actually, the change is completely self-contained.

- I've made my own utf-8 -> ucs converter, since I haven't found any
without a copyright notice yesterday. It checks invalid and
non-optimal UTF-8 sequences, as requested per Unicode 3.0.1 (or 3.1,
I don't remember).

- it works for japanese (and I believe other "full-width" characters).

- if MULTIBYTE is not defined, the code doesn't change from the
commited version.

The not so good things :

- I've made my own utf-8 -> ucs converter... It seems to work fine,
but it's not tested well enough, it may not be so robust.

- The printf( "%*s", width, utfstr) doesn't work as expected, so I had
to fix by doing printf( "%*s%s", width - utfstrwidth, "", utfstr);

- everything in #ifdef MULTIBYTE/#endif . Since they're is no
dependancy on anything else (including the rest of the multibyte
implementation - which I haven't had the time to look at in detail),
it doesn't depend on it.

- I get this (for each call to pg_mb_utfs_width) and I don't know why :

print.c:265: warning: passing arg 1 of `pg_mb_utfs_width' discards
qualifiers from pointer target type

- If pg_mb_utfs_width finds an invalid UTF-8 string, it truncates it.
I suppose that's what we want to do, but that's probably not the
best place to do it.

The bad things :

- If MULTIBYTE is defined, the strings must be in UTF-8, it doesn't
check any encoding.

- it is not integrated at all with the rest of the MB code.

- it doesn't respect the indentation policy ;)

To do :

- integrate better with the rest of the MB (client-side encoding), and
with the rest of the code of print.c .

- verify utf8-to-ucs robustness seriously.

- make a visually nicer code :)

- find better function names.

And possibly :

- consolidate the code, in order to remove the need for the #ifdef's
in many places.

- make it working with some others multiwidth-encoding (but then, I
don't know anything about these encodings myself !).

- check also utf-8 stream at input time, so that no invalid utf-8 is
sent to the backend (at least from psql - the backend will need also
a strict checking for UTF-8).

- add nice UTF-8 borders as an option :)

- add a command-line parameter to consider Unicode Ambiguous
characters (characters which can be narrow or wide, depending on the
terminal) wide characters, as it seems to be the case for CJK
terminals (as per TR#11).

- What else ?

BTW, here is the table I had in the first mail. I would have shown the
one with all the weird Unicode characters, but my mutt is configured
with iso-8859-15, and I doubt many of you have utf-8 as a default yet
:)

+------+-------+--------+
| lang | text | text |
+------+-------+--------+
| isl | álíta | áleit |
| isl | álíta | álitum |
| isl | álíta | álitið |
| isl | maður | mann |
| isl | maður | mönnum |
| isl | maður | manna |
| isl | óska | -aði |
+------+-------+--------+

The files in attachment :
- a diff for pgsql/src/bin/psql/print.c
- a diff for pgsql/src/bin/psql/Makefile
- two new files :
pgsql/src/bin/psql/pg_mb_utf8.c
pgsql/src/bin/psql/pg_mb_utf8.h

Have fun !

Patrice

--
Patrice HÉDÉ ------------------------------- patrice à islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----

Attachment	Content-Type	Size
pg_mb_utf8.c	text/x-csrc	6.7 KB
pg_mb_utf8.h	text/x-chdr	617 bytes

Responses

Re: Combining chars in psql (pre-patch) at 2002-02-22 18:07:49 from Bruce Momjian
Re: Combining chars in psql (pre-patch) at 2002-03-06 21:16:52 from Bruce Momjian

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2001-09-26 18:36:58	Re: [SQL] CHECK problem really OK now...
Previous Message	Tom Lane	2001-09-26 17:22:48	Re: Spinlock performance improvement proposal