Skip site navigation (1) Skip section navigation (2)

Combining chars in psql (pre-patch)

From: Patrice Hédé <phede-ml(at)islande(dot)org>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Combining chars in psql (pre-patch)
Date: 2001-09-26 18:23:33
Message-ID: 20010926202333.P1316@idf.net (view raw or flat)
Thread:
Lists: pgsql-hackers
Hi,

I have been working a bit at a patch for that problem in psql. The
patch is far from being ready for inclusion or whatever, it's just for
comments...

By the way, someone can tell me how to generate nice patches showing
the difference between one's version and the cvs code that has been
downloaded ? I'm new to this (I've only used cvs for personal projects
so far, and I don't need to send patches to myself ;) ).

The good things in this patch :

- it works for me :)

- I've used Markus Kuhn's implementation of wcwidth.c : it is locale
  independant, and is in the public domain. :) [if we keep it, I'll
  have to tell him, though !]

- No dependency on the local libc's UTF-8-awareness ;) [I've seen that
  psql has no such dependancy, at least in print.c, so I haven't added
  any]. Actually, the change is completely self-contained.

- I've made my own utf-8 -> ucs converter, since I haven't found any
  without a copyright notice yesterday. It checks invalid and
  non-optimal UTF-8 sequences, as requested per Unicode 3.0.1 (or 3.1,
  I don't remember).

- it works for japanese (and I believe other "full-width" characters).

- if MULTIBYTE is not defined, the code doesn't change from the
  commited version.

The not so good things :

- I've made my own utf-8 -> ucs converter... It seems to work fine,
  but it's not tested well enough, it may not be so robust.

- The printf( "%*s", width, utfstr) doesn't work as expected, so I had
  to fix by doing printf( "%*s%s", width - utfstrwidth, "", utfstr);

- everything in #ifdef MULTIBYTE/#endif . Since they're is no
  dependancy on anything else (including the rest of the multibyte
  implementation - which I haven't had the time to look at in detail),
  it doesn't depend on it.

- I get this (for each call to pg_mb_utfs_width) and I don't know why :

  print.c:265: warning: passing arg 1 of `pg_mb_utfs_width' discards
  qualifiers from pointer target type

- If pg_mb_utfs_width finds an invalid UTF-8 string, it truncates it.
  I suppose that's what we want to do, but that's probably not the
  best place to do it.

The bad things :

- If MULTIBYTE is defined, the strings must be in UTF-8, it doesn't
  check any encoding.

- it is not integrated at all with the rest of the MB code.

- it doesn't respect the indentation policy ;)


To do :

- integrate better with the rest of the MB (client-side encoding), and
  with the rest of the code of print.c .

- verify utf8-to-ucs robustness seriously.

- make a visually nicer code :)

- find better function names.

And possibly :

- consolidate the code, in order to remove the need for the #ifdef's
  in many places.

- make it working with some others multiwidth-encoding (but then, I
  don't know anything about these encodings myself !).

- check also utf-8 stream at input time, so that no invalid utf-8 is
  sent to the backend (at least from psql - the backend will need also
  a strict checking for UTF-8).

- add nice UTF-8 borders as an option :)

- add a command-line parameter to consider Unicode Ambiguous
  characters (characters which can be narrow or wide, depending on the
  terminal) wide characters, as it seems to be the case for CJK
  terminals (as per TR#11).

- What else ?


BTW, here is the table I had in the first mail. I would have shown the
one with all the weird Unicode characters, but my mutt is configured
with iso-8859-15, and I doubt many of you have utf-8 as a default yet
:)

+------+-------+--------+
| lang | text  |  text  |
+------+-------+--------+
| isl  | álíta | áleit  |
| isl  | álíta | álitum |
| isl  | álíta | álitið |
| isl  | maður | mann   |
| isl  | maður | mönnum |
| isl  | maður | manna  |
| isl  | óska  | -aði   |
+------+-------+--------+


The files in attachment :
- a diff for pgsql/src/bin/psql/print.c
- a diff for pgsql/src/bin/psql/Makefile
- two new files :
  pgsql/src/bin/psql/pg_mb_utf8.c
  pgsql/src/bin/psql/pg_mb_utf8.h

Have fun !

Patrice

-- 
Patrice HÉDÉ ------------------------------- patrice à islande org -----
  --  Isn't it weird  how scientists  can imagine  all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
  -- What would _you_ call the creation of the universe ?
  -- "The HORRENDOUS SPACE KABLOOIE !"               - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----

Attachment: pg_mb_utf8.h
Description: text/x-chdr (617 bytes)
Attachment: pg_mb_utf8.c
Description: text/x-csrc (6.7 KB)

Responses

pgsql-hackers by date

Next:From: Tom LaneDate: 2001-09-26 18:36:58
Subject: Re: [SQL] CHECK problem really OK now...
Previous:From: Tom LaneDate: 2001-09-26 17:22:48
Subject: Re: Spinlock performance improvement proposal

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group