Re: Error while loading sql file

From: Bèrto ëd Sèra <berto(dot)d(dot)sera(at)gmail(dot)com>
To: Adarsh Sharma <adarsh(dot)sharma(at)orkash(dot)com>
Cc: Alban Hertroys <haramrae(at)gmail(dot)com>, pgsql-general General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Error while loading sql file
Date: 2011-12-27 12:20:53
Message-ID: CAKwGa_96Y6c9GPNX3Ed7GQjH3jH5PmTm8_g-YUQgKhBsECiNsQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

>
> HI Ardash!
>
> INSERT INTO conceptnet_frame
> VALUES(3884,'ja','{1}は{2}を持っている。',16,3,2140,NULL,NULL,NULL);
> ....
>

Can you still access the database that produced the dump? If so, you may
want to produce a number of dumps for distinct language values. Japanese,
in particular, is a very complicated issue, as Japanese change alphabet to
produce the same effect Latin writers get by using bold and italic. So text
searches are quite a nightmare, since DOG is written in different alphabets
(Latin included), depending on what the author had in mind, in terms of
underlining this specific word/expression.

If you could manage to insulate single languages things would definitely
get less complex, as you can track down specific problems related to a
single language (or maybe to an external source that inserted weird
encodings). If the original DB is not available you could still filter the
dump file for such values (although this is a very boring experience).

One of the good things with this approach (in case you need to do heavy
text searches on your data) is that (as of PG 9.1) you can model structures
like:

CREATE COLLATION "de_DE.utf8" (
LC_COLLATE = "de_DE.utf8",
LC_CTYPE = "de_DE.utf8"
);
CREATE COLLATION "en_GB.utf8" (
LC_COLLATE = "en_GB.utf8",
LC_CTYPE = "en_GB.utf8"
);
# NOTE!! all these locales must pre-exist on your box, you are simply
importing them into PG with this CREATE phase.

CREATE TABLE conceptnet_frame_root (
id BIGINT NOT NULL PRIMARY KEY,
ISO693_1_code CHAR(2) NOT NULL
);

CREATE TABLE conceptnet_frame_en (
linguistic_content TEXT COLLATE "en_GB.utf8" NOT NULL
) INHERITS (conceptnet_frame_root);

CREATE TABLE conceptnet_frame_de (
linguistic_content TEXT COLLATE "de_DE.utf8" NOT NULL
) INHERITS (conceptnet_frame_root);

Then have a view pick up the entire linguistic bouquet, if needed. Sadly
you cannot leave 'linguistic_content' in the root table, because you won't
be able to use an alter table on it later, to alter the collation at single
inherited table level, so you are responsible of ensuring type consistency
on your own. It does add complication to the model, however, if you are
into serious conflicts among different languages this is the best solution
I could come up with, thus far.

You may want to have a look at:
http://www.postgresql.org/docs/9.1/static/collation.html

BTW, ISO 639-1 is a dangerous standard to use, if you are to make extensive
language coverage, see
http://en.wikipedia.org/wiki/ISO_639

Hope this helps
Bèrto

--
==============================
If Pac-Man had affected us as kids, we'd all be running around in a
darkened room munching pills and listening to repetitive music.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Filip Rembiałkowski 2011-12-27 12:27:52 Re: Show <parameter> in psql does any calculations?
Previous Message robins.tharakan 2011-12-27 10:47:11 Re: Why does index not use for CTE query?