Re: Latin vs non-Latin words in text search parsing

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>, "Teodor Sigaev" <teodor(at)sigaev(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Latin vs non-Latin words in text search parsing
Date: 2007-10-22 14:36:04
Message-ID: 6225.1193063764@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:
>> I like the "aword" name more than "lword", BTW. If we change the meaning
>> of the classes, surely we can change the name as well, right?

> I'm not very familiar with the use case here. Is there a good reason to want
> to abbreviate these names? I think I would expect "ascii", "word", and "token"
> for the three categories Tom describes.

Please look at the first nine rows of the table here:
http://developer.postgresql.org/pgdocs/postgres/textsearch-parsers.html
It's not clear to me where we'd go with the names for the
hyphenated-word and hyphenated-word-part categories. Also, ISTM that
we should use related names for these three categories, since they are
all considered valid parts of hyphenated words.

Another point: "token" is probably unreasonably confusing as a name for
a token type. "Is that a token token or a word token?"

Maybe "aword", "word", and "numword"?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dave Page 2007-10-22 14:38:07 Re: 8.2.3: Server crashes on Windows using Eclipse/Junit
Previous Message Magnus Hagander 2007-10-22 14:33:17 Re: 8.2.3: Server crashes on Windows using Eclipse/Junit