Quick Links

Re: Latin vs non-Latin words in text search parsing

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>, "Teodor Sigaev" <teodor(at)sigaev(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Latin vs non-Latin words in text search parsing
Date:	2007-10-22 14:26:44
Message-ID:	6046.1193063204@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:
> Alvaro Herrera wrote:
>> lword Entirely letters per iswalpha, with at least one ASCII
>> nlword Entirely letters per iswalpha
>> word Entirely alphanumeric per iswalnum, but not nlword

> I don't like this categorization much more than the original. The
> distinction between lword and nlword is useless for most European
> languages.

Right. That's not an objection in itself, since you can just add the
same dictionary mappings to both token types, but the question is when
would such a distinction actually be useful? AFAICS the only case where
it'd make sense to put different mappings on lword and nlword with the
above definitions is when dealing with Russian or similar languages,
where the entire alphabet is non-ASCII. However, my proposal (pure
ASCII vs not pure ASCII) seems to work just as well for that case as
this proposal does.

> ... I'm also not clear what the use case for
> the distinction between words with digits or not is. I don't think
> there's any natural languages where a word can contain digits, so it
> must be a computer-oriented thing as well.

Well, that's exactly why we *should* distinguish words-with-digits;
it's unlikely that any standard dictionary will do sane things with
them, so if you want to index them they need to go down a different
dictionary chain.

A more drastic change would be to not treat a string like "beta1"
as a single token at all, so that the alphanumeric-word category
would go away entirely. However I'm disinclined to tinker with
the parser that much. It's seen enough use in the contrib module
that I'm prepared to grant that the design is generally useful.
I'm just worried that the subcategories of "word" need a bit of
adjustment for languages other than Russian and English.

regards, tom lane

In response to

Re: Latin vs non-Latin words in text search parsing at 2007-10-22 08:58:03 from Heikki Linnakangas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Dave Page	2007-10-22 14:30:18	Re: pgadmin debug on windows
Previous Message	Trevor Talbot	2007-10-22 14:25:08	Re: 8.2.3: Server crashes on Windows using Eclipse/Junit