Quick Links

Re: Bunch of tsearch fixes and cleanup

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Patches" <pgsql-patches(at)postgresql(dot)org>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Teodor Sigaev" <teodor(at)sigaev(dot)ru>, "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>
Subject:	Re: Bunch of tsearch fixes and cleanup
Date:	2007-08-24 11:39:52
Message-ID:	46CEC388.1050301@enterprisedb.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-patches

Heikki Linnakangas wrote:
> Tom Lane wrote:
>> Something that was annoying me yesterday was that it was not clear
>> whether we had fixed every single place that uses a tsearch config file
>> to assume that the file is in UTF8 and should be converted to database
>> encoding. So I was thinking of hardwiring the "recode" part into
>> readstopwords, and using wordop just for the "lowercase" part, which
>> seemed to me like a saner division of labor. That is, UTF8 is a policy
>> that we want to enforce globally, but lowercasing maybe not, and this
>> still leaves the door open for more processing besides lowercasing.
>
> I think we also want to always run input files through pg_verify_mbstr.
> We do it for stopwords, and synonym files (though incorrectly), but not
> for thesaurus files or ispell files. It's probably best to do that
> within the recode-function as well.

Ok, here's an updated version of the patch.

- ispell initialization crashed on empty dictionary file
- ispell initialization crashed on affix file with prefixes but no suffixes
- stop words file was ran through pg_verify_mbstr, with database
encoding, but it's later interpreted as being UTF-8. Now verifies that
it's UTF-8, regardless of database encoding.

- introduces new t_readline function that reads a line from a file,
verifies that it's valid UTF-8, and converts it to database encoding.
Modified all places that read tsearch config files to use this function
instead of fgets directly.

- readstopwords now sorts the stop words after loading them. Removed the
separate sortstopwords function.

- moved the wordop-input parameter from StopList struct to a direct
argument to readstopwords. Seems cleaner to me that way, the struct is
now purely an output of readstopwords, not mixed input/output.
readstopwords now recodes the input implicitly using t_readline.

- bunch of comments added, typos fixed, and other cleanup

PS. It's bank holiday here in the UK on Monday, so I won't be around
until Tuesday if something comes up.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Re: Bunch of tsearch fixes and cleanup at 2007-08-23 20:30:05 from Heikki Linnakangas

Responses

Re: Bunch of tsearch fixes and cleanup at 2007-08-24 11:40:50 from Heikki Linnakangas
Re: Bunch of tsearch fixes and cleanup at 2007-08-24 15:36:31 from Tom Lane

Browse pgsql-patches by date

	From	Date	Subject
Next Message	Heikki Linnakangas	2007-08-24 11:40:50	Re: Bunch of tsearch fixes and cleanup
Previous Message	Zdenek Kotala	2007-08-23 20:50:05	Re: pg_ctl configurable timeout