Re: Bunch of tsearch fixes and cleanup

From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Patches" <pgsql-patches(at)postgresql(dot)org>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Teodor Sigaev" <teodor(at)sigaev(dot)ru>, "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>
Subject: Re: Bunch of tsearch fixes and cleanup
Date: 2007-08-24 11:40:50
Message-ID: 46CEC3C2.50100@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches

And here's the attachment I forgot.

Heikki Linnakangas wrote:
> Heikki Linnakangas wrote:
>> Tom Lane wrote:
>>> Something that was annoying me yesterday was that it was not clear
>>> whether we had fixed every single place that uses a tsearch config file
>>> to assume that the file is in UTF8 and should be converted to database
>>> encoding. So I was thinking of hardwiring the "recode" part into
>>> readstopwords, and using wordop just for the "lowercase" part, which
>>> seemed to me like a saner division of labor. That is, UTF8 is a policy
>>> that we want to enforce globally, but lowercasing maybe not, and this
>>> still leaves the door open for more processing besides lowercasing.
>> I think we also want to always run input files through pg_verify_mbstr.
>> We do it for stopwords, and synonym files (though incorrectly), but not
>> for thesaurus files or ispell files. It's probably best to do that
>> within the recode-function as well.
>
> Ok, here's an updated version of the patch.
>
> - ispell initialization crashed on empty dictionary file
> - ispell initialization crashed on affix file with prefixes but no suffixes
> - stop words file was ran through pg_verify_mbstr, with database
> encoding, but it's later interpreted as being UTF-8. Now verifies that
> it's UTF-8, regardless of database encoding.
>
>
> - introduces new t_readline function that reads a line from a file,
> verifies that it's valid UTF-8, and converts it to database encoding.
> Modified all places that read tsearch config files to use this function
> instead of fgets directly.
>
> - readstopwords now sorts the stop words after loading them. Removed the
> separate sortstopwords function.
>
> - moved the wordop-input parameter from StopList struct to a direct
> argument to readstopwords. Seems cleaner to me that way, the struct is
> now purely an output of readstopwords, not mixed input/output.
> readstopwords now recodes the input implicitly using t_readline.
>
> - bunch of comments added, typos fixed, and other cleanup
>
> PS. It's bank holiday here in the UK on Monday, so I won't be around
> until Tuesday if something comes up.
>

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
tsearch-fixes-2.patch text/x-diff 50.4 KB

In response to

Responses

Browse pgsql-patches by date

  From Date Subject
Next Message Heikki Linnakangas 2007-08-24 14:43:34 HeadlineParsedText vs HeadlineText
Previous Message Heikki Linnakangas 2007-08-24 11:39:52 Re: Bunch of tsearch fixes and cleanup