Re: HTML tags and tsearch2

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Joanna Sharman <Joanna(dot)Sharman(at)ed(dot)ac(dot)uk>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: HTML tags and tsearch2
Date: 2008-06-26 12:05:09
Message-ID: Pine.LNX.4.64.0806261602120.11363@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, 26 Jun 2008, Joanna Sharman wrote:

> Hi,
>
> I have recently started experimenting with tsearch2 and it seems that the
> default behaviour is to ignore HTML tags and treat them as word-separators.
> What I would like it to do is to ignore HTML tags within words, but instead
> of creating separate words, combine the characters separated by the tag into
> one word.
>
> For example: in the database I have words like 'K<sub>ir</sub>' that need to
> be searched using the term without HTML tags, i.e. 'Kir'. Currently, the HTML
> tags are ignored and two words are stored in the vector, 'k' and 'ir'. I
> would like only one word, 'kir', to be stored in the vector, so that searches
> using the word 'kir' will match the row.

2 options - write HTML parser and preprocess text before to_tsvector.

>
> A second, related question is whether it is possible to cause tsearch2 to
> split up words when it encounters digits, e.g. 'TM8' into 'TM' and '8'.

you can write your own dictionary or use dict_regex from
http://vo.astronet.ru/arxiv/dict_regex.html

>
> I am not sure if this functionality is possible to implement using tsearch2
> or if there might be a better way, so I would be grateful for any advice or
> pointers to further reading on how I might do this. (I am using PostgreSQL
> version 8.1.10)

think about upgrading to 8.3

>
> Many thanks in advance,
> Joanna
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Phillip Mills 2008-06-26 12:20:04 Re: Serialized Access
Previous Message Joanna Sharman 2008-06-26 11:11:58 HTML tags and tsearch2