Re: english parser in text search: support for multiple words in the same position

From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: english parser in text search: support for multiple words in the same position
Date: 2010-12-23 05:35:30
Message-ID: AANLkTin+XiewXD396WMqr-Pnk9QOHday3OTTM3MyS7SR@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Just a reminder that this patch is discussing how to break url, emails etc
into its components.

On Mon, Oct 4, 2010 at 3:54 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> [ sorry for not responding on this sooner, it's been hectic the last
> couple weeks ]
>
> Sushant Sinha <sushant354(at)gmail(dot)com> writes:
> >> I looked at this patch a bit. I'm fairly unhappy that it seems to be
> >> inventing a brand new mechanism to do something the ts parser can
> >> already do. Why didn't you code the url-part mechanism using the
> >> existing support for compound words?
>
> > I am not familiar with compound word implementation and so I am not sure
> > how to split a url with compound word support. I looked into the
> > documentation for compound words and that does not say much about how to
> > identify components of a token.
>
> IIRC, the way that that works is associated with pushing a sub-state
> of the state machine in order to scan each compound-word part. I don't
> have the details in my head anymore, though I recall having traced
> through it in the past. Look at the state machine actions that are
> associated with producing the compound word tokens and sub-tokens.
>

I did look around for compound word support in postgres. In particular, I
read the documentation and code in tsearch/spell.c that seems to implement
the compound word support.

So in my understanding the way it works is:

1. Specify a dictionary of words in which each word will have applicable
prefix/suffix flags
2. Specify a flag file that provides prefix/suffix operations on those flags
3. flag z indicates that a word in the dictionary can participate in
compound word splitting
4. When a token matches words specified in the dictionary (after applying
affix/suffix operations), the matching words are emitted as sub-words of the
token (i.e., compound word)

If my above understanding is correct, then I think it will not be possible
to implement url/email splitting using the compound word support.

The main reason is that the compound word support requires the
"PRE-DETERMINED" dictionary of words. So to split a url/email we will need
to provide a list of *all possible* host names and user names. I do not
think that is a possibility.

Please correct me if I have mis-understood something.

-Sushant.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jie Li 2010-12-23 07:33:12 Why is sorting on two columns so slower than sorting on one column?
Previous Message Robert Haas 2010-12-23 03:05:33 Re: knngist - 0.8