Re: english parser in text search: support for multiple words in the same position

From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: sushant354(at)gmail(dot)com, Robert Haas <robertmhaas(at)gmail(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: english parser in text search: support for multiple words in the same position
Date: 2010-09-29 05:29:47
Message-ID: AANLkTimOL8U68putCUUyhTnABbXu4pGpx740T4ENb-Wf@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Any updates on this?

On Tue, Sep 21, 2010 at 10:47 PM, Sushant Sinha <sushant354(at)gmail(dot)com>wrote:

> > I looked at this patch a bit. I'm fairly unhappy that it seems to be
> > inventing a brand new mechanism to do something the ts parser can
> > already do. Why didn't you code the url-part mechanism using the
> > existing support for compound words?
>
> I am not familiar with compound word implementation and so I am not sure
> how to split a url with compound word support. I looked into the
> documentation for compound words and that does not say much about how to
> identify components of a token. Does a compound word split by matching
> with a list of words? If yes, then we will not be able to use that as we
> do not know all the words that can appear in a url/host/email/file.
>
> I think another approach can be to use the dict_regex dictionary
> support. However, we will have to match the regex with something that
> parser is doing.
>
> The current patch is not inventing any new mechanism. It uses the
> special handler mechanism already present in the parser. For example,
> when the current parser finds a URL it runs a special handler called
> SpecialFURL which resets the parser position to the start of token to
> find hostname. After finding the host it moves to finding the path. So
> you first get the URL and then the host and finally the path.
>
> Similarly, we are resetting the parser to the start of the token on
> finding a url to output url parts. Then before entering the state that
> can lead to a url we output the url part. The state machine modification
> is similar for other tokens like file/email/host.
>
>
> > The changes made to parsetext()
> > seem particularly scary: it's not clear at all that that's not breaking
> > unrelated behaviors. In fact, the changes in the regression test
> > results suggest strongly to me that it *is* breaking things. Why are
> > there so many diffs in examples that include no URLs at all?
> >
>
> I think some of the difference is coming from the fact that now pos
> starts with 0 and it used to be 1 earlier. That is easily fixable
> though.
>
> > An issue that's nearly as bad is the 100% lack of documentation,
> > which makes the patch difficult to review because it's hard to tell
> > what it intends to accomplish or whether it's met the intent.
> > The patch is not committable without documentation anyway, but right
> > now I'm not sure it's even usefully reviewable.
>
> I did not provide any explanation as I could not find any place in the
> code to provide the documentation (that was just a modification of state
> machine). Should I do a separate write-up to explain the desired output
> and the changes to achieve it?
>
> >
> > In line with the lack of documentation, I would say that the choice of
> > the name "parttoken" for the new token type is not helpful. Part of
> > what? And none of the other token type names include the word "token",
> > so that's not a good decision either. Possibly "url_part" would be a
> > suitable name.
> >
>
> I can modify it to output url-part/host-part/email-part/file-part if
> there is an agreement over the rest of the issues. So let me know if I
> should go ahead with this.
>
> -Sushant.
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Smith 2010-09-29 06:44:35 Re: ask for review of MERGE
Previous Message Itagaki Takahiro 2010-09-29 05:25:38 Re: I: About "Our CLUSTER implementation is pessimal" patch