Re: english parser in text search: support for multiple words in the same position

From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: sushant354(at)gmail(dot)com, Robert Haas <robertmhaas(at)gmail(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: english parser in text search: support for multiple words in the same position
Date: 2010-09-21 17:17:25
Message-ID: 1285089445.4454.31.camel@yoffice
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> I looked at this patch a bit. I'm fairly unhappy that it seems to be
> inventing a brand new mechanism to do something the ts parser can
> already do. Why didn't you code the url-part mechanism using the
> existing support for compound words?

I am not familiar with compound word implementation and so I am not sure
how to split a url with compound word support. I looked into the
documentation for compound words and that does not say much about how to
identify components of a token. Does a compound word split by matching
with a list of words? If yes, then we will not be able to use that as we
do not know all the words that can appear in a url/host/email/file.

I think another approach can be to use the dict_regex dictionary
support. However, we will have to match the regex with something that
parser is doing.

The current patch is not inventing any new mechanism. It uses the
special handler mechanism already present in the parser. For example,
when the current parser finds a URL it runs a special handler called
SpecialFURL which resets the parser position to the start of token to
find hostname. After finding the host it moves to finding the path. So
you first get the URL and then the host and finally the path.

Similarly, we are resetting the parser to the start of the token on
finding a url to output url parts. Then before entering the state that
can lead to a url we output the url part. The state machine modification
is similar for other tokens like file/email/host.

> The changes made to parsetext()
> seem particularly scary: it's not clear at all that that's not breaking
> unrelated behaviors. In fact, the changes in the regression test
> results suggest strongly to me that it *is* breaking things. Why are
> there so many diffs in examples that include no URLs at all?
>

I think some of the difference is coming from the fact that now pos
starts with 0 and it used to be 1 earlier. That is easily fixable
though.

> An issue that's nearly as bad is the 100% lack of documentation,
> which makes the patch difficult to review because it's hard to tell
> what it intends to accomplish or whether it's met the intent.
> The patch is not committable without documentation anyway, but right
> now I'm not sure it's even usefully reviewable.

I did not provide any explanation as I could not find any place in the
code to provide the documentation (that was just a modification of state
machine). Should I do a separate write-up to explain the desired output
and the changes to achieve it?

>
> In line with the lack of documentation, I would say that the choice of
> the name "parttoken" for the new token type is not helpful. Part of
> what? And none of the other token type names include the word "token",
> so that's not a good decision either. Possibly "url_part" would be a
> suitable name.
>

I can modify it to output url-part/host-part/email-part/file-part if
there is an agreement over the rest of the issues. So let me know if I
should go ahead with this.

-Sushant.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2010-09-21 17:18:20 Re: .gitignore files, take two
Previous Message Robert Haas 2010-09-21 17:14:56 Re: Serializable snapshot isolation error logging