Re: english parser in text search: support for multiple words in the same position

From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Markus Wanner <markus(at)bluegap(dot)ch>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: english parser in text search: support for multiple words in the same position
Date: 2010-09-01 06:42:04
Message-ID: 1283323324.2084.22.camel@dragflick
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I have attached a patch that emits parts of a host token, a url token,
an email token and a file token. Further, it makes sure that a
host/url/email/file token and the first part-token are at the same
position in tsvector.

The two major changes are:

1. Tokenization changes: The patch exploits the special handlers in the
text parser to reset the parser position to the start of a
host/url/email/file token when it finds one. Special handlers were
already used for extracting host and urlpath from a full url. So this is
more of an extension of the same idea.

2. Position changes: We do not advance position when we encounter a
host/url/email/file token. As a result the first part of that token
aligns with the token itself.

Attachments:

tokens_output.txt: sample queries and results with the patch
token_v1.patch: patch wrt cvs head

Currently, the patch output parts of the tokens as normal tokens like
WORD, NUMWORD etc. Tom argued earlier that this will break
backward-compatibility and so it should be outputted as parts of the
respective tokens. If there is an agreement over what Tom says, then the
current patch can be modified to output subtokens as parts. However,
before I complicate the patch with that, I wanted to get feedback on any
other major problem with the patch.

-Sushant.

On Mon, 2010-08-02 at 10:20 -0400, Tom Lane wrote:
> Sushant Sinha <sushant354(at)gmail(dot)com> writes:
> >> This would needlessly increase the number of tokens. Instead you'd
> >> better make it work like compound word support, having just "wikipedia"
> >> and "org" as tokens.
>
> > The current text parser already returns url and url_path. That already
> > increases the number of unique tokens. I am only asking for adding of
> > normal english words as well so that if someone types only "wikipedia"
> > he gets a match.
>
> The suggestion to make it work like compound words is still a good one,
> ie given wikipedia.org you'd get back
>
> host wikipedia.org
> host-part wikipedia
> host-part org
>
> not just the "host" token as at present.
>
> Then the user could decide whether he needed to index hostname
> components or not, by choosing whether to forward hostname-part
> tokens to a dictionary or just discard them.
>
> If you submit a patch that tries to force the issue by classifying
> hostname parts as plain words, it'll probably get rejected out of
> hand on backwards-compatibility grounds.
>
> regards, tom lane

Attachment Content-Type Size
tokens_output.txt text/plain 3.7 KB
tokens_v1.patch text/x-patch 37.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thom Brown 2010-09-01 06:56:58 Re: array_agg() NULL Handling
Previous Message David E. Wheeler 2010-09-01 05:45:05 array_agg() NULL Handling