Re: 9.6 phrase search distance specification

From: Ryan Pedela <rpedela(at)datalanche(dot)com>
To: obartunov(at)gmail(dot)com
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.6 phrase search distance specification
Date: 2016-08-11 16:50:01
Message-ID: CACu89FRfR9VTWcx=H0Ro9MgBOtuoWp5bW_LVLY2jZ-_kkSs+bw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 11, 2016 at 10:42 AM, Ryan Pedela <rpedela(at)datalanche(dot)com>
wrote:

> On Thu, Aug 11, 2016 at 9:27 AM, Oleg Bartunov <obartunov(at)gmail(dot)com>
> wrote:
>
>> On Tue, Aug 9, 2016 at 9:59 PM, Ryan Pedela <rpedela(at)datalanche(dot)com>
>> wrote:
>> >
>> >
>>
>> > I would say that it is worth it to have a "phrase slop" operator
>> (Apache
>> > Lucene terminology). Proximity search is extremely useful for improving
>> > relevance and phrase slop is one of the tools to achieve that.
>> >
>>
>> It'd be great if you explain what is "phrase slop". I assume it's not
>> about search, but about relevance.
>>
>
> Sure. An exact phrase query has slop = 0 which means find all terms in the
> exact positions relative to each other. Phrase query with slop > 0 means
> find all terms within <slop> positions relative to each other. If slop =
> 10, find all terms within 10 positions of each other. Here is a concrete
> example from my current work searching SEC filings.
>
> Bill Gates' full legal name is William H. Gates, III. In the SEC database
> [1], his name is GATES WILLIAM H III. If you are searching the records of
> people within the SEC database and you want to find Bill Gates, most users
> will type "bill gates". Since there are many people with the first name
> Bill (William) and the last name Gates, Bill Gates most likely won't be the
> first result with a standard keyword query. Likewise an exact phrase query
> (slop = 0) will not find him either because the first and last names are
> transposed. What you need is a phrase query with a slop = 2 which will
> match "William Gates", "William H Gates", "Gates William", etc. There is
> still the issue of Bill vs William, but that can be solved with synonyms
> and is a different topic.
>
> 1. https://www.sec.gov/cgi-bin/browse-edgar?CIK=902012&owner
> =exclude&action=getcompany&Find=Search
>

One more thing. In that trivial example, an AND query would probably do a
great job too. However if you are searching for Bill Gates in large text
documents rather than a list of names, an AND query will not give you very
good results because the words "bill" and "gates" are so common.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2016-08-11 16:56:25 Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)
Previous Message Ryan Pedela 2016-08-11 16:42:48 Re: 9.6 phrase search distance specification