[PATCH] Phrase search ported to 9.6

From: Dmitry Ivanov <d(dot)ivanov(at)postgrespro(dot)ru>
To: pgsql-hackers(at)postgresql(dot)org
Subject: [PATCH] Phrase search ported to 9.6
Date: 2016-02-01 11:21:03
Message-ID: 33828354.WrrSMviC7Y@abook
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Hackers,

Although PostgreSQL is capable of performing some FTS (full text search)
queries, there's still a room for improvement. Phrase search support could
become a great addition to the existing set of features.

Introduction
============

It is no secret that one can make Google search for an exact phrase (instead
of an unordered lexeme set) simply by enclosing it within double quotes. This
is a really nice feature which helps to save the time that would otherwise be
spent on annoying result filtering.

One weak spot of the current FTS implementation is that there is no way to
specify the desired lexeme order (since it would not make any difference at
all). In the end, the search engine will look for each lexeme individually,
which means that a hypothetical end user would have to discard documents not
including search phrase all by himself. This problem is solved by the patch
below (should apply cleanly to 61ce1e8f1).

Problem description
===================

The problem comes from the lack of lexeme ordering operator. Consider the
following example:

select q @@ plainto_tsquery('fatal error') from
unnest(array[to_tsvector('fatal error'), to_tsvector('error is not fatal')])
as q;
?column?
----------
t
t
(2 rows)

Clearly the latter match is not the best result in case we wanted to find
exactly the "fatal error" phrase. That's when the need for a lexeme ordering
operator arises:

select q @@ to_tsquery('fatal ? error') from unnest(array[to_tsvector('fatal
error'), to_tsvector('error is not fatal')]) as q;
?column?
----------
t
f
(2 rows)

Implementation
==============

The ? (FOLLOWED BY) binary operator takes form of "?" or "?[N]" where 0 <= N <
~16K. If N is provided, the distance between left and right operands must be
no greater that N. For example:

select to_tsvector('postgres has taken severe damage') @@ to_tsquery('postgres
? (severe ? damage)');
?column?
----------
f
(1 row)

select to_tsvector('postgres has taken severe damage') @@ to_tsquery('postgres
?[4] (severe ? damage)');
?column?
----------
t
(1 row)

New function phraseto_tsquery([ regconfig, ] text) takes advantage of the "?
[N]" operator in order to facilitate phrase search:

select to_tsvector('postgres has taken severe damage') @@
phraseto_tsquery('severely damaged');
?column?
----------
t
(1 row)

This patch was originally developed by Teodor Sigaev and Oleg Bartunov in
2009, so all credit goes to them. Any feedback is welcome.

--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment Content-Type Size
phrase_search.patch text/x-patch 124.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-02-01 11:29:52 Re: Template for commit messages
Previous Message Magnus Hagander 2016-02-01 10:44:42 Re: Comment typos in source code: s/thats/that is/