Rethinking our fulltext phrase-search implementation

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Rethinking our fulltext phrase-search implementation
Date: 2016-12-17 18:36:48
Message-ID: 28215.1481999808@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've been thinking about how to fix the problem Andreas Seltenreich
reported at
https://postgr.es/m/87eg1y2s3x.fsf@credativ.de

The core of that problem is that the phrase-search patch attempts to
restructure tsquery trees so that there are no operators underneath a
PHRASE operator, except possibly other PHRASE operators. The goal is
to not have to deal with identifying specific match locations except
while processing a PHRASE operator. Well, that's an OK idea if it can
be implemented reasonably, but there are several problems with the code
as it stands:

1. The transformation is done by normalize_phrase_tree(), which is
currently invoked (via cleanup_fakeval_and_phrase()) in a rather
ad-hoc set of places including tsqueryin() and the various variants
of to_tsquery(). This leaves lots of scope for errors of omission,
which is exactly the proximate cause of Andreas' bug: ts_rewrite() is
neglecting to re-normalize its result tsquery. I have little faith
that there aren't other similar errors of omission today, and even
less that we won't introduce more in future.

2. Because the transformation is invoked as early as tsqueryin(),
it's user-visible:

regression=# select 'a <-> (b | c)'::tsquery;
tsquery
---------------------------
'a' <-> 'b' | 'a' <-> 'c'
(1 row)

This is confusing to users, and I do not think it's a good idea to
expose what's basically an optimization detail this way. For stored
tsqueries, we're frozen into this approach forever even if we later
decide that checking for 'a' twice isn't such a hot idea.

3. Worse, early application of the transformation creates problems for
operations such as ts_rewrite: a rewrite might fail to match, or produce
surprising results, because the query trees actually being operated on
aren't what the user probably thinks they are. At a minimum, to get
consistent results I think we'd have to re-normalize after *each step*
of ts_rewrite, not only at the end. That would be expensive, and it's
far from clear that it would eliminate all the surprises.

4. The transformations are wrong anyway. The OR case I showed above is
all right, but as I argued in <24331(dot)1480199636(at)sss(dot)pgh(dot)pa(dot)us>, the AND
case is not:

regression=# select 'a <-> (b & c)'::tsquery;
tsquery
---------------------------
'a' <-> 'b' & 'a' <-> 'c'
(1 row)

This matches 'a b a c', because 'a <-> b' and 'a <-> c' can each be
matched at different places in that text; but it seems highly unlikely to
me that that's what the writer of such a query wanted. (If she did want
that, she would write it that way to start with.) NOT is not very nice
either:

regression=# select '!a <-> !b'::tsquery;
tsquery
------------------------------------
!'a' & !( !( 'a' <-> 'b' ) & 'b' )
(1 row)

If you dig through that, you realize that the <-> test is pointless;
the query can't match any vector containing 'a', so certainly the <->
condition can't succeed, making this an expensive way to spell "!a & !b".
And that's not the right semantics anyway: if 'x y' matches this query,
which it does and should, why doesn't 'x y a' match?

5. The case with only one NOT under <-> looks like this:

regression=# select '!a <-> b'::tsquery;
tsquery
------------------------
!( 'a' <-> 'b' ) & 'b'
(1 row)

This is more or less in line with the naive view of what its semantics
should be, although I notice that it will match a 'b' at position 1,
which might be a surprising result. We're not out of the woods though:
this will (and should) match, eg, 'c b a'. But that means that '!a & b'
is not a safe lossy approximation to '!a <-> b', which is an assumption
that is wired into a number of places. Simple testing shows that indeed
GIN and GIST index searches get the wrong answers, different from what
you get in a non-indexed search, for queries like this.

So we have a mess here, which needs to be cleaned up quite aside from the
fact that it's capable of provoking Assert failures and/or crashes.

I thought for awhile about moving the normalize_phrase_tree() work
to occur at the start of tsquery execution, rather than in tsqueryin()
et al. That addresses points 1..3, but doesn't by itself do anything
for points 4 or 5. Also, it would be expensive because right now
execution works directly from the flat tsquery representation; there's no
conversion to an explicit tree structure on which normalize_phrase_tree()
could conveniently be applied. Nor do we know at the start whether the
tsquery contains any PHRASE operators.

On the whole, it seems like the best fix would be to get rid of
normalize_phrase_tree() altogether, and instead fix the TS_execute()
engine so it can cope with regular operators underneath phrase operators.
We can still have the optimization of not worrying about lexeme locations
when no phrase operator has been seen, but we have to change
TS_phrase_execute to cope with plain AND/OR/NOT operators and calculate
proper location information for the result of one.

Here is a design for doing that. The hard part seems to be dealing with
NOT: merging position lists during AND or OR is pretty clear, but how do
we represent the positions where a lexeme isn't? I briefly considered
explicitly negating the position list, eg [2,5,7] goes to [1,3,4,6,8,...],
but the problem is where to stop. If we knew the largest position present
in the current document, we could stop there, but AFAICS we generally
don't know that --- and even if we did, this approach could be quite
expensive for large documents. So my design is based on keeping the
original position list and adding a "negate" flag that says these are
the positions where the query pattern does NOT occur.

Hence, I propose adding a field to ExecPhraseData:

typedef struct ExecPhraseData
{
int npos; /* number of positions reported */
bool allocated; /* pos points to palloc'd data? */
+ bool negate; /* positions are where value is NOT */
WordEntryPos *pos; /* ordered, non-duplicate lexeme positions */
} ExecPhraseData;

It's already the case that TS_phrase_execute is always responsible for
initializing this struct, so adding this field and initializing it to
false doesn't break any existing TSExecuteCallback functions (it's even
ABI-compatible because the field is going into a padding hole). I've
worked out the following specification for the meaning of this field
given that a TSExecuteCallback function, or recursive execution of
TS_phrase_execute, has returned true or false:

if npos > 0:
func result = false, negate = false: disallowed
func result = true, negate = false: asserts that query is matched at
specified position(s) (and only those positions)
func result = false, negate = true: disallowed
func result = true, negate = true: asserts that query is matched at all
positions *except* specified position(s)

if npos == 0:
func result = false, negate = false: asserts that query is not matched
anywhere
func result = true, negate = false: query is (possibly) matched, matching
position(s) are unknown
func result = false, negate = true: disallowed
func result = true, negate = true: asserts that query is matched at all
positions

The negate = false cases agree with the existing semantics, so that
TSExecuteCallback functions do not need to know about the field; there
is no case where they'd need to set it to true.

Given this definition, the tsquery operators can be implemented as follows
in TS_phrase_execute:

OP_NOT:

if npos > 0 (implying its subquery returned true), invert the negate
flag and return true, keeping the same list of positions
if npos == 0, the possible cases are:
subquery result subquery's negate return result negate
false false true true
true false true false
true true false false
The correctness of these rules can be seen from the specification of
the flag meanings. Notice that NOT atop NOT is a no-op, as expected.

OP_AND, OP_OR:

"not" notations here indicate that L or R input has the negate flag set:

a & b: emit positions listed in both inputs
a & !b: emit positions listed in a but not b
!a & b: emit positions listed in b but not a
!a & !b: treat as !(a | b), ie emit positions listed in either input,
then set negate flag on output
a | b: emit positions listed in either input
a | !b: treat as !(!a & b), ie emit positions listed in b but not a,
then set negate flag on output
!a | b: treat as !(a & !b), ie emit positions listed in a but not b,
then set negate flag on output
!a | !b: treat as !(a & b), ie emit positions listed in both inputs,
then set negate flag on output

AND/OR function result is always true when output negate flag is set,
else it is true if output npos > 0

OP_PHRASE:

Works like OP_AND except we allow for a possibly-nonzero offset between
L and R positions, ie we compare an R position of X to an L position of
X minus offset while deciding whether to emit position X. Note in
particular that <0> acts *exactly* like OP_AND.

We could accept a match only if X minus offset is greater than zero, so
that "!a <-> b" doesn't match b at start of document. But I see no way
to do the equivalent for "a <-> !b" at the end of the document, so I'm
inclined not to do that, leaving the existing semantics for these cases
alone.

The above rules for AND/OR/PHRASE work only when we have position
information for both inputs (ie, neither input returned true with npos = 0
and negate = false); otherwise we just do the dumb thing and return
true/false with npos = 0, negate = false, to indicate either "there might
be a match" or "there definitely is not a match".

It's worth noting that with these rules, phrase searches will act as
though "!x" always matches somewhere; for instance "!a <-> !b" will match
any tsvector. I argue that this is not wrong, not even if the tsvector is
empty: there could have been adjacent stopwords matching !a and !b in the
original text. Since we've adjusted the phrase matching rules to treat
stopwords as unknown-but-present words in a phrase, I think this is
consistent. It's also pretty hard to assert this is wrong and at the same
time accept "!a <-> b" matching b at the start of the document.

This sounds like it would be a lot of code, but I think that all of the
AND/OR/PHRASE cases can be implemented by a single subroutine that's
told whether to emit a position depending on whether it finds that
position in both inputs/neither input/left only/right only. That
subroutine otherwise won't be much more complicated than the existing
position-finding loop for OP_PHRASE. I'm guessing that it'll end up
being roughly a wash once you allow for removing normalize_phrase_tree().

I haven't yet looked into point 5 (wrong GIN/GIST search results),
but I'm hopeful that removing the assumption that <-> approximates
as AND will fix it. In any case we need to make the base tsquery
engine right before we try to fix the index approximations to it.

Thoughts, objections? Anybody see errors in the logic?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-12-17 18:52:19 Re: PSQL commands: \quit_if, \quit_unless
Previous Message Steve Singer 2016-12-17 17:34:32 Re: Logical Replication WIP