Mishandling of right-associated phrase operators in FTS

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-bugs(at)postgreSQL(dot)org
Subject: Mishandling of right-associated phrase operators in FTS
Date: 2016-12-18 18:54:10
Message-ID: 26706.1482087250@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

What do you think a tsquery like 'x <-> (y <-> z)' should mean?
I find it hard to assign it any meaning other than the same thing
as '(x <-> y) <-> z', ie, it should match a 3-lexeme sequence 'x y z'.

Right now, the execution engine gets this wrong:

regression=# select to_tsvector('x y z') @@ to_tsquery('x <-> y <-> z');
?column?
----------
t -- okay
(1 row)

regression=# select to_tsvector('x y z') @@ to_tsquery('x <-> (y <-> z)');
?column?
----------
f -- not so okay
(1 row)

This happens because the lower (righthand) <-> operator returns the
position of its righthand-side input ('z'), but that's two away from
where the 'x' is, so the upper phrase operator doesn't think there
is a match.

I considered trying to fix this by forcing right-associated cases into
left-associated form during tsquery parsing, but that has all the same
problems that I pointed out with respect to normalize_phrase_tree().
Really it'd be best to fix this by making the executor cope properly.
I think what we want is to pass down a flag telling recursive invocations
of TS_phrase_execute whether to return the position of the left-side or
right-side argument of a phrase match, which we would set according to
whether we are within the right or left argument of the most closely
nested upper phrase operator. I propose to incorporate that fix into
the TS_phrase_execute rewrite I'm working on.

A related problem appears in clean_fakeval_intree()'s attempts to adjust
phrase-operator distances when it removes a stopword. For example, 'a'
is a stopword, so we get:

regression=# select to_tsquery('(b <-> a) <-> c');
to_tsquery
-------------
'b' <2> 'c'
(1 row)

That's fine, but I don't think this answer is right:

regression=# select to_tsquery('b <-> (a <-> c)');
to_tsquery
-------------
'b' <-> 'c'
(1 row)

It should be 'b <2> c', same as the other one.

I haven't worked this out in detail, but I think a similar solution
would work for clean_fakeval_intree: pass down a flag indicating if
we're within the left or right argument of a <-> op, and return the
appropriate adjustment distance based on that.

regards, tom lane

Browse pgsql-bugs by date

  From Date Subject
Next Message Heikki Linnakangas 2016-12-19 12:47:35 Crash with a CUBE query on 9.6
Previous Message Tom Lane 2016-12-17 17:48:22 Re: BUG #14469: Wrong cost estimates for merge append plan with partitions.