Re: BUG #15689: Stemming of negation/not operator

From: Ivan Viragine <ivanviragine(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15689: Stemming of negation/not operator
Date: 2019-03-13 12:38:04
Message-ID: CAOWkBR+AdOer2mWX5ahKsPUgozZ7-0s-FRN6+ENhjmtG8mZqyQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi, Tom!

Thanks for the reply.
Surely there are many cases where stemming would be nice, but from the
user's perspective, when someone does a complex query, with NOTs, they
usually know what they are doing and wants to match certain specific cases.
Stemming the NOT clause "removes" their control.
Also, I think it is better to have more results with the stemmed words and
then have the user to add new clauses to filter them out, then to lose some
correct results without the user even knowing why or if he/she is losing it
really (a priori, the user does not know that the NOT clause was stemmed).
If you try this on Elastic Search, it works as (I) expected.
The idea is not to be particular words, but to not stem the clauses of the
query, that is: the query parser knows which parts are in the NOT clause,
it should parse it and add dynamically to the not stemmed words.
About the index / token being "car" for the word "cars", sure it will, as
long as we use the same parser / tokener. That's why the recheck you said,
should be necessary.

About the lexemes: we do not use prefix match here. But I see your point.
It falls almost in the same category: doing things under the hood that the
user may not be aware of.

The normal way to explicitly do not stem something would be using quotes.
Normally, quotes means "match this exactly".

Atenciosamente,

--
Ivan Nicola Viragine

On Tue, Mar 12, 2019 at 7:34 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> PG Bug reporting form <noreply(at)postgresql(dot)org> writes:
> > When using to_tsquery function it is stemming negation/not parts of the
> > query, where it probably shouldn't.
> > Some examples:
>
> > SELECT to_tsquery('english', 'car & !cars');
> > to_tsquery
> > ----------------
> > 'car' & !'car'
>
> I'm not exactly convinced by this argument, because it seems like
> you're only thinking about a corner case. There are probably at
> least as many examples where you *do* want stemming on a negated term.
>
> Another issue is that even if we changed the tsquery input function
> to not stem particular words, I doubt that it would do anything useful,
> because what it will be comparing to is tsvector entries that have
> certainly been stemmed. That is, even if the original document said
> "cars", what's going to be in the tsvector is just "car", so that
> forbidding a match to "cars" isn't going to do anything. (Maybe
> what this says is that there should be a less-lossy recheck against
> the original document after the tsvector match, but that'd have to
> be done by an additional, explicit operator I think. Or possibly
> the recheck just requires tsquery match with a different stemming
> configuration.)
>
> A related problem that's bothered me for some time is that lexemes
> get stemmed even if there is a "*" (prefix match) marker on them,
> causing them to possibly match much more than the user expected.
> But again, it's not real obvious how to make that better given the
> match-to-tsvector context --- not stemming could easily remove
> desired matches to stemmed tsvector entries.
>
> If we could think of a way for it to do something useful, my inclination
> would be to allow an explicit "don't stem" marker on lexemes, rather
> than trying to drive it off whether the context is a negation or not.
>
> regards, tom lane
>

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Dean Rasheed 2019-03-13 12:43:11 Re: BUG #15692: infinity loop
Previous Message Sergei Kornilov 2019-03-13 11:04:32 Re: BUG #15692: infinity loop