Re: citext LIKE search bug

From: Morris de Oryx <morrisdeoryx(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: citext LIKE search bug
Date: 2019-09-22 22:48:30
Message-ID: CAKqnccj44OKMzf8hL0oGbUP04DiHj7iKtvG8VrkjHbTxk1ssFA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Thanks for the answer and clarification. Yes, I'm using trigram index now,
and it's working well. I love n-grams for fuzzy matching, and the Postgres
implementation is a thing of beauty. It is a bit funny in that you have to
use LIKE to hint the planner to use the index on a = search.

I've been working with Postgres for a couple of years now, coming from
another environment. In our case, Postgres is not a front-line system, it's
a secondary system for aggregation, analysis and pushing things through to
a visualization platform. Postgres is just awesome but, whenever you're
shifting to a new context, it's easy to come in with bad priors. That's me
and citext, all the way down.

Postgres has case-sensitive text searches, as you all know. I guess that
becomes a bit more optional in PG 12 with case and/or diacritical-blind
collations...but we're on PG 11 deployed on RDS, so we're unlikely to see
PG 12 a lot before ~March 2020. We started on 9.5 or 9.6. What I was used
to are case-blind searches with B-trees of up to the first 1024 bytes of
character data (it's all in UTF-16) that work on = <> and starts_with%
without any functions or castings. So, Bad Priors. In Postgres:

-- Searches on text are case-sensitive (I figured that out early.)

-- The planner only uses a function index if the query uses *exactly* the
same query. This I found out *last week*.

It's that second bit that was the killer, and that makes me question the
point of citext at all. For background, I dug through my notes and
assembled a chronological list of days when users asked me for
case-sensitive search in their database over the past 30+ years:

....

Exactly never. My goal was to make case-blind searches automatic and
transparent...no special goo in the query:

select foo from bar where lower(text) = lower('the search string');

But, as it turns out, you lose the indexes unless you do exactly something
like that with citext. Like if you use the clever (not my clever) idea of
B-treeing a hash of the text:

select *
from record_changes_log_detail
where hashtext(old_value) = hashtext('Gold Kerrison Neuro') and -- The
first clause uses the index to find records with matching hash codes.
old_value = 'Gold Kerrison Neuro'; -- The second clause makes sure
that the records are real matches.

That works perfectly for = searches, but there's a whole lot of complexity
in the query syntax for someone used to this to do the same thing:

select *
from record_changes_log_detail
where old_value = 'Gold Kerrison Neuro';

Is there, or could there be, any way for the query planner to guess that an
expression-based index should be searched other than recapitulating the
entire expression in the query? For example, my sample data has about 8M
rows with a citext field with old values (a sort of audit-log thing). The
length of these values are typically very small. 6 records in the entire
data set have long values. If I index the first 128 characters, I think
that gives me the original value for 99.9999+% of the rows. So, something
like this - or the same for text:

DROP INDEX IF EXISTS record_changes_log_detail_old_value_ix_btree;
CREATE INDEX record_changes_log_detail_old_value_ix_btree
ON record_changes_log_detail
USING btree ((left(old_value,128)::citext) citext_pattern_ops);

For the index to be used today, the search needs to be

select * from record_changes_log_detail
where left(old_value,128)::citext = 'Gold Kerrison Neuro';

But here's the thing, Postgres _knows_ that the expression index is on the
first 128 characters of the field. If the query value is within that value
on an = search, then the B-tree of the substring could be used, and would
return 100% accurate results.

Is that sort of query planner optimization conceivable in Postgres? I'd
offer to help but, well, I'm absolutely unqualified. I knew a guy who gave
up his PhD on mapping five-dimensional math into four-dimensional space. He
said "I was smart enough to ask the question, but not even close to smart
enough to answer it." That's me on my query planner idea/question.

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Fahar Abbas 2019-09-23 08:16:54 Re: Postgres 11.5.1 failed installation
Previous Message Tom Lane 2019-09-22 21:47:49 Re: BUG #16013: Unexpected results from bit field query