Re: BUG #15277: ts_headline strips things that look like HTML tags and it cannot be disabled

From: Dan Book <grinnz(at)gmail(dot)com>
To: a(dot)zakirov(at)postgrespro(dot)ru
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15277: ts_headline strips things that look like HTML tags and it cannot be disabled
Date: 2018-07-12 15:33:52
Message-ID: CABMkAVUjc7Bh4WWTnF_US95_t8L6hpPFV8yJQJ51YQmWjG=Spg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, Jul 12, 2018 at 5:22 AM Arthur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
wrote:

> Hello,
>
> On Thu, Jul 12, 2018 at 07:59:40AM +0000, PG Bug reporting form wrote:
> > I have text that is not HTML and contains things that look like HTML
> tags.
> > The headlines are HTML escaped when output. It is very odd to have this
> text
> > missing from the resulting headlines and no way to control the behavior.
>
> <b> and </b> are recognized as "tag" token. By default they are
> ignored. You need to modify existing configuration or create new one:
>
> =# CREATE TEXT SEARCH CONFIGURATION english_tag (COPY = english);
> =# alter text search configuration english_tag
> add mapping for tag with simple;
>
> Then tags aren't skipped:
>
> =# select * from ts_debug('english_tag', 'query <b>test</b>');
> alias | description | token | dictionaries | dictionary |
> lexemes
>
> -----------+-----------------+-------+----------------+--------------+---------
> asciiword | Word, all ASCII | query | {english_stem} | english_stem |
> {queri}
> blank | Space symbols | | {} | (null) |
> (null)
> tag | XML tag | <b> | {simple} | simple |
> {<b>}
> asciiword | Word, all ASCII | test | {english_stem} | english_stem |
> {test}
> tag | XML tag | </b> | {simple} | simple |
> {</b>}
>
> But even in this case ts_headline will skip tags. Because it is
> hardcoded [1].
>
> I think it isn't good to change the behaviour for existing versions of
> PostgreSQL. But there is a workaround of course if it is appropriate for
> someone. It is possible to create your own text search parser extension.
> Example [2]. And change
>
> #define HLIDREPLACE(x) ( (x)==TAG_T )
>
> to
>
> #define HLIDREPLACE(x) ( false )
>

Thanks for the response. It's good to know this is possible but defining a
custom parser is not ideal.

-Dan

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Moshe Jacobson 2018-07-12 19:48:04 pg_restore: All GRANTs on table fail when any one role is missing
Previous Message Arthur Zakirov 2018-07-12 09:22:06 Re: BUG #15277: ts_headline strips things that look like HTML tags and it cannot be disabled