Html parsing and inline elements

From: Marcelo Zabani <mzabani(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Html parsing and inline elements
Date: 2016-04-13 13:44:57
Message-ID: CACgY3QZ0_TX4LBC8=RRCRGM2Mgos6S8jj8AhxYMP6P5EM2M4yQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi everyone,

I was here wondering whether HTML parsing should separate tokens that are
not separated by spaces in the original text, but are separated by an
inline element. Let me show you an example:

*SELECT to_tsvector('english', 'Hello<p>neighbor</p>, you are
<strong>n</strong>i<em>ce</em>')*
*Results:** "'ce':7 'hello':1 'n':5 'neighbor':2"*

"Hello" and "neighbor" should really be separated, because *<p>* is a block
element, but "nice" should be a single word there, since there is no visual
separation when rendered (*<em>* and *<strong>* are inline elements).

Sorry if this has been asked before, but I couldn't find it anywhere.

Thanks in advance,
Marcelo.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-04-13 13:46:09 Re: Missing PG_INT32_MIN in numutils.c
Previous Message Tom Lane 2016-04-13 13:38:21 Re: Missing PG_INT32_MIN in numutils.c