Hi everyone,
I was here wondering whether HTML parsing should separate tokens that are
not separated by spaces in the original text, but are separated by an
inline element. Let me show you an example:
*SELECT to_tsvector('english', 'Hello<p>neighbor</p>, you are
<strong>n</strong>i<em>ce</em>')*
*Results:** "'ce':7 'hello':1 'n':5 'neighbor':2"*
"Hello" and "neighbor" should really be separated, because *<p>* is a block
element, but "nice" should be a single word there, since there is no visual
separation when rendered (*<em>* and *<strong>* are inline elements).
Sorry if this has been asked before, but I couldn't find it anywhere.
Thanks in advance,
Marcelo.