Quick Links

Re: Html parsing and inline elements

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Marcelo Zabani <mzabani(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Html parsing and inline elements
Date:	2016-04-13 14:09:49
Message-ID:	21258.1460556589@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Marcelo Zabani <mzabani(at)gmail(dot)com> writes:
> I was here wondering whether HTML parsing should separate tokens that are
> not separated by spaces in the original text, but are separated by an
> inline element. Let me show you an example:

> *SELECT to_tsvector('english', 'Helloneighbor, you are
> nice')*
> *Results:** "'ce':7 'hello':1 'n':5 'neighbor':2"*

> "Hello" and "neighbor" should really be separated, because ** is a block
> element, but "nice" should be a single word there, since there is no visual
> separation when rendered (** and ** are inline elements).

I can't imagine that we want to_tsvector to know that much about HTML.
It doesn't, really, even have license to assume that its input *is*
HTML. So even if you see things that look like <foo> and </foo> in the
string, it could easily be XML or SGML or some other SGML-like markup
format with different semantics for the markup keywords.

Perhaps it'd be sane to do something like this as long as the
HTML-specific behavior was broken out into a separate function.
(Or maybe it could be done within to_tsvector as a separate parser
or separate dictionary?) But I don't think it should be part of
the default behavior.

regards, tom lane

In response to

Html parsing and inline elements at 2016-04-13 13:44:57 from Marcelo Zabani

Responses

Re: Html parsing and inline elements at 2016-04-13 15:57:19 from Marcelo Zabani

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2016-04-13 14:11:06	Re: Re: [COMMITTERS] pgsql: Avoid extra locks in GetSnapshotData if old_snapshot_threshold <
Previous Message	Alvaro Herrera	2016-04-13 14:08:21	Re: Re: [COMMITTERS] pgsql: Avoid extra locks in GetSnapshotData if old_snapshot_threshold <