Quick Links

Re: Html parsing and inline elements

From:	Ryan Pedela <rpedela(at)datalanche(dot)com>
To:	Marcelo Zabani <mzabani(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Html parsing and inline elements
Date:	2016-05-01 18:32:10
Message-ID:	CACu89FSEhvJ451pRymAJb9ij-449o1GW9dvsu5hUPg8xGygZtg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Apr 13, 2016 at 9:57 AM, Marcelo Zabani <mzabani(at)gmail(dot)com> wrote:

> Hi, Tom,
>
> You're right, I don't think one can argue that the default parser should
> know HTML.
> How about your suggestion of there being an HTML parser, is it feasible? I
> ask this because I think that a lot of people store HTML documents these
> days, and although there probably aren't lots of HTML with words written
> along multiple inline elements, it would certainly be nice to have a proper
> parser for these use cases.
>
> What do you think?
>

I recommend using Apache Tika [1] for plain text extraction from HTML.
There are so many weird edge cases when parsing HTML that it is easier to
use something that is already mature than reinventing the wheel.

1. https://tika.apache.org/

Thanks,
Ryan Pedela

In response to

Re: Html parsing and inline elements at 2016-04-13 15:57:19 from Marcelo Zabani

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	david	2016-05-02 00:24:01	About subxact and xact nesting level...
Previous Message	Yury Zhuravlev	2016-05-01 09:24:55	Re: Windows 7, Visual Studio 2010: building PgAdmin3