Re: Replacement for Oracle Text

From: Oleg Bartunov <obartunov(at)gmail(dot)com>
To: Josh berkus <josh(at)agliodbs(dot)com>
Cc: s d <daku(dot)sandor(at)gmail(dot)com>, Postgresql General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Replacement for Oracle Text
Date: 2016-02-19 19:23:34
Message-ID: CAF4Au4zq=GLioTWF9byiQ_iSY3TYcbEaM31za5X2OKQu5yQJfA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Fri, Feb 19, 2016 at 8:28 PM, Josh berkus <josh(at)agliodbs(dot)com> wrote:

> On 02/19/2016 05:49 AM, s d wrote:
>
>> On 19 February 2016 at 14:19, Bruce Momjian <bruce(at)momjian(dot)us
>> <mailto:bruce(at)momjian(dot)us>> wrote:
>>
>> I wonder if PLPerl could be used to extract the words from a PDF
>> document and create a tsvector column from it.
>>
>>
>> I don't know about PLPerl(I'm pretty sure it could be used for this
>> purpose, though.). On the other hand I've written code for this in
>> Python which should be easy to adapt for PLPython, if necessary.
>>
>
> I'd swear someone already built something to do this. All you need is a
> library which reads PDF and transforms it into text, and then you can FTS
> it. I know there's a module for OpenOffice docs somewhere as well, but
> heck if I can remember where.
>

I used pdftotext for that.
I think it'd be useful to have extension{s}, which can be used to convert
anything to text. I remember someone indexed chemical formulae, TeX/LaTeX,
DOC files.

>
> --
> --
> Josh Berkus
> Red Hat OSAS
> (any opinions are my own)
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Stephen Davies 2016-02-20 00:10:43 Re: Replacement for Oracle Text
Previous Message Jeff Janes 2016-02-19 19:18:23 Re: Monitoring and insight into NOTIFY queue