Re: PDF Parsing and Indexing

From: Doug McNaught <doug(at)wireboard(dot)com>
To: "Raymond" <support(at)bigriverinfotech(dot)com>
Cc: "PostgreSQL General Listserver" <pgsql-general(at)postgresql(dot)org>
Subject: Re: PDF Parsing and Indexing
Date: 2001-06-15 23:33:42
Message-ID: m3bsnpjp2h.fsf@belphigor.mcnaught.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

"Raymond" <support(at)bigriverinfotech(dot)com> writes:

> I need to parse / index Adobe PDF content and store both the document and
> index in Postgres.
>
> Has anybody had experience in doing this?

I can give you some information that may get you started.

PDF is a (mostly) open standard. The spec is available from Adobe's
website, and there are libraries out there (some free, some
commercial) to help you work with it. That said, there are several
gotchas:

* It is possible to both compress and encrypt PDF content. You need
the proper data filters to handle documents of these types, and some
may only be available commercially.

* PDF is a page description language like PostScript (except it does
not include a Turing-complete programming language as well). It
provides for arbitrary placement of each glyph on the page. So the
word "this" might be encoded in the file as something like:

moveto(100, 200)
draw("t")
moveto(105, 200)
draw("h")
moveto(112, 200)
draw("i")
moveto(115, 200)
draw("s")

You can see that it would hard to index something like this in any
kind of useful way.

PDF files are binary and can be arbitrarily large, so I would probably
store them in Postgres as large objects.

I recommend you download and at least skim the PDF spec (a 500-page
PDF, natch) to get an idea of what you're in for in the general case.

-Doug
--
The rain man gave me two cures; he said jump right in,
The first was Texas medicine--the second was just railroad gin,
And like a fool I mixed them, and it strangled up my mind,
Now people just get uglier, and I got no sense of time... --Dylan

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Mike Castle 2001-06-16 00:02:03 Re: PDF Parsing and Indexing
Previous Message Rob Hoopman 2001-06-15 23:14:26 Re: canned code to get db on web quickly via perl or PHP?