Re: tsvector limitations

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Tim" <elatllat(at)gmail(dot)com>,<pgsql-admin(at)postgresql(dot)org>
Subject: Re: tsvector limitations
Date: 2011-06-14 22:19:04
Message-ID: 4DF79808020000250003E649@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Tim <elatllat(at)gmail(dot)com> wrote:

> So I ran this test:
> unzip -p text.docx word/document.xml | perl -p -e
> 's/<.+?>/\n/g;s/[^a-z0-9\n]/\n/ig;'|grep ".." > text.txt
> ls -hal ./text.*
> #-rwxrwxrwx 1 postgres postgres 15M 2011-06-14 15:12 ./text.docx
> #-rwxrwxrwx 1 postgres postgres 29M 2011-06-14 15:17 ./text.txt
> mv /tmp/text.* /var/lib/postgresql/9.0/main/
> cd ~/;psql -d postgres
> #psql (9.0.4)
> CREATE DATABASE test;
> \q
> cd ~/;psql -d test
> CREATE TABLE test(title VARCHAR(256), data OID, words TSVECTOR);
> INSERT INTO test VALUES ( 'text.docx', LO_IMPORT('text.docx'),
> TO_TSVECTOR(pg_read_file('text.txt' ,0, 100000000)) );
>
> and I got this:
> #ERROR: string is too long for tsvector (30990860 bytes, max
> 1048575 bytes)

Your test (whatever data it is that you used) don't seem typical of
English text. The entire PostgreSQL documentation in HTML form,
when all the html files are concatenated is 11424165 bytes (11MB),
and the tsvector of that is 364410 (356KB). I don't suppose you
know of some publicly available file on the web that I could use to
reproduce your problem?

> The year is 2011 I don't think searching a 2MB text file is to
> much to expect.

Based on the ratio for the PostgreSQL docs, it seems possible to
index documents considerably larger than that. Without the markup
(as in the case of a PDF), I bet it would take a lot less than what
I saw for the docs. A printed or typewritten page usually has about
2KB of text per page. I used pdftotext to get as text the contents
of a 119 page technical book about database technology, and it came
to 235KB of text. I made a tsvector for that, and it was 99KB. So,
at *that* rate you'd need about 100 books that size, totaling
11,900 pages of text in a document to hit the limit you showed.
Well, probably more than that, because some of the words might be
repeated from one book to another.

So, I'm back to wondering what problem you're trying to solve where
this is actually a limitation for you.

-Kevin

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Tim 2011-06-14 22:33:41 Re: tsvector limitations
Previous Message Craig James 2011-06-14 21:12:05 Re: tsvector limitations