Re: tsvector limitations

From: Tim <elatllat(at)gmail(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: tsvector limitations
Date: 2011-06-14 23:04:36
Message-ID: BANLkTikWX0dTrU0zAdK=0Oi_kUxr7=f6Zg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi Kevin,

My test was indeed atypical vocabulary; it was a dictionary file.
I was intentionally trying to hit the limit to find out where it was,
because the documentation did not directly address it.
I am mainly trying to find out if this actually will be a limitation for me.
Thank you for contributing the test data on the PostgreSQL docs
(text=11MB,tsvector=0.356MB) and anonymous technical book (text=0.2MB,
tsvector=0.1MB).
It seems that as long as a document uses a small % of the potential language
vocabulary it would be hard to hit the limit.
On the other hand if someone tries to upload a dictionary or a file using
more than 4% of one vocabulary it would need special handling.
(I guess this would be more likely on larger vocabularies like maybe the
Japanese writing systems)
The 4% number is probably low due to misspellings, slang, etc.

Anyway I now have an approximate answer to the original question of where
the limit is,
and it's probably safe to for my goals to just check and warn if a file
can't be indexed.

On Tue, Jun 14, 2011 at 6:19 PM, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov
> wrote:

> Tim <elatllat(at)gmail(dot)com> wrote:
>
> > So I ran this test:
> > unzip -p text.docx word/document.xml | perl -p -e
> > 's/<.+?>/\n/g;s/[^a-z0-9\n]/\n/ig;'|grep ".." > text.txt
> > ls -hal ./text.*
> > #-rwxrwxrwx 1 postgres postgres 15M 2011-06-14 15:12 ./text.docx
> > #-rwxrwxrwx 1 postgres postgres 29M 2011-06-14 15:17 ./text.txt
> > mv /tmp/text.* /var/lib/postgresql/9.0/main/
> > cd ~/;psql -d postgres
> > #psql (9.0.4)
> > CREATE DATABASE test;
> > \q
> > cd ~/;psql -d test
> > CREATE TABLE test(title VARCHAR(256), data OID, words TSVECTOR);
> > INSERT INTO test VALUES ( 'text.docx', LO_IMPORT('text.docx'),
> > TO_TSVECTOR(pg_read_file('text.txt' ,0, 100000000)) );
> >
> > and I got this:
> > #ERROR: string is too long for tsvector (30990860 bytes, max
> > 1048575 bytes)
>
> Your test (whatever data it is that you used) don't seem typical of
> English text. The entire PostgreSQL documentation in HTML form,
> when all the html files are concatenated is 11424165 bytes (11MB),
> and the tsvector of that is 364410 (356KB). I don't suppose you
> know of some publicly available file on the web that I could use to
> reproduce your problem?
>
> > The year is 2011 I don't think searching a 2MB text file is to
> > much to expect.
>
> Based on the ratio for the PostgreSQL docs, it seems possible to
> index documents considerably larger than that. Without the markup
> (as in the case of a PDF), I bet it would take a lot less than what
> I saw for the docs. A printed or typewritten page usually has about
> 2KB of text per page. I used pdftotext to get as text the contents
> of a 119 page technical book about database technology, and it came
> to 235KB of text. I made a tsvector for that, and it was 99KB. So,
> at *that* rate you'd need about 100 books that size, totaling
> 11,900 pages of text in a document to hit the limit you showed.
> Well, probably more than that, because some of the words might be
> repeated from one book to another.
>
> So, I'm back to wondering what problem you're trying to solve where
> this is actually a limitation for you.
>
> -Kevin
>

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Tim 2011-06-15 00:10:58 Re: tsvector limitations
Previous Message Tim 2011-06-14 22:33:41 Re: tsvector limitations