Re: tsvector limitations

From: Tim <elatllat(at)gmail(dot)com>
To: Craig James <craig_james(at)emolecules(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: tsvector limitations
Date: 2011-06-14 22:33:41
Message-ID: BANLkTi=Rz9imi7eKF7DmQigjut5pgQJYSg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi Craig,

Thanks for writing.

If one were to try to increase the limitation of tsvectors (I'm not sure I
need to yet; this thread is mainly to determine that.)
Instead of using a solution involving a "vocabulary" file,
one would probably be better off discarding tsvectors making a vocabulary
table then linking it to documents with a (dict_id, hit_count, word_id)
table
It would be faster, smaller, and more accurate because it would not contain
the now useless position information, while it would contain the otherwise
lost word count information.

I wonder if anyone has any incite on the inner workings of ORACLE/MSSQL/etc
FTS.
Maybe there is a common design pattern I/we can use to handle text files of
non trivial vocabulary that preserves position information.
I'm not sure a (dict_id, position, word_id) table would be queryable in a
useful way
because as far as I can think at the moment there is no good way to compare
the position in different rows with SQL.

I will collect a few more sample files to see how much I relay need this
.... multi-language files are probably the worst offenders.

On Tue, Jun 14, 2011 at 5:12 PM, Craig James <craig_james(at)emolecules(dot)com>wrote:

> On 6/14/11 1:42 PM, Tim wrote:
>
> So I ran this test:
> unzip -p text.docx word/document.xml | perl -p -e
> 's/<.+?>/\n/g;s/[^a-z0-9\n]/\n/ig;'|grep ".." > text.txt
> ls -hal ./text.*
> #-rwxrwxrwx 1 postgres postgres 15M 2011-06-14 15:12 ./text.docx
> #-rwxrwxrwx 1 postgres postgres 29M 2011-06-14 15:17 ./text.txt
> mv /tmp/text.* /var/lib/postgresql/9.0/main/
> cd ~/;psql -d postgres
> #psql (9.0.4)
> CREATE DATABASE test;
> \q
> cd ~/;psql -d test
> CREATE TABLE test(title VARCHAR(256), data OID, words TSVECTOR);
> INSERT INTO test VALUES ( 'text.docx', LO_IMPORT('text.docx'),
> TO_TSVECTOR(pg_read_file('text.txt' ,0, 100000000)) );
>
> and I got this:
> #ERROR: string is too long for tsvector (30990860 bytes, max 1048575
> bytes)
>
> doing the math
> echo "scale=3;29/(30990860/1048575)"|bc #==0.981
>
> Indicates a (worst case) limit of searching only the first 1MB of a text
> file before you need to start building your own search maybe on top of
> tsvector.
> The year is 2011 I don't think searching a 2MB text file is to much to
> expect.
>
> The novel "Hawaii" at 960 pages is roughly 1MB. tsvector was intended for
> documents (web pages, news articles, corporate memos, ...), not for books.
> What you're asking for is interesting, but you can't complain that an
> open-source project that was designed for a different purpose doesn't meet
> your needs.
>
> So how am I to use the PGSQL FTS as a "full text search" when the above
> example can only handle a "small or partial text search"?
> If I'm not missing anything maybe the documentation should be adjusted
> accordingly.
>
> Maybe a better question is, "So how am I to use PGSQL FTS as a "massively
> huge text search" when it was designed for nothing bigger than "huge text
> search"?
>
> Any thoughts or alternatives are most welcome.
>
> I'm curious how tsvector could be useful on a 29 MB document. That's
> roughly one whole encyclopedia set. A document that size should have a huge
> vocabulary, and tsvector's index would be saturated.
>
> However, if the vocabulary in this 29 MB document isn't that big, then you
> might consider creating a smaller "document." You could write a Perl script
> that scans the document and creates a dictionary which it writes out as a
> secondary "vocabulary" file that's a list of the unique words in your
> document. Create an auxillary column in your database to hold this
> vocabulary for each document, and use tsvector to index that. The perl
> program would be trivial, and tsvector would be happy.
>
> Craig
>
>

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Tim 2011-06-14 23:04:36 Re: tsvector limitations
Previous Message Kevin Grittner 2011-06-14 22:19:04 Re: tsvector limitations