Re: Dumb question involving to_tsvector and a view

From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: Jasen Betts <jasen(at)xnet(dot)co(dot)nz>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Dumb question involving to_tsvector and a view
Date: 2013-02-24 16:56:54
Message-ID: 1361725014.4245.YahooMailNeo@web162905.mail.bf1.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Jasen Betts <jasen(at)xnet(dot)co(dot)nz> wrote:
>On 2013-02-23, Raymond C. Rodgers <sinful622(at)gmail(dot)com> wrote:

>> On 02/23/2013 05:26 AM, Tom Lane wrote:
>>> A "virtual" tsvector like that is probably going to be useless for
>>> searching as soon as you get a meaningful amount of data, because the
>>> only way the DB can implement a search is to compute the tsvector
>>> value for each table row and then examine it for the target word(s).
>>> What you want is a GIST or GIN index on the contents of the tsvector.
>
>> I think the only real advantage to using something like this would be a
>> space savings in terms of storing the tsvector data, but I don't see
>> that being a significant enough reason to go ahead and use this idea in
>> a production situation. As mentioned [by pretty much all of us], once
>> the table size is sufficiently large there would be a performance
>> penalty by to_tsvector being executed on every record in the table.
>
>Unless the plan comes out as a table scan the index will be used
>instead ot to_tsvector()
>
>When there is a table scan to_tsvector will be used instead of reading
>from disk, I don't know how fast to_tsvector is compared to disk, but
>usually computing a result is faster than reading it from disk.
>
>Storing the tsvector in the table is likely to be faster only when a
>tablescan is done and the table is fully cached in ram.

I guess I was being dumb in assuming that it was obvious that a GIN
or GiST index would be needed for decent performance at scale.
Without that, a scan of the whole table (or at least all rows
matching other search criteria) is needed, which is going to hurt.
The benchmarks I mentioned were for a GIN index on the results of
the function which generated the tsvector, versus a GIN index on
the stored tsvector.  In our case, a typical scan for document text
against years of accumulated court documents was about 300 ms
versus about 1.5 seconds.  It may matter that we weren't just
looking for matches, but the top K matches based on the ranking
function.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Ray Stell 2013-02-24 19:37:05 Re: [JDBC] can't access through SSL
Previous Message Alban Hertroys 2013-02-24 11:52:16 Re: autoanalyze criteria