Tsearch docs question

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: pgsql-docs(at)postgresql(dot)org
Subject: Tsearch docs question
Date: 2007-10-26 18:25:36
Message-ID: 1193423136.7624.56.camel@dogma.ljc.laika.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-docs

The Tsearch docs, under the GiST and GIN section, say:

"Lossiness [of GiST] causes serious performance degradation since random
access of heap records is slow and limits the usefulness of GiST
indexes."

The docs do go into some detail, but I think it causes some confusion,
also.

Let me digress to state how I understand the relationship between GIN,
GiST, and RECHECK:

The benefit of avoiding RECHECK is to avoid the need to re-evaluate the
predicate after finding the entry in the index. This can be valuable in
tsearch, because the functions are much more expensive than (for
example) integer equality. We (currently) have to visit the heap anyway,
to see the visibility information. So avoiding a RECHECK clause doesn't
do anything to prevent random heap I/O (although, a less-lossy index
will have fewer false positives, by definition).

GIN (as used with tsearch) is lossy for more sophisticated tsqueries
(those involving labels) and non-lossy for simpler tsqueries. There's
only one tsquery type, so PostgreSQL has no way of differentiating
between these two cases.

GiST (as used with tsearch) is lossy for large tsvectors or tsqueries
containing labels; and non-lossy for small tsvectors matched against a
tsquery that contains no labels. PostgreSQL can't differentiate between
these two cases.

So, for GiST they always RECHECK (so you're always sure to get the right
result), and for GIN the default operator does not RECHECK (for
performance), but if you suspect that you might be using labels in your
tsqueries you need to use a special RECHECKing operator, "@@@", to be
accurate.

Is the above accurate?

Back to the docs: I think the docs could clear this issue up somewhat.
The current wording suggests that GIN performs better because it avoids
a trip to the heap, when in reality it seems the benefit is avoiding the
need to re-evaluate the expensive tsearch functions (which might need to
access TOASTed data).

There's also a related issue: I think a RECHECK would be less costly if
you have the tsvectors materialized in the table (using triggers) and
index that. Maybe that could be a tip for using GiST indexes.

Regards,
Jeff Davis

Responses

Browse pgsql-docs by date

  From Date Subject
Next Message Bruce Momjian 2007-10-26 18:35:43 Re: [DOCS] Pattern for use of the alias "Postgres"
Previous Message Andrew Sullivan 2007-10-26 17:54:43 Re: [DOCS] Pattern for use of the alias "Postgres"