Speaking of Indexing... (Text indexing)

From: Poet/Joshua Drake <poet(at)linuxports(dot)com>
To: <lockhart(at)fourpalms(dot)org>
Cc: <pgsql-hackers(at)postgresql(dot)org>, <pgsql-general(at)postgresql(dot)org>
Subject: Speaking of Indexing... (Text indexing)
Date: 2001-04-10 22:41:31
Message-ID: Pine.LNX.4.30.0104101541140.10612-100000@crazypenguins.commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Good day,

I've been experimenting a bit with Full Text Indexing in PostgreSQL. I
have found several conflicting sites various places on the net pertaining
to whether or not PostgreSQL supports FTI, and I was hoping I could find
an authoritative answer here - I tried searching the website's archives,
but the search seems to be having some problems.

At any rate, I am running a CVS snapshot of 7.1, and I have been trying to
create a full text index on a series of resumes. Some of these exceed 8k
in size, which is no longer a storage problem of course with 7.1, but I
seem to have run into the wicked 8k once again. Specifically:

ERROR: index_formtuple: data takes 9344 bytes, max is 8191

Furthermore, after trying to just index on a 8191-character long substring
of the resume, I run into the following:

ERROR: btree: index item size 3948 exceeds maximum 2713

The only way I could actually get the index created was to substring the
body of the resumes down to 2k. I also later tried using HASH rather than
BTREE, which worked, but none of these solutions really appreciably
increased performance in the way we were hoping.

Are these known and accepted limitations of the current 7.1
implementation, or am I doing something terribly wrong? ;)
On Tue, 10 Apr 2001, Thomas Lockhart wrote:

>> I have the following table, containing about 570000 Rows, but some
>> indexes are not used, on 7.1RC4, freshly vacuumed (analyse). It was the
>> same at least in 7.1RC1
>> CREATE TABLE access_log(
>> access_time timestamp NOT NULL DEFAULT NOW(),
>> method_num int2 NOT NULL,
>> url_id int4 NOT NULL REFERENCES urls(id),
>> );
>> CREATE INDEX method_idx ON access_log(method_num);
>> CREATE INDEX url_idx ON access_log(url_id);
>> url_idx seems OK:
>> But the others not:
>> logger=# EXPLAIN SELECT * FROM access_log WHERE method_num = 0;
>> Seq Scan on access_log (cost=0.00..16443.71 rows=559371 width=89)
>
>The parser does not know that your int4 constant "0" can be represented
>as an int2. Try
>
> SELECT * FROM access_log WHERE method_num = int2 '0';
>
>(note the type coersion on the constant; there are other ways of
>specifying the same thing).
>
>For the other cases, PostgreSQL is estimating the query cost to be lower
>with a sequential scan. For the "SELECT 1" subselect case, it may be
>that the optimizer does not cheat and determine that there will be only
>one row returned, or that the query can be reformulated to use a simple
>constant.
>
>HTH
>
> - Thomas
>
>---------------------------(end of broadcast)---------------------------
>TIP 4: Don't 'kill -9' the postmaster
>

--
--
<COMPANY>CommandPrompt - http://www.commandprompt.com </COMPANY>
<PROJECT>OpenDocs, LLC. - http://www.opendocs.org </PROJECT>
<PROJECT>LinuxPorts - http://www.linuxports.com </PROJECT>
<WEBMASTER>LDP - http://www.linuxdoc.org </WEBMASTER>
--
Instead of asking why a piece of software is using "1970s technology,"
start asking why software is ignoring 30 years of accumulated wisdom.
--

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Peter Eisentraut 2001-04-10 23:03:55 Re: LOs and pg_dump, restore, vacuum for 7.1
Previous Message Joel Burton 2001-04-10 21:20:30 Re: Converting from access to pgsql..questions...

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2001-04-10 23:09:21 Re: Re: HOLD THE PRESSES!! ... pg_dump from v7.0.3 can't import to v7.1?
Previous Message Kyle VanderBeek 2001-04-10 21:08:22 Re: Large Object problems (was Re: JDBC int8 hack)