Quick Links

Re: advice on indexing email

From:	Marc Tardif <intmktg(at)CAM(dot)ORG>
To:	Maarten Boekhold <maarten(dot)boekhold(at)tibcofinance(dot)com>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: advice on indexing email
Date:	2000-04-28 15:45:47
Message-ID:	Pine.LNX.4.10.10004281122560.31860-100000@Gloria.CAM.ORG
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Regarding your first suggestion, I cannot use a trigger since the actual
email contents will not be inserted in the database because of the 8K
limit on text fields. Instead, I will be keeping the messages as a
seperate file and only keep the pathname in the database.

As for your second suggestion, which of the following methods should I use
to actually enter the words into the database:
- INSERT for each word, which will probably be slow considering an email
message will most likely contain >50 words.
- BEGIN before and END after inserting every word, not sure if that makes
much difference considering there will still be the overhead of single
inserts.
- creating a temporary file and using COPY, which also has it's own share
of overhead considering the additional i/o of creating a file.

PS. Thanks for the fti crontrib, much appreciated. I have already used
your example to create a few other indexing alternatives of my own. I have
also created a daily maintenance script to cluster and rebuild the indices
of my fti tables. Let me know if that could be of interest to you, perhaps
it could be added to your contrib directory.

> > My problem is how to create the full word index. The actual code to
> > seperate the email into seperate words isn't a problem, but should I be
> > using INSERT, BEGIN/END or COPY? In this last case, I would have to create
> > a temporary file holding each word of the email and then use COPY... all
> > of which also has it's fair share of overhead.
>
> You can use one of 2 ways.
>
> 1. the fti stuff in contrib uses triggers, so every time you
> insert/update/delete something in/from the 'fti-ed' table, the full text index
> is also updated. If you're coding abilities are OK, you can just replace the
> word breakup code in contrib/fti with your own one.
>
> 2. if you have to insert large amounts of data, it is probably faster to *not*
> create the triggers at first, bulk load all your data, write a little perl
> script that reads the data from your table, does the word breakup and inserts
> those words into the full text index table. Using a 'sort' on the output of
> the perl script will help performance as the fti data will now already be
> pre-sorted in the database (you could also use CLUSTER on the fti table after
> the index has been created). I think I described this somewhat better in the
> README in contrib/fti. If you take this approach, don't forget to create the
> triggers after the bulk load of the fti table!

In response to

Re: advice on indexing email at 2000-04-28 12:04:50 from Maarten Boekhold

Browse pgsql-general by date

	From	Date	Subject
Next Message	Loïc TREGOUËT	2000-04-28 15:52:15	How tu put/read a composite type in a table?
Previous Message	Bruce Momjian	2000-04-28 14:48:34	Re: Revisited: Transactions, insert unique.