N-grams

From: Anthony Gentile <asgentile(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: N-grams
Date: 2011-01-13 02:37:42
Message-ID: AANLkTi=Gs8obcr_suRmEOUYUXpYRVNGzO9s2TWWMqn2m@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

Today I was reading a blog post from a fellow coworker
http://www.depesz.com/index.php/2010/12/11/waiting-for-9-1-knngist/ and
started to mess around with the trigram contrib package for postgres and
playing with some different word dictionaries for English and German. I was
wanting to see how performant particular queries could be if SIGLENINT in
trgm.h was adjusted to be the avg character length for a particular word
dictionary

http://packages.ubuntu.com/dapper/wamerican
compling=# SELECT AVG(LENGTH(CAST(word AS bytea), 'UTF8')) FROM
english_words;
avg
--------------------
8.4498980409662267

vs

http://packages.ubuntu.com/dapper/wngerman
compling=# SELECT AVG(LENGTH(CAST(word AS bytea), 'UTF8')) FROM words;
//german
avg
---------------------
11.9518056504365566

(unsurprisingly German words are on average longer than English ones)

Effectly wanting to make the trigram package act more along the lines of
n-gram where I am explicitly setting the N when it is built. I, am however,
not very proficient in C and doubt that is the only change necessary needed
to convert the trigram contrib to an n-gram as after changing SIGLENINT to
12 in trgm.h I still get trigram results for show_trgrm() . I was hoping
someone familiar with it could provide a little help for me by perhaps
giving me a path of action needed to change the trigram implementation to
behave as an n-gram. Thanks for your time and I appreciate any advice anyone
can give me.

Anthony Gentile

Browse pgsql-hackers by date

  From Date Subject
Next Message Itagaki Takahiro 2011-01-13 02:52:47 Re: pg_regress multibyte setting
Previous Message Itagaki Takahiro 2011-01-13 02:29:59 Re: pg_ctl failover Re: Latches, signals, and waiting