Tsearch vs Snowball, or what's a source file?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Tsearch vs Snowball, or what's a source file?
Date: 2007-06-02 20:05:48
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-hackers
While looking at the tsearch-in-core patch I was distressed to notice
that a good fraction of it is derived files, bearing notices such as

/* This file was generated automatically by the Snowball to ANSI C compiler */

Our normal policy is "no derived files in CVS", so I went looking to
see if we couldn't avoid that.  I now see that contrib/tsearch2 has been
doing the same thing for awhile, and it's risen up to bite us before, eg

I had not previously known anything about Snowball, but after perusing
their website
for a bit, I believe the following is an accurate summary:

1. The original word-stemming algorithms are written in a special
language "Snowball".  You can get both the Snowball compiler and the
original ".sbl" source files off the Snowball site, but these files are
not those.

2. The Snowball people also distribute a "pre-compiled" version of their
stuff, ie, the results of generating ANSI C code from all the stemming
algorithms.  They call this distribution "libstemmer".

3. What we've been distributing in contrib/tsearch2/snowball is a
severely cut-back subset of libstemmer, ie, just the English and Russian
stemmers.  This accounts for the occasional complaints in the mailing
lists from people who were trying to add other stemmers from the
libstemmer distribution (and running into version-skew problems, because
the version we're using is not very up-to-date).

4. The proposed tsearch-in-core patch includes a larger subset of
libstemmer, but it's still not the whole thing, and it still seems to be
a modified copy rather than an exact one.

There isn't any part of this that seems to me to be a good idea.
Arguably we should be relying on the original .sbl files, but that would
make the Snowball compiler a required tool for building distributions,
which is a dependency I for one don't want to add.  In any case there's
probably not a lot of practical difference between relying on the
Snowball project's .sbl files and relying on their libstemmer
distribution.  Either way, we are importing someone else's sources.
(At least they're BSD-license sources...)

What I definitely *don't* like is that we've whacked the fileset around
in ways that make it hard for someone to drop in a newer version of the
upstream sources.  The filenames don't match, the directory layout
doesn't match, and to add insult to injury we've plastered our copyright
on their files.

Following the precedent of the zic timezone files would suggest dropping
an *unmodified* copy of the libstemmer distro into its own subdirectory
of our CVS, and doing whatever we have to do to compile it without any
changes, so that we can drop in updates later without creating problems.
(This is, in fact, what the Snowball people recommend for incorporating
their code into a larger application.)

OTOH, keeping our copy of the zic files up-to-date has proven to be a
significant pain in the neck, and so I'm not sure I care to follow that
precedent exactly.  The Snowball files may not change as often as
politicians invent new timezone laws, but they seem to change regularly
enough --- the libstemmer tarball I just downloaded from their website
seems to have been generated barely a week ago, and no it doesn't match
what's in the patch now.

Is there a reasonable way to treat libstemmer as an external library?

			regards, tom lane


