Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "David Rowley" <dgrowley(at)gmail(dot)com>
Cc: "'Heikki Linnakangas'" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "'Peter Eisentraut'" <peter_e(at)gmx(dot)net>, pgsql-patches(at)postgresql(dot)org
Subject: Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
Date: 2008-09-06 22:21:01
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

"David Rowley" <dgrowley(at)gmail(dot)com> writes:
> I've made and attached the changes Heikki recommended.

I looked this over a bit and was immediately confused by one thing:
the introductory comment says that the skip table size ought to be based
on the length of the haystack, which makes sense to me, but the code is
actually initializing it on the basis of len2, ie, the length of the
needle. Isn't that a bug? Was the same bug present in the tests you
made to determine the best table sizes?

Stylistic issues:

I really dislike having two copies of the heuristic size-setting code.
I think it's worth introducing a second test of use_wchar in order to
arrange text_position_setup like this:

... existing code ...

choose skiptablesize

initialize skip table (this loop doesn't depend on char width)

if (!state->use_wchar)
process char needle
process wide-char needle

Also it might be worth having local-variable copies of skiptablesize and
len2 for this initialization loop:

for (ai = 0; ai <= state->skiptablesize; ai++)
state->skiptable[ai] = state->len2;

I'm not at all sure whether a C compiler will think that it's allowed to
avoid fetching these values again on each iteration, seeing that the
loop is storing into other integer members of *state. Given that this
loop is exactly the overhead we're trying to control by adjusting
skiptablesize, making it as tight as possible seems worthwhile.

> Now that the skip table is a member of TextPositionState, I was not quite
> sure if I should #define a size for it. It would certainly look neater, only
> the code that defines the skip table size in text_position_start assumes 256
> elements.

I tend to agree that a #define is pointless there, since there's no easy
way to make the relevant code do anything with it. It would be
worthwhile to point out adjacent to the code what the max length is,
though. I had gotten as far as rewriting your introductory comment like

* Prepare the skip table for Boyer-Moore-Horspool searching. First we
* must determine how big a skip table to use. The declaration of
* TextPositionState allows up to 256 elements, but with haystack lengths
* of only a few characters we don't really want to have to initialize so
* many elements --- it would take too long in comparison to the actual
* search time. So we choose a useful skip table size based on the
* haystack length.

before noticing that what I was writing wasn't what the code actually
did. Another point that doesn't seem to be included in your comments is
that the skip table size *must* be a power of 2 because we use bit
masking. (In fact state->skiptablesize might better be named skiptablemask
since it's really 1 less than the table size.)

BTW, try to avoid introducing so much random vertical space. IMHO it
does little for readability and much to make routines too long to fit in
one screenful. Generally the PG code doesn't use double blank lines
anywhere inside a function, nor blank lines before a right brace line.
pg_indent will clean this up to some extent, but it would be better if
the code looked more like the project style in the first place.

regards, tom lane

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2008-09-06 22:30:37 Re: reducing statistics write overhead
Previous Message Bruce Momjian 2008-09-06 21:19:20 Re: [PATCH] "\ef <function>" in psql

Browse pgsql-patches by date

  From Date Subject
Next Message Tom Lane 2008-09-06 22:50:47 Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
Previous Message Tom Lane 2008-09-06 20:57:57 Re: [PgFoundry] Unsigned Data Types [1 of 2]