Quick Links

TODO item: Implement Boyer-Moore searching (First time hacker)

From:	"David Rowley" <dgrowley(at)gmail(dot)com>
To:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	TODO item: Implement Boyer-Moore searching (First time hacker)
Date:	2008-08-30 23:21:11
Message-ID:	D558A7BC159740A1BEF36EE8997B64DD@amd64
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers pgsql-patches

Reference: Bruce Momjian writes: ->
http://archives.postgresql.org/pgsql-committers/2007-09/msg00402.php

Other references: Boyer Moore?? ->
http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/fstrpos-exa
mple.html

As a first time hacker of postgresql I've decided to attempt something that
does not go too deep in to the heart of the software.

Before I make an attempt at writing a patch, I've decided to start
benchmarking outside of Postgres. On reading the link underneath the Boyer
Moore item in the in the new TODO list wiki, I learned that someone else had
a shot at speeding up strpos around this time last year using KMP algotithm.
On looking a little deeper into Boyer Moore's algorithm it's quite obvious
that there is some overhead involved with preprocessing the search key
(needle). My target has been to not have any trade offs with this possibly
to be patch. I didn't want smaller searches to suffer.

I'd like to receive some feedback on the various functions in the attached
before I go ahead and work on this any more. I've written 8 different
versions of the function. Version 8 seems to be the fastest out of them,
however I've not tested with multi byte chars.

Please note that these are just some initial ideas. For example startpos is
not yet implemented, but there is very little overhead to that anyway.

For comparision sake I mocked up the current postgresql 8.3's strpos() from
varlena.c. I've no idea how accuratly this will be to the real version.
Probably find out once I write the patch. I replaced strncmp with my own
version as I was having problems with my compilers optimizer, it optimized
too much and I was unable to benchmark. I felt running without and
optimization was unfair as strncmp is. Perhaps if someone else wants to
benchmark with -O2 on gcc they should first replace pg_strncmp with strncmp.

Currently I have a single file which can be compiled itself that contains
benchmarks for the functions I've tested so far. All of these implement a
stripped out version of the Boyer Moore algotithm. I've done away with one
of the character maps in a bid to reduce the overhead involved creating
them.

The position calculation will likely need changed for multi byte characters.
I'm suspecting I'm not meant to be doing length calculations on pointers.
Can someone confirm?

To keep this email short I've put most of the information required in as
comments into the source of the program.

I've uploaded some benchmark results using the attached program. The
benchmarking was done on windows.

Please see http://www.unixbeast.com/~fat/test1.html there are 10 tests in
total, in each I attempt to exploit weaknesses, some with my functions and
some with the existing function.

My method for version 8 of the function probably looks quite unusual as it
does some simple cost based optimisation

I look forward to receiving feedback on this.

David.

Attachment	Content-Type	Size
BMStrpos.tar	application/x-tar	38.4 KB

Responses

Re: TODO item: Implement Boyer-Moore searching (First time hacker) at 2008-09-02 11:59:37 from Peter Eisentraut

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Stephen R. van den Berg	2008-08-30 23:29:59	Re: Attaching error cursor position to invalid constant values
Previous Message	Tom Lane	2008-08-30 20:18:57	Attaching error cursor position to invalid constant values

Browse pgsql-patches by date

	From	Date	Subject
Next Message	Simon Riggs	2008-08-31 09:27:47	rmgr hooks and contrib/rmgr_hook
Previous Message	Brendan Jurd	2008-08-30 01:39:04	to_date() validation