Skip site navigation (1) Skip section navigation (2)

Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: David Rowley <dgrowley(at)gmail(dot)com>
Cc: 'Peter Eisentraut' <peter_e(at)gmx(dot)net>, pgsql-patches(at)postgresql(dot)org
Subject: Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
Date: 2008-09-04 22:31:15
Message-ID: 48C061B3.1090709@enterprisedb.com (view raw or flat)
Thread:
Lists: pgsql-hackerspgsql-patches
After reading the wikipedia article on Boyer-Moore search algorithm, it 
looks to me like this patch actually implements the simpler 
Boyer-Moore-Horspool algorithm that only uses one lookup table. That's 
probably fine, as it ought to be faster on small needles and haystacks 
because it requires less effort to build the lookup tables, even though 
the worst-case performance is worse. It should still be faster than what 
we have now.

The skip table really should be constructed only once in 
text_position_start and stored in TextPositionState. That would make a 
big difference to the performance of those functions that call 
text_position_next repeatedly: replace_text, split_text and text_to_array.

David Rowley wrote:
> I've done some more revisions to the patch. This has mostly just involved
> tuning the skip table size based on the size of the search. This has
> basically involved lots of benchmarks with different strings and calculating
> the best size of table to use. The reason for this is to maintain fast
> searches for smaller strings. The overhead of initialising a 256 element
> array would probably out weigh the cost of the search if this were not done.
> The size of the skip table increases with longer strings, or rather the size
> that is utilised.
> 
> Performance:
> For smaller searches performance of the patch and existing version are very
> similar. The patched version starts to out perform the existing version when
> the needle and haystack become larger.
> 
> The patch wins hands down with searches that leads the existing function in
> to dead ends, for example:
> 
> SELECT STRPOS('A AA AAA AAAA AAAAA AAAAAA AAAAAAA','AAAAAAA');
> 
> When searching for very small strings, say just a single character in a
> sentence the existing function marginally beats the patched version.
> 
> Outside of Postgres I've done benchmarks where I've searched for every
> combination of the search string in the search string. Like:
> 
> test     | t      |
> test     | te     |
> test     | tes    |
> test     | test   |
> test     | e      |
> test     | es     |
> test     | est    |
> test     | s      |
> test     | st     |
> test     | t      |
> 
> 
> I felt this was fair for both versions. The patched version beat the
> unpatched version. The test I carried out was a string of 934 characters.


-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

In response to

Responses

pgsql-hackers by date

Next:From: Alex HunsakerDate: 2008-09-04 23:10:22
Subject: Re: Need more reviewers!
Previous:From: Michelle CaisseDate: 2008-09-04 22:23:43
Subject: Re: code coverage patch

pgsql-patches by date

Next:From: Tom LaneDate: 2008-09-04 23:11:28
Subject: Re: hash index improving v3
Previous:From: Tom LaneDate: 2008-09-04 20:28:34
Subject: Re: hash index improving v3

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group