- Naïve: simple and least inefficient way to search in strings, it basically checks every position in the haystack then every position of the needle. The advantage of using this algorithm is that it needs no initial overhead such as auxiliary tables creation. It has O(mn) complexity, where m is the needle length and n the haystack length.
- Boyer-Moore: Efficient searching string algorithm that preprocesses the needle and doesn’t check every position in the haystack but rather skips some of them on each unsuccessful attempt. It has O(n) complexity with O(3n) in the worst case.
- Boyer-Moore-Horspool: It’s a simplification of Boyer-Moore’s algorithm with less overhear during initial needle preprocessing. It also has O(n) complexity but O(mn) in the worst case.
Algorithms by engines
|SpiderMonkey||Gecko||Firefox up to 3.0.*||Boyer-Moore-Horspool|
|TraceMonkey||Gecko||Firefox from 3.1.*||Boyer-Moore-Horspool|
|SquirrelFish||WebKit||Safari from 4.*||Naïve|
|V8||WebKit||Chrome||Strategy: Naïve, Boyer-Moore-Horspool and Boyer-Moore|
|Linear B||Presto||Opera 7.0 - 9.50[||?|
|Futhark||Presto Core 2||Opera from 9.50||?|
String.indexOf in C with some verifications in string lengths prior to run BMH algorithm in order to avoid long searching for relatively small strings.
Source code available at: ftp://ftp.mozilla.org/pub/mozilla.org/firefox/releases/3.0.13/source/firefox-3.0.13-source.tar.bz2
It has exactly the same
String.indexOf implementation as SpiderMonkey but in C++.
Source code available at: ftp://ftp.mozilla.org/pub/mozilla.org/firefox/releases/3.5.2/source/firefox-3.5.2-source.tar.bz2
The main part of the naïve implementation of
/* ... */ for (const UChar* c = data_ + pos; c <= end; c++) if (c->uc == fchar && !memcmp(c + 1, fdata, fsizeminusone)) return (c - data_); return -1;
* taken from KDE 4.0 API reference
Files related to
- string_object.cpp: defines the prototype for String object where
indexOfis in a switch case statement and calls
- ustring.cpp: defines the
findfunction where the naïve algorithm is implemented.
Browse the source code online: http://api.kde.org/4.0-api/kdelibs-apidocs/kjs/html/files.html
Files related to
indexOfmethod is defined and call
A very smart strategy is applied to the string searching in order to choose the best algorithm based on the length of the needle:
- First of all it checks if there is a non-ASCII needle for an ASCII haystack and bail out if there is.
- Checks if the needle length is less than 5 then uses a naïve solution called
simpleIndexOf, because the max shift of Boyer-Moore on such needle length doesn’t compensate for the overhead. This
simpleIndexOffunction never bails out, it means that the needle will be checked for a match in the whole haystack.
- If the needle length is greater than or equals to 5 another
simpleIndexOffunction is called. This one considers how much work have been done (unsuccessful matches) in order to stop trying and switch for a better algorithm. This is called the “badness count” which once reached the max, stop the search and returns the index in the haystack where the next algorithm should continue from.
- The next algorithm in the strategy chain is Boyer-Moore-Horspool which also consider the badness count prior to jump to the next algorithm.
- The last one is Boyer-Moore which has some initial overhead when creating good and bad shift tables.
Source code available at: http://code.google.com/p/v8/wiki/Source?tm=4
Rhino runs on top of Java Virtual Machine and uses the
java.lang.String.indexOf from Java language implemented for such JVM. Interestingly there is a comment saying:
“Uses Java String.indexOf(). OPT to add - BMH searching from jsstr.c”.
java.lang.String.indexOf implementation is much worse than that.
Source code available at: http://www.mozilla.org/rhino/download.html
By running a simple test across some browsers we can have an idea how fast/slow is
The test consists of the average of a 100 times running a search for the word ”foobar” in the middle of a ~1200 length ”ipsum lorem” text iterating 1 million times each search. Try it yourself.
The results in the follow table were taken by running this test on the same machine (Pentium 4HT, 3GHz, 1Gb RAM, Windows XP SP3).
|Futhark||Opera||10.00 beta 2||1549.06|
* running on the same machine with Ubuntu 9.04 live cd
** running on a VM on a different computer
*** running on Sun JRE 6 - 1.6.0_14
Again, these results don’t prove which algorithm is the best due to different browser performances, however it is worth noting that Firefox 3.0.13 performed better than Firefox 3.5.2 on this benchmark. Internet Explorer had the worst results, it can be either the algorithm or the browser performance itself or even both. :-)