Posted: 2011-11-01

Category: Programming

Three String Matching Algorithms in C++

During this post, I will show you my C++ implementation of three string algorithms:

a silly naive solution
the Rabin-Karp algorithm
the Knuth-Morris-Pratt algorithm

I've recently been fooling around with a ton of different programming projects (OpenCV, F#, C#, and of course my usual screen scraping antics...) but one topic that caught my interest is string matching algorithms. One of my recent classes in school was Algorithms I (more verbosely, the Design and Analysis of Algorithms I). Unfortunately, most of the class was spent on a brushing-over of universal hashing and more than one experimentation project concerning the choosing of pivots when implementing Quicksort. Big woop.

For the class, we were required to purchase the renowned Introduction to Algorithms... which we never opened throughout the entire class. In fact, my professor just taught off of Wikipedia for basically every lecture. Bummer. So I decided to look through it on my own time. Well there's an entire chapter dedicated to string matching! Cool! So I decided to throw together an implementation or two in C++.

Throughout this post, I used the following two terms frequently:

needle: this is the string that you are searching for. So if you are searching for "Emma" in the string "Hermione Granger is Emma Watson", then "Emma" is the needle.
haystack: this is the string that you are searching in. In the above example, "Hermione Granger is Emma Watson" would be the haystack.

To ease back into C++, I wrote a naive implementation which works perfectly well, but isn't very efficient (as there is absolutely no pre-processing step, which is where other algorithms get their speedup). It returns a vector of all of the indices where needle exists in haystack. Very intuitive. No fancy tricks.

#include <vector>
#include <string>
using namespace std;

vector<size_t> naiveSearch(const string & needle, const string & haystack)
{
    vector<size_t> matches;

    // thanks Jeff!
    if(needle.size() > haystack.size())
        return matches;

    size_t needleSize = needle.size();
    size_t maximumIndex = haystack.size() - needleSize;
    
    size_t needleIndex;
    for(size_t haystackIndex = 0; haystackIndex <= maximumIndex; haystackIndex++)
    {
        for(needleIndex = 0; needleIndex < needleSize && needle[needleIndex] == haystack[haystackIndex + needleIndex]; needleIndex++);
        
        if(needleIndex == needleSize)
            matches.push_back(haystackIndex);
    }

    return matches;
}

The next one I implemented was the Rabin-Karp string searching algorithm. Basically the algorithm works by using a specific kind of hashing. For the pre-processing step, a hash of the needle is generated. Then a hash of the first m characters of the haystack is generated, where m is the length of the needle. If those two hashes match, then we know (or are relatively certain) that the haystack starts with an instance of the needle. We then shift forward one character in the haystack, calculate the new hash for the new position, then compare the hashes. The reason this is faster than the naive search is because the hashing algorithm used in Rabin-Karp is designed to be very cheap to change one character in the hashed value. So moving the bounds of the candidate string in the haystack forward one character is cheaper than rechecking the whole string, character-by-character.

The prime used for the hashing algorithm is the largest prime less than number values expressible in your hash data type (in my case, a 64-bit integer - 2⁶⁴) divided by your alphabet size (in my case, a char - 2⁸). Wolfram Alpha is magic and tells me the prime number I can use.

The offSetMatch function just does a character-by-character check for a needle and haystack at a given offset in the haystack. Note that this function does no bounds checking!

#include <vector>
#include <string>
using namespace std;

typedef unsigned long long uint64;

bool offsetMatch(const string & needle, const string & haystack, size_t offset)
{
    size_t needleCount = needle.size();
    size_t index;
    
    for(index = 0; index < needleCount && needle[index] == haystack[offset + index]; index++);

    return index == needleCount;
}

vector<size_t> rabinKarpMatcher(const string & needle, const string & haystack)
{
    static const uint64 radixLength = 256ULL;
    static const uint64 prime = 72057594037927931ULL;

    vector<size_t> matches;

    size_t needleLength = needle.size();
    size_t haystackLength = haystack.size();
    size_t lastIndex = haystackLength - needleLength;

    uint64 differenceHash = pow(radixLength, (uint64)(needleLength - 1)) % prime;

    size_t needleHash = 0;
    size_t firstHaystackHash = 0;

    size_t index;

    // preprocessing
    for(index = 0; index < needleLength; index++)
    {
        needleHash = (radixLength * needleHash + needle[index]) % prime;
        firstHaystackHash = (radixLength * firstHaystackHash + haystack[index]) % prime;
    }

    vector<uint64> haystackHashes;
    haystackHashes.reserve(lastIndex + 1);
    haystackHashes.push_back(firstHaystackHash);

    // matching
    for(index = 0; index <= lastIndex; index++)
    {
        if(needleHash == haystackHashes[index])
            if(offsetMatch(needle, haystack, index))
                matches.push_back(index);

        if(index < lastIndex)
        {
            uint64 newHaystackHash = (radixLength * (haystackHashes[index] - haystack[index] * differenceHash) + haystack[index + needleLength]) % prime;
            haystackHashes.push_back(newHaystackHash);
        }
    }

    return matches;
}

The last searching algorithm is the Knuth-Morris-Pratt Algorithm. Like Rabin-Karp, it also does a pre-processing step on the needle before starting the search routine. It uses an array called the partial match table. I don't want to butcher the explanation, so I would recommend reading this article if you want to understand what's going on. Also, here's a neat little script that breaks down the steps a KMP execution. Great for seeing what's going on!

Here's my implementation.

#include <vector>
#include <string>
using namespace std;

vector<size_t> knuthMorrisPrattTable(const string & needle)
{
    vector<size_t> table(needle.size() + 1, -1);
    for(size_t index = 1; index <= needle.size(); index++)
    {
        size_t position = table[index - 1];
        
        while(position != -1 && needle[position] != needle[index - 1])
            position = table[position];

        table[index] = position + 1;
    }

    return table;
}

vector<size_t> knuthMorrisPrattSearch(const string & needle, const string & haystack, const vector<size_t> & table)
{
    vector<size_t> matches;
    size_t haystackIndex = 0;
    size_t needleIndex = 0;

    size_t haystackSize = haystack.size();
    size_t needleSize = needle.size();

    while(haystackIndex < haystackSize)
    {
        while(needleIndex != -1 && (needleIndex == needleSize || needle[needleIndex] != haystack[haystackIndex]))
            needleIndex = table[needleIndex];

        needleIndex++;
        haystackIndex++;

        if(needleIndex == needleSize)
            matches.push_back(haystackIndex - needleSize);
    }
 
    return matches;
}

Enjoy!