In a previous blog article, I presented the beginnings of a Keyword Lookup class. This article takes that work, turns it into a template and makes it useful for longest match lookups.
I'll have to compare this lookup library with what a C++ map or unordered_map does for
lookup speed. I'm hoping that when a comparison is made, that this routine is indeed
faster. I'm thinking it might be because, even though a map will do a binary search through
it's map, complete strings are compared at each step. With this library, keyword patterns
are added to a rooted tree of characters, possibly reducing the amount of time spent
matching.
I must admit that at each match step, there is a linear search performed through a list
of likely character candidates. To improve the search, I've been thinking that once the
pattern tree has been
created, it could be sorted so that each step search can be done with a binary search. This
will be something for next time.
In the meantime, this routine does work for finding maximum matches. For example when
performing a long distance rate lookup, this will find the most specific rate code from a
list of various length candidates, something which is difficult to do with a map.
This keyword match algorithm is template based. 'class T' is the type to be returned
upon a successful match. It can be an index, a pointer, a number, or anything else
suitable. The class constructor requires an initializer... basically a value to be given
the equivalent meaning of NULL. On no match, this value is returned. Use 'AddPattern' to
add patterns and their associated 'meanings'. Use FindMatch to perform the lookup.
// this is kind of a subset of Aho Corasick algorithm
// only full keyword matching, no text searches
// no on failure coding
#ifndef CKEYWORDMATCH_H_
#define CKEYWORDMATCH_H_
#include <string>
#include <vector>
#include <stdexcept>
#include <iostream>
template<class T> class CKeyWordMatch {
public:
explicit CKeyWordMatch<T>( T initializer, size_t size );
virtual ~CKeyWordMatch(void);
void ClearPatterns( void );
void AddPattern( const std::string &sPattern, T object );
T FindMatch( const std::string &sMatch );
size_t size( void ) { return m_vNodes.size(); };
protected:
T m_Initializer;
struct structNode {
size_t ixLinkToNextLevel; // next letter of same word
size_t ixLinkAtSameLevel; // look for other letters at same location
T object; // upon match, (returned when keyword found)
char chLetter; // the letter at this node
explicit structNode( T initializer ) : ixLinkToNextLevel( 0 ), ixLinkAtSameLevel( 0 ),
object( initializer ), chLetter( 0 ) {};
};
std::vector<structNode> m_vNodes;
private:
};
template<class T> CKeyWordMatch<T>::CKeyWordMatch( T initializer, size_t size )
: m_Initializer( initializer )
{
m_vNodes.reserve( size );
ClearPatterns();
}
template<class T> CKeyWordMatch<T>::~CKeyWordMatch(void) {
m_vNodes.clear();
}
template<class T> void CKeyWordMatch<T>::ClearPatterns() {
m_vNodes.clear();
structNode node( m_Initializer );
m_vNodes.push_back( node ); // root node with nothing
}
template<class T> void CKeyWordMatch<T>::AddPattern(
const std::string &sPattern, T object ) {
std::string::const_iterator iter = sPattern.begin();
if ( sPattern.end() == iter ) {
throw std::invalid_argument( "zero length pattern" );
}
size_t ixNode = 0;
size_t ix;
bool bDone = false;
while ( !bDone ) {
char ch = *iter;
ix = m_vNodes[ ixNode ].ixLinkToNextLevel;
if ( 0 == ix ) { // end of chain, so add letter
structNode node( m_Initializer );
node.chLetter = ch;
m_vNodes.push_back( node );
ix = m_vNodes.size() - 1;
m_vNodes[ ixNode ].ixLinkToNextLevel = ix;
ixNode = ix;
}
else { // find letter at this level
bool bLevelDone = false;
size_t ixLevel = ix; // set from above
while ( !bLevelDone ) {
if ( ch == m_vNodes[ ixLevel ].chLetter ) {
// found matching character
ixNode = ixLevel;
bLevelDone = true;
}
else {
// move onto next node at this level to find character
size_t ixLinkAtNextSameLevel
= m_vNodes[ ixLevel ].ixLinkAtSameLevel;
if ( 0 == ixLinkAtNextSameLevel ) {
// add a new node at this level
structNode node( m_Initializer );
node.chLetter = ch;
m_vNodes.push_back( node );
ix = m_vNodes.size() - 1;
m_vNodes[ ixLevel ].ixLinkAtSameLevel = ix;
ixNode = ix;
bLevelDone = true;
}
else {
// check the new node, nothing to do here
// check next in sequence
ixLevel = ixLinkAtNextSameLevel;
}
}
}
}
++iter;
if ( sPattern.end() == iter ) {
if ( m_Initializer != m_vNodes[ ixNode ].object ) {
throw std::domain_error( "Pattern already present" );
}
m_vNodes[ ixNode ].object = object; // assign and finish
bDone = true;
}
}
}
template<class T> T CKeyWordMatch<T>::FindMatch( const std::string &sPattern ) {
// traverse structure looking for matches, object at longest match is returned
std::string::const_iterator iter = sPattern.begin();
if ( sPattern.end() == iter ) {
throw std::runtime_error( "zero length pattern" );
}
T object = m_Initializer;
size_t ixNode = 0;
size_t ix;
bool bDone = false;
while ( !bDone ) {
char ch = *iter;
ix = m_vNodes[ ixNode ].ixLinkToNextLevel;
if ( 0 == ix ) {
bDone = true; // no more matches to be found so exit
}
else {
// compare characters at this level
bool bLevelDone = false;
size_t ixLevel = ix; // set from above
while ( !bLevelDone ) {
if ( ch == m_vNodes[ ixLevel ].chLetter ) {
if ( m_Initializer != m_vNodes[ ixLevel ].object )
object = m_vNodes[ ixLevel ].object;
ixNode = ixLevel;
bLevelDone = true;
}
else {
ixLevel = m_vNodes[ ixLevel ].ixLinkAtSameLevel;
if ( 0 == ixLevel ) { // no match so end
bLevelDone = true;
bDone = true;
}
}
}
}
++iter;
if ( sPattern.end() == iter ) {
bDone = true;
}
}
return object;
}
#endif /*CKEYWORDMATCH_H_*/