There are a number of well known algorithms out there for taking in a set of keywords and matching them against test. Aho and Corasick comes to mind, as does the Wu Manber algorithm (the latter I've implemented, and the code resides elsewhere on this site).
For another project, I didn't need something quite so fancy. Actually two projects come to mind. One is that I
have a input comma separated value file which includes stock symbols, a description, and the associated exchange. I
wanted to keep statitics on what is read in on an exchange basis. My first kick at the can on this was to implement a
string look up table using
After being reminded of Aho Corasick in another context, I decided to take another stab at improving efficiency. I
ended up implementing half of Aho Corasick's algorithm, the 'go' function. Since I'm always matching from the
beginning of the string, and am not stepping through text, the exclusion of the 'fail' function works quite well, and
keeps the code smaller.
At some point in time the code can be coverted to something using templates in order to accept various string
types, various element matches, and return codes.
Since the 'fail' function isn't implemented, the lookup table can be updated dynamically without being rebuilt from
scratch.
The library is quite easy to use. With AddPattern, add new keywords along with some sort of index or object. When
using FindMatch with the target keyword, the appropriate object will be returned on a successful match. If no match
is found NULL will be returned. A modification to the class would add a default object for use when no matches are
found.
Here is the header file.
#pragma once
// Copyright (2008) Ray Burkholder
// Matching against multiple keywords
#include <vector>
#include <string>
class CKeyWordMatch {
public:
CKeyWordMatch(void);
virtual ~CKeyWordMatch(void);
void ClearPatterns( void );
void AddPattern( const std::string
&sPattern, void *object );
void *FindMatch( const std::string
&sMatch );
protected:
struct structNode {
size_t ixLinkToNextLevel; // next letter of same word
size_t ixLinkAtSameLevel; // look for other letters at same location
void *object; // upon match, (returned when
keyword found)
char chLetter; // the letter at this node
structNode() : ixLinkToNextLevel( 0 ), ixLinkAtSameLevel( 0 ),
object( NULL ), chLetter( 0 ) {};
};
std::vector<structNode> m_vNodes;
private:
};
Here is the code file:
#include "StdAfx.h"
#include "KeyWordMatch.h"
// Copyright (2008) Ray Burkholder
// Matching against multiple keywords
#include <stdexcept>
CKeyWordMatch::CKeyWordMatch(void) {
ClearPatterns();
}
CKeyWordMatch::CKeyWordMatch(size_t size) {
m_vNodes.reserve( size );
ClearPatterns();
}
CKeyWordMatch::~CKeyWordMatch(void) {
m_vNodes.clear();
}
void CKeyWordMatch::ClearPatterns() {
m_vNodes.clear();
structNode node;
m_vNodes.push_back( node ); // root node with nothing
}
void CKeyWordMatch::AddPattern(
const std::string &sPattern, void *object ) {
std::string::const_iterator iter = sPattern.begin();
if ( sPattern.end() == iter ) {
throw std::runtime_error( "zero length pattern" );
}
size_t ixNode = 0;
size_t ix;
bool bDone = false;
while ( !bDone ) {
char ch = *iter;
ix = m_vNodes[ ixNode ].ixLinkToNextLevel;
if ( 0 == ix ) { // end of chain, so add letter
structNode node;
node.chLetter = ch;
m_vNodes.push_back( node );
ix = m_vNodes.size() - 1;
m_vNodes[ ixNode ].ixLinkToNextLevel = ix;
ixNode = ix;
}
else { // find letter at this level
//ix = m_vNodes[ ixNode ].ixLinkToNextLetter; // already set
bool bLevelDone = false;
size_t ixLevel = ix; // set from above
while ( !bLevelDone ) {
if ( ch == m_vNodes[ ixLevel ].chLetter ) {
// found matching character
ixNode = ixLevel;
bLevelDone = true;
}
else {
// move onto next node at this level to find character
size_t ixLinkAtNextSameLevel
= m_vNodes[ ixLevel ].ixLinkAtSameLevel;
if ( 0 == ixLinkAtNextSameLevel ) {
// add a new node at this level
structNode node;
node.chLetter = ch;
m_vNodes.push_back( node );
ix = m_vNodes.size() - 1;
m_vNodes[ ixLevel ].ixLinkAtSameLevel = ix;
ixNode = ix;
bLevelDone = true;
}
else {
// check the new node, nothing to do here
// check next in sequence
ixLevel = ixLinkAtNextSameLevel;
}
}
}
}
++iter;
if ( sPattern.end() == iter ) {
if ( NULL != m_vNodes[ ixNode ].object ) {
std::runtime_error( "Pattern already present" );
}
m_vNodes[ ixNode ].object = object; //
assign and finish
bDone = true;
}
}
}
void *CKeyWordMatch::FindMatch( const std::string &sPattern ) {
// traverse structure looking for matches
std::string::const_iterator iter = sPattern.begin();
if ( sPattern.end() == iter ) {
throw std::runtime_error( "zero length pattern" );
}
void *object = NULL;
size_t ixNode = 0;
size_t ix;
bool bOpFound = true;
bool bDone = false;
while ( !bDone ) {
char ch = *iter;
ix = m_vNodes[ ixNode ].ixLinkToNextLevel;
if ( 0 == ix ) {
bDone = true; // no more matches to be found so exit
}
else {
// compare characters at this level
bool bLevelDone = false;
size_t ixLevel = ix; // set from above
while ( !bLevelDone ) {
if ( ch == m_vNodes[ ixLevel ].chLetter ) {
ixNode = ixLevel;
bLevelDone = true;
}
else {
ixLevel = m_vNodes[ ixLevel ].ixLinkAtSameLevel;
if ( 0 == ixLevel ) { // no match so end
bLevelDone = true;
bDone = true;
}
}
}
}
++iter;
if ( sPattern.end() == iter ) {
object = m_vNodes[ ixNode ].object;
bDone = true;
}
}
return object;
}
In a previous blog article, I presented the beginnings of a Keyword Lookup class. This article takes that work, turns it into a template and makes it useful for longest match lookups. I'll have to compare this lookup library with what a C++ map or u
Tracked: Nov 30, 14:20