January 21, 2013

Searching for Substrings in Streams: a Slight Modification of the Knuth-Morris-Pratt Algorithm in Haxe

It is odd that the base libraries for most programming languages do not allow you to search for regular expressions and substrings in streams or partial reads. We have modified the KMP algorithm so that it accepts virtually infinite partial strings. The code is implemented in Haxe, so it can generate code in multiple programming languages.

Streams are important when working with data that does not fit in main memory, such as large files, or with data which is being transferred. There are a few implementations of regular expressions and substrings matching. One is the Jakarta Regexp, now retired and resting in the Apache Attic. The Jakarta Regexp library “match” method in the RE class uses a CharacterIterator as a parameter. In C++, Boost.Regex implements partial matches.

Our code is implemented in Haxe so the same code can target Javascript, ActionScript, Flash SWF, NekoVM, PHP, C++, C#, and Java. We really like the concept of writing one code and expanding it to a variety of platforms with minimum effort. There are excellent libraries in specific environments that can work perfectly in other environments. Porting libraries from one programming language to another is tedious. For example, the amazing NetworkX graph library implemented in Python can be easily ported to C# to benefit a broader audience.

Code

Prerequisites

  1. Haxe (tested on version 2.10)
  2. For C++: hxcpp (run haxelib install hxcpp)
  3. For Java: hxjava (run haxelib install hxjava)
  4. For Mono/C#: jxcs (run haxelib install hxcs)

Source code available on github.

See Also

  1. Parsing S-Expressions in C# using OMeta
  2. Esoteric Queue Scheduling Disciplines

Resources

  1. Knuth-Morris-Pratt string matching
  2. Text Searching: Theory and Practice
  3. Boyer–Moore–Horspool algorithm
  4. Rabin–Karp algorithm
  5. Aho–Corasick string matching algorithm
  6. Lexicographically minimal string rotation
  7. Efficient way to search a stream for a string