Harald Kirsch

genug Unfug.

2015-11-01

Monq.jfa performance comparison with java.util.regex

Recently I found the time to revive the Java package for deterministic finite automata (DFA), that I wrote many years ago at the EBI: Monq.jfa.

The software dates from times when Java was version 1.4 and generics where not yet available, but I noticed that it is quite complete with regard to functionality, and the test coverage shows that most of it is exercised in tests.

My biggest surprise though came from a performance comparison with java.util.regex. While java.util.regex uses "only" nondeterministic finite automata (NFA) and should therefore naturally be slower than a DFA implementation, I expected that implementation tricks and algorithmic developments in the last 10 years had improved its speed such that my DFA implementation provides no real gain anymore. But, as is often case: it depends.

I downloaded Wikicorpus 1.0, and combined 10 of the files into one file of around 210 MB with 35 million words, according to wc -w. The task was to count the appearance of a number of words. The gist of the implementation with java.util.regex is as follows:

Matcher matcher = Pattern.compile("word1|word2|... ").matcher(...);
while (/*input available*/) {
  if (matcher.find(start)) {
    Counter cnt = result.get(matcher.group());
    cnt.increment();
  }
  /* advance start */
}

The class Counter is merely a wrapper around an int with the increment() method. For brevity I leave out the details of buffering and looping over the input which is read in chunks.

For monq.jfa the central code looks like

  CounterAction cntAction = new CounterAction();
  Nfa nfa = new Nfa().or("word1").or("word2")...;
  Dfa dfa = nfa.compile(...);
  new DfaRun(dfa).setIn(...).filter();

The CounterAction is a callback which is called inside the filter() method for each matche found in order to count it.

I tested with an increasing number of randomly picked words, where the was one of them with a total count of 2.2 million times and the most frequent. The table below shows running times with an increasing number of words.

(all times in seconds)
no. words 3 4 11 17 26 35
$\Delta t$ regex 9 12 28 42 68 82
$\Delta t$ monq 40 47 49 54 55 56

When I started the measurements with three words, I was disappointed to see that all the lines of code seem to be useles: java.util.regex is 4 times faster. But then I remembered why monq.jfa was implemented in the first place: to filter text for whole dictionaries of strings or even regular expressions. Therefore I started adding words and, hey, the table shows the trend. As soon as your dictionary exceeds around 20 words, the DFA shows its strength.

In fact, starting from 17 words and onwards, the DFA implementation exhibits nearly the theoretical limit behavior, whereby the running time depends on the input length only, not on the complexity of the regular expression. Given that at the EBI we were using dictionaries with several thousand words, it seems monq.jfa was just the right thing to use. ☺