Harald Kirsch

genug Unfug.


Monq.jfa faster again

The implementation of a deterministic finite automaton (DFA) has one central component: transition tables. Each node of the DFA contains one. It maps the incoming character to an outgoing edge, or rather to another node of the DFA, namely the target node of the transition by this character. The fastest implementation is an array indexed by the incoming character. For a character ch and a transition table trans this is just one array access:


Many transition tables are, however, extremely sparse with sometimes only a handful characters leading out of a node. In this case, using the full array of $2^{16}$ character values is an extreme waste of main memory. From the beginning monq optimized these array based transition tables such they only contain entries for the range of characters between the first and the last character used. Retrieving the transition then looks like


which is only minimally slower than storing the full array. But when only a few characters between the first and the last used character indeed have a transition, this still is a big waste of main memory. Therefore there was always a second implementation for the transition table, that only stores the available transitions and searches for the selected one with a binary search.

For every node, the decision between the two implementations was based on an estimated memory footprint. And the smaller one was always chosen.

In version 1.4 of monq I introduced a parameter that allows to trade memory usage for speed. If it is set high enough, no transition table will use a binary search. And the good news: my performance tests show that for the use case of searching a number of words in a long text, a speedup between 20% and 40% is possible.

(all times in seconds)
no. words 3 4 11 17 26 35
$\Delta t$ regex 9 12 28 42 68 82
$\Delta t$ monq 40 47 49 54 55 56
$\Delta t$ monq with StringBuilder 24 24 26 27 29 30
$\Delta t$ monq trading memory for speed 21 22 23 25 25 25