Let’s build our own REGEX library!

Most of us use language library functions that provide things like dynamic arrays (ArrayLists) or sorting functions, yet as time goes on programming languages have obtained an ever increasing number of layers of abstraction and now most programmers don’t even consider how such functions came to exist. While it is nice to be able to use such powerful functions without worrying about understanding their implementation, other times it is nice to be able to understand the underlying mechanisms behind such things; thus allowing us to build derivative modules or make extensions for a custom purpose.  Today, we’re going to build our own Regular Expressions library. Most developers are familiar with the use of regular expressions, but few are privy to the fact that they are equivalent to Deterministic Finite Automata.

What is a Deterministic Finite Automaton (DFA)?

Well, informally, a DFA can be represented in a graph akin to the one below:

In the context of Deterministic Finite Automata, the nodes above are referred to as states. The states with two circles q and s are known as accept states. In addition to being an accept state, q is also the starting state as indicated by the incoming arrow from no state. The directed edges depicted above are known as transitions. The edge from state q to state r indicates that the DFA transitions from state q to state r upon reading input symbol 1. Likewise, the arrow that loops on state q indicates that the DFA transitions from state q to state q upon reading any number of 0’s. Any Deterministic Finite Automaton can be converted into an equivalent  regular expression through a process commonly referred to as the State Elimination Algorithm. However, the version of this algorithm that we will employ differs in some respects from the original. To convert a DFA G into a Regular expression we first convert G into a ‘GNFA’. Let for example G be the DFA above in the image that we discussed earlier.

The process of converting a DFA to a GNFA is as follows:
  • Add a new start state with an epsilon transition to the original start state.
  • Add a new accept state and add epsilon transitions from every original accept state to the new one, then make all the original accept states into normal states.
This is the resulting GNFA:

Then we remove each state in-between the new start state and the new accept state one at a time, adjusting the graph to maintain correctness. The process works as follows: Let x, y, and z be states in our DFA. Additionally, the transitions are as follows x->y on input a, y->y on input b and y->z on input c. Say we want to remove y. For every transition from some node n to y and for every transition from y to m, we must add a new transition n->m. The transition from n to m would be the content of the transition from n to y, followed by the content of transition y->y with a kleene star followed by the content of the transition from y->m. In this case x->y on a, y->y on b and y->z on c, after removing state y, there would be a transition from x->z on a(b*)c.

So in our DFA, after removing state q we get:

After removing state r, we get:

Finally, after removing state s we are left with:

This is a regular expression representation of our DFA. 

Likewise, any regular expression can be converted into an equivalent DFA. One way to accomplish this is by first converting the regular expression into a non-deterministic finite automaton(NFA) using Thompson’s Construction Algorithm, then converting the NFA into an equivalent DFA using the Powerset Construction.

Now that we’ve seen how deterministic finite automata work and how they can be converted into regular expressions, we can start building our own “REGEX” library in C. Our library implementation will allow us to create, modify and run a variety of algorithms on DFA. We can then use the discussed above to convert user entered regular expressions into DFA for our program to run. 

In the next part of this series, we will start to write up the C code for our library.