Generating Functions and Singularities
Gödel’s Lost Letter and P=NP 2023-10-10
Sharpening an analytic tool for regular languages
IEEE source for Book, Even, Ott photosRonald Book, Shimon Even, Sheila Greibach, and Gene Ott proved a fact that may escape attention nowadays. Everyone knows the equivalence of finite automata (of all major kinds) and regular expressions. But do you know that the regular expressions can be made unambiguous?
Today we explore consequences of this fact for analyzing automata via generating functions.
A regular expression is unambiguous if:
- wherever occurs inside , ;
- wherever occurs inside , every string can be written in exactly one way as with and ;
- wherever occurs inside , every can be written in exactly one way as a concatenation of strings in .
The last clause entails in particular that the empty string cannot match when appears in . Since is always in , it would have multiple “parses” if matched . The only parse we want to allow is that is a concatenation of zero strings.
Is there an easy way to convert any given regular expression into one that is unambiguous? In senses that can be made precise in computational complexity terms, the answer is no. But we can get an equivalent unambiguous regular expression by different means.
DFAs Give Unambiguous Regexps
If we have a deterministic finite automaton where has multiple final states, let stand for the machine with the same code but only as final state. Then
and moreover this union is disjoint: Because is deterministic, every string goes to exactly one of the final states .
Hence to build an unambiguous regular expression for , it suffices to build one when there is exactly one final state. First suppose that is the start state, which we number and the other states . Initialize the matrix with if has a transition on from state to state ; otherwise. The initial condition is that with , every entry in the principal minor of is an unambiguous regular expression. Now execute the following code:
for (k = n; k > 2; k--) //bypass and eliminate state k for (i = 1; i < k; i++) //obviate incoming arc (i,c,k) for (j = 1; j < k; j++) //update transition from i to j A[i,j] = A[i,j] + A[i,k] A[k,k]* A[k,j];
The loop maintains three invariants:
- Computations on strings matching involve no intermediate states .
- All strings that take from state to state without going through states match .
- is a of zero or more terms having different leading characters.
If has multiple terms, we can distribute them over to preserve the second invariant. (For a binary alphabet this is not needed.) Now we verify that the regular expression remains unambiguous because:
- The previous involved no intermediate states , so and the character taking it there in are new.
- is unambiguous because, by the second invariant, substrings matching can be uniquely parsed.
- The concatenation is uniquely parsable at front and back, and does not match , since and are unambiguous and do not match .
On loop exit, the regular expression denotes and is unambiguous for the same reason argued for with . If the final state is not the start state, we can number it , and then the answer is . This finishes the proof.
The paper proves a stronger statement about preserving ambiguities. The above looks like a polynomial-time (indeed, cubic time) algorithm but is not—because the sizes of the parts in particular can compound with each step of .
Generating Functions
Let us fix the binary alphabet . Given any language over , we define the bivariate generating function
where the number of strings of length in that have exactly s and s.
Note that if we identify , then is the more-familiar univariate generating function, whose coefficients of count the numbers of strings of length in . Philippe Flajolet and Bob Sedgewick wrote a whole free book titled Analytic Combinatorics in which these generating functions figure prominently. The main fact about is a simple extension of the well-known one about :
Theorem 1 If is regular, then there are two polynomials and such that
That is, the generating function of a regular language is a rational function. The converse does not hold, because there are regular and non-regular languages that have the same “census counts”
To prove this theorem, we start by simply associating to every regular expression over the algebraic function given by the following recursive rules, in which the right-hand quantities are numerical:
- ,
Clearly the function thus defined is rational. The observation that finishes the proof is:
If is unambiguous, then equals where is the language of .
That every regular language has an unambiguous regular expression finishes the proof: The equality clearly holds in the base cases and for the inductive case. The checks for the and cases are straightforward from the definition of unambiguity for .
Machines and Invariant Denominators
Given any -state DFA that recognizes , we can get directly from . Form the matrix along lines of the regular expression matrix above, but this time put an for every 0-arc and a for every -arc. Make every other entry . If has more than one accepting state , make a separate matrix for each such and sum the results. Number the start state and (or if is the start state). Then run the same triple nested loop as before, but with the body
The result is the generating function for the language of strings processed from to . Adding them up gives . The result is invariant both under the ordering of states of to eliminate and the choice of DFA recognizing —it need not be the canonical minimum one. If is strongly connected, meaning that every state can reach every other state by some string (i.e., ), then there is a further notable invariant:
Theorem 2 If a DFA is strongly connected then for all states and , the bivariate generating function of is given by
where and the numerator has no common divisor with .
The proof only requires analyzing the two-state case. If is strongly connected, then all elimination sequences will leave a strongly connected two-state machine, meaning that both and are nonzero in the following diagram:
The generating functions shown for , , , and have the same denominator, and by , do not reduce further. Because is the same whichever state of is used as “,” we can “walk” the diagram to represent any while preserving the denominator and the irreducibility.
Thus is common to the whole host of DFAs defined from by changing the start state and/or the set of final states. It is an invariant of the two-sorted graph , whose edge sets of -arcs and -arcs each have out-degree .
Singular Sets
The zeroes of the polynomial form an algebraic set and are singularities of the generating function. Adopting the looser nomenclature by which a “variety” is allowed to be algebraically reducible, we propose calling the zero set the singular variety of . It is really an invariant of the two-sorted graph . Here are pictures for all automata of one or two nodes. The pictures are plotted in the real plane using the Desmos Graphing Calculator.
What does represent? Intuitively, it captures information about the cycle structure of . Whether it captures enough information is perhaps doubted by the last example, whose is the same as for the one-state machine. The full generating function for is computed via
Some three-state automata exhibit other interesting behaviors. The two in the middle should be plotted in the complex plane.
Intersecting with the line gives the singular points of , which figure highly in analytical results collected in the book by Flajolet and Sedgewick and in some more recent references. We anticipate higher returns from analysis of by bringing in more techniques from algebraic geometry.
Open Problems
Have you seen this two-variable generating function for regular languages? What can be done with it?