Generating Functions and Singularities

Gödel’s Lost Letter and P=NP 2023-10-10

Sharpening an analytic tool for regular languages

IEEE source for Book, Even, Ott photos

Ronald Book, Shimon Even, Sheila Greibach, and Gene Ott proved a fact that may escape attention nowadays. Everyone knows the equivalence of finite automata (of all major kinds) and regular expressions. But do you know that the regular expressions can be made unambiguous?

Today we explore consequences of this fact for analyzing automata via generating functions.

A regular expression ${\alpha}$ is unambiguous if:

wherever ${\beta + \gamma}$ occurs inside ${\alpha}$ , ${L(\beta) \cap L(\gamma) = \emptyset}$ ;
wherever ${\beta \cdot \gamma}$ occurs inside ${\alpha}$ , every string ${z \in L(\beta \cdot \gamma)}$ can be written in exactly one way as ${z = xy}$ with ${x \in L(\beta)}$ and ${y \in L(\gamma)}$ ;
wherever ${\beta^*}$ occurs inside ${\alpha}$ , every ${z \in L(\beta^*)}$ can be written in exactly one way as a concatenation of strings in ${L(\beta)}$ .

The last clause entails in particular that the empty string cannot match ${\beta}$ when ${\beta^*}$ appears in ${\alpha}$ . Since ${\epsilon}$ is always in ${L(\beta^*)}$ , it would have multiple “parses” if ${\epsilon}$ matched ${\beta}$ . The only parse we want to allow is that ${\epsilon}$ is a concatenation of zero strings.

Is there an easy way to convert any given regular expression ${\alpha}$ into one that is unambiguous? In senses that can be made precise in computational complexity terms, the answer is no. But we can get an equivalent unambiguous regular expression by different means.

DFAs Give Unambiguous Regexps

If we have a deterministic finite automaton ${M = (Q,\Sigma,\delta,s,F)}$ where ${F = \{q_1,q_2,\dots,q_k\}}$ has multiple final states, let ${M_j}$ stand for the machine with the same code but only ${q_j}$ as final state. Then

$\displaystyle L(M) = L(M_1) \cup L(M_2) \cup \cdots \cup L(M_k),$

and moreover this union is disjoint: Because ${M}$ is deterministic, every string ${x \in L(M)}$ goes to exactly one of the final states ${q_j}$ .

Hence to build an unambiguous regular expression for ${M}$ , it suffices to build one when there is exactly one final state. First suppose that is the start state, which we number ${1}$ and the other states ${2,\dots,n}$ . Initialize the ${n \times n}$ matrix ${A}$ with ${A[i,j] = c}$ if ${M}$ has a transition on ${c}$ from state ${i}$ to state ${j}$ ; ${A[i,j] = \emptyset}$ otherwise. The initial condition is that with ${k=n}$ , every entry in the principal ${k \times k}$ minor of ${A}$ is an unambiguous regular expression. Now execute the following code:

for (k = n; k > 2; k--) //bypass and eliminate state k for (i = 1; i < k; i++) //obviate incoming arc (i,c,k) for (j = 1; j < k; j++) //update transition from i to j A[i,j] = A[i,j] + A[i,k] A[k,k]* A[k,j];

The loop maintains three invariants:

Computations on strings matching ${A[i,j]}$ involve no intermediate states ${< k}$ .
All strings that take ${M}$ from state ${i}$ to state ${j}$ without going through states ${< k}$ match ${A[i,j]}$ .
${A[i,j]}$ is a ${+}$ of zero or more terms having different leading characters.

If ${A[i,k]}$ has multiple terms, we can distribute them over ${A[k,k]^*\cdot A[k,j]}$ to preserve the second invariant. (For a binary alphabet this is not needed.) Now we verify that the regular expression ${A[i,j]}$ remains unambiguous because:

The previous ${A[i,j]}$ involved no intermediate states ${< k+1}$ , so ${k}$ and the character taking it there in ${A[i,k]}$ are new.
${A[k,k]^*}$ is unambiguous because, by the second invariant, substrings matching ${A[k,k]}$ can be uniquely parsed.
The ${A[i,k]\cdot A[k,k]^*\cdot A[k,j]}$ concatenation is uniquely parsable at front and back, and does not match ${\epsilon}$ , since ${A[i,k]}$ and ${A[k,j]}$ are unambiguous and do not match ${\epsilon}$ .

On loop exit, the regular expression ${A[1,1]^*}$ denotes ${L(M)}$ and is unambiguous for the same reason argued for ${A[k,k]^*}$ with ${k \geq 2}$ . If the final state is not the start state, we can number it ${2}$ , and then the answer is ${A[1,1]^*\cdot A[1,2] \cdot A[2,2]^*}$ . This finishes the proof. ${\Box}$

The paper proves a stronger statement about preserving ambiguities. The above looks like a polynomial-time (indeed, cubic time) algorithm but is not—because the sizes of the ${A[k,k]}$ parts in particular can compound with each step of ${k}$ .

Generating Functions

Let us fix the binary alphabet ${\Sigma = \{0,1\}}$ . Given any language ${L}$ over ${\Sigma}$ , we define the bivariate generating function

$\displaystyle f_L(x,y) = \sum_{i,j = 0}^{\infty} b_{i,j}x^i y^j,$

where ${b_{i,j} =}$ the number of strings of length ${i+j}$ in ${L}$ that have exactly ${i}$ ${0}$ s and ${j}$ ${1}$ s.

Note that if we identify ${y = x}$ , then ${g_L(x) = f_L(x,x)}$ is the more-familiar univariate generating function, whose coefficients ${a_n}$ of ${x^n}$ count the numbers of strings of length ${n}$ in ${L}$ . Philippe Flajolet and Bob Sedgewick wrote a whole free book titled Analytic Combinatorics in which these generating functions figure prominently. The main fact about ${f_L(x,y)}$ is a simple extension of the well-known one about ${g_L(x)}$ :

Theorem 1 If ${L}$ is regular, then there are two polynomials ${h(x,y)}$ and ${d(x,y)}$ such that
$\displaystyle f_L(x,y) = \frac{h(x,y)}{d(x,y)}.$

That is, the generating function of a regular language is a rational function. The converse does not hold, because there are regular and non-regular languages that have the same “census counts” ${b_{i,j}.}$

To prove this theorem, we start by simply associating to every regular expression ${\alpha}$ over ${\Sigma}$ the algebraic function ${f_\alpha = f_\alpha(x,y)}$ given by the following recursive rules, in which the right-hand quantities are numerical:

${f_{\emptyset} = 0}$
${f_{\epsilon} = 1}$
${f_0 = x}$ , ${f_1 = y}$
${f_{\alpha+\beta} = f_\alpha + f_\beta}$
${f_{\alpha\cdot\beta} = f_\alpha \cdot f_\beta}$
${f_{\alpha^*} = \frac{1}{1 - f_\alpha} = 1 + f_\alpha + f_\alpha^2 + f_\alpha^3 + \cdots}$

Clearly the function thus defined is rational. The observation that finishes the proof is:

If ${\alpha}$ is unambiguous, then ${f_\alpha}$ equals ${f_L}$ where ${L}$ is the language of ${\alpha}$ .

That every regular language ${L}$ has an unambiguous regular expression finishes the proof: The equality clearly holds in the base cases and for the ${+}$ inductive case. The checks for the ${\cdot}$ and ${*}$ cases are straightforward from the definition of unambiguity for ${\alpha}$ .

Machines and Invariant Denominators

Given any ${n}$ -state DFA ${M}$ that recognizes ${L}$ , we can get ${f_L}$ directly from ${M}$ . Form the ${n \times n}$ matrix ${A_M}$ along lines of the regular expression matrix ${A}$ above, but this time put an ${x}$ for every 0-arc and a ${y}$ for every ${1}$ -arc. Make every other entry ${0}$ . If ${M}$ has more than one accepting state ${t}$ , make a separate matrix for each such ${q}$ and sum the results. Number the start state ${s = 1}$ and ${t = 2}$ (or ${t=1}$ if ${t}$ is the start state). Then run the same triple nested loop as before, but with the body

$\displaystyle A_M[i,j] = A_M[i,j] + \frac{A_M[i,k]\cdot A_M[k,j]}{1 - A_M[k,k]}.$

The result is the generating function for the language ${L_{s,t}}$ of strings processed from ${s}$ to ${t}$ . Adding them up gives ${f_L}$ . The result ${f_L}$ is invariant both under the ordering of states of ${M}$ to eliminate and the choice of DFA ${M}$ recognizing ${L}$ —it need not be the canonical minimum one. If ${M}$ is strongly connected, meaning that every state ${p}$ can reach every other state ${q}$ by some string (i.e., ${L_{p,q} \neq \emptyset}$ ), then there is a further notable invariant:

Theorem 2 If a DFA ${M}$ is strongly connected then for all states ${p}$ and ${q}$ , the bivariate generating function of ${L_{p,q}}$ is given by
$\displaystyle \frac{h_{p,q}(x,y)}{d(x,y)},$
where ${d(x,y) = \det(I - A_m)}$ and the numerator has no common divisor with ${d(x,y)}$ .

The proof only requires analyzing the two-state case. If ${M}$ is strongly connected, then all elimination sequences will leave a strongly connected two-state machine, meaning that both ${\beta = \beta(x,y)}$ and ${\eta = \eta(x,y)}$ are nonzero in the following diagram:

The generating functions shown for ${L_{s,s}}$ , ${L_{s,t}}$ , ${L_{t,t}}$ , and ${L_{t,s}}$ have the same denominator, and by ${\beta,\eta \neq 0}$ , do not reduce further. Because ${L_{s,s}}$ is the same whichever state of ${M}$ is used as “ ${t}$ ,” we can “walk” the diagram to represent any ${L_{p,q}}$ while preserving the denominator and the irreducibility. ${\Box}$

Thus ${d(x,y)}$ is common to the whole host of DFAs defined from ${M}$ by changing the start state and/or the set of final states. It is an invariant of the two-sorted graph ${G_M = (V,E_0,E_1)}$ , whose edge sets of ${0}$ -arcs and ${1}$ -arcs each have out-degree ${1}$ .

Singular Sets

The zeroes of the polynomial ${d(x,y)}$ form an algebraic set and are singularities of the generating function. Adopting the looser nomenclature by which a “variety” is allowed to be algebraically reducible, we propose calling the zero set ${S_M}$ the singular variety of ${M}$ . It is really an invariant of the two-sorted graph ${G_M}$ . Here are pictures for all automata of one or two nodes. The pictures are plotted in the real plane using the Desmos Graphing Calculator.

What does ${S_M}$ represent? Intuitively, it captures information about the cycle structure of ${M}$ . Whether it captures enough information is perhaps doubted by the last example, whose ${S_M}$ is the same as for the one-state machine. The full generating function ${f_L}$ for ${L = L_{s,s} = (1 + 00^*1)^*}$ is computed via

$\displaystyle \frac{1}{1 - \left(y + \frac{xy}{1-x}\right)} = \frac{1-x}{1 - x - y +xy - xy} = \frac{1-x}{1 - x - y}.$

Some three-state automata exhibit other interesting behaviors. The two in the middle should be plotted in the complex plane.

Intersecting ${S_M}$ with the line ${x = y}$ gives the singular points of ${g_L(x)}$ , which figure highly in analytical results collected in the book by Flajolet and Sedgewick and in some more recent references. We anticipate higher returns from analysis of ${f_L(x,y)}$ by bringing in more techniques from algebraic geometry.

Open Problems

Have you seen this two-variable generating function for regular languages? What can be done with it?