Automata Theory, by Jeffrey Ullman and John Hopcroft

Introduction to Automata Theory

Summary of Chapter 1

Finite Automata: Finite automata involve states and transitions among states in response to inputs. They are useful for building several different kinds of software, including the lexical analysis component of a compiler and systems for verifying the correctness of circuits or protocols, for example.

Regular Expressions: These are a structural notation for describing the same patterns that can be represented by finite automata. They are used in many common types of software, including tools to search for patterns in text or in file names, for instance.

Context-Free Grammars: These are an important notation for describing the structure of programming languages and related sets of strings; they are used to build the parser component of a compiler.

Turing Machines: These are automata that model the power of real computers. They allow us to study decidabilty, the question of what can or cannot be done by a computer. They also let us distinguish tractable problems - those that can be solved in polynomial time - from the intractable problems - those that cannot.

Deductive Proofs: This basic method of proof proceeds by listing statements that are either given to be true, or that follow logically from some of the previous statements.

Proving If-Then Statements: Many theorems are of the form if (something) then (something else).'' The statement or statements following the if’’ are the hypothesis, and what follows ``then’’ is the conclusion. Deductive proofs of if-then statements begin with the hypothesis, and continue with statements that follow’ logically from the hypothesis and previous statements, until the conclusion is proved as one of the statements.

Proving 1f-And- Only-1f Statements: There are other theorems of the form ``(something) if and only if (something else)’’. They are proved by showing if-then statements in both directions. A similar kind of theorem claims the equality of the sets described in two different ways; these are proved by showing that each of the two sets is contained in the other.

Proving the Contrapositive: Sometimes, it is easier to prove a statement of the form if H then C'' by proving the equivalent statement: if not C then not H.’’ The latter is called the contrapositive of the former.

Proof by Contradiction: Other times, it is more convenient to prove the statement if H then C'' by proving if H and not C then (something known to be false).’’ A proof of this type is called proof by contradiction.

Counterexamples: Sometimes we are asked to show that a certain statement is not true. If the statement has one or more parameters, then we can show it is false as a generality by providing just one counterexample, that is, one assignment of values to the parameters that makes the statement false.

Inductive Proofs: A statement that has an integer parameter n can often be proved by induction on n. We prove the statement is true for the basis, a finite number of cases for particular values of n, and then prove the inductive step: that if the statement is true for values up to n, then it is true for n + 1.

Structural Inductions: In some situations, including many in this book, the theorem to be proved inductively is about some recursively defined construct, such as trees. We may prove a theorem about the constructed objects by induction on the number of steps used in its construction. This type of induction is referred to as structural.

Alphabets: An alphabet is any finite set of symbols.

Strings: A string is a finite-length sequence of symbols.

Languages and Problems: A language is a (possibly infinite) set of strings, all of which choose their symbols from some one alphabet. When the strings of a language are to be interpreted in some way, the question of whether a string is in the language is sometimes called a problem.

Summary of Chapter 2

Deterministic Finite Automata: A DFA has a finite set of states and a finite set of input symbols. One state is designated the start state, and zero or more states are accepting states. A transition function determines how the state changes each time an input symbol is processed.

Transition Diagrams: It is convenient to represent automata by a graph in which the nodes are the states, and arcs are labeled by input symbols, indicating the transitions of that automaton. The start state is designated by an arrow, and the accepting states by double circles.

Language of an Automaton: The automaton accepts strings. A string is accepted if, starting in the start state, the transitions ?aused by processing the symbols of that string one-at-a-time lead to an accepting state. In terms of the transition diagram, a string is accepted if it is the label of a path from the start state to some accepting state.

Nondeterministic Finite Automata: The NFA differs from the DFA in that the NFA can have any number of transitions (including zero) to next states from a given state on a given input symbol.

The Subset Construction: By treating sets of states of an NFA as states of a DFA, it is possible to convert any NFA to a DFA that accepts the same language.

e-Transitions: We can extend the NFA by allowing transitions on an empty input, i.e., no input symbol at all. These extended NFA’s can be converted to DFA’s accepting the same language.

Text-Searching Applications: Nondeterministic finite automata are a useful way to represent a pattern matcher that scans a large body of text for one or more keywords. These automata are either simulated directly in software or are first converted to a DFA, which is then simulated.

Summary of Chapter 3

Regular Expressions: This algebraic notation describes exactly the same languages as finite automata: the regular languages. The regular-expression operators are union, concatenation (or dot'' ), and closure (or star’’ ).

Regular Expressions in Practice: Systems such as UNIX and various of its commands use an extended regular-expression language that provides shorthands for many common expressions. Character classes allow the easy expression of sets of symbols, while operators such as one-or-more-of and at-most-one-of augment the usual regular-expression operators.

Equivalence 01 Regular Expressionsand Finite Automata: We can convert a DFA to a regular expression by an inductive construction in which expressions for the labels of paths allowed to pass through increasingly larger sets of states are constructed. Alternatively, we can use a stateelimination procedure to build the regular expression for a DFA. In the other direction, we can construct recursively an NFA from regular expressions, and then convert the e-NFA to a DFA, if we wish.

The Algebra of Regular Expressions: Regular expressions obey many of the algebraic laws of arithmetic, although there are differences. Union and concatenation are associative, but only union is commutative. Concatenation distributes over union. Union is idempotent.

Testing Algebraic Identities: We can tell whether a regular-expression equivalence involving variables as arguments is true by replacing the variables by distinct constants and testing whether the resulting languages are the same.

Summary of Chapter 4

The Pumping Lemma: If a language is regular, then every sufficiently long string in the language has a nonempty substring that can be ``pumped,’’ that is, repeated any number of times while the resulting strings are also in the language. This fact can be used to prove that many different languages are not regular.

Operations That Preserve the Property of Being a Regular Language: There are many operations that, when applied to regular languages, yield a regular language as a result. Among these are union, concatenation, closure, intersection, complementation, difference, reversal, homomorphism (replacement of each symbol by an associated string), and inverse homomorphism.

Testing Emptiness of Regular Languages: There is an algorithm that, given a representation of a regular language, such as an automaton or regular expression, tel1s whether or not the represented language is the empty set.

Testing Membership in a Regular Language: There is an algorithm that, given a string and a representation of a regular language, tells whether or not the string is in the language.

Testing Distinguishability of States: Two states of a DFA are distinguishable if there is an input string that takes exactly one of the two states to an accepting state. By starting with only the fact that pairs consisting of one accepting and one nonaccepting state are distinguishable, and trying to discover additional pairs of distinguishable states by finding pairs whose successors on one input symbol are distinguishable, we can discover all pairs of distinguishable states.

Minimizing Deterministic Finite Automata: We can partition the states of any DFA into groups of mutually indistinguishable states. Members of two different groups are always distinguishable. If we replace each group by a single state, we get an equivalent DFA that has as few states as any DFA for the same language.

Summary of Chapter 5

Context-Free Grammars: A CFG is a way of describing languages by recursive rules called productions. A CFG consists of a set of variables, a set of terminal symbols, and a start variable, as well as the productions. Each production consists of a head variable and a body consisting of a string of zero or more variables and/or terminals.

Derivationsand Languages: Beginning with the start symbol, we derive terminal strings by repeatedly replacing a variable by the body of some production with that variable in the head. The language of the CFG is the set of terminal strings we can derive; it is called a context-free language.

Leftmost and Rightmost Derivations: If we always replace the leftmost (resp. rightmost) variable in a string, then the resulting derivation is a leftmost (resp. rightmost) derivation. Every string in the language of a CFG has at least one leftmost and at least one rightmost derivation.

Sentential Forms: Any step in a derivation is a string of variables and/or terminals. We call such a string a sentential form. If the derivation is leftmost (resp. rightmost), then the string is a left- (resp. right-) sentential form.

Parse Trees: A parse tree is a tree that shows the essentials of a derivation. Interior nodes are labeled by variables, and leaves are labeled by terminals or epsilon. For each internal node, there must be a production such that the head of the production is the label of the node, and the labels of its children, read from left to right, form the body of that production.

Equivalence of Parse Trees and Derivations: A terminal string is in the language of a grammar if and only if it is the yield of at least one parse tree. Thus, the existence of leftmost derivations, rightmost derivations, and parse trees are equivalent conditions that each define exactly the strings in the language of a CFG.

Ambiguous Grammars: For some CFG’s, it is possible to find a terminal string with more than one parse tree, or equivalently, more than one leftmost derivation or more than one rightmost derivation. Such a grammar is called ambiguous.

Eliminating Ambiguity: For many useful grammars, such as those that describe the structure of programs in a typical programming language, it is possible to find an unambiguous grammar that generates the same language. Unfortunately, the unambiguous grammar is frequently more complex than the simplest ambiguous grammar for the language. There are also some context-free languages, usually quite contrived, that are inherently ambiguous, meaning that every grammar for that language is ambiguous.

Parsers: The context-free grammar is an essential concept for the implementation of compilers and other programming-language processors. Tools such as YACC take a CFG as input and produce a parser, the component of a compiler that deduces the structure of the program being compiled.

Document Type Definitions: The emerging XML standard for sharing information through Web documents has a notation, called the DTD, for describing the structure of such documents, through the nesting of semantic tags within the document. The DTD is in essence a context-free grammar whose language is a class of related documents.

Summary of Chapter 6

Pushdown Automata: A PDA is a nondeterministic finite automaton coupled with a stack that can be used to store a string of arbitrary length. The stack can be read and modified only at its top.

Moves of a Pushdown Automaton: A PDA chooses its next move based on its current state, the next input symbol, and the symbol at the top of its stack. It may also choose to make a move independent of the input symbol and without consuming that symbol from the input. Being nondeterministic, the PDA may have some finite number of choices of move; each is a new state and?a string of stack symbols with which to replace the symbol currently on top of the stack.

Acceptance by Pushdown Automata: There are two ways in which we may allow the PDA to signal acceptance. One is by entering an accepting state; the other by emptying its stack. These methods are equivalent, in the sense that any language accepted by one method is accepted (by some other PDA) by the other method.

Instantaneous Descriptions: We use an ID consisting of the state, remaining input, and stack contents to describe the ``current condition’’ of a PDA. A transition function between ID’s represents single moves of a PDA.

Pushdown Automataand Grammars: The languages accepted by PDA’s either by final state or by empty stack, are exactly the context-free languages.

Deterministic Pushdown Automata: A PDA is deterministic if it never has a choice of move for a given state, input symbol (including E), and stack symbol. Also, it never has a choice between making a move using a true input and a move using einput.

Acceptance by Deterministic Pushdo?n Automata: The two modes of acceptance - final state and empty stack - are not the same for DPDA’s. Rather, the languages accepted by empty stack are exactly those of the languages accepted by final state that have the prefix property: no string in the language is a prefix of another word in the language.

The Languages Accepted by DPDA’s: All the regular languages are accepted (by final state) by DPDA’s, and there are nonregular languages accepted by DPDA’s. The DPDA languages are context-free languages, and in fact are languages that have unambiguous CFG’s. Thus, the DPDA languages lie strictly between the regular languages and the context-free languages.

Summary of Chapter 7

Eliminating Useless Symbols: A variable can be eliminated from a CFG unless it derives some string of terminals and also appears in at least one string derived from the start symbol. To correctly eliminate such useless symbols, we must first test whether a variable derives a terminal string, and eliminate those that do not, along with all their productions. Only then do we eliminate variables that are not derivable from the start symbol.

Eliminating epsilon and Unit-productions: Given a CFG, we can find another CFG that generates the same language, except for string yet has no epsilon productions (those with body epsilon) or unit productions (those with a single variable as the body).

Chomsky Normal Form: Given a CFG that derives at least one nonempty string, we can find another CFG that generates the same language, except for epsilon, and is in Chomsky Normal Form: there are no useless symbols, and every production body consists of either two variables or one terminal.

The Pumping Lemma: In any CFL, it is possible to find, in any sufficiently long string of the language, a short substring such that the two ends of that substring can be ``pumped’’ in tandem; i.e., each can be repeated any desired number of times. The strings being pumped are not both epsilon. This lemma allows us to prove many languages not to be context-free.

Operations That Preserve Context-Free Languages: The CFL’s are closed under substitution, union, concatenation, closure (star), reversal, and inverse homomorphisms. CFL’s are not closed under intersection or complementation, but the intersection of a CFL and a regular language is always a CFL.

Testing Emptiness of a CFL: Given a CFG, there is an algorithm to tell whether it generates any strings at all. A careful implementation allows this test to be conducted in time that is proportional to the size of the grammar itself.

Testing Membership in a CFL: The Cocke-Younger-Kasami algorithm tells whether a given string is in a given context-free language. For a fixed CFL, this test takes time O(n^3), if n is the length of the string being tested.

Summary of Chapter 8

The Turing Machine: The TM is an abstract computing machine with the power of both real computers and of other mathematical definitions of what can be computed. The TM consists of a finite-state control and an infinite tape divided into cells. Each cell holds one of a finite number of tape symbols, and one cell is the current position of the tape head. The TM makes moves based on its current state and the tape symbol at the cell scanned by the tape head. In one move, it changes state, overwrites the scanned cell with some tape symbol, and moves the head one cell left or right.

Acceptance byaTuring Machine: The TM starts with its input, a finite length string of tape symbols, on its tape, and the rest of the tape containing the blank symbol on each cell. The blank is one of the tape symbols, and the input is chosen from a subset of the tape symbols, not including blank, called the input symbols. The TM accepts its input if it ever enters an accepting state.

Recursively Enumerable Languages: The languages accepted by TM’s are called recursively enumerable (RE) languages. Thus, the RE languages are those languages that can be recognized or accepted by any sort of computing device.

Instantaneous Descriptions of a TM: We can describe the current configuration of a TM by a finite-length string that includes all the tape cells from the leftmost to the rightmost nonblank: The state and the position of the head are shown by placing the state within the sequence of tape symbols, just to the left of the cell scanned.

Storage in the Finite Control: Sometimes, it helps to design a TM for a particular language if we imagine that the state has two or more components. One component is the control component, and functions as a state normally does. The other components hold data that the TM needs to remember.

Multiple Tracks: It also helps frequently if we think of the tape symbols as vectors with a fixed number of components. We may visualize each component as a separate track of the tape.

Multitape Turing Machines: An extended TM model has some fixed number of tapes greater than one. A move of this TM is based on the state and on the vector of symbols scanned by the head on each of the tapes. In a move, the multitape TM changes state, overwrites symbols on the cells scanned by each of its tape heads, and moves any or all of its tape heads one cell in either direction. Although able to recognize certain languages faster than the conventional one-tape TM, the multitape TM cannot recognize any language that is not RE.

Nondeterministic Turing Machines: The NTM has a finite number of choices of next move (state, new symbol, and head move) for each state and symbol scanned. It accepts an input if any sequence of choices leads to an ID with an accepting state. Although seemingly more powerful than the deterministic TM, the NTM is not able to recognize any language that is not RE.

Semi-infinite-Tape Turing Machines: We can restrict a TM to have a tape that is infinite only to the right, with no cells to the left of the initial head position. Such a TM can accept any RE language.

Multistack Machines: We can restrict the tapes of a multitape TM to behave like a stack. The input is on a separate tape, which is read once from left-to-right, mimicking the input mode for a finite automaton or PDA. A one-stack machine is really a DPDA, while a machine with two stacks can accept any RE language.

Counter Machines: We may further restrict the stacks of a multistack machine to have only one symbol other than a bottom-marker. Thus, each stack functions as a counter, allowing us to store a nonnegative integer, and to test whether the integer stored is 0, but nothing more. A machine with two counters is sufficient to accept any RE language.

Simulating a Turing Machine by a Real Computer: It is possible, in principle, to simulate a TM by a real computer if we accept that there is a potentially infinite supply of a removable storage device such as a disk, to simulate the nonblank portion of the TM tape. Since the physical resources to make disks are not infinite, this argument is questionable. However, since the limits on how much storage exists in the universe are unknown and undoubtedly vast, the assumption of an infinite resource, as in the TM tape, is realistic in practice and generally accepted.

Simulating a Computer by a Turing Machine: A TM can simulate the storage and control of a real computer by using one tape to store all the locations and their contents: registers, main memory, disks, and other storage devices. Thus, we can be confident that something not doable by a TM cannot be done by a real computer.

Summary of Chapter 9

Recursive and Recursively Enumerable Languages: The languages accepted by Turing machines are called recursively enumerable (RE), and the subset of RE languages that are accepted by a TM that always halts are called recursive.

Complements of Recursive and RE Languages: The recursive languages are closed under complementation, and if a language and its complement are both RE, then both languages are actually recursive. Thus, the complement of an RE-but-not-recursive language can never be RE.

Decidability and Undecidability: Decidable'' is a synonym for recursive,’’ although we tend to refer to languages as recursive'' and problems (which are languages interpreted as a question) as decidable.’’ If a language is not recursive, then we call the problem expressed by that language ``undecidable.''

The Language Ld: This language is the set of strings of O’s and 1’s that, when interpreted as a TM, are not in the language of that TM. The language Ld is a good example of a language that is not RE; i.e., no Turing machine accepts it.

The Universal Language: The language Lu consists of strings that are interpreted as a TM followed by an input for that TM. The string is in Lu if the TM accepts that input. Lu is a good example of a language that is RE but not recursive.

Rice-Theorem: Any nontrivial property of the languages accepted by Turing machines is undecidable. For instance, the set of codes for Turing machines whose language is empty is undecidable by Rice’s theorem. In fact, this language is not RE, although its complement - the set of codes for TM’s that accept at least one string - is RE but not recursive.

Post’s Correspondence Problem: This question asks, given two lists of the same number of strings, whether we can pick a sequence of corresponding strings from the two lists and form the same string by concatenation. PCP is an important example of an undecidable problem. PCP is a good choice for reducing to other problems and thereby proving them undecidable.

Undecidable Context-Free-Language Problems: By reduction from PCP, we can show a number of questions about CFL’s or their grammars to be undecidable. For instance, it is undecidable whether a CFG is ambiguous, whether one CFL is contained in another, or whether the intersection of two CFL’s is empty.

Summary of Chapter 10

The Classes P and NP: P consists of all those languages or problems accepted by some Turing machine that runs in some polynomial amount of time, as a function of its input length. NP is the class of languages or problems that are accepted by nondeterministic TM’s with a polynomial bound on the time taken along any sequence of nondeterministic choices.

The P = NP Question: It is unknown whether or not P and NP are really the same classes of languages, although we suspect strongly that there are languages in NP that are not in P.

Polynomial-Time Reductions: If we can transform instances of one problem in polynomial time into instances of a second problem that has the same answer - yes or no - then we say the first problem is polynomial time reducible to the second.

NP-Complete Problems: A language is NP-complete if it is in NP, and there is a polynomial-time reduction from each language in NP to the language in question. We believe strongly that none of the NP-complete problems are in P, and the fact that no one has ever found a polynomial time algorithm for any of the thousands of known NP-complete problems is mutually re-enforcing evidence that none are in P.

NP-Complete Satisfiability Problems: Cook’s theorem showed the first NP-complete problem - whether a boolean expression is satisfiable - by reducing all problems in NP to the SAT problem in polynomial time. In addition, the problem remains NP-complete even if the expression is restricted to consist of a product of clauses, each of which consists of only three literals - the problem 3SAT.

Other NP-Complete Problems: There is a vast collection of known NP-complete problems; each is proved NP-complete by a polynomial-time reduction from some previously known NP-complete problem. We have given reductions that show the following problems are NP-complete: independent set, node cover, directed and undirected versions of the Hamilton circuit problem, and the traveIing-salesman problem.

Summary of Chapter 11

The Class co-NP: A language is said to be in co-NP if its complement is in NP. All languages in P are surely in co-NP, but it is likely that there are some languages in Np that are not in co-NP, and vice-versa. In particular, the NP-complete problems do not appear to be in co-NP.

The Class PS: A language is said to be in PS (polynomial space) if it is accepted by a deterministic TM for which there is a polynomial p(n) such that on input of length n the TM never uses more than p(n) cells of its tape.

The Class NPS: We can also define acceptance by a nondeterministic TM whose tape-usage is limited by a polynomial function of its input length. The class of these languages is referred to as NPS. However, Savitch’s theorem tells us that PS = NPS. In particular, a NTM with space bound p(n) can be simulated by a DTM using space p2(n).

Randomized Algorithms and Turing Machines: Many algorithms use randomness productively. On a real computer, a random-number generator is used to simulate ``coin-flipping.’’ A randomized Turing rbachine can achieve the same random behavior if it is given an additional tape on which a sequence of random bits is written.

The Class RP: A language is accepted in random polynomial time if there is a polynomial-time, randomized Turing machine that has at least 50% chance of accepting its input if that input is in the language. If the input is not in the language, then this TM never accepts. Such a TM or algorithm is called ``Monte-Carlo.''

The Class ZPP: A language is in the class of zero-error, probabilistic polynomial time if it is accepted by a randomized Turing machine that always gives the correct decision regarding membership in the language; this TM must run in expected polynomial time, although the worst case may be greater than any polynomial. Such a TM or algorithm is called ``Las Vegas.''

Relationships Among Language Classes: The class co-RP is the set of complements of languages in RP. The following containments are known: P is contained in ZPP which is contained in (RP intersection with co-RP). Also, RP is contained in NP and therefore co-RP is contained in co-NP.

The Primes and NP: Both the primes and the complement of the language of primes - the composite numbers - are in NP. These facts make it unlikely that the primes or composite numbers are NP-complete. Since there are important cryptographic schemes based on primes, such a proof would have offered strong evidence of their security.

The Primes and RP: The composite numbers are in RP. The random polynomial algorithm for testing compositeness is in common use to allow the generation of large primes, or at least large numbers that have an arbitrarily small chance of being composite.