CSE 311 Lecture 20: Regular Expressions

Topics

Structural induction
A brief review of Lecture 19.
Regular expressions
Definition, examples, applications.
Context-free grammars
Syntax, semantics, and examples.

Structural induction

A brief review of Lecture 19.

Structural induction proof template

① Let $P(x)$ be [ definition of $P(x)$ ].
We will show that $P(x)$ is true for every $x\in S$ by structural induction.
② Base cases:
[ Proof of $P(s_0), \ldots, P(s_m)$. ]
③ Inductive hypothesis:
Assume that $P(y_0), \ldots, P(y_k)$ are true for some arbitrary $y_0, \ldots, y_k \in S$.
④ Inductive step:
We want to prove that $P(y)$ is true.
[ Proof of $P(y)$. The proof must invoke the structural inductive hypothesis. ]
⑤ The result follows for all $x\in S$ by structural induction.
 
Recursive definition of $S$
Basis step: $s_0\in S, \ldots, s_m\in S$.
Recursive step:
if $y_0, \ldots, y_k\in S$, then $y\in S$.

If the recursive step of $S$ includes multiple rules for constructing new elements from existing elements, then
assume $P$ for the existing elements in every rule, and
prove $P$ for the new element in every rule.

Structural induction works just like ordinary induction

① Let $P(x)$ be [ definition of $P(x)$ ].
We will show that $P(x)$ is true for every $x\in \N$ by structural induction.
② Base cases:
[ Proof of $P(0)$. ]
③ Inductive hypothesis:
Assume that $P(n)$ is true for some arbitrary $n \in \N$.
 
④ Inductive step:
We want to prove that $P(n+1)$ is true.
[ Proof of $P(n+1)$. The proof must invoke the structural inductive hypothesis. ]
⑤ The result follows for all $x\in \N$ by structural induction.
 
Recursive definition of $\N$
Basis step: $0 \in \N$.
Recursive step:
if $n\in \N$, then $n+1\in \N$.

Ordinary induction is just structural induction applied to the recursively defined set of natural numbers!

Understanding structural induction

$\rule{P(\Node); \forall L, R\in S. (P(L)\wedge P(R))\rightarrow P(\Tree(\Node,L,R))}{\forall x \in S. P(x)}$

How do we get $P(\Tree(\Node,\Node,\Tree(\Node, \Node,\Node)))$ from $P(\Node)$ and $\forall L,R\in S. (P(L)\wedge P(R))\rightarrow P(\Tree(\Node,L,R))$?

Define $S$ by
Basis: $\Node \in S$.
Recursive:
if $L, R\in S$, then
$\Tree(\Node,L,R)\in S$
1. First, we have $\forall L,R\in S. (P(L)\wedge P(R))\rightarrow P(\Tree(\Node,L,R))$  
2. Next, we have $P(\Node)$. $P(\Node)$
3. Intro $\wedge$ on 2 gives us $P(\Node)\wedge P(\Node)$. $P(\Node)\wedge P(\Node)$
4. Elim $\forall$ on 1 gives us $(P(\Node)\wedge P(\Node))\rightarrow P(\Tree(\Node, \Node, \Node))$. $\ \Downarrow_{\ (P(\Node)\wedge P(\Node))\rightarrow P(\Tree(\Node, \Node,\Node))}$
5. Modus Ponens on 3 and 4 gives us $P(\Tree(\Node, \Node,\Node))$. $P(\Tree(\Node, \Node,\Node))$
6. Intro $\wedge$ on 2 and 5 gives us $P(\Node)\wedge P(\Tree(\Node, \Node,\Node))$. $P(\Node)\wedge P(\Tree(\Node, \Node,\Node))$
7. Elim $\forall$ on 1 gives us $(P(\Node)\wedge P(\Tree(\Node, \Node,\Node))\rightarrow P(\Tree(\Node,\Node,\Tree(\Node, \Node, \Node)))$. $\ \Downarrow_{\ (P(\Node)\wedge P(\Tree(\Node, \Node,\Node))\rightarrow P(\Tree(\Node,\Node,\Tree(\Node, \Node, \Node)))}$
8. Modus Ponens on 6 and 7 gives us $P(\Tree(\Node,\Node,\Tree(\Node, \Node, \Node)))$. $P(\Tree(\Node,\Node,\Tree(\Node, \Node, \Node)))$

Example: prove $\op{len}(x\bullet y) = \op{len}(x) + \op{len}(y)$ for all $x,y\in\Sigma^* $

① Let $P(y)$ be $\forall x\in\Sigma^* . \op{len}(x\bullet y) = \op{len}(x) + \op{len}(y)$.
We will show that $P(y)$ is true for every $y\in \Sigma^* $ by structural induction.
② Base case ($y=\varepsilon$):
Let $x$ in $\Sigma^* $ be arbitrary. Then, $\op{len}(x\bullet \varepsilon)$ $=$ $\op{len}(x)$ $=$ $\op{len}(x) + \op{len}(\varepsilon)$ since $\op{len}(\varepsilon) = 0$. So $P(\varepsilon)$ is true.
③ Inductive hypothesis:
Assume that $P(w)$ is true for some arbitrary $w \in \Sigma^* $.
④ Inductive step:
We want to prove that $P(wa)$ is true for every $a\in\Sigma$.
Let $a\in\Sigma$ and $x\in\Sigma^* $ be arbitrary. Then
So $\op{len}(x\bullet wa)=\op{len}(x) + \op{len}(wa)$ for all $x\in\Sigma^* $, and $P(wa)$ is true.
⑤ The result follows for all $y\in \Sigma^* $ by structural induction.
 
Define $\Sigma^* $ by
Basis: $\varepsilon \in \Sigma^* $.
Recursive:
if $w\in\Sigma^* $ and $a\in\Sigma$,
then $wa\in\Sigma^* $
Length
$\op{len}(\varepsilon) = 0$
$\op{len}(wa) = \op{len}(w) + 1$
Concatenation
$x\bullet \varepsilon = x$
$x\bullet (wa) = (x\bullet w)a$

Example: prove $\Size{t}\leq 2^{\Height{t}+1}-1$ for any rooted binary tree $t$

① Let $P(t)$ be $\Size{t}\leq 2^{\Height{t}+1}-1$.
We will show that $P(t)$ is true for every $t\in S $ by structural induction.
② Base case ($t=\Node$):
$\Size{\Node} = 1 = 2^1 - 1 = 2^{0+1}-1 = 2^{\Height{\Node}+1}-1$ so $P(\Node)$ is true.
③ Inductive hypothesis:
Assume that $P(L)$ and $P(R)$ are true for some arbitrary $L, R \in S$.
④ Inductive step:
We want to prove that $P(\Tree(\Node,L,R))$ is true.
⑤ The result follows for all $t\in S$ by structural induction.
 
Define $S$ by
Basis: $\Node \in S$.
Recursive:
if $L, R\in S$, then
$\Tree(\Node,L,R)\in S$
Size
$\Size{\Node} = 1$
$\Size{\Tree(\Node,L,R)} = $
$\quad 1 + \Size{L} + \Size{R}$
Height
$\Height{\Node} = 0$
$\Height{\Tree(\Node,L,R))} = $
$\quad 1 + \max(\Height{L}, \Height{R})$

Regular expressions

Definition, examples, applications.

Sets of strings as languages

A language is a sets of strings with specific syntax, e.g.:
Syntactically correct Java/C/C++ programs.
The set $\Sigma^* $ of all strings over the alphabet $\Sigma$.
Palindromes over $\Sigma$.
Binary strings with no 1’s before 0’s.
Regular expressions let us specify regular languages, e.g.:
All binary strings.
The strings $\{0000, 0010, 1000, 1010\}$.
All strings that contain the string “CSE311”.

Regular expressions over $\Sigma $: syntax

Basis step:
$\emptyset, \varepsilon$ are regular expressions.
$a$ is a regular expression for any $a\in\Sigma$.
Recursive step:
If $A$ and $B$ are regular expressions, then so are
$AB$, $A\cup B$, and $A^* $.
Examples: regular expressions of $\Sigma = \{0, 1\}$
Basis: $\emptyset$, $\varepsilon$, $0$, $1$.
Recursive: $01011$, $0^* 1^* $, $(0\cup 1)0(0\cup 1)0$, etc.

Regular expressions over $\Sigma $: semantics

A regular expression over $\Sigma $ represents a set of strings over $\Sigma $.
$\emptyset$ represents the set with no strings.
$\varepsilon$ represents the set $\{\varepsilon\}$.
$a$ represents the set $\{a\}$.
$AB$ represents the concatenation of the sets represented by $A$ and $B$: $\{ a\bullet b \ \vert\ a\in A, b\in B\}$.
$A\cup B$ represents the union of the sets represented by $A$ and $B$: $A\cup B$.
$A^* $ represents the concatenation of the set represented by $A$ with itself zero or more times: $A^* = \{\varepsilon\} \cup A \cup AA \cup AAA \cup AAAA \cup \ldots$

This just defines a recursive function definition for computing the meaning of a regular expression:

Examples of regular expressions

$001^* $
Binary strings with “00” followed by any number of 1s.
$0^* 1^* $
Binary strings with any number of 0s followed by any number of 1s.
$(0\cup 1)0(0\cup 1)0$
$\{0000, 0010, 1000, 1010\}$
$(0^* 1^* )^* $
All binary strings.
$(0 \cup 1)^* 0110 (0 \cup 1)^* $
Binary strings that contain “0110”.

Regular expressions in practice

Used to define the tokens in a programming language.
Legal variable names, keywords, etc.
Used in grep, a Unix program that searches for patterns in a set of files.
For example, grep "311" *.md searches for the string “311” in all Markdown files in the current directory.
Used in programs to process strings.
These slides are generated with the help of regular expressions :)

Context-free grammars

Syntax, semantics, and examples.

Regular expressions can specify only regular languages

But many languages aren’t regular, including simple ones such as
palindromes, and
strings with an equal number of 0s and 1s.
Many programming language constructs are also irregular, such as
expressions with matched parentheses, and
properly formed arithmetic expressions.

Context-free grammars are a more powerful formalism that lets us specify all of these example languages (i.e., sets of strings)!

Context-free grammars over $\Sigma$: syntax

A context-free grammar (CFG) is a finite set of production rules over:
An alphabet $\Sigma$ of terminal symbols.
A finite set $V$ of nonterminal symbols.
A start symbol from $V$, usually denoted by $\S$ (i.e., $\S\in V$).
A production rule for a nonterminal $\nt{A}\in V$ takes the form
$\nt{A} \to w_1 \OR w_2 \OR \ldots \OR w_k$
where each $w_i\in(V\cup\Sigma)^* $ is a string of nonterminals and terminals.

Only nonterminals can appear on the left-hand side of a production rule.

Context-free grammars over $\Sigma$: semantics

A CFG over $\Sigma $ represents a set of strings over $\Sigma $.

Compute (or generate) a string from this set as follows:

  1. Begin with the start symbol $\S$ as the current string.
  2. If the current string contains a nonterminal $\nt{A}$, apply the rule $\nt{A} \to w_1 \OR \ldots \OR w_k$ to replace $\nt{A}$ in the current string with one of the $w_i$’s.
  3. Repeat step 2 until the current string contains only terminals.

A CFG represents the set of all strings over $\Sigma$ that can be generated in this way.

Example context-free grammars

$\S\to 0\S0 \OR 1\S1 \OR 0 \OR 1 \OR \varepsilon$
The set of all binary palindromes.
$\S\to 0\S \OR \S1 \OR \varepsilon$
The set of strings denoted by the regular expression $0^* 1^* $.
$\S\to (\S) \OR \S\S \OR \varepsilon$
The set of all strings of matched parentheses.
CFG for $\{ 0^n1^n : n\geq 0\}$, strings an equal number of 0s and 1s.
$\S\to 0\S1\OR\varepsilon$

Summary

To prove $\forall x\in S. P(x)$ using structural induction:
Show that $P$ holds for the elements in the basis step of $S$.
Assume $P$ for every existing element of $S$ named in the recursive step.
Prove $P$ for every new element of $S$ created in the recursive step.
A regular expression defines a set of strings over an alphabet $\Sigma $.
$\emptyset$, $\varepsilon$, and $a\in\Sigma$ are regular expressions.
If $A$ and $B$ are regular expressions, then so are $(AB), (A\cup B), A^* $.
Many practical applications, from grep to everyday programming.
Context-free grammars (CFGs) are a more expressive formalism for specifying strings over an alphabet $\Sigma $.
A CFG consists of a set of terminal symbols, a set of nonterminal symbols including the distinguished start symbol, and a set of production rules that specify how to rewrite nonterminals in a string.
Used for specifying programming language syntax and for parsing.