CSE 311 Lecture 21: Context-Free Grammars

Topics

Regular expressions
A brief review of Lecture 20.
Context-free grammars
Syntax, semantics, and examples.

Regular expressions

A brief review of Lecture 20.

Sets of strings as languages

A language is a sets of strings with specific syntax, e.g.:
Syntactically correct Java/C/C++ programs.
The set $\Sigma^* $ of all strings over the alphabet $\Sigma$.
Palindromes over $\Sigma$.
Binary strings with no 1’s before 0’s.
Regular expressions let us specify regular languages, e.g.:
All binary strings.
The strings $\{0000, 0010, 1000, 1010\}$.
All strings that contain the string “CSE311”.

Regular expressions over $\Sigma $: syntax

Basis step:
$\emptyset, \varepsilon$ are regular expressions.
$a$ is a regular expression for any $a\in\Sigma$.
Recursive step:
If $A$ and $B$ are regular expressions, then so are
$AB$, $A\cup B$, and $A^* $.
Examples: regular expressions over $\Sigma = \{0, 1\}$
Basis: $\emptyset$, $\varepsilon$, $0$, $1$.
Recursive: $01011$, $0^* 1^* $, $(0\cup 1)0(0\cup 1)0$, etc.

Regular expressions over $\Sigma $: semantics

A regular expression over $\Sigma $ represents a set of strings over $\Sigma $.
$\emptyset$ represents the set with no strings.
$\varepsilon$ represents the set $\{\varepsilon\}$.
$a$ represents the set $\{a\}$.
$AB$ represents the concatenation of the sets represented by $A$ and $B$: $\{ a\bullet b \ \vert\ a\in A, b\in B\}$.
$A\cup B$ represents the union of the sets represented by $A$ and $B$: $A\cup B$.
$A^* $ represents the concatenation of the set represented by $A$ with itself zero or more times: $A^* = \{\varepsilon\} \cup A \cup AA \cup AAA \cup AAAA \cup \ldots$

This just defines a recursive function definition for computing the meaning of a regular expression:

Understanding regex semantics

What is the meaning of $(0 \cup 1)^*$?

Regular expressions in practice

Used to define the tokens in a programming language.
Legal variable names, keywords, etc.
Used in grep, a Unix program that searches for patterns in a set of files.
For example, grep "311" *.md searches for the string “311” in all Markdown files in the current directory.
Used in programs to process strings.
These slides are generated with the help of regular expressions :)

Context-free grammars

Syntax, semantics, and examples.

Regular expressions can specify only regular languages

But many languages aren’t regular, including simple ones such as
palindromes, and
strings with an equal number of 0s and 1s.
Many programming language constructs are also irregular, such as
expressions with matched parentheses, and
properly formed arithmetic expressions.

Context-free grammars are a more powerful formalism that lets us specify all of these example languages (i.e., sets of strings)!

Context-free grammars over $\Sigma$: syntax

A context-free grammar (CFG) is a finite set of production rules over:
An alphabet $\Sigma$ of terminal symbols.
A finite set $V$ of nonterminal symbols.
A start symbol from $V$, usually denoted by $\S$ (i.e., $\S\in V$).
A production rule for a nonterminal $\nt{A}\in V$ takes the form
$\nt{A} \to w_1 \OR w_2 \OR \ldots \OR w_k$
where each $w_i\in(V\cup\Sigma)^* $ is a string of nonterminals and terminals.

Only nonterminals can appear on the left-hand side of a production rule.

Context-free grammars over $\Sigma$: semantics

A CFG over $\Sigma $ represents a set of strings over $\Sigma $.

Compute (or generate) a string from this set as follows:

  1. Begin with the start symbol $\S$ as the current string.
  2. If the current string contains a nonterminal $\nt{A}$, apply the rule $\nt{A} \to w_1 \OR \ldots \OR w_k$ to replace $\nt{A}$ in the current string with one of the $w_i$’s.
  3. Repeat step 2 until the current string contains only terminals.

A CFG represents the set of all strings over $\Sigma$ that can be generated in this way.

Example context-free grammars

$\S\to 0\S0 \OR 1\S1 \OR 0 \OR 1 \OR \varepsilon$
The set of all binary palindromes.
$\S\to 0\S \OR \S1 \OR \varepsilon$
The set of strings denoted by the regular expression $0^* 1^* $.
$\S\to (\S) \OR \S\S \OR \varepsilon$
The set of all strings of matched parentheses.
CFG for $\{ 0^n1^n : n\geq 0\}$, strings an equal number of 0s and 1s.
$\S\to 0\S1\OR\varepsilon$

Another example CFG: simple arithmetic expressions

$\nt{E} \to \nt{E}+\nt{E} \OR \nt{E} * \nt{E} \OR (\nt{E}) \OR x \OR y \OR z \OR 0 \OR 1 \OR 2 \OR 3 \OR 4 \OR 5 \OR 6 \OR 7 \OR 8 \OR 9$

Can this CFG generate $(2 * x) + y$?
$\nt{E}$ $\Rightarrow \nt{E} + \nt{E}$ $\Rightarrow (\nt{E}) + \nt{E}$ $\Rightarrow (\nt{E} * \nt{E}) + \nt{E}$ $\Rightarrow (2 * \nt{E}) + \nt{E}$ $\Rightarrow (2 * x) + \nt{E}$ $\Rightarrow (2 * x) + y$
Can this CFG generate $x + y * z$ in two entirely different ways?
$\nt{E} \Rightarrow \nt{E} + \nt{E} \Rightarrow x + \nt{E} \Rightarrow x + \nt{E} * \nt{E} \Rightarrow x + y * \nt{E}\Rightarrow x + y * z$
$\nt{E} \Rightarrow \nt{E} * \nt{E} \Rightarrow \nt{E} + \nt{E} * \nt{E} \Rightarrow x + \nt{E} * \nt{E} \Rightarrow x + y * \nt{E} \Rightarrow x + y * z$

This is perfectly valid according to the CFG rule, but it violates operator precedence for arithmetic! How can we write our grammar to enforce operator precedence?

Building precedence in simple arithmetic expressions

We use multiple production rules to encode precedence.
$\nt{E}$ generates expressions; it’s the start symbol.
$\nt{T}$ generates terms.
$\nt{F}$ generates factors.
$\nt{I}$ generates identifiers.
$\nt{N}$ generates numbers.
Example: generating $x + y * z$
$\nt{E}$ $\Rightarrow \nt{E} + \nt{T}$ $\Rightarrow \nt{T} + \nt{T}$ $\Rightarrow \nt{F} + \nt{T}$ $\Rightarrow \nt{I} + \nt{T}$ $\Rightarrow \nt{x} + \nt{T}$ $\Rightarrow \nt{x} + \nt{F} * \nt{T}$ $\Rightarrow \nt{x} + \nt{I} * \nt{T}$ $\Rightarrow \nt{x} + \nt{y} * \nt{T}$ $\Rightarrow \nt{x} + \nt{y} * \nt{F}$ $\Rightarrow \nt{x} + \nt{y} * \nt{I}$ $\Rightarrow \nt{x} + \nt{y} * \nt{z}$

Visualizing CFG derivations with parse trees

Suppose that a grammar $G$ generates a string $x$.
The sequence of steps (rule applications) that generates $x$ is called a derivation.
We represent derivations as parse trees.
The root of the tree is the start symbol.
The internal nodes are the nonterminal symbols in the derivation.
The leaves are the terminal symbols in the derivation.
Palindrome grammar
$\S\to 0\S0 \OR 1\S1 \OR 0 \OR 1 \OR \varepsilon$
Derivation of $01110$
$\nt{S} \Rightarrow 0\nt{S}0 \Rightarrow 01\nt{S}10 \Rightarrow 01110$

In practice, CFGs are often given in Backus-Naur Form

Backus-Naur Form (BNF) is a notation for CFGs developed for specifying the syntax of programming languages.
Production rules use $::=$ instead of $\to$.
Nonterminals are denoted by names enclosed in angle brackets, e.g., <identifier>, <digit>, <expression>, etc.

<expression> ::= <term> | <expression> + <term>
<term>       ::= <factor> | <factor> * <term>
<factor>     ::= (<expression>) | <identifier> | <number>
<identifier> ::= x | y | z
<number>     ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Summary

A regular expression defines a set of strings over an alphabet $\Sigma $.
$\emptyset$, $\varepsilon$, and $a\in\Sigma$ are regular expressions.
If $A$ and $B$ are regular expressions, then so are $(AB), (A\cup B), A^* $.
Many practical applications, from grep to everyday programming.
Context-free grammars (CFGs) are a more expressive formalism for specifying strings over an alphabet $\Sigma $.
A CFG consists of a set of terminal symbols, a set of nonterminal symbols including the distinguished start symbol, and a set of production rules that specify how to rewrite nonterminals in a string.
Used for specifying programming language syntax and for parsing.