Context-Free Grammars

Emina Torlak and Kevin Zatloukal

- Regular expressions
- A brief review of Lecture 20.
- Context-free grammars
- Syntax, semantics, and examples.

A brief review of Lecture 20.

- A
*language*is a sets of strings with specific syntax, e.g.: - Syntactically correct Java/C/C++ programs.
- The set $\Sigma^* $ of all strings over the alphabet $\Sigma$.
- Palindromes over $\Sigma$.
- Binary strings with no 1’s before 0’s.

**Regular expressions**let us specify*regular languages*, e.g.:- All binary strings.
- The strings $\{0000, 0010, 1000, 1010\}$.
- All strings that contain the string “CSE311”.

- Basis step:
- $\emptyset, \varepsilon$ are regular expressions.
- $a$ is a regular expression for any $a\in\Sigma$.
- Recursive step:
- If $A$ and $B$ are regular expressions, then so are
- $AB$, $A\cup B$, and $A^* $.

- Examples: regular expressions over $\Sigma = \{0, 1\}$
- Basis: $\emptyset$, $\varepsilon$, $0$, $1$.
- Recursive: $01011$, $0^* 1^* $, $(0\cup 1)0(0\cup 1)0$, etc.

- A regular expression over $\Sigma $ represents a set of strings over $\Sigma $.
- $\emptyset$ represents the set with no strings.
- $\varepsilon$ represents the set $\{\varepsilon\}$.
- $a$ represents the set $\{a\}$.
- $AB$ represents the concatenation of the sets represented by $A$ and $B$: $\{ a\bullet b \ \vert\ a\in A, b\in B\}$.
- $A\cup B$ represents the union of the sets represented by $A$ and $B$: $A\cup B$.
- $A^* $ represents the concatenation of the set represented by $A$ with itself zero or more times: $A^* = \{\varepsilon\} \cup A \cup AA \cup AAA \cup AAAA \cup \ldots$

This just defines a recursive function definition for computing
the meaning of a regular expression:

- $001^* $
- Binary strings with “00” followed by any number of 1s.
- $0^* 1^* $
- Binary strings with any number of 0s followed by any number of 1s.
- $(0\cup 1)0(0\cup 1)0$
- $\{0000, 0010, 1000, 1010\}$
- $(0^* 1^* )^* $
- All binary strings.
- $(0 \cup 1)^* 0110 (0 \cup 1)^* $
- Binary strings that contain “0110”.

- Used to define the
*tokens*in a programming language. - Legal variable names, keywords, etc.
- Used in
`grep`

, a Unix program that searches for patterns in a set of files. - For example,
`grep "311" *.md`

searches for the string “311” in all Markdown files in the current directory. - Used in programs to process strings.
- These slides are generated with the help of regular expressions :)

Syntax, semantics, and examples.

- But many languages aren’t regular, including simple ones such as
- palindromes, and
- strings with an equal number of 0s and 1s.
- Many programming language constructs are also irregular, such as
- expressions with matched parentheses, and
- properly formed arithmetic expressions.

Context-free grammars are a more powerful formalism that lets us specify all of these example languages (i.e., sets of strings)!

- A context-free grammar (CFG) is a finite set of
*production rules*over: - An alphabet $\Sigma$ of
*terminal symbols*. - A finite set $V$ of
*nonterminal symbols*. - A
*start symbol*from $V$, usually denoted by $\S$ (i.e., $\S\in V$).

- A production rule for a nonterminal $\nt{A}\in V$ takes the form
- $\nt{A} \to w_1 \OR w_2 \OR \ldots \OR w_k$
- where each $w_i\in(V\cup\Sigma)^* $ is a string of nonterminals and terminals.

Only nonterminals can appear on the left-hand side of a production rule.

A CFG over $\Sigma $ represents a set of strings over $\Sigma $.

Compute (or *generate*) a string from this set as follows:

- Begin with the start symbol $\S$ as the current string.
- If the current string contains a nonterminal $\nt{A}$, apply the rule $\nt{A} \to w_1 \OR \ldots \OR w_k$ to replace $\nt{A}$ in the current string with one of the $w_i$’s.
- Repeat step 2 until the current string contains only terminals.

A CFG represents the set of all strings over $\Sigma$ that can be generated in this way.

- $\S\to 0\S0 \OR 1\S1 \OR 0 \OR 1 \OR \varepsilon$
- The set of all binary palindromes.
- $\S\to 0\S \OR \S1 \OR \varepsilon$
- The set of strings denoted by the regular expression $0^* 1^* $.
- $\S\to (\S) \OR \S\S \OR \varepsilon$
- The set of all strings of matched parentheses.
- CFG for $\{ 0^n1^n : n\geq 0\}$, strings an equal number of 0s and 1s.
- $\S\to 0\S1\OR\varepsilon$

$\nt{E} \to \nt{E}+\nt{E} \OR \nt{E} * \nt{E} \OR (\nt{E}) \OR x \OR y \OR z \OR 0 \OR 1 \OR 2 \OR 3 \OR 4 \OR 5 \OR 6 \OR 7 \OR 8 \OR 9$

- Can this CFG generate $(2 * x) + y$?
- $\nt{E}$ $\Rightarrow \nt{E} + \nt{E}$ $\Rightarrow (\nt{E}) + \nt{E}$ $\Rightarrow (\nt{E} * \nt{E}) + \nt{E}$ $\Rightarrow (2 * \nt{E}) + \nt{E}$ $\Rightarrow (2 * x) + \nt{E}$ $\Rightarrow (2 * x) + y$
- Can this CFG generate $x + y * z$ in two entirely different ways?
- $\nt{E} \Rightarrow \nt{E} + \nt{E} \Rightarrow x + \nt{E} \Rightarrow x + \nt{E} * \nt{E} \Rightarrow x + y * \nt{E}\Rightarrow x + y * z$
- $\nt{E} \Rightarrow \nt{E} * \nt{E} \Rightarrow \nt{E} + \nt{E} * \nt{E} \Rightarrow x + \nt{E} * \nt{E} \Rightarrow x + y * \nt{E} \Rightarrow x + y * z$

This is perfectly valid according to the CFG rule, but it violates operator precedence for arithmetic! How can we write our grammar to enforce operator precedence?

- We use multiple production rules to encode precedence.
- $\nt{E}$ generates expressions; it’s the start symbol.
- $\nt{T}$ generates terms.
- $\nt{F}$ generates factors.
- $\nt{I}$ generates identifiers.
- $\nt{N}$ generates numbers.

- Suppose that a grammar $G$ generates a string $x$.
- The sequence of steps (rule applications) that generates $x$ is called a
*derivation*. - We represent derivations as
*parse trees*. - The root of the tree is the start symbol.
- The internal nodes are the nonterminal symbols in the derivation.
- The leaves are the terminal symbols in the derivation.

- Palindrome grammar
- $\S\to 0\S0 \OR 1\S1 \OR 0 \OR 1 \OR \varepsilon$
- Derivation of $01110$
- $\nt{S} \Rightarrow 0\nt{S}0 \Rightarrow 01\nt{S}10 \Rightarrow 01110$

- Backus-Naur Form (BNF) is a notation for CFGs developed for specifying the syntax of programming languages.
- Production rules use $::=$ instead of $\to$.
- Nonterminals are denoted by names enclosed in angle brackets, e.g.,
`<identifier>`

,`<digit>`

,`<expression>`

, etc.

```
<expression> ::= <term> | <expression> + <term>
<term> ::= <factor> | <factor> * <term>
<factor> ::= (<expression>) | <identifier> | <number>
<identifier> ::= x | y | z
<number> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
```

- A regular expression defines a set of strings over an alphabet $\Sigma $.
- $\emptyset$, $\varepsilon$, and $a\in\Sigma$ are regular expressions.
- If $A$ and $B$ are regular expressions, then so are $(AB), (A\cup B), A^* $.
- Many practical applications, from
`grep`

to everyday programming. - Context-free grammars (CFGs) are a more expressive formalism for specifying strings over an alphabet $\Sigma $.
- A CFG consists of a set of
*terminal symbols*, a set of*nonterminal symbols*including the distinguished*start symbol*, and a set of*production rules*that specify how to rewrite nonterminals in a string. - Used for specifying programming language syntax and for parsing.