CSE 401/M501 21sp Homework 3 - LL Grammars & Parsing

Due: Thursday, May 6 by 11 pm. Please use Gradescope (gradescope.com, also linked from the CSE 401/M501 resources web page) to submit your homework online. Gradescope's web site has several instructional videos and help pages to outline the process, and this guide has specific information about scanning and uploading pdf files containing assignments.

We suggest you show your work to help us award partial credit if appropriate, and for TA sanity. You should do this assignment individually.

This assignment has two parts - some written problems involving LL grammars, and a second programming part. These will be submitted separately on Gradescope. The written (part 1) problems should be scanned and uploaded as usual. For the programming problems, drag and drop your files onto the Gradescope interface for part 2 to upload them. The two parts are separate "assignments" in Gradescope because the mechanics for submitting and processing them are different.

Part 1 - Written Problems

  1. Consider the following grammar:

    A ::= s C n g | ε
    B ::= C r | t
    C ::= B i | t

    Does this grammar satisfy the LL(1) condition? Justify your answer. If it does not, change the grammar to make it LL(1) without changing the language that it generates.

  2. Write a grammar that generates the straight-line code language given below, but that is suitable for LL(1) parsing. That is, eliminate the ambiguity, eliminate the left recursion, and (if necessary) left-factor.

    S ::= S ; S
    S ::= id := E
    S ::= print( L )
    E ::= id
    E ::= num
    E ::= E + E
    E ::= ( S , E )
    L ::= E
    L ::= L , E

When you are done with the written problems for this part of the assignment, scan your work and submit it using Gradescope, as with previous written homework assignments.

Part 2 - Programming Problem

Write a Java program to implement a small calculator. The goal of this exercise is to gain experience with top-down LL(1) parsing by writing a program whose input is specified by a small formal grammar, and writing a recursive-descent parser to process that input. The calculator language and its capabilities have been deliberately kept simple and limited to focus on top-down parsing, with only the minimal infrastructure needed to support that.

Overview

The calculator should be packaged as a small set of Java classes (hand-written scanner, parser, and main calculator program that uses them). We are providing you with a class named Calc that contains the main method for the calculator in this Calc.java file (right-click to download this file to use in your code). It should be possible to compile and run the calculator using the following terminal commands on any system that has standard Java 11 or later installed. The user enters lines containing arithmetic expressions or commands and the calculator immediately displays the value of the expression or executes the command. In the example below, user input is underlined; comments beginning with a # are for explanation in this example only and are not part of the calculator input or output:
javac Calc.java        # compile everything needed
java Calc              # start the program
1+1
2.0
x+1
x not defined
let x = 17
x+1
18.0
2  *x+ 8               # i.e., 2*x+8; whitespace is ignored
42.0
let almostpi = 22/7
almostpi
3.142857142857143
y+x*4
y not defined
let y = 2
let x = 3
y+x*4
14.0
(y+x)*4
20.0
list
x = 3.0
almostpi = 3.142857142857143
y = 2.0
quit

Calculator Input Language Specification

The input to the calculator is specified by the following grammar and associated semantic rules. The input is a program, which consists of one or more statements, each of which occupies one line of input. Input lines are processed immediately as soon as they are read by the calculator.

program ::= statement | program statement
statement ::= exp | let id = exp | list | quit
exp ::= term  |  exp + term  |   exp - term
term ::= factor  |  term * factor | term / factor
factor ::= id  |  integer  |   ( exp )

  1. A program consists of one or more statements, each written on a separate input line.
  2. A exp statement causes the given expression to be evaluated and the resulting value printed.
  3. An assignment let id = exp is executed by evaluating the expression and binding the resulting value to the given identifier. Any previous value bound to that identifier is replaced. No output is printed.
  4. If an identifier appears in an expression, it evaluates to the value most recently bound to it. If the identifier is not currently bound to a value, the error message id not defined is printed (with the actual identifier substituted for id) and the value of the expression is not defined. If an expression contains multiple occurrences of undefined identifiers, the error message is printed for each occurrence of each undefined identifier.
  5. The statement list causes all known identifiers to be printed along with the values they are currently bound to using the format id = exp, with each identifier binding printed on a separate line. The order in which the identifiers and their values are printed is not specified.
  6. The statement quit terminates execution of the calculator.
  7. There are two undefined nonterminals in the grammar: id and integer. An identifier, id, consists of one or more letters (a-z and A-Z). Upper- and lower-case letters are distinct, thus aa, AA, Aa, and aA are four different identifiers. An integer is a decimal integer constant consisting of one or more decimal digits (0-9). The implementation may restrict the magnitude of an integer constant so that it can be stored as an ordinary Java int value.
  8. The keywords in the grammar (let, list, and quit) are reserved and may not be used as identifiers.
  9. The arithmetic operations have their usual meaning as defined on double-precision floating-point values in Java. All arithmetic is done using values of type double, even though the grammar is restricted to only allow integers as numeric constants. Note that the grammar implies that all of the binary arithmetic operations are left-associative.
  10. Statements may contain an arbitrary amount of whitespace (tabs and spaces) before, after, and in between terminal symbols. The whitespace is ignored. The end of a line indicates the end of a statement.

Requirements

The goal of this assignment is to gain experience with recursive-descent parsing. To focus on that, the language has been deliberately kept very simple. For example, it includes integer constants only, even though internal arithmetic should be done using Java doubles. Also, identifiers are restricted to being sequences only of letters with no digits, underscores, or other characters, and no error handling is specified other than detecting use of undefined identifiers in expressions.

Your calculator must use recursive-descent parsing as outlined in lecture to parse and evaluate the language defined above. You will find the examples in the lecture slides to be a good starting point, and the textbook has additional discussion and examples. The parser should contain a method for each major non-terminal in the grammar. Each parser method should parse the associated non-terminal, and either evaluate or execute the language construct that it parses. After each expression statement is parsed, the resulting value should be printed using the ordinary Java System.out.println method. You will need to maintain a few additional data structures; in particular you'll want to have a hash table or dictionary that maps known identifiers to their values.

If the grammar is not suitable for a recursive-descent LL(1) parser, modify it as needed so you can write a predictive, top-down parser. However, you are not required to modify the grammar formally to create an equivalent one that satisfies the LL(1) condition if parsing can be done correctly in a straight-forward manner (i.e., this is not a "fix the grammar" question - just make any adjustments needed so you can write a correct recursive-descent parser -- the examples from lecture should be helpful).

Your code only needs to work on syntactically correct input. If possible, you will find it more convenient to test and use the calculator if it can recover from an error in one statement and continue processing with the next one on the next input line, but this is not required.

We have provided you with the main method in Calc.java. Feel free to change it, but it must be able to be compiled and executed by entering the commands javac Calc.java followed by java Calc as shown in the example transcript above. You should also create two files to contain the key classes used by the main program: Scanner.java and Parser.java. Do not add package declarations to any of these files. All of the code should be in the default, anonymous package used by Java when package declarations are not present.

Hints and Suggestions

As implied by the provided Calc.java file, you should create two classes: a simple scanner class, and the parser class. We suggest that your scanner have an appropriate method that the parser can call to get the next input token. The Scanner class should handle the details of reading input lines and delivering tokens to the parser when requested, for example, with a getNextToken() method. The parser itself should have a run() method that the main program can call to initiate parsing and processing of input lines until a quit statement is executed. This separation of input character handling (scanning) from the actual parsing should make the structure of the overall program simpler and easier to build. The Parser class should contain all of the parser methods and any data shared by them.

Keep the scanner simple and take advantage of any useful string handling classes found in the Java standard library. One possibility is to read each input line as a single string, then use a Java library split method to break the input line up into an array of strings, each of which corresponds to a single token. You will still want some sort of getToken or nextToken method in the scanner to deliver individual tokens when requested by the parser, but you don't need to (i.e., should not) create an elaborate DFA or write lengthy code to do this. It's also up to you how you represent tokens. Just remember that the parser will need some sort of ID(string) token to represent an identifier, with the actual identifier contained in a field in the token, and something similar needs to be done for integer constants.

Because individual input lines represent single input statements and there is no punctuation like a semicolon to terminate statements, you may find it useful to create some sort of EOL (end-of-line) token that the scanner can return to the parser at the end of each input line.

Although not required as part of the assignment, we suggest you write a tiny stand-alone program to test your scanner by repeatedly getting tokens from the scanner and printing them. Feel free to include more elaborate automated testing if you'd like, but it might be sufficient for this assignment just to be able to type a few input lines and look at the output produced to verify that the scanner seems to be working properly.

The parser will be simpler if there is some sort of shared "next token" variable accessible to all parser methods, and a method that can be called to advance (update) the "next token" variable by calling the appropriate scanner method as the parser does its work.

A simple way to evaluate expressions is to have each parser method that processes an expression or sub-expression return a double value giving the result of evaluating the expression or sub-expression parsed by that method. An easy way to indicate errors in an expression (if you need to do this) would be to return a Double.NaN value if an error occurs while trying to parse and evaluate part of an expression.

Advice and Perspective

Remember that the goal of this part of the assignment is to gain experience with recursive-descent parsing, not to create a Complex Industrial-Strength Production-Quality Software Artifact™. In particular, keep the scanner and token data structure as simple as possible. This problem is not the same as the compiler project, so those parts of the calculator program should be as simple as possible as long as they get the job done.

The calculator input specification is designed so that a scanner can read one input line from the terminal at a time as a string, then use library functions like split() to chop it up into an array of strings with individual operators, identifiers, and numbers in the array elements, and then deliver those one at a time to the parser as "tokens", skipping any that only contain whitespace. You do not have to do things this way, but some similar approach should do the job. No fancy DFA needed. You will need some sort of simple "token" data structure to be able to, for instance, return an integer as an integer token with an associated value, but for the most part strings from the input can be treated as tokens themselves if you want - operator tokens like "+" or keywords like "list". There's no need to create a long enumerated type with lexical classes for a fancy Token data structure if you don't need it (and you probably don't). Similarly, just put the code in the unnamed, default Java package - no need for a complex package or folder structure.

Bottom line is keep it simple. If you find yourself writing big, complex scanner or token classes, step back and take another look. Keep those as simple as possible as long as they modularize the program and allow the recursive-descent parser to request the input one token at a time and do its job.

Program Submission

When you are done with this part of the assignment, turn in your source program file(s) using Gradescope. Be sure to include your Calc.java file with any changes or additions you make there. File submission for this part of the assignment is separate from the written problems in Part 1.